subject:"\[PATCH\] slub\: Don't throw away partial remote slabs if there is no local memory"

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-07 Thread Joonsoo Kim

On Thu, Feb 06, 2014 at 11:28:12AM -0800, Nishanth Aravamudan wrote:
 On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:
  On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:
   On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
 On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
 
  Thank you for clarifying and providing  a test patch. I ran with 
  this on
  the system showing the original problem, configured to have 15GB of
  memory.
  
  With your patch after boot:
  
  MemTotal:   15604736 kB
  MemFree: 8768192 kB
  Slab:3882560 kB
  SReclaimable: 105408 kB
  SUnreclaim:  3777152 kB
  
  With Anton's patch after boot:
  
  MemTotal:   15604736 kB
  MemFree:11195008 kB
  Slab:1427968 kB
  SReclaimable: 109184 kB
  SUnreclaim:  1318784 kB
  
  
  I know that's fairly unscientific, but the numbers are 
  reproducible. 
  
 
 I don't think the goal of the discussion is to reduce the amount of 
 slab 
 allocated, but rather get the most local slab memory possible by use 
 of 
 kmalloc_node().  When a memoryless node is being passed to 
 kmalloc_node(), 
 which is probably cpu_to_node() for a cpu bound to a node without 
 memory, 
 my patch is allocating it on the most local node; Anton's patch is 
 allocating it on whatever happened to be the cpu slab.
 
   diff --git a/mm/slub.c b/mm/slub.c
   --- a/mm/slub.c
   +++ b/mm/slub.c
   @@ -2278,10 +2278,14 @@ redo:
   
 if (unlikely(!node_match(page, node))) {
 stat(s, ALLOC_NODE_MISMATCH);
   - deactivate_slab(s, page, c-freelist);
   - c-page = NULL;
   - c-freelist = NULL;
   - goto new_slab;
   + if (unlikely(!node_present_pages(node)))
   + node = numa_mem_id();
   + if (!node_match(page, node)) {
   + deactivate_slab(s, page, c-freelist);
   + c-page = NULL;
   + c-freelist = NULL;
   + goto new_slab;
   + }
  
  Semantically, and please correct me if I'm wrong, this patch is 
  saying
  if we have a memoryless node, we expect the page's locality to be 
  that
  of numa_mem_id(), and we still deactivate the slab if that isn't 
  true.
  Just wanting to make sure I understand the intent.
  
 
 Yeah, the default policy should be to fallback to local memory if the 
 node 
 passed is memoryless.
 
  What I find odd is that there are only 2 nodes on this system, node   0
  (empty) and node 1. So won't numa_mem_id() always be 1? And every 
  page
  should be coming from node 1 (thus node_match() should always be 
  true?)
  
 
 The nice thing about slub is its debugging ability, what is 
 /sys/kernel/slab/cache/objects showing in comparison between the two 
 patches?

Ok, I finally got around to writing a script that compares the objects
output from both kernels.

log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Joonsoo's patch.

log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Anton's patch.

slab   objectsobjects   percent
   log1   log2  change
---
:t-104 71190  85680  20.353982 %
UDP4352   3392   22.058824 %
inode_cache54302  41923  22.796582 %
fscache_cookie_jar 3276   2457   25.00 %
:t-896 43829233.33 %
:t-080 310401 195323 37.073978 %
ext4_inode_cache   33520140.00 %
:t-192 89408  128898 44.168307 %
:t-184 151300 81880  45.882353 %
:t-512 49698  73648  48.191074 %
:at-192242867 120948 50.199904 %
xfs_inode  34350  15221  55.688501 %
:t-0016384 11005  17257  56.810541 %
proc_inode_cache   103868 34717  66.575846 %
tw_sock_TCP76825666.67 %
:t-0004096 15240  25672  68.451444 %
nfs_inode_cache1008   31568.75 %
:t-0001024 14528  24720  70.154185 %
:t-0032768 6551312

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-06 Thread Christoph Lameter

On Wed, 5 Feb 2014, Nishanth Aravamudan wrote:

  Right so if we are ignoring the node then the simplest thing to do is to
  not deactivate the current cpu slab but to take an object from it.

 Ok, that's what Anton's patch does, I believe. Are you ok with that
 patch as it is?

No. Again his patch only works if the node is memoryless not if there are
other issues that prevent allocation from that node.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-06 Thread Nishanth Aravamudan

On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:
 On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:
  On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
   On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

 Thank you for clarifying and providing  a test patch. I ran with this 
 on
 the system showing the original problem, configured to have 15GB of
 memory.
 
 With your patch after boot:
 
 MemTotal:   15604736 kB
 MemFree: 8768192 kB
 Slab:3882560 kB
 SReclaimable: 105408 kB
 SUnreclaim:  3777152 kB
 
 With Anton's patch after boot:
 
 MemTotal:   15604736 kB
 MemFree:11195008 kB
 Slab:1427968 kB
 SReclaimable: 109184 kB
 SUnreclaim:  1318784 kB
 
 
 I know that's fairly unscientific, but the numbers are reproducible. 
 

I don't think the goal of the discussion is to reduce the amount of 
slab 
allocated, but rather get the most local slab memory possible by use of 
kmalloc_node().  When a memoryless node is being passed to 
kmalloc_node(), 
which is probably cpu_to_node() for a cpu bound to a node without 
memory, 
my patch is allocating it on the most local node; Anton's patch is 
allocating it on whatever happened to be the cpu slab.

  diff --git a/mm/slub.c b/mm/slub.c
  --- a/mm/slub.c
  +++ b/mm/slub.c
  @@ -2278,10 +2278,14 @@ redo:
  
  if (unlikely(!node_match(page, node))) {
  stat(s, ALLOC_NODE_MISMATCH);
  -   deactivate_slab(s, page, c-freelist);
  -   c-page = NULL;
  -   c-freelist = NULL;
  -   goto new_slab;
  +   if (unlikely(!node_present_pages(node)))
  +   node = numa_mem_id();
  +   if (!node_match(page, node)) {
  +   deactivate_slab(s, page, c-freelist);
  +   c-page = NULL;
  +   c-freelist = NULL;
  +   goto new_slab;
  +   }
 
 Semantically, and please correct me if I'm wrong, this patch is saying
 if we have a memoryless node, we expect the page's locality to be that
 of numa_mem_id(), and we still deactivate the slab if that isn't true.
 Just wanting to make sure I understand the intent.
 

Yeah, the default policy should be to fallback to local memory if the 
node 
passed is memoryless.

 What I find odd is that there are only 2 nodes on this system, node 0
 (empty) and node 1. So won't numa_mem_id() always be 1? And every page
 should be coming from node 1 (thus node_match() should always be 
 true?)
 

The nice thing about slub is its debugging ability, what is 
/sys/kernel/slab/cache/objects showing in comparison between the two 
patches?
   
   Ok, I finally got around to writing a script that compares the objects
   output from both kernels.
   
   log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
   and Joonsoo's patch.
   
   log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
   and Anton's patch.
   
   slab   objectsobjects   percent
  log1   log2  change
   ---
   :t-104 71190  85680  20.353982 %
   UDP4352   3392   22.058824 %
   inode_cache54302  41923  22.796582 %
   fscache_cookie_jar 3276   2457   25.00 %
   :t-896 43829233.33 %
   :t-080 310401 195323 37.073978 %
   ext4_inode_cache   33520140.00 %
   :t-192 89408  128898 44.168307 %
   :t-184 151300 81880  45.882353 %
   :t-512 49698  73648  48.191074 %
   :at-192242867 120948 50.199904 %
   xfs_inode  34350  15221  55.688501 %
   :t-0016384 11005  17257  56.810541 %
   proc_inode_cache   103868 34717  66.575846 %
   tw_sock_TCP76825666.67 %
   :t-0004096 15240  25672  68.451444 %
   nfs_inode_cache1008   31568.75 %
   :t-0001024 14528  24720  70.154185 %
   :t-0032768 6551312   100.305344%
   :t-0002048 14242  30720  115.700042%
   :t-640 1020   2550   150.00%
   :t-0008192 10005

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-05 Thread Christoph Lameter

On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:

  If the target node allocation fails (for whatever reason) then I would
  recommend for simplicities sake to change the target node to
  NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
  complex solution would be to look through partial lists in increasing
  distance to find a partially used slab that is reasonable close to the
  current node. Slab has logic like that in fallback_alloc(). Slubs
  get_any_partial() function does something close to what you want.

 I apologize for my own ignorance, but I'm having trouble following.
 Anton's original patch did fallback to the current cpu slab, but I'm not
 sure any NUMA_NO_NODE change is necessary there. At the point we're
 deactivating the slab (in the current code, in __slab_alloc()), we have
 successfully allocated from somewhere, it's just not on the node we
 expected to be on.

Right so if we are ignoring the node then the simplest thing to do is to
not deactivate the current cpu slab but to take an object from it.

 So perhaps you are saying to make a change lower in the code? I'm not
 sure where it makes sense to change the target node in that case. I'd
 appreciate any guidance you can give.

This not an easy thing to do. If the current slab is not the right node
but would be the node from which the page allocator would be returning
memory then the current slab can still be allocated from. If the fallback
is to another node then the current cpu slab needs to be deactivated and
the allocation from that node needs to proceeed. Have a look at
fallback_alloc() in the slab allocator.

A allocation attempt from the page allocator can be restricted to a
specific node through GFP_THIS_NODE.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-05 Thread Nishanth Aravamudan

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
 On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
 
  Thank you for clarifying and providing  a test patch. I ran with this on
  the system showing the original problem, configured to have 15GB of
  memory.
  
  With your patch after boot:
  
  MemTotal:   15604736 kB
  MemFree: 8768192 kB
  Slab:3882560 kB
  SReclaimable: 105408 kB
  SUnreclaim:  3777152 kB
  
  With Anton's patch after boot:
  
  MemTotal:   15604736 kB
  MemFree:11195008 kB
  Slab:1427968 kB
  SReclaimable: 109184 kB
  SUnreclaim:  1318784 kB
  
  
  I know that's fairly unscientific, but the numbers are reproducible. 
  
 
 I don't think the goal of the discussion is to reduce the amount of slab 
 allocated, but rather get the most local slab memory possible by use of 
 kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
 which is probably cpu_to_node() for a cpu bound to a node without memory, 
 my patch is allocating it on the most local node; Anton's patch is 
 allocating it on whatever happened to be the cpu slab.
 
   diff --git a/mm/slub.c b/mm/slub.c
   --- a/mm/slub.c
   +++ b/mm/slub.c
   @@ -2278,10 +2278,14 @@ redo:
   
 if (unlikely(!node_match(page, node))) {
 stat(s, ALLOC_NODE_MISMATCH);
   - deactivate_slab(s, page, c-freelist);
   - c-page = NULL;
   - c-freelist = NULL;
   - goto new_slab;
   + if (unlikely(!node_present_pages(node)))
   + node = numa_mem_id();
   + if (!node_match(page, node)) {
   + deactivate_slab(s, page, c-freelist);
   + c-page = NULL;
   + c-freelist = NULL;
   + goto new_slab;
   + }
  
  Semantically, and please correct me if I'm wrong, this patch is saying
  if we have a memoryless node, we expect the page's locality to be that
  of numa_mem_id(), and we still deactivate the slab if that isn't true.
  Just wanting to make sure I understand the intent.
  
 
 Yeah, the default policy should be to fallback to local memory if the node 
 passed is memoryless.
 
  What I find odd is that there are only 2 nodes on this system, node 0
  (empty) and node 1. So won't numa_mem_id() always be 1? And every page
  should be coming from node 1 (thus node_match() should always be true?)
  
 
 The nice thing about slub is its debugging ability, what is 
 /sys/kernel/slab/cache/objects showing in comparison between the two 
 patches?

Ok, I finally got around to writing a script that compares the objects
output from both kernels.

log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Joonsoo's patch.

log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Anton's patch.

slab   objectsobjects   percent
   log1   log2  change
---
:t-104 71190  85680  20.353982 %
UDP4352   3392   22.058824 %
inode_cache54302  41923  22.796582 %
fscache_cookie_jar 3276   2457   25.00 %
:t-896 43829233.33 %
:t-080 310401 195323 37.073978 %
ext4_inode_cache   33520140.00 %
:t-192 89408  128898 44.168307 %
:t-184 151300 81880  45.882353 %
:t-512 49698  73648  48.191074 %
:at-192242867 120948 50.199904 %
xfs_inode  34350  15221  55.688501 %
:t-0016384 11005  17257  56.810541 %
proc_inode_cache   103868 34717  66.575846 %
tw_sock_TCP76825666.67 %
:t-0004096 15240  25672  68.451444 %
nfs_inode_cache1008   31568.75 %
:t-0001024 14528  24720  70.154185 %
:t-0032768 6551312   100.305344%
:t-0002048 14242  30720  115.700042%
:t-640 1020   2550   150.00%
:t-0008192 10005  27905  178.910545%

FWIW, the configuration of this LPAR has slightly changed. It is now configured
for maximally 400 CPUs, of which 200 are present. The result is that even with
Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
script reports:

slab   mem objsslabs
  used   active   active

kmalloc-5121182 MB2.03%  100.00%
kmalloc-1921182 MB

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-05 Thread Nishanth Aravamudan

On 05.02.2014 [13:28:03 -0600], Christoph Lameter wrote:
 On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:
 
   If the target node allocation fails (for whatever reason) then I would
   recommend for simplicities sake to change the target node to
   NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
   complex solution would be to look through partial lists in increasing
   distance to find a partially used slab that is reasonable close to the
   current node. Slab has logic like that in fallback_alloc(). Slubs
   get_any_partial() function does something close to what you want.
 
  I apologize for my own ignorance, but I'm having trouble following.
  Anton's original patch did fallback to the current cpu slab, but I'm not
  sure any NUMA_NO_NODE change is necessary there. At the point we're
  deactivating the slab (in the current code, in __slab_alloc()), we have
  successfully allocated from somewhere, it's just not on the node we
  expected to be on.
 
 Right so if we are ignoring the node then the simplest thing to do is to
 not deactivate the current cpu slab but to take an object from it.

Ok, that's what Anton's patch does, I believe. Are you ok with that
patch as it is?

  So perhaps you are saying to make a change lower in the code? I'm not
  sure where it makes sense to change the target node in that case. I'd
  appreciate any guidance you can give.
 
 This not an easy thing to do. If the current slab is not the right node
 but would be the node from which the page allocator would be returning
 memory then the current slab can still be allocated from. If the fallback
 is to another node then the current cpu slab needs to be deactivated and
 the allocation from that node needs to proceeed. Have a look at
 fallback_alloc() in the slab allocator.
 
 A allocation attempt from the page allocator can be restricted to a
 specific node through GFP_THIS_NODE.

Thanks for the pointers, I will try and take a look.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-04 Thread Christoph Lameter

On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

 Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
 the $SUBJECT issue.

Hmmm... I am not sure that this is a general solution. The fallback to
other nodes can not only occur because a node has no memory as his patch
assumes.

If the target node allocation fails (for whatever reason) then I would
recommend for simplicities sake to change the target node to NUMA_NO_NODE
and just take whatever is in the current cpu slab. A more complex solution
would be to look through partial lists in increasing distance to find a
partially used slab that is reasonable close to the current node. Slab has
logic like that in fallback_alloc(). Slubs get_any_partial() function does
something close to what you want.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-04 Thread Nishanth Aravamudan

On 04.02.2014 [14:39:32 -0600], Christoph Lameter wrote:
 On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:
 
  Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
  the $SUBJECT issue.
 
 Hmmm... I am not sure that this is a general solution. The fallback to
 other nodes can not only occur because a node has no memory as his patch
 assumes.

Thanks, Christoph. I see your point.

Something in this area would be nice, though, as it does produce a
fairly significant bump in the slab usage on our test system.

 If the target node allocation fails (for whatever reason) then I would
 recommend for simplicities sake to change the target node to
 NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
 complex solution would be to look through partial lists in increasing
 distance to find a partially used slab that is reasonable close to the
 current node. Slab has logic like that in fallback_alloc(). Slubs
 get_any_partial() function does something close to what you want.

I apologize for my own ignorance, but I'm having trouble following.
Anton's original patch did fallback to the current cpu slab, but I'm not
sure any NUMA_NO_NODE change is necessary there. At the point we're
deactivating the slab (in the current code, in __slab_alloc()), we have
successfully allocated from somewhere, it's just not on the node we
expected to be on.

So perhaps you are saying to make a change lower in the code? I'm not
sure where it makes sense to change the target node in that case. I'd
appreciate any guidance you can give.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-03 Thread Nishanth Aravamudan

On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote:
 On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
  On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
   On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

 Thank you for clarifying and providing  a test patch. I ran with this 
 on
 the system showing the original problem, configured to have 15GB of
 memory.
 
 With your patch after boot:
 
 MemTotal:   15604736 kB
 MemFree: 8768192 kB
 Slab:3882560 kB
 SReclaimable: 105408 kB
 SUnreclaim:  3777152 kB
 
 With Anton's patch after boot:
 
 MemTotal:   15604736 kB
 MemFree:11195008 kB
 Slab:1427968 kB
 SReclaimable: 109184 kB
 SUnreclaim:  1318784 kB
 
 
 I know that's fairly unscientific, but the numbers are reproducible. 
 
  
  Hello,
  
  I think that there is one mistake on David's patch although I'm not sure
  that it is the reason for this result.
  
  With David's patch, get_partial() in new_slab_objects() doesn't work
  properly, because we only change node id in !node_match() case. If we
  meet just !freelist case, we pass node id directly to
  new_slab_objects(), so we always try to allocate new slab page
  regardless existence of partial pages. We should solve it.
  
  Could you try this one?
 
 This helps about the same as David's patch -- but I found the reason
 why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
 shortly for that and one other case I found.
 
 This patch on its own seems to help on our test system by saving around
 1.5GB of slab.
 
 Tested-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 
 with the caveat below.

So what's the status of this patch? Christoph, do you think this is fine
as it is?

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-03 Thread Christoph Lameter

On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

 So what's the status of this patch? Christoph, do you think this is fine
 as it is?

Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
already acked the patch.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-03 Thread Nishanth Aravamudan

On 03.02.2014 [21:38:36 -0600], Christoph Lameter wrote:
 On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:
 
  So what's the status of this patch? Christoph, do you think this is fine
  as it is?
 
 Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
 already acked the patch.

Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
the $SUBJECT issue.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-30 Thread Christoph Lameter

On Wed, 29 Jan 2014, Nishanth Aravamudan wrote:

 exactly what the caller intends.

 int searchnode = node;
 if (node == NUMA_NO_NODE)
   searchnode = numa_mem_id();
 if (!node_present_pages(node))
   searchnode = local_memory_node(node);

 The difference in semantics from the previous is that here, if we have a
 memoryless node, rather than using the CPU's nearest NUMA node, we use
 the NUMA node closest to the requested one?

The idea here is that the page allocator will do the fallback to other
nodes. This check for !node_present should not be necessary. SLUB needs to
accept the page from whatever node the page allocator returned and work
with that.

The problem is the check for having a slab from the right node may fall
again after another attempt to allocate from the same node. SLUB will then
push the slab from the *wrong* node back to the partial lists and may
attempt another allocation that will again be successful but return memory
from another node. That way the partial lists from a particular node are
growing uselessly.

One way to solve this may be to check if memory is actually allocated
from the requested node and fallback to NUMA_NO_NODE (which will use the
last allocated slab) for future allocs if the page allocator returned
memory from a different node (unless GFP_THIS_NODE is set of course).
Otherwise we end up replicating  the page allocator logic in slub like in
slab. That is what I wanted to
avoid.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-29 Thread Christoph Lameter

On Tue, 28 Jan 2014, Nishanth Aravamudan wrote:

 This helps about the same as David's patch -- but I found the reason
 why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
 shortly for that and one other case I found.

Oww...

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-29 Thread Nishanth Aravamudan

On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote:
 On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
  On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
   On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

 Thank you for clarifying and providing  a test patch. I ran with this 
 on
 the system showing the original problem, configured to have 15GB of
 memory.
 
 With your patch after boot:
 
 MemTotal:   15604736 kB
 MemFree: 8768192 kB
 Slab:3882560 kB
 SReclaimable: 105408 kB
 SUnreclaim:  3777152 kB
 
 With Anton's patch after boot:
 
 MemTotal:   15604736 kB
 MemFree:11195008 kB
 Slab:1427968 kB
 SReclaimable: 109184 kB
 SUnreclaim:  1318784 kB
 
 
 I know that's fairly unscientific, but the numbers are reproducible. 
 
  
  Hello,
  
  I think that there is one mistake on David's patch although I'm not sure
  that it is the reason for this result.
  
  With David's patch, get_partial() in new_slab_objects() doesn't work
  properly, because we only change node id in !node_match() case. If we
  meet just !freelist case, we pass node id directly to
  new_slab_objects(), so we always try to allocate new slab page
  regardless existence of partial pages. We should solve it.
  
  Could you try this one?
 
 This helps about the same as David's patch -- but I found the reason
 why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
 shortly for that and one other case I found.
 
 This patch on its own seems to help on our test system by saving around
 1.5GB of slab.
 
 Tested-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
 
 with the caveat below.
 
 Thanks,
 Nish
 
  
  Thanks.
  
  --- a/mm/slub.c
  +++ b/mm/slub.c
  @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
  flags, int node,
  struct kmem_cache_cpu *c)
   {
  void *object;
  -   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
  +   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
  
  +   if (node != NUMA_NO_NODE  !node_present_pages(node))
  +   searchnode = numa_mem_id();
 
 This might be clearer as:
 
 int searchnode = node;
 if (node == NUMA_NO_NODE || !node_present_pages(node))
   searchnode = numa_mem_id();

Cody Schafer mentioned to me on IRC that this may not always reflect
exactly what the caller intends.

int searchnode = node;
if (node == NUMA_NO_NODE)
searchnode = numa_mem_id();
if (!node_present_pages(node))
searchnode = local_memory_node(node);

The difference in semantics from the previous is that here, if we have a
memoryless node, rather than using the CPU's nearest NUMA node, we use
the NUMA node closest to the requested one?

  object = get_partial_node(s, get_node(s, searchnode), c, flags);
  if (object || node != NUMA_NO_NODE)
  return object;
  @@ -2278,10 +2280,14 @@ redo:
  
  if (unlikely(!node_match(page, node))) {
  stat(s, ALLOC_NODE_MISMATCH);
  -   deactivate_slab(s, page, c-freelist);
  -   c-page = NULL;
  -   c-freelist = NULL;
  -   goto new_slab;
  +   if (unlikely(!node_present_pages(node)))
  +   node = numa_mem_id();

Similarly here?

-Nish

  +   if (!node_match(page, node)) {
  +   deactivate_slab(s, page, c-freelist);
  +   c-page = NULL;
  +   c-freelist = NULL;
  +   goto new_slab;
  +   }
  }
  
  /*
  

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-28 Thread Nishanth Aravamudan

On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
 On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
  On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
   On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
   
Thank you for clarifying and providing  a test patch. I ran with this on
the system showing the original problem, configured to have 15GB of
memory.

With your patch after boot:

MemTotal:   15604736 kB
MemFree: 8768192 kB
Slab:3882560 kB
SReclaimable: 105408 kB
SUnreclaim:  3777152 kB

With Anton's patch after boot:

MemTotal:   15604736 kB
MemFree:11195008 kB
Slab:1427968 kB
SReclaimable: 109184 kB
SUnreclaim:  1318784 kB


I know that's fairly unscientific, but the numbers are reproducible. 

 
 Hello,
 
 I think that there is one mistake on David's patch although I'm not sure
 that it is the reason for this result.
 
 With David's patch, get_partial() in new_slab_objects() doesn't work
 properly, because we only change node id in !node_match() case. If we
 meet just !freelist case, we pass node id directly to
 new_slab_objects(), so we always try to allocate new slab page
 regardless existence of partial pages. We should solve it.
 
 Could you try this one?

This helps about the same as David's patch -- but I found the reason
why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
shortly for that and one other case I found.

This patch on its own seems to help on our test system by saving around
1.5GB of slab.

Tested-by: Nishanth Aravamudan n...@linux.vnet.ibm.com
Acked-by: Nishanth Aravamudan n...@linux.vnet.ibm.com

with the caveat below.

Thanks,
Nish

 
 Thanks.
 
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
 struct kmem_cache_cpu *c)
  {
 void *object;
 -   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 +   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
 +   if (node != NUMA_NO_NODE  !node_present_pages(node))
 +   searchnode = numa_mem_id();

This might be clearer as:

int searchnode = node;
if (node == NUMA_NO_NODE || !node_present_pages(node))
searchnode = numa_mem_id();

 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 @@ -2278,10 +2280,14 @@ redo:
 
 if (unlikely(!node_match(page, node))) {
 stat(s, ALLOC_NODE_MISMATCH);
 -   deactivate_slab(s, page, c-freelist);
 -   c-page = NULL;
 -   c-freelist = NULL;
 -   goto new_slab;
 +   if (unlikely(!node_present_pages(node)))
 +   node = numa_mem_id();
 +   if (!node_match(page, node)) {
 +   deactivate_slab(s, page, c-freelist);
 +   c-page = NULL;
 +   c-freelist = NULL;
 +   goto new_slab;
 +   }
 }
 
 /*
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-27 Thread Christoph Lameter

On Fri, 24 Jan 2014, David Rientjes wrote:

 kmalloc_node(nid) and kmem_cache_alloc_node(nid) should fallback to nodes
 other than nid when memory can't be allocated, these functions only
 indicate a preference.

The nid passed indicated a preference unless __GFP_THIS_NODE is specified.
Then the allocation must occur on that node.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-27 Thread Christoph Lameter

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

 As to cpu_to_node() being passed to kmalloc_node(), I think an
 appropriate fix is to change that to cpu_to_mem()?

Yup.

  Yeah, the default policy should be to fallback to local memory if the node
  passed is memoryless.

 Thanks!

I would suggest to use NUMA_NO_NODE instead. That will fit any slab that
we may be currently allocating from or can get a hold of and is mosty
efficient.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-26 Thread Joonsoo Kim

On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
 On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
  On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
  
   Thank you for clarifying and providing  a test patch. I ran with this on
   the system showing the original problem, configured to have 15GB of
   memory.
   
   With your patch after boot:
   
   MemTotal:   15604736 kB
   MemFree: 8768192 kB
   Slab:3882560 kB
   SReclaimable: 105408 kB
   SUnreclaim:  3777152 kB
   
   With Anton's patch after boot:
   
   MemTotal:   15604736 kB
   MemFree:11195008 kB
   Slab:1427968 kB
   SReclaimable: 109184 kB
   SUnreclaim:  1318784 kB
   
   
   I know that's fairly unscientific, but the numbers are reproducible. 
   

Hello,

I think that there is one mistake on David's patch although I'm not sure
that it is the reason for this result.

With David's patch, get_partial() in new_slab_objects() doesn't work properly,
because we only change node id in !node_match() case. If we meet just !freelist
case, we pass node id directly to new_slab_objects(), so we always try to 
allocate
new slab page regardless existence of partial pages. We should solve it.

Could you try this one?

Thanks.

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
+   if (node != NUMA_NO_NODE  !node_present_pages(node))
+   searchnode = numa_mem_id();
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;
@@ -2278,10 +2280,14 @@ redo:
 
if (unlikely(!node_match(page, node))) {
stat(s, ALLOC_NODE_MISMATCH);
-   deactivate_slab(s, page, c-freelist);
-   c-page = NULL;
-   c-freelist = NULL;
-   goto new_slab;
+   if (unlikely(!node_present_pages(node)))
+   node = numa_mem_id();
+   if (!node_match(page, node)) {
+   deactivate_slab(s, page, c-freelist);
+   c-page = NULL;
+   c-freelist = NULL;
+   goto new_slab;
+   }
}
 
/*

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-24 Thread Christoph Lameter

On Fri, 24 Jan 2014, Wanpeng Li wrote:

 
 diff --git a/mm/slub.c b/mm/slub.c
 index 545a170..a1c6040 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
  void *object;
  int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

This needs to be numa_mem_id() and numa_mem_id would need to be
consistently used.

 
 +if (!node_present_pages(searchnode))
 +searchnode = numa_mem_id();

Probably wont need that?

 +
  object = get_partial_node(s, get_node(s, searchnode), c, flags);
  if (object || node != NUMA_NO_NODE)
  return object;
 

 The bug still can't be fixed w/ this patch.

Some more detail would be good. If memory is requested from a particular
node then it would be best to use one that has memory. Callers also may
have used numa_node_id() and that also would need to be fixed.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-24 Thread Nishanth Aravamudan

On 24.01.2014 [13:03:13 -0800], David Rientjes wrote:
 On Fri, 24 Jan 2014, Christoph Lameter wrote:
 
  On Fri, 24 Jan 2014, Wanpeng Li wrote:
  
   
   diff --git a/mm/slub.c b/mm/slub.c
   index 545a170..a1c6040 100644
   --- a/mm/slub.c
   +++ b/mm/slub.c
   @@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, 
   gfp_t flags, int node,
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
  
  This needs to be numa_mem_id() and numa_mem_id would need to be
  consistently used.
  
   
   +if (!node_present_pages(searchnode))
   +searchnode = numa_mem_id();
  
  Probably wont need that?
  
 
 I think the problem is a memoryless node being used for kmalloc_node() so 
 we need to decide where to enforce node_present_pages().  __slab_alloc() 
 seems like the best candidate when !node_match().
 

Yep, I'm looking through callers and such right now and came to a
similar conclusion. I should have a patch soon.

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-24 Thread Nishanth Aravamudan

On 24.01.2014 [13:03:13 -0800], David Rientjes wrote:
 On Fri, 24 Jan 2014, Christoph Lameter wrote:
 
  On Fri, 24 Jan 2014, Wanpeng Li wrote:
  
   
   diff --git a/mm/slub.c b/mm/slub.c
   index 545a170..a1c6040 100644
   --- a/mm/slub.c
   +++ b/mm/slub.c
   @@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, 
   gfp_t flags, int node,
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
  
  This needs to be numa_mem_id() and numa_mem_id would need to be
  consistently used.
  
   
   +if (!node_present_pages(searchnode))
   +searchnode = numa_mem_id();
  
  Probably wont need that?
  
 
 I think the problem is a memoryless node being used for kmalloc_node() so 
 we need to decide where to enforce node_present_pages().  __slab_alloc() 
 seems like the best candidate when !node_match().

Actually, this is effectively what Anton's patch does, except with
Wanpeng's adjustment to use node_present_pages(). Does that seem
sufficient to you?

It does only cover the memoryless node case (not the exhausted node
case), but I think that shouldn't block the fix (and it does fix the
issue we've run across in our testing).

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-24 Thread Nishanth Aravamudan

On 24.01.2014 [15:49:33 -0800], David Rientjes wrote:
 On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
 
   I think the problem is a memoryless node being used for kmalloc_node() so 
   we need to decide where to enforce node_present_pages().  __slab_alloc() 
   seems like the best candidate when !node_match().
  
  Actually, this is effectively what Anton's patch does, except with
  Wanpeng's adjustment to use node_present_pages(). Does that seem
  sufficient to you?
  
 
 I don't see that as being the effect of Anton's patch.  We need to use 
 numa_mem_id() as Christoph mentioned when a memoryless node is passed for 
 the best NUMA locality.  Something like this:

Thank you for clarifying and providing  a test patch. I ran with this on
the system showing the original problem, configured to have 15GB of
memory.

With your patch after boot:

MemTotal:   15604736 kB
MemFree: 8768192 kB
Slab:3882560 kB
SReclaimable: 105408 kB
SUnreclaim:  3777152 kB

With Anton's patch after boot:

MemTotal:   15604736 kB
MemFree:11195008 kB
Slab:1427968 kB
SReclaimable: 109184 kB
SUnreclaim:  1318784 kB


I know that's fairly unscientific, but the numbers are reproducible. 

For what it's worth, a sample of the unmodified numbers:

MemTotal:   15317632 kB
MemFree: 5023424 kB
Slab:7176064 kB
SReclaimable: 106816 kB
SUnreclaim:  7069248 kB

So it's an improvement, but something is still causing us to (it seems)
be pretty inefficient with the slabs.


 diff --git a/mm/slub.c b/mm/slub.c
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2278,10 +2278,14 @@ redo:
 
   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
 - deactivate_slab(s, page, c-freelist);
 - c-page = NULL;
 - c-freelist = NULL;
 - goto new_slab;
 + if (unlikely(!node_present_pages(node)))
 + node = numa_mem_id();
 + if (!node_match(page, node)) {
 + deactivate_slab(s, page, c-freelist);
 + c-page = NULL;
 + c-freelist = NULL;
 + goto new_slab;
 + }

Semantically, and please correct me if I'm wrong, this patch is saying
if we have a memoryless node, we expect the page's locality to be that
of numa_mem_id(), and we still deactivate the slab if that isn't true.
Just wanting to make sure I understand the intent.

What I find odd is that there are only 2 nodes on this system, node 0
(empty) and node 1. So won't numa_mem_id() always be 1? And every page
should be coming from node 1 (thus node_match() should always be true?)

Thanks,
Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-24 Thread Nishanth Aravamudan

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
 On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
 
  Thank you for clarifying and providing  a test patch. I ran with this on
  the system showing the original problem, configured to have 15GB of
  memory.
  
  With your patch after boot:
  
  MemTotal:   15604736 kB
  MemFree: 8768192 kB
  Slab:3882560 kB
  SReclaimable: 105408 kB
  SUnreclaim:  3777152 kB
  
  With Anton's patch after boot:
  
  MemTotal:   15604736 kB
  MemFree:11195008 kB
  Slab:1427968 kB
  SReclaimable: 109184 kB
  SUnreclaim:  1318784 kB
  
  
  I know that's fairly unscientific, but the numbers are reproducible. 
  
 
 I don't think the goal of the discussion is to reduce the amount of slab 
 allocated, but rather get the most local slab memory possible by use of 
 kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
 which is probably cpu_to_node() for a cpu bound to a node without memory, 
 my patch is allocating it on the most local node; Anton's patch is 
 allocating it on whatever happened to be the cpu slab.

Well, the issue we're trying to resolve, based upon our analysis, is
that we're seeing incredibly inefficient slab usage with memoryless
nodes. To the point where we are OOM'ing a 8GB system without doing
anything in particularly stressful.

As to cpu_to_node() being passed to kmalloc_node(), I think an
appropriate fix is to change that to cpu_to_mem()?

   diff --git a/mm/slub.c b/mm/slub.c
   --- a/mm/slub.c
   +++ b/mm/slub.c
   @@ -2278,10 +2278,14 @@ redo:
   
 if (unlikely(!node_match(page, node))) {
 stat(s, ALLOC_NODE_MISMATCH);
   - deactivate_slab(s, page, c-freelist);
   - c-page = NULL;
   - c-freelist = NULL;
   - goto new_slab;
   + if (unlikely(!node_present_pages(node)))
   + node = numa_mem_id();
   + if (!node_match(page, node)) {
   + deactivate_slab(s, page, c-freelist);
   + c-page = NULL;
   + c-freelist = NULL;
   + goto new_slab;
   + }
  
  Semantically, and please correct me if I'm wrong, this patch is saying
  if we have a memoryless node, we expect the page's locality to be that
  of numa_mem_id(), and we still deactivate the slab if that isn't true.
  Just wanting to make sure I understand the intent.
  
 
 Yeah, the default policy should be to fallback to local memory if the node 
 passed is memoryless.

Thanks!

  What I find odd is that there are only 2 nodes on this system, node 0
  (empty) and node 1. So won't numa_mem_id() always be 1? And every page
  should be coming from node 1 (thus node_match() should always be true?)
  
 
 The nice thing about slub is its debugging ability, what is 
 /sys/kernel/slab/cache/objects showing in comparison between the two 
 patches?

Do you mean kmem_cache or kmem_cache_node?

-Nish

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-23 Thread Wanpeng Li

Hi Christoph,
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
On Mon, 20 Jan 2014, Wanpeng Li wrote:

 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 

 The patch fix the bug. However, the kernel crashed very quickly after running
 stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..a1c6040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

+   if (!node_present_pages(searchnode))
+   searchnode = numa_mem_id();
+
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-23 Thread Wanpeng Li

On Fri, Jan 24, 2014 at 11:09:07AM +0800, Wanpeng Li wrote:
Hi Christoph,
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
On Mon, 20 Jan 2014, Wanpeng Li wrote:

 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 

 The patch fix the bug. However, the kernel crashed very quickly after 
 running
 stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..a1c6040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
   void *object;
   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

+  if (!node_present_pages(searchnode))
+  searchnode = numa_mem_id();
+
   object = get_partial_node(s, get_node(s, searchnode), c, flags);
   if (object || node != NUMA_NO_NODE)
   return object;


The bug still can't be fixed w/ this patch. 

Regards,
Wanpeng Li 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-20 Thread Wanpeng Li

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
[...]

-8
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 {
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   struct zonelist *zonelist;
+   struct zoneref *z;
+   struct zone *zone;
+   enum zone_type high_zoneidx = gfp_zone(flags);

+   if (!node_present_pages(searchnode)) {
+   zonelist = node_zonelist(searchnode, flags);
+   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+   searchnode = zone_to_nid(zone);
+   if (node_present_pages(searchnode))
+   break;
+   }
+   }
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;


The patch fix the bug. However, the kernel crashed very quickly after running 
stress tests for a short while:

[  287.464285] Unable to handle kernel paging request for data at address 
0x0001
[  287.464289] Faulting instruction address: 0xc0445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 
arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver 
nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT 
ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc 
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter 
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter 
ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto 
pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput 
dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi 
scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c0fde590 ti: c001fffd task.ti: 
c10a4000
[  287.464382] NIP: c0445af8 LR: c0445bcc CTR: c0445b90
[  287.464385] REGS: c001fffd38e0 TRAP: 0300   Not tainted  
(3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 88002084  XER: 
0001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c000908c
[  287.464401] DAR: 0001, DSISR: 4000
[  287.464403]
GPR00: d3649a04 c001fffd3b60 c10a94d0 0003
GPR04: c0018d841048 c001fffd3bd0 0012 d364eff0
GPR08: c001fffd3bd0 0001 d364d688 c0445b90
GPR12: d364b960 c7e0 042ac510 0060
GPR16: 0020 fb19 c1122100 
GPR20: c0a94680 c1122180 c0a94680 000a
GPR24: 0100  0001 c001ef90
GPR28: c001d6c066f0 c001aea03520 c001bc9a2640 c0018d841680
[  287.464447] NIP [c0445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c0445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [80009032]
[  287.464455] Call Trace:
[  287.464458] [c001fffd3b60] [c001fffd3c00] 0xc001fffd3c00 
(unreliable)
[  287.464467] [c001fffd3bf0] [d3649a04] 
.ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c001fffd3cb0] [d36495b8] 
.ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c001fffd3d30] [d3649fe4] .ibmvfc_tasklet+0xd4/0x250 
[ibmvfc]
[  287.464494] [c001fffd3de0] [c009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c001fffd3e90] [c009a668] .__do_softirq+0x148/0x360
[  287.464503] [c001fffd3f90] [c00218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c001fffcfdf0] [c00107e0] .do_softirq+0xd0/0x100
[  287.464511] [c001fffcfe80] [c009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c001fffcff10] [c0010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c001fffcff90] [c00218cc] .call_do_irq+0x14/0x24
[  287.464522] [c10a76d0] [c00105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0x
[  287.464527] LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c10a7770] [c0002494] 
hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540] LR = .check_and_cede_processor+0x24/0x40
[  287.464546]

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-20 Thread Christoph Lameter

On Mon, 20 Jan 2014, Wanpeng Li wrote:

 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 

 The patch fix the bug. However, the kernel crashed very quickly after running
 stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).
[  287.464285] Unable to handle kernel paging request for data at address 
0x0001
[  287.464289] Faulting instruction address: 0xc0445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 
arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver 
nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT 
ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc 
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter 
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter 
ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto 
pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput 
dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi 
scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c0fde590 ti: c001fffd task.ti: 
c10a4000
[  287.464382] NIP: c0445af8 LR: c0445bcc CTR: c0445b90
[  287.464385] REGS: c001fffd38e0 TRAP: 0300   Not tainted  
(3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 88002084  XER: 
0001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c000908c
[  287.464401] DAR: 0001, DSISR: 4000
[  287.464403]
GPR00: d3649a04 c001fffd3b60 c10a94d0 0003
GPR04: c0018d841048 c001fffd3bd0 0012 d364eff0
GPR08: c001fffd3bd0 0001 d364d688 c0445b90
GPR12: d364b960 c7e0 042ac510 0060
GPR16: 0020 fb19 c1122100 
GPR20: c0a94680 c1122180 c0a94680 000a
GPR24: 0100  0001 c001ef90
GPR28: c001d6c066f0 c001aea03520 c001bc9a2640 c0018d841680
[  287.464447] NIP [c0445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c0445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [80009032]
[  287.464455] Call Trace:
[  287.464458] [c001fffd3b60] [c001fffd3c00] 0xc001fffd3c00 
(unreliable)
[  287.464467] [c001fffd3bf0] [d3649a04] 
.ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c001fffd3cb0] [d36495b8] 
.ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c001fffd3d30] [d3649fe4] .ibmvfc_tasklet+0xd4/0x250 
[ibmvfc]
[  287.464494] [c001fffd3de0] [c009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c001fffd3e90] [c009a668] .__do_softirq+0x148/0x360
[  287.464503] [c001fffd3f90] [c00218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c001fffcfdf0] [c00107e0] .do_softirq+0xd0/0x100
[  287.464511] [c001fffcfe80] [c009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c001fffcff10] [c0010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c001fffcff90] [c00218cc] .call_do_irq+0x14/0x24
[  287.464522] [c10a76d0] [c00105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0x
[  287.464527] LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c10a7770] [c0002494] 
hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540] LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c10a7a60] [0001] 0x1 (unreliable)
[  287.464550] [c10a7ad0] [c0074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c10a7b50] [c05538f4]

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-20 Thread Wanpeng Li

On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
On Mon, 20 Jan 2014, Wanpeng Li wrote:

 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 

 The patch fix the bug. However, the kernel crashed very quickly after running
 stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).

Thanks for your pointing out, I will do it and retest it later.

Regards,
Wanpeng Li 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-08 Thread Anton Blanchard


Hi Andi,

  Thoughts? It seems like we could hit a similar situation if a
  machine is balanced but we run out of memory on a single node.
 
 Yes I agree, but your patch doesn't seem to attempt to handle this?

It doesn't. I was hoping someone with more mm knowledge than I could
suggest a lightweight way of doing this.

Anton
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-08 Thread Anton Blanchard


Hi David,

 Why not just delete the entire test?
 Presumably some time a little earlier no local memory was available.
 Even if there is some available now, it is very likely that some won't
 be available again in the near future.

I agree, the current behaviour seems strange but it has been around
since the inital slub commit.

Anton
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-08 Thread Anton Blanchard


Hi Wanpeng,

 +if (node_spanned_pages(node)) {
 
 s/node_spanned_pages/node_present_pages 

Thanks, I hadn't come across node_present_pages() before.

Anton
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-08 Thread Joonsoo Kim

On Tue, Jan 07, 2014 at 05:52:31PM +0800, Wanpeng Li wrote:
 On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
 On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:

  Index: b/mm/slub.c
  ===
  --- a/mm/slub.c
  +++ b/mm/slub.c
  @@ -2278,10 +2278,17 @@ redo:
   
 if (unlikely(!node_match(page, node))) {
 stat(s, ALLOC_NODE_MISMATCH);
  -  deactivate_slab(s, page, c-freelist);
  -  c-page = NULL;
  -  c-freelist = NULL;
  -  goto new_slab;
  +
  +  /*
  +   * If the node contains no memory there is no point in trying
  +   * to allocate a new node local slab
  +   */
  +  if (node_spanned_pages(node)) {
  +  deactivate_slab(s, page, c-freelist);
  +  c-page = NULL;
  +  c-freelist = NULL;
  +  goto new_slab;
  +  }
 }
   
 /*
 
 Hello,
 
 I think that we need more efforts to solve unbalanced node problem.
 
 With this patch, even if node of current cpu slab is not favorable to
 unbalanced node, allocation would proceed and we would get the unintended 
 memory.
 
 And there is one more problem. Even if we have some partial slabs on
 compatible node, we would allocate new slab, because get_partial() cannot 
 handle
 this unbalance node case.
 
 To fix this correctly, how about following patch?
 
 Thanks.
 
 -8
 diff --git a/mm/slub.c b/mm/slub.c
 index c3eb3d3..a1f6dfa 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
  {
 void *object;
 int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 +   struct zonelist *zonelist;
 +   struct zoneref *z;
 +   struct zone *zone;
 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 
 Why change searchnode instead of depending on fallback zones/nodes in 
 get_any_partial() to allocate partial slabs?
 

If node != NUMA_NO_NODE, get_any_partial() isn't called.
That's why I change searchnode here instead of get_any_partial().

Thanks.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
 
[...]
Hello,

I think that we need more efforts to solve unbalanced node problem.

With this patch, even if node of current cpu slab is not favorable to
unbalanced node, allocation would proceed and we would get the unintended 
memory.


We have a machine:

[0.00] Node 0 Memory:
[0.00] Node 4 Memory: 0x0-0x1000 0x2000-0x6000 
0x8000-0xc000
[0.00] Node 6 Memory: 0x1000-0x2000 0x6000-0x8000
[0.00] Node 10 Memory: 0xc000-0x18000

[0.041486] Node 0 CPUs: 0-19
[0.041490] Node 4 CPUs:
[0.041492] Node 6 CPUs:
[0.041495] Node 10 CPUs:

The pages of current cpu slab should be allocated from fallback zones/nodes 
of the memoryless node in buddy system, how can not favorable happen? 

And there is one more problem. Even if we have some partial slabs on
compatible node, we would allocate new slab, because get_partial() cannot 
handle
this unbalance node case.

To fix this correctly, how about following patch?


So I think we should fold both of your two patches to one.

Regards,
Wanpeng Li 

Thanks.

-8
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 {
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   struct zonelist *zonelist;
+   struct zoneref *z;
+   struct zone *zone;
+   enum zone_type high_zoneidx = gfp_zone(flags);

+   if (!node_present_pages(searchnode)) {
+   zonelist = node_zonelist(searchnode, flags);
+   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+   searchnode = zone_to_nid(zone);
+   if (node_present_pages(searchnode))
+   break;
+   }
+   }
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li

On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
 Hi Joonsoo,
 On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
 On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
  
 [...]
 Hello,
 
 I think that we need more efforts to solve unbalanced node problem.
 
 With this patch, even if node of current cpu slab is not favorable to
 unbalanced node, allocation would proceed and we would get the unintended 
 memory.
 
 
 We have a machine:
 
 [0.00] Node 0 Memory:
 [0.00] Node 4 Memory: 0x0-0x1000 0x2000-0x6000 
 0x8000-0xc000
 [0.00] Node 6 Memory: 0x1000-0x2000 0x6000-0x8000
 [0.00] Node 10 Memory: 0xc000-0x18000
 
 [0.041486] Node 0 CPUs: 0-19
 [0.041490] Node 4 CPUs:
 [0.041492] Node 6 CPUs:
 [0.041495] Node 10 CPUs:
 
 The pages of current cpu slab should be allocated from fallback zones/nodes 
 of the memoryless node in buddy system, how can not favorable happen? 

Hi, Wanpeng.

IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
allocate the page in fallback zones/node of that node #. So fallback list isn't
related to fallback one of memoryless node #. Am I wrong?


Anton add node_spanned_pages(node) check, so current cpu slab mentioned
above is against memoryless node. If I miss something?

Regards,
Wanpeng Li 

Thanks.

 
 And there is one more problem. Even if we have some partial slabs on
 compatible node, we would allocate new slab, because get_partial() cannot 
 handle
 this unbalance node case.
 
 To fix this correctly, how about following patch?
 
 
 So I think we should fold both of your two patches to one.
 
 Regards,
 Wanpeng Li 
 
 Thanks.
 
 -8
 diff --git a/mm/slub.c b/mm/slub.c
 index c3eb3d3..a1f6dfa 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
  {
 void *object;
 int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 +   struct zonelist *zonelist;
 +   struct zoneref *z;
 +   struct zone *zone;
 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Joonsoo Kim

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
 
 We noticed a huge amount of slab memory consumed on a large ppc64 box:
 
 Slab:2094336 kB
 
 Almost 2GB. This box is not balanced and some nodes do not have local
 memory, causing slub to be very inefficient in its slab usage.
 
 Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
 sees it isn't node local, deactivates it and tries to allocate a new
 slab. On empty nodes we will allocate a new remote slab and use the
 first slot, but as explained above when we get called a second time
 we will just deactivate that slab and retry.
 
 As such we end up only using 1 entry in each slab:
 
 slabmem  objects
used   active
 
 kmalloc-16384   1404 MB4.90%
 task_struct  668 MB2.90%
 kmalloc-128  193 MB3.61%
 kmalloc-192  152 MB5.23%
 kmalloc-8192  72 MB   23.40%
 kmalloc-1664 MB7.43%
 kmalloc-512   33 MB   22.41%
 
 The patch below checks that a node is not empty before deactivating a
 slab and trying to allocate it again. With this patch applied we now
 use about 352MB:
 
 Slab: 360192 kB
 
 And our efficiency is much better:
 
 slabmem  objects
used   active
 
 kmalloc-16384 92 MB   74.27%
 task_struct   23 MB   83.46%
 idr_layer_cache   18 MB  100.00%
 pgtable-2^12  17 MB  100.00%
 kmalloc-65536 15 MB  100.00%
 inode_cache   14 MB  100.00%
 kmalloc-256   14 MB   97.81%
 kmalloc-8192  14 MB   85.71%
 
 Signed-off-by: Anton Blanchard an...@samba.org
 ---
 
 Thoughts? It seems like we could hit a similar situation if a machine
 is balanced but we run out of memory on a single node.
 
 Index: b/mm/slub.c
 ===
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2278,10 +2278,17 @@ redo:
  
   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
 - deactivate_slab(s, page, c-freelist);
 - c-page = NULL;
 - c-freelist = NULL;
 - goto new_slab;
 +
 + /*
 +  * If the node contains no memory there is no point in trying
 +  * to allocate a new node local slab
 +  */
 + if (node_spanned_pages(node)) {
 + deactivate_slab(s, page, c-freelist);
 + c-page = NULL;
 + c-freelist = NULL;
 + goto new_slab;
 + }
   }
  
   /*

Hello,

I think that we need more efforts to solve unbalanced node problem.

With this patch, even if node of current cpu slab is not favorable to
unbalanced node, allocation would proceed and we would get the unintended 
memory.

And there is one more problem. Even if we have some partial slabs on
compatible node, we would allocate new slab, because get_partial() cannot handle
this unbalance node case.

To fix this correctly, how about following patch?

Thanks.

-8
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 {
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   struct zonelist *zonelist;
+   struct zoneref *z;
+   struct zone *zone;
+   enum zone_type high_zoneidx = gfp_zone(flags);
 
+   if (!node_present_pages(searchnode)) {
+   zonelist = node_zonelist(searchnode, flags);
+   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+   searchnode = zone_to_nid(zone);
+   if (node_present_pages(searchnode))
+   break;
+   }
+   }
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Joonsoo Kim

On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
 Hi Joonsoo,
 On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
 On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
  
 [...]
 Hello,
 
 I think that we need more efforts to solve unbalanced node problem.
 
 With this patch, even if node of current cpu slab is not favorable to
 unbalanced node, allocation would proceed and we would get the unintended 
 memory.
 
 
 We have a machine:
 
 [0.00] Node 0 Memory:
 [0.00] Node 4 Memory: 0x0-0x1000 0x2000-0x6000 
 0x8000-0xc000
 [0.00] Node 6 Memory: 0x1000-0x2000 0x6000-0x8000
 [0.00] Node 10 Memory: 0xc000-0x18000
 
 [0.041486] Node 0 CPUs: 0-19
 [0.041490] Node 4 CPUs:
 [0.041492] Node 6 CPUs:
 [0.041495] Node 10 CPUs:
 
 The pages of current cpu slab should be allocated from fallback zones/nodes 
 of the memoryless node in buddy system, how can not favorable happen? 

Hi, Wanpeng.

IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
allocate the page in fallback zones/node of that node #. So fallback list isn't
related to fallback one of memoryless node #. Am I wrong?

Thanks.

 
 And there is one more problem. Even if we have some partial slabs on
 compatible node, we would allocate new slab, because get_partial() cannot 
 handle
 this unbalance node case.
 
 To fix this correctly, how about following patch?
 
 
 So I think we should fold both of your two patches to one.
 
 Regards,
 Wanpeng Li 
 
 Thanks.
 
 -8
 diff --git a/mm/slub.c b/mm/slub.c
 index c3eb3d3..a1f6dfa 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
  {
 void *object;
 int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 +   struct zonelist *zonelist;
 +   struct zoneref *z;
 +   struct zone *zone;
 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Joonsoo Kim

On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
 On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
 On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
  Hi Joonsoo,
  On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
  On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
   
  [...]
  Hello,
  
  I think that we need more efforts to solve unbalanced node problem.
  
  With this patch, even if node of current cpu slab is not favorable to
  unbalanced node, allocation would proceed and we would get the unintended 
  memory.
  
  
  We have a machine:
  
  [0.00] Node 0 Memory:
  [0.00] Node 4 Memory: 0x0-0x1000 0x2000-0x6000 
  0x8000-0xc000
  [0.00] Node 6 Memory: 0x1000-0x2000 0x6000-0x8000
  [0.00] Node 10 Memory: 0xc000-0x18000
  
  [0.041486] Node 0 CPUs: 0-19
  [0.041490] Node 4 CPUs:
  [0.041492] Node 6 CPUs:
  [0.041495] Node 10 CPUs:
  
  The pages of current cpu slab should be allocated from fallback 
  zones/nodes 
  of the memoryless node in buddy system, how can not favorable happen? 
 
 Hi, Wanpeng.
 
 IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
 allocate the page in fallback zones/node of that node #. So fallback list 
 isn't
 related to fallback one of memoryless node #. Am I wrong?
 
 
 Anton add node_spanned_pages(node) check, so current cpu slab mentioned
 above is against memoryless node. If I miss something?

I thought following scenario.

memoryless node # : 1
1's fallback node # : 0

On node 1's cpu,

1. kmem_cache_alloc_node (node 2)
2. allocate the page on node 2 for the slab, now cpu slab is that one.
3. kmem_cache_alloc_node (local node, that is, node 1)
4. It check node_spanned_pages() and find it is memoryless node.
So return node 2's memory.

Is it impossible scenario?

Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li

On Tue, Jan 07, 2014 at 06:31:56PM +0900, Joonsoo Kim wrote:
On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
 On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
 On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
  Hi Joonsoo,
  On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
  On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
   
  [...]
  Hello,
  
  I think that we need more efforts to solve unbalanced node problem.
  
  With this patch, even if node of current cpu slab is not favorable to
  unbalanced node, allocation would proceed and we would get the 
  unintended memory.
  
  
  We have a machine:
  
  [0.00] Node 0 Memory:
  [0.00] Node 4 Memory: 0x0-0x1000 0x2000-0x6000 
  0x8000-0xc000
  [0.00] Node 6 Memory: 0x1000-0x2000 0x6000-0x8000
  [0.00] Node 10 Memory: 0xc000-0x18000
  
  [0.041486] Node 0 CPUs: 0-19
  [0.041490] Node 4 CPUs:
  [0.041492] Node 6 CPUs:
  [0.041495] Node 10 CPUs:
  
  The pages of current cpu slab should be allocated from fallback 
  zones/nodes 
  of the memoryless node in buddy system, how can not favorable happen? 
 
 Hi, Wanpeng.
 
 IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
 allocate the page in fallback zones/node of that node #. So fallback list 
 isn't
 related to fallback one of memoryless node #. Am I wrong?
 
 
 Anton add node_spanned_pages(node) check, so current cpu slab mentioned
 above is against memoryless node. If I miss something?

I thought following scenario.

memoryless node # : 1
1's fallback node # : 0

On node 1's cpu,

1. kmem_cache_alloc_node (node 2)
2. allocate the page on node 2 for the slab, now cpu slab is that one.
3. kmem_cache_alloc_node (local node, that is, node 1)
4. It check node_spanned_pages() and find it is memoryless node.
So return node 2's memory.

Is it impossible scenario?


Indeed, it can happen. 

Regards,
Wanpeng Li 

Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread David Laight

 From: Anton Blanchard
 We noticed a huge amount of slab memory consumed on a large ppc64 box:
 
 Slab:2094336 kB
 
 Almost 2GB. This box is not balanced and some nodes do not have local
 memory, causing slub to be very inefficient in its slab usage.
 
 Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
 sees it isn't node local, deactivates it and tries to allocate a new
 slab. ...
...
   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
   deactivate_slab(s, page, c-freelist);
   c-page = NULL;
   c-freelist = NULL;
   goto new_slab;
   }

Why not just delete the entire test?
Presumably some time a little earlier no local memory was available.
Even if there is some available now, it is very likely that some won't
be available again in the near future.

David.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li

On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
 
 We noticed a huge amount of slab memory consumed on a large ppc64 box:
 
 Slab:2094336 kB
 
 Almost 2GB. This box is not balanced and some nodes do not have local
 memory, causing slub to be very inefficient in its slab usage.
 
 Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
 sees it isn't node local, deactivates it and tries to allocate a new
 slab. On empty nodes we will allocate a new remote slab and use the
 first slot, but as explained above when we get called a second time
 we will just deactivate that slab and retry.
 
 As such we end up only using 1 entry in each slab:
 
 slabmem  objects
used   active
 
 kmalloc-16384   1404 MB4.90%
 task_struct  668 MB2.90%
 kmalloc-128  193 MB3.61%
 kmalloc-192  152 MB5.23%
 kmalloc-8192  72 MB   23.40%
 kmalloc-1664 MB7.43%
 kmalloc-512   33 MB   22.41%
 
 The patch below checks that a node is not empty before deactivating a
 slab and trying to allocate it again. With this patch applied we now
 use about 352MB:
 
 Slab: 360192 kB
 
 And our efficiency is much better:
 
 slabmem  objects
used   active
 
 kmalloc-16384 92 MB   74.27%
 task_struct   23 MB   83.46%
 idr_layer_cache   18 MB  100.00%
 pgtable-2^12  17 MB  100.00%
 kmalloc-65536 15 MB  100.00%
 inode_cache   14 MB  100.00%
 kmalloc-256   14 MB   97.81%
 kmalloc-8192  14 MB   85.71%
 
 Signed-off-by: Anton Blanchard an...@samba.org
 ---
 
 Thoughts? It seems like we could hit a similar situation if a machine
 is balanced but we run out of memory on a single node.
 
 Index: b/mm/slub.c
 ===
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2278,10 +2278,17 @@ redo:
  
  if (unlikely(!node_match(page, node))) {
  stat(s, ALLOC_NODE_MISMATCH);
 -deactivate_slab(s, page, c-freelist);
 -c-page = NULL;
 -c-freelist = NULL;
 -goto new_slab;
 +
 +/*
 + * If the node contains no memory there is no point in trying
 + * to allocate a new node local slab
 + */
 +if (node_spanned_pages(node)) {
 +deactivate_slab(s, page, c-freelist);
 +c-page = NULL;
 +c-freelist = NULL;
 +goto new_slab;
 +}
  }
  
  /*

Hello,

I think that we need more efforts to solve unbalanced node problem.

With this patch, even if node of current cpu slab is not favorable to
unbalanced node, allocation would proceed and we would get the unintended 
memory.

And there is one more problem. Even if we have some partial slabs on
compatible node, we would allocate new slab, because get_partial() cannot 
handle
this unbalance node case.

To fix this correctly, how about following patch?

Thanks.

-8
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 {
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   struct zonelist *zonelist;
+   struct zoneref *z;
+   struct zone *zone;
+   enum zone_type high_zoneidx = gfp_zone(flags);

+   if (!node_present_pages(searchnode)) {
+   zonelist = node_zonelist(searchnode, flags);
+   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+   searchnode = zone_to_nid(zone);
+   if (node_present_pages(searchnode))
+   break;
+   }
+   }

Why change searchnode instead of depending on fallback zones/nodes in 
get_any_partial() to allocate partial slabs?

Regards,
Wanpeng Li 

object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:

We noticed a huge amount of slab memory consumed on a large ppc64 box:

Slab:2094336 kB

Almost 2GB. This box is not balanced and some nodes do not have local
memory, causing slub to be very inefficient in its slab usage.

Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
sees it isn't node local, deactivates it and tries to allocate a new
slab. On empty nodes we will allocate a new remote slab and use the
first slot, but as explained above when we get called a second time
we will just deactivate that slab and retry.


Deactive cpu slab cache doesn't always mean free the slab cache to buddy 
system, 
maybe the slab cache will be putback to the remote node's partial list if there 
are objects still in used in this unbalance situation. In this case, the slub 
slow 
path can freeze the partial slab in remote node again. So why the slab cache is 
fragmented as below? 

Regards,
Wanpeng Li 

As such we end up only using 1 entry in each slab:

slabmem  objects
   used   active

kmalloc-16384   1404 MB4.90%
task_struct  668 MB2.90%
kmalloc-128  193 MB3.61%
kmalloc-192  152 MB5.23%
kmalloc-8192  72 MB   23.40%
kmalloc-1664 MB7.43%
kmalloc-512   33 MB   22.41%

The patch below checks that a node is not empty before deactivating a
slab and trying to allocate it again. With this patch applied we now
use about 352MB:

Slab: 360192 kB

And our efficiency is much better:

slabmem  objects
   used   active

kmalloc-16384 92 MB   74.27%
task_struct   23 MB   83.46%
idr_layer_cache   18 MB  100.00%
pgtable-2^12  17 MB  100.00%
kmalloc-65536 15 MB  100.00%
inode_cache   14 MB  100.00%
kmalloc-256   14 MB   97.81%
kmalloc-8192  14 MB   85.71%

Signed-off-by: Anton Blanchard an...@samba.org
---

Thoughts? It seems like we could hit a similar situation if a machine
is balanced but we run out of memory on a single node.

Index: b/mm/slub.c
===
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,17 @@ redo:

   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
-  deactivate_slab(s, page, c-freelist);
-  c-page = NULL;
-  c-freelist = NULL;
-  goto new_slab;
+
+  /*
+   * If the node contains no memory there is no point in trying
+   * to allocate a new node local slab
+   */
+  if (node_spanned_pages(node)) {
+  deactivate_slab(s, page, c-freelist);
+  c-page = NULL;
+  c-freelist = NULL;
+  goto new_slab;
+  }
   }

   /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-06 Thread Wanpeng Li

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:

We noticed a huge amount of slab memory consumed on a large ppc64 box:

Slab:2094336 kB

Almost 2GB. This box is not balanced and some nodes do not have local
memory, causing slub to be very inefficient in its slab usage.

Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
sees it isn't node local, deactivates it and tries to allocate a new
slab. On empty nodes we will allocate a new remote slab and use the
first slot, but as explained above when we get called a second time
we will just deactivate that slab and retry.

As such we end up only using 1 entry in each slab:

slabmem  objects
   used   active

kmalloc-16384   1404 MB4.90%
task_struct  668 MB2.90%
kmalloc-128  193 MB3.61%
kmalloc-192  152 MB5.23%
kmalloc-8192  72 MB   23.40%
kmalloc-1664 MB7.43%
kmalloc-512   33 MB   22.41%

The patch below checks that a node is not empty before deactivating a
slab and trying to allocate it again. With this patch applied we now
use about 352MB:

Slab: 360192 kB

And our efficiency is much better:

slabmem  objects
   used   active

kmalloc-16384 92 MB   74.27%
task_struct   23 MB   83.46%
idr_layer_cache   18 MB  100.00%
pgtable-2^12  17 MB  100.00%
kmalloc-65536 15 MB  100.00%
inode_cache   14 MB  100.00%
kmalloc-256   14 MB   97.81%
kmalloc-8192  14 MB   85.71%

Signed-off-by: Anton Blanchard an...@samba.org

Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com

---

Thoughts? It seems like we could hit a similar situation if a machine
is balanced but we run out of memory on a single node.

Index: b/mm/slub.c
===
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,17 @@ redo:

   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
-  deactivate_slab(s, page, c-freelist);
-  c-page = NULL;
-  c-freelist = NULL;
-  goto new_slab;
+
+  /*
+   * If the node contains no memory there is no point in trying
+   * to allocate a new node local slab
+   */
+  if (node_spanned_pages(node)) {

s/node_spanned_pages/node_present_pages 

+  deactivate_slab(s, page, c-freelist);
+  c-page = NULL;
+  c-freelist = NULL;
+  goto new_slab;
+  }
   }

   /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-06 Thread Andi Kleen

Anton Blanchard an...@samba.org writes:

 Thoughts? It seems like we could hit a similar situation if a machine
 is balanced but we run out of memory on a single node.

Yes I agree, but your patch doesn't seem to attempt to handle this?

-Andi

 Index: b/mm/slub.c
 ===
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2278,10 +2278,17 @@ redo:
  
   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
 - deactivate_slab(s, page, c-freelist);
 - c-page = NULL;
 - c-freelist = NULL;
 - goto new_slab;
 +
 + /*
 +  * If the node contains no memory there is no point in trying
 +  * to allocate a new node local slab
 +  */
 + if (node_spanned_pages(node)) {
 + deactivate_slab(s, page, c-freelist);
 + c-page = NULL;
 + c-freelist = NULL;
 + goto new_slab;
 + }
   }
  
   /*
-- 
a...@linux.intel.com -- Speaking for myself only
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

43 matches

Mail list logo