Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-02 Thread Peter Zijlstra
On Mon, 2007-10-01 at 14:30 -0700, Andrew Morton wrote:
 On Mon, 1 Oct 2007 13:55:29 -0700 (PDT)
 Christoph Lameter [EMAIL PROTECTED] wrote:
 
  On Sat, 29 Sep 2007, Andrew Morton wrote:
  
atomic allocations. And with SLUB using higher order pages, atomic !0
order allocations will be very very common.
   
   Oh OK.
   
   I thought we'd already fixed slub so that it didn't do that.  Maybe that
   fix is in -mm but I don't think so.
   
   Trying to do atomic order-1 allocations on behalf of arbitray slab caches
   just won't fly - this is a significant degradation in kernel reliability,
   as you've very easily demonstrated.
  
  Ummm... SLAB also does order 1 allocations. We have always done them.
  
  See mm/slab.c
  
  /*
   * Do not go above this order unless 0 objects fit into the slab.
   */
  #define BREAK_GFP_ORDER_HI  1
  #define BREAK_GFP_ORDER_LO  0
  static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;
 
 Do slab and slub use the same underlying page size for each slab?
 
 Single data point: the CONFIG_SLAB boxes which I have access to here are
 using order-0 for radix_tree_node, so they won't be failing in the way in
 which Peter's machine is.
 
 I've never ever before seen reports of page allocation failures in the
 radix-tree node allocation code, and that's the bottom line.  This is just
 a drop-dead must-fix show-stopping bug.  We cannot rely upon atomic order-1
 allocations succeeding so we cannot use them for radix-tree nodes.  Nor for
 lots of other things which we have no chance of identifying.
 
 Peter, is this bug -mm only, or is 2.6.23 similarly failing?

I'm mainly using -mm (so you have at least one tester :-), I think the
-mm specific SLUB patch that ups slub_min_order makes the problem -mm
specific, would have to test .23.


signature.asc
Description: This is a digitally signed message part


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-02 Thread Nick Piggin
On Tuesday 02 October 2007 07:01, Christoph Lameter wrote:
 On Sat, 29 Sep 2007, Peter Zijlstra wrote:
  On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
   Really? That means we can no longer even allocate stacks for forking.
 
  I think I'm running with 4k stacks...

 4k stacks will never fly on an SGI x86_64 NUMA configuration given the
 additional data that may be kept on the stack. We are currently
 considering to go from 8k to 16k (or even 32k) to make things work. So
 having the ability to put the stacks in vmalloc space may be something to
 look at.

i386 and x86-64 already used 8K stacks for years and they have never
really been much problem before.

They only started failing when contiguous memory is getting used up
by other things, _even with_ those anti-frag patches in there.

Bottom line is that you do not use higher order allocations when you do
not need them.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Christoph Lameter
On Fri, 28 Sep 2007, Nick Piggin wrote:

 I thought it was slower. Have you fixed the performance regression?
 (OK, I read further down that you are still working on it but not confirmed
 yet...)

The problem is with the weird way of Intel testing and communication. 
Every 3-6 month or so they will tell you the system is X% up or down on 
arch Y (and they wont give you details because its somehow secret). And 
then there are conflicting statements by the two or so performance test 
departments. One of them repeatedly assured me that they do not see any 
regressions.

 OK, so long as it isn't going to depend on using higher order pages, that's
 fine. (if they help even further as an optional thing, that's fine too. You
 can turn them on your huge systems and not even bother about adding
 this vmap fallback -- you won't have me to nag you about these
 purely theoretical issues).

Well the vmap fallback is generally useful AFAICT. Higher order 
allocations are common on some of our platforms. Order 1 failures even 
affect essential things like stacks that have nothing to do with SLUB and 
the LBS patchset.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Christoph Lameter
On Sat, 29 Sep 2007, Andrew Morton wrote:

  atomic allocations. And with SLUB using higher order pages, atomic !0
  order allocations will be very very common.
 
 Oh OK.
 
 I thought we'd already fixed slub so that it didn't do that.  Maybe that
 fix is in -mm but I don't think so.
 
 Trying to do atomic order-1 allocations on behalf of arbitray slab caches
 just won't fly - this is a significant degradation in kernel reliability,
 as you've very easily demonstrated.

Ummm... SLAB also does order 1 allocations. We have always done them.

See mm/slab.c

/*
 * Do not go above this order unless 0 objects fit into the slab.
 */
#define BREAK_GFP_ORDER_HI  1
#define BREAK_GFP_ORDER_LO  0
static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Christoph Lameter
On Fri, 28 Sep 2007, Mel Gorman wrote:

 Minimally, SLUB by default should continue to use order-0 pages. Peter has
 managed to bust order-1 pages with mem=128MB. Admittedly, it was a really
 hostile workload but the point remains. It was artifically worked around
 with min_free_kbytes (value set based on pageblock_order, could also have
 been artifically worked around by dropping pageblock_order) and he eventually
 caused order-0 failures so the workload is pretty damn hostile to everything.

SLAB default is order 1 so is SLUB default upstream.

SLAB does runtime detection of the amount of memory and configures the max 
order correspondingly:

from mm/slab.c:

/*
 * Fragmentation resistance on low memory - only use bigger
 * page orders on machines with more than 32MB of memory.
 */
if (num_physpages  (32  20)  PAGE_SHIFT)
slab_break_gfp_order = BREAK_GFP_ORDER_HI;


We could duplicate something like that for SLUB.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Christoph Lameter
On Sat, 29 Sep 2007, Peter Zijlstra wrote:

 
 On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
 
  Really? That means we can no longer even allocate stacks for forking.
 
 I think I'm running with 4k stacks...

4k stacks will never fly on an SGI x86_64 NUMA configuration given the 
additional data that may be kept on the stack. We are currently 
considering to go from 8k to 16k (or even 32k) to make things work. So 
having the ability to put the stacks in vmalloc space may be something to 
look at.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Andrew Morton
On Mon, 1 Oct 2007 13:55:29 -0700 (PDT)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Sat, 29 Sep 2007, Andrew Morton wrote:
 
   atomic allocations. And with SLUB using higher order pages, atomic !0
   order allocations will be very very common.
  
  Oh OK.
  
  I thought we'd already fixed slub so that it didn't do that.  Maybe that
  fix is in -mm but I don't think so.
  
  Trying to do atomic order-1 allocations on behalf of arbitray slab caches
  just won't fly - this is a significant degradation in kernel reliability,
  as you've very easily demonstrated.
 
 Ummm... SLAB also does order 1 allocations. We have always done them.
 
 See mm/slab.c
 
 /*
  * Do not go above this order unless 0 objects fit into the slab.
  */
 #define BREAK_GFP_ORDER_HI  1
 #define BREAK_GFP_ORDER_LO  0
 static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;

Do slab and slub use the same underlying page size for each slab?

Single data point: the CONFIG_SLAB boxes which I have access to here are
using order-0 for radix_tree_node, so they won't be failing in the way in
which Peter's machine is.

I've never ever before seen reports of page allocation failures in the
radix-tree node allocation code, and that's the bottom line.  This is just
a drop-dead must-fix show-stopping bug.  We cannot rely upon atomic order-1
allocations succeeding so we cannot use them for radix-tree nodes.  Nor for
lots of other things which we have no chance of identifying.

Peter, is this bug -mm only, or is 2.6.23 similarly failing?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Christoph Lameter
On Mon, 1 Oct 2007, Andrew Morton wrote:

 Do slab and slub use the same underlying page size for each slab?

SLAB cannot pack objects as dense as SLUB and they have different 
algorithm to make the choice of order. Thus the number of objects per slab 
may vary between SLAB and SLUB and therefore also the choice of order to 
store these objects.

 Single data point: the CONFIG_SLAB boxes which I have access to here are
 using order-0 for radix_tree_node, so they won't be failing in the way in
 which Peter's machine is.

Upstream SLUB uses order 0 allocations for the radix tree. MM varies 
because the use of higher order allocs is more loose if the mobility 
algorithms are found to be active:

2.6.23-rc8:

Name   Objects ObjsizeSpace Slabs/Part/Cpu  O/S O %Fr %Ef 
Flg\
radix_tree_node  14281 552 9.9M 2432/948/17 0  38  79

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Andrew Morton
On Mon, 1 Oct 2007 14:38:55 -0700 (PDT)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Mon, 1 Oct 2007, Andrew Morton wrote:
 
  Do slab and slub use the same underlying page size for each slab?
 
 SLAB cannot pack objects as dense as SLUB and they have different 
 algorithm to make the choice of order. Thus the number of objects per slab 
 may vary between SLAB and SLUB and therefore also the choice of order to 
 store these objects.
 
  Single data point: the CONFIG_SLAB boxes which I have access to here are
  using order-0 for radix_tree_node, so they won't be failing in the way in
  which Peter's machine is.
 
 Upstream SLUB uses order 0 allocations for the radix tree.

OK, that's a relief.

 MM varies 
 because the use of higher order allocs is more loose if the mobility 
 algorithms are found to be active:
 
 2.6.23-rc8:
 
 Name   Objects ObjsizeSpace Slabs/Part/Cpu  O/S O %Fr %Ef 
 Flg\
 radix_tree_node  14281 552 9.9M 2432/948/17 0  38  79

Ah.  So the already-dropped
slub-exploit-page-mobility-to-increase-allocation-order.patch was the
culprit?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Christoph Lameter
On Mon, 1 Oct 2007, Andrew Morton wrote:

 Ah.  So the already-dropped
 slub-exploit-page-mobility-to-increase-allocation-order.patch was the
 culprit?

Yes without that patch SLUB will no longer take special action if antifrag 
is around.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-30 Thread Nick Piggin
On Sunday 30 September 2007 05:20, Andrew Morton wrote:
 On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin [EMAIL PROTECTED] 
wrote:
  On Saturday 29 September 2007 19:27, Andrew Morton wrote:
   On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra
   [EMAIL PROTECTED]
 
  wrote:
 oom-killings, or page allocation failures?  The latter, one hopes.
   
Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 
4.1.2
(Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
   
...
   
   
mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
Call Trace:
611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
611b3888:  [60052ed4] out_of_memory+0x80/0x100
611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
611b3948:  [6006c608] allocate_slab+0x5b/0xb0
611b3968:  [6006c705] new_slab+0x7e/0x183
611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
611b3b98:  [60056f00] read_pages+0x37/0x9b
611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
611b3c78:  [60050ab4] filemap_fault+0x145/0x278
611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
611b3da8:  [60120a7b] __up_read+0x73/0x7b
611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
611b3e48:  [60013419] segv+0xac/0x297
611b3f28:  [60013367] segv_handler+0x68/0x6e
611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
611b3f68:  [60023853] userspace+0x13a/0x19d
611b3fc8:  [60010d58] fork_handler+0x86/0x8d
  
   OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
   allocations aren't supposed to fail.
  
   I'm suspecting that did_some_progress thing.
 
  The allocation didn't fail -- it invoked the OOM killer because the
  kernel ran out of unfragmented memory.

 We can't run out of unfragmented memory for an order-2 GFP_KERNEL
 allocation in this workload.  We go and synchronously free stuff up to make
 it work.

 How did this get broken?

Either no more order-2 pages could be freed, or the ones that were being
freed were being used by something else (eg. other order-2 slab allocations).


  Probably because higher order
  allocations are the new vogue in -mm at the moment ;)

 That's a different bug.

 bug 1: We shouldn't be doing higher-order allocations in slub because of
 the considerable damage this does to atomic allocations.

 bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.

I think one causes 2 as well -- it isn't just considerable damage to atomic
allocations but to GFP_KERNEL allocations too.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-30 Thread Andrew Morton
On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

 On Sunday 30 September 2007 05:20, Andrew Morton wrote:
  On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin [EMAIL PROTECTED] 
 wrote:
   On Saturday 29 September 2007 19:27, Andrew Morton wrote:
On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra
[EMAIL PROTECTED]
  
   wrote:
  oom-killings, or page allocation failures?  The latter, one hopes.

 Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 
 4.1.2
 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007

 ...


 mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
 Call Trace:
 611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
 611b3888:  [60052ed4] out_of_memory+0x80/0x100
 611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
 611b3948:  [6006c608] allocate_slab+0x5b/0xb0
 611b3968:  [6006c705] new_slab+0x7e/0x183
 611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
 611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
 611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
 611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
 611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
 611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
 611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
 611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
 611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
 611b3b98:  [60056f00] read_pages+0x37/0x9b
 611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
 611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
 611b3c78:  [60050ab4] filemap_fault+0x145/0x278
 611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
 611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
 611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
 611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
 611b3da8:  [60120a7b] __up_read+0x73/0x7b
 611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
 611b3e48:  [60013419] segv+0xac/0x297
 611b3f28:  [60013367] segv_handler+0x68/0x6e
 611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
 611b3f68:  [60023853] userspace+0x13a/0x19d
 611b3fc8:  [60010d58] fork_handler+0x86/0x8d
   
OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
allocations aren't supposed to fail.
   
I'm suspecting that did_some_progress thing.
  
   The allocation didn't fail -- it invoked the OOM killer because the
   kernel ran out of unfragmented memory.
 
  We can't run out of unfragmented memory for an order-2 GFP_KERNEL
  allocation in this workload.  We go and synchronously free stuff up to make
  it work.
 
  How did this get broken?
 
 Either no more order-2 pages could be freed, or the ones that were being
 freed were being used by something else (eg. other order-2 slab allocations).

No.  The current design of reclaim (for better or for worse) is that for
order 0,1,2 and 3 allocations we just keep on trying until it works.  That
got broken and I think it got broken at a design level when that
did_some_progress logic went in.  Perhaps something else we did later
worsened things.


   Probably because higher order
   allocations are the new vogue in -mm at the moment ;)
 
  That's a different bug.
 
  bug 1: We shouldn't be doing higher-order allocations in slub because of
  the considerable damage this does to atomic allocations.
 
  bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.
 
 I think one causes 2 as well -- it isn't just considerable damage to atomic
 allocations but to GFP_KERNEL allocations too.

Well sure, because we already broke GFP_KERNEL allocations.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-30 Thread Nick Piggin
On Monday 01 October 2007 06:12, Andrew Morton wrote:
 On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin [EMAIL PROTECTED] 
wrote:
  On Sunday 30 September 2007 05:20, Andrew Morton wrote:

   We can't run out of unfragmented memory for an order-2 GFP_KERNEL
   allocation in this workload.  We go and synchronously free stuff up to
   make it work.
  
   How did this get broken?
 
  Either no more order-2 pages could be freed, or the ones that were being
  freed were being used by something else (eg. other order-2 slab
  allocations).

 No.  The current design of reclaim (for better or for worse) is that for
 order 0,1,2 and 3 allocations we just keep on trying until it works.  That
 got broken and I think it got broken at a design level when that
 did_some_progress logic went in.  Perhaps something else we did later
 worsened things.

It will keep trying until it works. It won't have stopped trying (unless
I'm very mistaken?), it's just oom killing things merrily along the way.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 
 On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
 
   start 2 processes that each mmap a separate 64M file, and which does
   sequential writes on them. start a 3th process that does the same with
   64M anonymous.
   
   wait for a while, and you'll see order=1 failures.
  
  Really? That means we can no longer even allocate stacks for forking.
  
  Its surprising that neither lumpy reclaim nor the mobility patches can 
  deal with it? Lumpy reclaim should be able to free neighboring pages to 
  avoid the order 1 failure unless there are lots of pinned pages.
  
  I guess then that lots of pages are pinned through I/O?
 
 memory got massively fragemented, as anti-frag gets easily defeated.
 setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
 order blocks to stay available, so we don't mix types. however 12M on
 128M is rather a lot.
 
 its still on my todo list to look at it further..
 

That would be really really bad (as in: patch-dropping time) if those
order-1 allocations are not atomic.

What's the callsite? 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Peter Zijlstra

On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:

 Really? That means we can no longer even allocate stacks for forking.

I think I'm running with 4k stacks...

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote:
 On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  
  On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
  
start 2 processes that each mmap a separate 64M file, and which does
sequential writes on them. start a 3th process that does the same with
64M anonymous.

wait for a while, and you'll see order=1 failures.
   
   Really? That means we can no longer even allocate stacks for forking.
   
   Its surprising that neither lumpy reclaim nor the mobility patches can 
   deal with it? Lumpy reclaim should be able to free neighboring pages to 
   avoid the order 1 failure unless there are lots of pinned pages.
   
   I guess then that lots of pages are pinned through I/O?
  
  memory got massively fragemented, as anti-frag gets easily defeated.
  setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
  order blocks to stay available, so we don't mix types. however 12M on
  128M is rather a lot.
  
  its still on my todo list to look at it further..
  
 
 That would be really really bad (as in: patch-dropping time) if those
 order-1 allocations are not atomic.
 
 What's the callsite? 

Ah, right, that was the detail... all this lumpy reclaim is useless for
atomic allocations. And with SLUB using higher order pages, atomic !0
order allocations will be very very common.

One I can remember was:

  add_to_page_cache()
radix_tree_insert()
  radix_tree_node_alloc()
kmem_cache_alloc()

which is an atomic callsite.

Which leaves us in a situation where we can load pages, because there is
free memory, but can't manage to allocate memory to track them.. 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:

 Ah, right, that was the detail... all this lumpy reclaim is useless for
 atomic allocations. And with SLUB using higher order pages, atomic !0
 order allocations will be very very common.
 
 One I can remember was:
 
   add_to_page_cache()
 radix_tree_insert()
   radix_tree_node_alloc()
 kmem_cache_alloc()
 
 which is an atomic callsite.
 
 Which leaves us in a situation where we can load pages, because there is
 free memory, but can't manage to allocate memory to track them.. 

Ah, I found a boot log of one of these sessions, its also full of
order-2 OOMs.. :-/

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Sat, 29 Sep 2007 10:47:12 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 
 On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote:
  On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
  
   
   On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
   
 start 2 processes that each mmap a separate 64M file, and which does
 sequential writes on them. start a 3th process that does the same with
 64M anonymous.
 
 wait for a while, and you'll see order=1 failures.

Really? That means we can no longer even allocate stacks for forking.

Its surprising that neither lumpy reclaim nor the mobility patches can 
deal with it? Lumpy reclaim should be able to free neighboring pages to 
avoid the order 1 failure unless there are lots of pinned pages.

I guess then that lots of pages are pinned through I/O?
   
   memory got massively fragemented, as anti-frag gets easily defeated.
   setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
   order blocks to stay available, so we don't mix types. however 12M on
   128M is rather a lot.
   
   its still on my todo list to look at it further..
   
  
  That would be really really bad (as in: patch-dropping time) if those
  order-1 allocations are not atomic.
  
  What's the callsite? 
 
 Ah, right, that was the detail... all this lumpy reclaim is useless for
 atomic allocations. And with SLUB using higher order pages, atomic !0
 order allocations will be very very common.

Oh OK.

I thought we'd already fixed slub so that it didn't do that.  Maybe that
fix is in -mm but I don't think so.

Trying to do atomic order-1 allocations on behalf of arbitray slab caches
just won't fly - this is a significant degradation in kernel reliability,
as you've very easily demonstrated.

 One I can remember was:
 
   add_to_page_cache()
 radix_tree_insert()
   radix_tree_node_alloc()
 kmem_cache_alloc()
 
 which is an atomic callsite.
 
 Which leaves us in a situation where we can load pages, because there is
 free memory, but can't manage to allocate memory to track them.. 

Right.  Leading to application failure which for many is equivalent to a
complete system outage.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 
 On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:
 
  Ah, right, that was the detail... all this lumpy reclaim is useless for
  atomic allocations. And with SLUB using higher order pages, atomic !0
  order allocations will be very very common.
  
  One I can remember was:
  
add_to_page_cache()
  radix_tree_insert()
radix_tree_node_alloc()
  kmem_cache_alloc()
  
  which is an atomic callsite.
  
  Which leaves us in a situation where we can load pages, because there is
  free memory, but can't manage to allocate memory to track them.. 
 
 Ah, I found a boot log of one of these sessions, its also full of
 order-2 OOMs.. :-/

oom-killings, or page allocation failures?  The latter, one hopes.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Peter Zijlstra

On Sat, 2007-09-29 at 02:01 -0700, Andrew Morton wrote:
 On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
 
  
  On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:
  
   Ah, right, that was the detail... all this lumpy reclaim is useless for
   atomic allocations. And with SLUB using higher order pages, atomic !0
   order allocations will be very very common.
   
   One I can remember was:
   
 add_to_page_cache()
   radix_tree_insert()
 radix_tree_node_alloc()
   kmem_cache_alloc()
   
   which is an atomic callsite.
   
   Which leaves us in a situation where we can load pages, because there is
   free memory, but can't manage to allocate memory to track them.. 
  
  Ah, I found a boot log of one of these sessions, its also full of
  order-2 OOMs.. :-/
 
 oom-killings, or page allocation failures?  The latter, one hopes.


Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
(Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007

...


mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
Call Trace:
611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
611b3888:  [60052ed4] out_of_memory+0x80/0x100
611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
611b3948:  [6006c608] allocate_slab+0x5b/0xb0
611b3968:  [6006c705] new_slab+0x7e/0x183
611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
611b3b98:  [60056f00] read_pages+0x37/0x9b
611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
611b3c78:  [60050ab4] filemap_fault+0x145/0x278
611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
611b3da8:  [60120a7b] __up_read+0x73/0x7b
611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
611b3e48:  [60013419] segv+0xac/0x297
611b3f28:  [60013367] segv_handler+0x68/0x6e
611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
611b3f68:  [60023853] userspace+0x13a/0x19d
611b3fc8:  [60010d58] fork_handler+0x86/0x8d

Mem-info:
Normal per-cpu:
CPU0: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   3 usd:   0
Active:11 inactive:9 dirty:0 writeback:1 unstable:0
 free:19533 slab:10587 mapped:0 pagetables:260 bounce:0
Normal free:78132kB min:4096kB low:5120kB high:6144kB active:44kB inactive:36kB 
present:129280kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0
Normal: 7503*4kB 5977*8kB 19*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 
0*1024kB 0*2048kB 0*4096kB = 78132kB
Swap cache: add 1192822, delete 1192790, find 491441/626861, race 0+1
Free swap  = 455300kB
Total swap = 524280kB
Free swap:   455300kB
32768 pages of RAM
0 pages of HIGHMEM
1948 reserved pages
11 pages shared
32 pages swap cached
Out of memory: kill process 2647 (portmap) score 2233 or a child
Killed process 2647 (portmap)


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

  oom-killings, or page allocation failures?  The latter, one hopes.
 
 
 Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
 
 ...
 
 
 mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
 Call Trace:
 611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
 611b3888:  [60052ed4] out_of_memory+0x80/0x100
 611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
 611b3948:  [6006c608] allocate_slab+0x5b/0xb0
 611b3968:  [6006c705] new_slab+0x7e/0x183
 611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
 611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
 611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
 611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
 611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
 611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
 611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
 611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
 611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
 611b3b98:  [60056f00] read_pages+0x37/0x9b
 611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
 611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
 611b3c78:  [60050ab4] filemap_fault+0x145/0x278
 611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
 611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
 611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
 611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
 611b3da8:  [60120a7b] __up_read+0x73/0x7b
 611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
 611b3e48:  [60013419] segv+0xac/0x297
 611b3f28:  [60013367] segv_handler+0x68/0x6e
 611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
 611b3f68:  [60023853] userspace+0x13a/0x19d
 611b3fc8:  [60010d58] fork_handler+0x86/0x8d

OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
allocations aren't supposed to fail.

I'm suspecting that did_some_progress thing.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Nick Piggin
On Saturday 29 September 2007 19:27, Andrew Morton wrote:
 On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra [EMAIL PROTECTED] 
wrote:
   oom-killings, or page allocation failures?  The latter, one hopes.
 
  Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
  (Ubuntu
  4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
 
  ...
 
 
  mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
  Call Trace:
  611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
  611b3888:  [60052ed4] out_of_memory+0x80/0x100
  611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
  611b3948:  [6006c608] allocate_slab+0x5b/0xb0
  611b3968:  [6006c705] new_slab+0x7e/0x183
  611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
  611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
  611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
  611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
  611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
  611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
  611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
  611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
  611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
  611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
  611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
  611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
  611b3b98:  [60056f00] read_pages+0x37/0x9b
  611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
  611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
  611b3c78:  [60050ab4] filemap_fault+0x145/0x278
  611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
  611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
  611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
  611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
  611b3da8:  [60120a7b] __up_read+0x73/0x7b
  611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
  611b3e48:  [60013419] segv+0xac/0x297
  611b3f28:  [60013367] segv_handler+0x68/0x6e
  611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
  611b3f68:  [60023853] userspace+0x13a/0x19d
  611b3fc8:  [60010d58] fork_handler+0x86/0x8d

 OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
 allocations aren't supposed to fail.

 I'm suspecting that did_some_progress thing.

The allocation didn't fail -- it invoked the OOM killer because the kernel
ran out of unfragmented memory. Probably because higher order
allocations are the new vogue in -mm at the moment ;)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Nick Piggin
On Saturday 29 September 2007 04:41, Christoph Lameter wrote:
 On Fri, 28 Sep 2007, Peter Zijlstra wrote:
  memory got massively fragemented, as anti-frag gets easily defeated.
  setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
  order blocks to stay available, so we don't mix types. however 12M on
  128M is rather a lot.

 Yes, strict ordering would be much better. On NUMA it may be possible to
 completely forbid merging. We can fall back to other nodes if necessary.
 12M is not much on a NUMA system.

 But this shows that (unsurprisingly) we may have issues on systems with a
 small amounts of memory and we may not want to use higher orders on such
 systems.

 The case you got may be good to use as a testcase for the virtual
 fallback. H... Maybe it is possible to allocate the stack as a virtual
 compound page. Got some script/code to produce that problem?

Yeah, you could do that, but we generally don't have big problems allocating
stacks in mainline, because we have very few users of higher order pages,
the few that are there don't seem to be a problem.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton
On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

 On Saturday 29 September 2007 19:27, Andrew Morton wrote:
  On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra [EMAIL PROTECTED] 
 wrote:
oom-killings, or page allocation failures?  The latter, one hopes.
  
   Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
   (Ubuntu
   4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
  
   ...
  
  
   mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
   Call Trace:
   611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
   611b3888:  [60052ed4] out_of_memory+0x80/0x100
   611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
   611b3948:  [6006c608] allocate_slab+0x5b/0xb0
   611b3968:  [6006c705] new_slab+0x7e/0x183
   611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
   611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
   611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
   611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
   611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
   611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
   611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
   611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
   611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
   611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
   611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
   611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
   611b3b98:  [60056f00] read_pages+0x37/0x9b
   611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
   611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
   611b3c78:  [60050ab4] filemap_fault+0x145/0x278
   611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
   611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
   611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
   611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
   611b3da8:  [60120a7b] __up_read+0x73/0x7b
   611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
   611b3e48:  [60013419] segv+0xac/0x297
   611b3f28:  [60013367] segv_handler+0x68/0x6e
   611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
   611b3f68:  [60023853] userspace+0x13a/0x19d
   611b3fc8:  [60010d58] fork_handler+0x86/0x8d
 
  OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
  allocations aren't supposed to fail.
 
  I'm suspecting that did_some_progress thing.
 
 The allocation didn't fail -- it invoked the OOM killer because the kernel
 ran out of unfragmented memory.

We can't run out of unfragmented memory for an order-2 GFP_KERNEL
allocation in this workload.  We go and synchronously free stuff up to make
it work.

How did this get broken?

 Probably because higher order
 allocations are the new vogue in -mm at the moment ;)

That's a different bug.

bug 1: We shouldn't be doing higher-order allocations in slub because of
the considerable damage this does to atomic allocations.

bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Nick Piggin
On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
 SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
 available then the conservative settings for higher order allocations are
 overridden. We then request an order that can accomodate at mininum
 100 objects. The size of an individual slab allocation is allowed to reach
 up to 256k (order 6 on i386, order 4 on IA64).

How come SLUB wants such a big amount of objects? I thought the
unqueued nature of it made it better than slab because it minimised
the amount of cache hot memory lying around in slabs...

vmalloc is incredibly slow and unscalable at the moment. I'm still working
on making it more scalable and faster -- hopefully to a point where it would
actually be usable for this... but you still get moved off large TLBs, and
also have to inevitably do tlb flushing.

Or do you have SLUB at a point where performance is comparable to SLAB,
and this is just a possible idea for more performance?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Christoph Lameter
On Fri, 28 Sep 2007, Nick Piggin wrote:

 On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
  SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
  available then the conservative settings for higher order allocations are
  overridden. We then request an order that can accomodate at mininum
  100 objects. The size of an individual slab allocation is allowed to reach
  up to 256k (order 6 on i386, order 4 on IA64).
 
 How come SLUB wants such a big amount of objects? I thought the
 unqueued nature of it made it better than slab because it minimised
 the amount of cache hot memory lying around in slabs...

The more objects in a page the more the fast path runs. The more the fast 
path runs the lower the cache footprint and the faster the overall 
allocations etc.

SLAB can be configured for large queues holdings lots of objects. 
SLUB can only reach the same through large pages because it does not 
have queues. One could add the ability to manage pools of cpu slabs but 
that would be adding yet another layer to compensate for the problem of 
the small pages. Reliable large page allocations means that we can get rid 
of these layers and the many workarounds that we have in place right now.

The unqueued nature of SLUB reduces memory requirements and in general the 
more efficient code paths of SLUB offset the advantage that SLAB can reach 
by being able to put more objects onto its queues. SLAB necessarily 
introduces complexity and cache line use through the need to manage those 
queues.

 vmalloc is incredibly slow and unscalable at the moment. I'm still working
 on making it more scalable and faster -- hopefully to a point where it would
 actually be usable for this... but you still get moved off large TLBs, and
 also have to inevitably do tlb flushing.

Again I have not seen any fallbacks to vmalloc in my testing. What we are 
doing here is mainly to address your theoretical cases that we so far have 
never seen to be a problem and increase the reliability of allocations of
page orders larger than 3 to a usable level. So far I have so far not 
dared to enable orders larger than 3 by default.

AFAICT The performance of vmalloc is not really relevant. If this would 
become an issue then it would be possible to reduce the orders used to 
avoid fallbacks.

 Or do you have SLUB at a point where performance is comparable to SLAB,
 and this is just a possible idea for more performance?

AFAICT SLUBs performance is superior to SLAB in most cases and it was like 
that from the beginning. I am still concerned about several corner cases 
though (I think most of them are going to be addressed by the per cpu 
patches in mm). Having a comparable or larger amount of per cpu objects as 
SLAB is something that also could address some of these concerns and could 
increase performance much further.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Peter Zijlstra

On Fri, 2007-09-28 at 10:33 -0700, Christoph Lameter wrote:

 Again I have not seen any fallbacks to vmalloc in my testing. What we are 
 doing here is mainly to address your theoretical cases that we so far have 
 never seen to be a problem and increase the reliability of allocations of
 page orders larger than 3 to a usable level. So far I have so far not 
 dared to enable orders larger than 3 by default.

take a recent -mm kernel, boot with mem=128M.

start 2 processes that each mmap a separate 64M file, and which does
sequential writes on them. start a 3th process that does the same with
64M anonymous.

wait for a while, and you'll see order=1 failures.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Christoph Lameter
On Fri, 28 Sep 2007, Peter Zijlstra wrote:

 
 On Fri, 2007-09-28 at 10:33 -0700, Christoph Lameter wrote:
 
  Again I have not seen any fallbacks to vmalloc in my testing. What we are 
  doing here is mainly to address your theoretical cases that we so far have 
  never seen to be a problem and increase the reliability of allocations of
  page orders larger than 3 to a usable level. So far I have so far not 
  dared to enable orders larger than 3 by default.
 
 take a recent -mm kernel, boot with mem=128M.

Ok so only 32k pages to play with? I have tried parallel kernel compiles 
with mem=256m and they seemed to be fine.

 start 2 processes that each mmap a separate 64M file, and which does
 sequential writes on them. start a 3th process that does the same with
 64M anonymous.
 
 wait for a while, and you'll see order=1 failures.

Really? That means we can no longer even allocate stacks for forking.

Its surprising that neither lumpy reclaim nor the mobility patches can 
deal with it? Lumpy reclaim should be able to free neighboring pages to 
avoid the order 1 failure unless there are lots of pinned pages.

I guess then that lots of pages are pinned through I/O?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Peter Zijlstra

On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:

  start 2 processes that each mmap a separate 64M file, and which does
  sequential writes on them. start a 3th process that does the same with
  64M anonymous.
  
  wait for a while, and you'll see order=1 failures.
 
 Really? That means we can no longer even allocate stacks for forking.
 
 Its surprising that neither lumpy reclaim nor the mobility patches can 
 deal with it? Lumpy reclaim should be able to free neighboring pages to 
 avoid the order 1 failure unless there are lots of pinned pages.
 
 I guess then that lots of pages are pinned through I/O?

memory got massively fragemented, as anti-frag gets easily defeated.
setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
order blocks to stay available, so we don't mix types. however 12M on
128M is rather a lot.

its still on my todo list to look at it further..

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Christoph Lameter
On Fri, 28 Sep 2007, Peter Zijlstra wrote:

 memory got massively fragemented, as anti-frag gets easily defeated.
 setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
 order blocks to stay available, so we don't mix types. however 12M on
 128M is rather a lot.

Yes, strict ordering would be much better. On NUMA it may be possible to 
completely forbid merging. We can fall back to other nodes if necessary. 
12M is not much on a NUMA system.

But this shows that (unsurprisingly) we may have issues on systems with a 
small amounts of memory and we may not want to use higher orders on such 
systems.

The case you got may be good to use as a testcase for the virtual 
fallback. H... Maybe it is possible to allocate the stack as a virtual 
compound page. Got some script/code to produce that problem?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Mel Gorman
On (28/09/07 20:25), Peter Zijlstra didst pronounce:
 
 On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
 
   start 2 processes that each mmap a separate 64M file, and which does
   sequential writes on them. start a 3th process that does the same with
   64M anonymous.
   
   wait for a while, and you'll see order=1 failures.
  
  Really? That means we can no longer even allocate stacks for forking.
  
  Its surprising that neither lumpy reclaim nor the mobility patches can 
  deal with it? Lumpy reclaim should be able to free neighboring pages to 
  avoid the order 1 failure unless there are lots of pinned pages.
  
  I guess then that lots of pages are pinned through I/O?
 
 memory got massively fragemented, as anti-frag gets easily defeated.
 setting min_free_kbytes to 12M does seem to solve it - it forces 2 max

The 12MB is related to the size of pageblock_order. I strongly suspect
that if you forced pageblock_order to be something like 4 or 5, the
min_free_kbytes would not need to be raised. The current values are
selected based on the hugepage size.

 order blocks to stay available, so we don't mix types. however 12M on
 128M is rather a lot.
 
 its still on my todo list to look at it further..
 

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Mel Gorman
On (28/09/07 10:33), Christoph Lameter didst pronounce:
 On Fri, 28 Sep 2007, Nick Piggin wrote:
 
  On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
   SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
   available then the conservative settings for higher order allocations are
   overridden. We then request an order that can accomodate at mininum
   100 objects. The size of an individual slab allocation is allowed to reach
   up to 256k (order 6 on i386, order 4 on IA64).
  
  How come SLUB wants such a big amount of objects? I thought the
  unqueued nature of it made it better than slab because it minimised
  the amount of cache hot memory lying around in slabs...
 
 The more objects in a page the more the fast path runs. The more the fast 
 path runs the lower the cache footprint and the faster the overall 
 allocations etc.
 
 SLAB can be configured for large queues holdings lots of objects. 
 SLUB can only reach the same through large pages because it does not 
 have queues.

Large pages, flood gates etc. Be wary.

SLUB has to run 100% reliable or things go whoops. SLUB regularly depends on
atomic allocations and cannot take the necessary steps to get the contiguous
pages if it gets into trouble. This means that something like lumpy reclaim
cannot help you in it's current state.

We currently do not take the per-emptive steps with kswapd to ensure the
high-order pages are free. We also don't do something like have users that
can sleep keep the watermarks high. I had considered the possibility but
didn't have the justification for the complexity.

Minimally, SLUB by default should continue to use order-0 pages. Peter has
managed to bust order-1 pages with mem=128MB. Admittedly, it was a really
hostile workload but the point remains. It was artifically worked around
with min_free_kbytes (value set based on pageblock_order, could also have
been artifically worked around by dropping pageblock_order) and he eventually
caused order-0 failures so the workload is pretty damn hostile to everything.

 One could add the ability to manage pools of cpu slabs but 
 that would be adding yet another layer to compensate for the problem of 
 the small pages.

A compromise may be to have per-cpu lists for higher-order pages in the page
allocator itself as they can be easily drained unlike the SLAB queues. The
thing to watch for would be excessive IPI calls which would offset any
performance gained by SLUB using larger pages.

 Reliable large page allocations means that we can get rid 
 of these layers and the many workarounds that we have in place right now.
 

They are not reliable yet, particularly for atomic allocs.

 The unqueued nature of SLUB reduces memory requirements and in general the 
 more efficient code paths of SLUB offset the advantage that SLAB can reach 
 by being able to put more objects onto its queues. SLAB necessarily 
 introduces complexity and cache line use through the need to manage those 
 queues.
 
  vmalloc is incredibly slow and unscalable at the moment. I'm still working
  on making it more scalable and faster -- hopefully to a point where it would
  actually be usable for this... but you still get moved off large TLBs, and
  also have to inevitably do tlb flushing.
 
 Again I have not seen any fallbacks to vmalloc in my testing. What we are 
 doing here is mainly to address your theoretical cases that we so far have 
 never seen to be a problem and increase the reliability of allocations of
 page orders larger than 3 to a usable level. So far I have so far not 
 dared to enable orders larger than 3 by default.
 
 AFAICT The performance of vmalloc is not really relevant. If this would 
 become an issue then it would be possible to reduce the orders used to 
 avoid fallbacks.
 

If we're falling back to vmalloc ever, there is a danger that the
problem is postponed until vmalloc space is consumed. More an issue for
32 bit.

  Or do you have SLUB at a point where performance is comparable to SLAB,
  and this is just a possible idea for more performance?
 
 AFAICT SLUBs performance is superior to SLAB in most cases and it was like 
 that from the beginning. I am still concerned about several corner cases 
 though (I think most of them are going to be addressed by the per cpu 
 patches in mm). Having a comparable or larger amount of per cpu objects as 
 SLAB is something that also could address some of these concerns and could 
 increase performance much further.
 

-- 
-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Mel Gorman
On (28/09/07 11:41), Christoph Lameter didst pronounce:
 On Fri, 28 Sep 2007, Peter Zijlstra wrote:
 
  memory got massively fragemented, as anti-frag gets easily defeated.
  setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
  order blocks to stay available, so we don't mix types. however 12M on
  128M is rather a lot.
 
 Yes, strict ordering would be much better. On NUMA it may be possible to 
 completely forbid merging.

The forbidding of merging is trivial and the code is isolated to one function
__rmqueue_fallback(). We don't do it because the decision at development
time was that it was better to allow fragmentation than take a reclaim step
for example[1] and slow things up. This is based on my initial assumption
of anti-frag being mainly of interest to hugepages which are happy to wait
long periods during startup or fail.

 We can fall back to other nodes if necessary. 
 12M is not much on a NUMA system.
 
 But this shows that (unsurprisingly) we may have issues on systems with a 
 small amounts of memory and we may not want to use higher orders on such 
 systems.
 

This is another option if you want to use a higher order for SLUB by
default. Use order-0 unless you are sure there is enough memory. At boot
if there is loads of memory, set the higher order and up min_free_kbytes on
each node to reduce mixing[2]. We can test with Peters uber-hostile
case to see if it works[3].

 The case you got may be good to use as a testcase for the virtual 
 fallback. H...

For sure.

 Maybe it is possible to allocate the stack as a virtual 
 compound page. Got some script/code to produce that problem?
 

[1] It might be tunnel vision but I still keep hugepages in mind as the
principal user of anti-frag. Andy used to have patches that force evicted
pages of the foreign type when mixing occured so the end result was
no mixing. We never fully completed them because it was too costly
for hugepages.

[2] This would require the identification of mixed blocks to be a
statistic available in mainline. Right now, it's only available in -mm
when PAGE_OWNER is set

[3] The definition of working in this case being that order-0
allocations fail which he has produced

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-28 Thread Nick Piggin
On Saturday 29 September 2007 03:33, Christoph Lameter wrote:
 On Fri, 28 Sep 2007, Nick Piggin wrote:
  On Wednesday 19 September 2007 13:36, Christoph Lameter wrote:
   SLAB_VFALLBACK can be specified for selected slab caches. If fallback
   is available then the conservative settings for higher order
   allocations are overridden. We then request an order that can
   accomodate at mininum 100 objects. The size of an individual slab
   allocation is allowed to reach up to 256k (order 6 on i386, order 4 on
   IA64).
 
  How come SLUB wants such a big amount of objects? I thought the
  unqueued nature of it made it better than slab because it minimised
  the amount of cache hot memory lying around in slabs...

 The more objects in a page the more the fast path runs. The more the fast
 path runs the lower the cache footprint and the faster the overall
 allocations etc.

 SLAB can be configured for large queues holdings lots of objects.
 SLUB can only reach the same through large pages because it does not
 have queues. One could add the ability to manage pools of cpu slabs but
 that would be adding yet another layer to compensate for the problem of
 the small pages. Reliable large page allocations means that we can get rid
 of these layers and the many workarounds that we have in place right now.

That doesn't sound very nice because you don't actually want to use up
higher order allocations if you can avoid it, and you definitely don't want
to be increasing your slab page size unit if you can help it, because it
compounds the problem of slab fragmentation.


 The unqueued nature of SLUB reduces memory requirements and in general the
 more efficient code paths of SLUB offset the advantage that SLAB can reach
 by being able to put more objects onto its queues.
 introduces complexity and cache line use through the need to manage those
 queues.

I thought it was slower. Have you fixed the performance regression?
(OK, I read further down that you are still working on it but not confirmed
yet...)


  vmalloc is incredibly slow and unscalable at the moment. I'm still
  working on making it more scalable and faster -- hopefully to a point
  where it would actually be usable for this... but you still get moved off
  large TLBs, and also have to inevitably do tlb flushing.

 Again I have not seen any fallbacks to vmalloc in my testing. What we are
 doing here is mainly to address your theoretical cases that we so far have
 never seen to be a problem and increase the reliability of allocations of
 page orders larger than 3 to a usable level. So far I have so far not
 dared to enable orders larger than 3 by default.

Basically, all that shows is that your testing isn't very thorough. 128MB
is an order of magnitude *more* memory than some users have. They
probably wouldn't be happy with a regression in slab allocator performance
either.


  Or do you have SLUB at a point where performance is comparable to SLAB,
  and this is just a possible idea for more performance?

 AFAICT SLUBs performance is superior to SLAB in most cases and it was like
 that from the beginning. I am still concerned about several corner cases
 though (I think most of them are going to be addressed by the per cpu
 patches in mm). Having a comparable or larger amount of per cpu objects as
 SLAB is something that also could address some of these concerns and could
 increase performance much further.

OK, so long as it isn't going to depend on using higher order pages, that's
fine. (if they help even further as an optional thing, that's fine too. You
can turn them on your huge systems and not even bother about adding
this vmap fallback -- you won't have me to nag you about these
purely theoretical issues).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-25 Thread Christoph Lameter
SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
available then the conservative settings for higher order allocations are
overridden. We then request an order that can accomodate at mininum
100 objects. The size of an individual slab allocation is allowed to reach
up to 256k (order 6 on i386, order 4 on IA64).

Implementing fallback requires special handling of virtual mappings in
the free path. However, the impact is minimal since we already check the
address if its NULL or ZERO_SIZE_PTR. No additional cachelines are
touched if we do not fall back. However, if we need to handle a virtual
compound page then walk the kernel page table in the free paths to
determine the page struct.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/slab.h |1 
 include/linux/slub_def.h |1 
 mm/slub.c|   52 +++
 3 files changed, 32 insertions(+), 22 deletions(-)

Index: linux-2.6/include/linux/slab.h
===
--- linux-2.6.orig/include/linux/slab.h 2007-09-24 20:34:14.0 -0700
+++ linux-2.6/include/linux/slab.h  2007-09-24 20:35:09.0 -0700
@@ -19,6 +19,7 @@
  * The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set.
  */
 #define SLAB_DEBUG_FREE0x0100UL/* DEBUG: Perform 
(expensive) checks on free */
+#define SLAB_VFALLBACK 0x0200UL/* May fall back to vmalloc */
 #define SLAB_RED_ZONE  0x0400UL/* DEBUG: Red zone objs in a 
cache */
 #define SLAB_POISON0x0800UL/* DEBUG: Poison objects */
 #define SLAB_HWCACHE_ALIGN 0x2000UL/* Align objs on cache lines */
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-09-24 20:34:14.0 -0700
+++ linux-2.6/mm/slub.c 2007-09-24 20:35:09.0 -0700
@@ -285,7 +285,7 @@ static inline int check_valid_pointer(st
if (!object)
return 1;
 
-   base = page_address(page);
+   base = page_to_addr(page);
if (object  base || object = base + s-objects * s-size ||
(object - base) % s-size) {
return 0;
@@ -470,7 +470,7 @@ static void slab_fix(struct kmem_cache *
 static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
 {
unsigned int off;   /* Offset of last byte */
-   u8 *addr = page_address(page);
+   u8 *addr = page_to_addr(page);
 
print_tracking(s, p);
 
@@ -648,7 +648,7 @@ static int slab_pad_check(struct kmem_ca
if (!(s-flags  SLAB_POISON))
return 1;
 
-   start = page_address(page);
+   start = page_to_addr(page);
end = start + (PAGE_SIZE  s-order);
length = s-objects * s-size;
remainder = end - (start + length);
@@ -1049,11 +1049,7 @@ static struct page *allocate_slab(struct
struct page * page;
int pages = 1  s-order;
 
-   if (s-order)
-   flags |= __GFP_COMP;
-
-   if (s-flags  SLAB_CACHE_DMA)
-   flags |= SLUB_DMA;
+   flags |= s-gfpflags;
 
if (node == -1)
page = alloc_pages(flags, s-order);
@@ -1107,7 +1103,7 @@ static struct page *new_slab(struct kmem
SLAB_STORE_USER | SLAB_TRACE))
SetSlabDebug(page);
 
-   start = page_address(page);
+   start = page_to_addr(page);
end = start + s-objects * s-size;
 
if (unlikely(s-flags  SLAB_POISON))
@@ -1139,7 +1135,7 @@ static void __free_slab(struct kmem_cach
void *p;
 
slab_pad_check(s, page);
-   for_each_object(p, s, page_address(page))
+   for_each_object(p, s, page_to_addr(page))
check_object(s, page, p, 0);
ClearSlabDebug(page);
}
@@ -1789,10 +1785,9 @@ static inline int slab_order(int size, i
return order;
 }
 
-static inline int calculate_order(int size)
+static inline int calculate_order(int size, int min_objects, int max_order)
 {
int order;
-   int min_objects;
int fraction;
 
/*
@@ -1803,13 +1798,12 @@ static inline int calculate_order(int si
 * First we reduce the acceptable waste in a slab. Then
 * we reduce the minimum objects required in a slab.
 */
-   min_objects = slub_min_objects;
while (min_objects  1) {
fraction = 8;
while (fraction = 4) {
order = slab_order(size, min_objects,
-   slub_max_order, fraction);
-   if (order = slub_max_order)
+   max_order, fraction);
+   if (order = max_order)
return order;
fraction /= 2;

[15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-18 Thread Christoph Lameter
SLAB_VFALLBACK can be specified for selected slab caches. If fallback is
available then the conservative settings for higher order allocations are
overridden. We then request an order that can accomodate at mininum
100 objects. The size of an individual slab allocation is allowed to reach
up to 256k (order 6 on i386, order 4 on IA64).

Implementing fallback requires special handling of virtual mappings in
the free path. However, the impact is minimal since we already check the
address if its NULL or ZERO_SIZE_PTR. No additional cachelines are
touched if we do not fall back. However, if we need to handle a virtual
compound page then walk the kernel page table in the free paths to
determine the page struct.

We also need special handling in the allocation paths since the virtual
addresses cannot be obtained via page_address(). SLUB exploits that
page-private is set to the vmalloc address to avoid a costly
vmalloc_address().

However, for diagnostics there is still the need to determine the
vmalloc address from the page struct. There we must use the costly
vmalloc_address().

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/slab.h |1 
 include/linux/slub_def.h |1 
 mm/slub.c|   83 ---
 3 files changed, 60 insertions(+), 25 deletions(-)

Index: linux-2.6/include/linux/slab.h
===
--- linux-2.6.orig/include/linux/slab.h 2007-09-18 17:03:30.0 -0700
+++ linux-2.6/include/linux/slab.h  2007-09-18 17:07:39.0 -0700
@@ -19,6 +19,7 @@
  * The ones marked DEBUG are only valid if CONFIG_SLAB_DEBUG is set.
  */
 #define SLAB_DEBUG_FREE0x0100UL/* DEBUG: Perform 
(expensive) checks on free */
+#define SLAB_VFALLBACK 0x0200UL/* May fall back to vmalloc */
 #define SLAB_RED_ZONE  0x0400UL/* DEBUG: Red zone objs in a 
cache */
 #define SLAB_POISON0x0800UL/* DEBUG: Poison objects */
 #define SLAB_HWCACHE_ALIGN 0x2000UL/* Align objs on cache lines */
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2007-09-18 17:03:30.0 -0700
+++ linux-2.6/mm/slub.c 2007-09-18 18:13:38.0 -0700
@@ -20,6 +20,7 @@
 #include linux/mempolicy.h
 #include linux/ctype.h
 #include linux/kallsyms.h
+#include linux/vmalloc.h
 
 /*
  * Lock order:
@@ -277,6 +278,26 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+static inline void *slab_address(struct page *page)
+{
+   if (unlikely(PageVcompound(page)))
+   return vmalloc_address(page);
+   else
+   return page_address(page);
+}
+
+static inline struct page *virt_to_slab(const void *addr)
+{
+   struct page *page;
+
+   if (unlikely(is_vmalloc_addr(addr)))
+   page = vmalloc_to_page(addr);
+   else
+   page = virt_to_page(addr);
+
+   return compound_head(page);
+}
+
 static inline int check_valid_pointer(struct kmem_cache *s,
struct page *page, const void *object)
 {
@@ -285,7 +306,7 @@ static inline int check_valid_pointer(st
if (!object)
return 1;
 
-   base = page_address(page);
+   base = slab_address(page);
if (object  base || object = base + s-objects * s-size ||
(object - base) % s-size) {
return 0;
@@ -470,7 +491,7 @@ static void slab_fix(struct kmem_cache *
 static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
 {
unsigned int off;   /* Offset of last byte */
-   u8 *addr = page_address(page);
+   u8 *addr = slab_address(page);
 
print_tracking(s, p);
 
@@ -648,7 +669,7 @@ static int slab_pad_check(struct kmem_ca
if (!(s-flags  SLAB_POISON))
return 1;
 
-   start = page_address(page);
+   start = slab_address(page);
end = start + (PAGE_SIZE  s-order);
length = s-objects * s-size;
remainder = end - (start + length);
@@ -1040,11 +1061,7 @@ static struct page *allocate_slab(struct
struct page * page;
int pages = 1  s-order;
 
-   if (s-order)
-   flags |= __GFP_COMP;
-
-   if (s-flags  SLAB_CACHE_DMA)
-   flags |= SLUB_DMA;
+   flags |= s-gfpflags;
 
if (node == -1)
page = alloc_pages(flags, s-order);
@@ -1098,7 +1115,11 @@ static struct page *new_slab(struct kmem
SLAB_STORE_USER | SLAB_TRACE))
SetSlabDebug(page);
 
-   start = page_address(page);
+   if (!PageVcompound(page))
+   start = slab_address(page);
+   else
+   start = (void *)page-private;
+
end = start + s-objects * s-size;
 
if (unlikely(s-flags  SLAB_POISON))
@@ -1130,7 +1151,7 @@ static void __free_slab(struct kmem_cach