Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-07 Thread Michael Ellerman
Christoph Lameter  writes:

> On Thu, 1 Jun 2017, Hugh Dickins wrote:
>
>> Thanks a lot for working that out.  Makes sense, fully understood now,
>> nothing to worry about (though makes one wonder whether it's efficient
>> to use ctors on high-alignment caches; or whether an internal "zero-me"
>> ctor would be useful).
>
> Use kzalloc to zero it.

But that's changing a per slab creation memset into a per object
allocation memset, isn't it?

> And here is another example of using slab allocations for page frames.
> Use the page allocator for this? The page allocator is there for
> allocating page frames. The slab allocator main purpose is to allocate
> small objects

Well usually they are small (< PAGE_SIZE), because we have 64K pages.

But we could rework the code to use the page allocator on 4K configs.

cheers


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-07 Thread Michael Ellerman
Hugh Dickins  writes:
> On Fri, 2 Jun 2017, Michael Ellerman wrote:
>> Hugh Dickins  writes:
>> > On Thu, 1 Jun 2017, Christoph Lameter wrote:
>> >> 
>> >> Ok so debugging was off but the slab cache has a ctor callback which
>> >> mandates that the free pointer cannot use the free object space when
>> >> the object is not in use. Thus the size of the object must be increased to
>> >> accomodate the freepointer.
>> >
>> > Thanks a lot for working that out.  Makes sense, fully understood now,
>> > nothing to worry about (though makes one wonder whether it's efficient
>> > to use ctors on high-alignment caches; or whether an internal "zero-me"
>> > ctor would be useful).
>> 
>> Or should we just be using kmem_cache_zalloc() when we allocate from
>> those slabs?
>> 
>> Given all the ctor's do is memset to 0.
>
> I'm not sure.  From a memory-utilization point of view, with SLUB,
> using kmem_cache_zalloc() there would certainly be better.
>
> But you may be forgetting that the constructor is applied only when a
> new slab of objects is allocated, not each time an object is allocated
> from that slab (and the user of those objects agrees to free objects
> back to the cache in a reusable state: zeroed in this case).

Ah yes, I was "forgetting" that :) - ie. didn't know it.

> So from a cpu-utilization point of view, it's better to use the ctor:
> it's saving you lots of redundant memsets.

OK. Presumably we guarantee (somewhere) that the page tables are zeroed
before we free them, which is a natural result of tearing down all
mappings?

But then I see other arches (x86, arm64 at least), which don't use a
constructor, and use __GPF_ZERO (via PGALLOC_GFP) at allocation time.

eg. arm64:

pgd_cache = kmem_cache_create("pgd_cache", PGD_SIZE, PGD_SIZE,
  SLAB_PANIC, NULL);
...
return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);


So that's a bit puzzling.

cheers


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-02 Thread Christoph Lameter
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> SLUB versus SLAB, cpu versus memory?  Since someone has taken the
> trouble to write it with ctors in the past, I didn't feel on firm
> enough ground to recommend such a change.  But it may be obvious
> to someone else that your suggestion would be better (or worse).

Umm how about using alloc_pages() for pageframes?



Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-02 Thread Christoph Lameter
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Use kzalloc to zero it. And here is another example of using slab
allocations for page frames. Use the page allocator for this? The page
allocator is there for allocating page frames. The slab allocator main
purpose is to allocate small objects




Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-01 Thread Michael Ellerman
Hugh Dickins  writes:

> On Thu, 1 Jun 2017, Christoph Lameter wrote:
>> 
>> Ok so debugging was off but the slab cache has a ctor callback which
>> mandates that the free pointer cannot use the free object space when
>> the object is not in use. Thus the size of the object must be increased to
>> accomodate the freepointer.
>
> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Or should we just be using kmem_cache_zalloc() when we allocate from
those slabs?

Given all the ctor's do is memset to 0.

cheers


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-01 Thread Hugh Dickins
On Thu, 1 Jun 2017, Christoph Lameter wrote:
> 
> Ok so debugging was off but the slab cache has a ctor callback which
> mandates that the free pointer cannot use the free object space when
> the object is not in use. Thus the size of the object must be increased to
> accomodate the freepointer.

Thanks a lot for working that out.  Makes sense, fully understood now,
nothing to worry about (though makes one wonder whether it's efficient
to use ctors on high-alignment caches; or whether an internal "zero-me"
ctor would be useful).

Hugh


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-01 Thread Christoph Lameter
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are
>
> CONFIG_SLUB_DEBUG=y
> # CONFIG_SLUB_MEMCG_SYSFS_ON is not set
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y
> # CONFIG_SLAB_FREELIST_RANDOM is not set
> CONFIG_SLUB_CPU_PARTIAL=y
> CONFIG_SLABINFO=y
> # CONFIG_SLUB_DEBUG_ON is not set
> CONFIG_SLUB_STATS=y

Thats fine.

> But I think you are now surprised, when I say no slub_debug options
> were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
> (before I tried the new kernel with Aneesh's fix patch)
> in case they tell you anything...
>
> pgtable-2^12/poison:0
> pgtable-2^12/red_zone:0
> pgtable-2^12/reserved:0
> pgtable-2^12/sanity_checks:0
> pgtable-2^12/store_user:0

Ok so debugging was off but the slab cache has a ctor callback which
mandates that the free pointer cannot use the free object space when
the object is not in use. Thus the size of the object must be increased to
accomodate the freepointer.


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-01 Thread Christoph Lameter


> > I am curious as to what is going on there. Do you have the output from
> > these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?

The output for the failing allocations when you disabling debugging. For
that I would think that you need remove(!) the slub_debug statement on the 
kernel
command line. You can verify that debug is off by inspecting the values in
/sys/kernel/slab//

> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.

No it does no longer. I dont think slub_debug=O does disable debugging
(frankly I am not sure what it does). Please do not specify any debug options.



Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Aneesh Kumar K.V
Hugh Dickins  writes:

> Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> I find that swapping loads on ppc64 on G5 with 4k pages are failing:
>
> SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 
> 4, min order: 4
>   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>   node 0: slabs: 209, objs: 209, free: 8
> gcc: page allocation failure: order:4, 
> mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> Call Trace:
> [c090b5c0] [c04f8478] .dump_stack+0xa0/0xcc (unreliable)
> [c090b650] [c00eb194] .warn_alloc+0xf0/0x178
> [c090b710] [c00ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> [c090b8b0] [c013921c] .new_slab+0x234/0x608
> [c090b980] [c013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> [c090bad0] [c04f5a84] 
> .__slab_alloc.isra.61.constprop.63+0x54/0x70
> [c090bb70] [c013b864] .kmem_cache_alloc+0x140/0x288
> [c090bc30] [c004d934] .mm_init.isra.65+0x128/0x1c0
> [c090bcc0] [c0157810] .do_execveat_common.isra.39+0x294/0x690
> [c090bdb0] [c0157e70] .SyS_execve+0x28/0x38
> [c090be30] [c000a118] system_call+0x38/0xfc
>
> I did try booting with slub_debug=O as the message suggested, but that
> made no difference: it still hoped for but failed on order:4 allocations.
>
> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.
>
> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.
>
> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.
>
> I tried the obvious partial revert below, but it's not good enough:
> the system did not boot beyond
>
> Starting init: /sbin/init exists but couldn't execute it (error -7)
> Starting init: /bin/sh exists but couldn't execute it (error -7)
> Kernel panic - not syncing: No working init found. ...
>

Can you try this patch.

commit fc55c0dc8b23446f937c1315aa61e74673de5ee6
Author: Aneesh Kumar K.V 
Date:   Thu Jun 1 08:06:40 2017 +0530

powerpc/mm/4k: Limit 4k page size to 64TB

Supporting 512TB requires us to do a order 3 allocation for level 1 page
table(pgd). Limit 4k to 64TB for now.

Signed-off-by: Aneesh Kumar K.V 

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index b4b5e6b671ca..0c4e470571ca 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  9
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE   (sizeof(pte_t) << H_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index a2123f291ab0..5de3271026f1 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -110,13 +110,15 @@ void release_thread(struct task_struct *);
 #define TASK_SIZE_128TB (0x8000UL)
 #define TASK_SIZE_512TB (0x0002UL)
 
-#ifdef CONFIG_PPC_BOOK3S_64
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
 /*
  * Max value currently used:
  */
-#define TASK_SIZE_USER64   TASK_SIZE_512TB
+#define TASK_SIZE_USER64   TASK_SIZE_512TB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_128TB
 #else
-#define TASK_SIZE_USER64   TASK_SIZE_64TB
+#define TASK_SIZE_USER64   TASK_SIZE_64TB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_64TB
 #endif
 
 /*
@@ -132,7 +134,7 @@ void release_thread(struct task_struct *);
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE_USER32 (PAGE_ALIGN(TASK_SIZE_USER32 / 4))
-#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(TASK_SIZE_128TB / 4))
+#define TASK_UNMAPPED_BASE_USER64 (PAGE_ALIGN(DEFAULT_MAP_WINDOW_USER64 / 4))
 
 #define TASK_UNMAPPED_BASE ((is_32bit_task()) ? \
TASK_UNMAPPED_BASE_USER32 : TASK_UNMAPPED_BASE_USER64 )
@@ -143,8 +145,8 @@ void release_thread(struct task_struct *);
  * with 128TB and conditionally enable upto 512TB
  */
 #ifdef CONFIG_PPC_BOOK3S_64
-#define DEFAULT_MAP_WINDOW ((is_32bit_task()) ? \
-TASK_SIZE_USER32 : TASK_SIZE_128TB)
+#define 

Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Mathieu Malaterre
On Wed, May 31, 2017 at 8:44 PM, Hugh Dickins  wrote:
> [ Merging two mails into one response ]
>
> On Wed, 31 May 2017, Christoph Lameter wrote:
>> On Tue, 30 May 2017, Hugh Dickins wrote:
>> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default 
>> > order: 4, min order: 4
>> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>>
>> > I did try booting with slub_debug=O as the message suggested, but that
>> > made no difference: it still hoped for but failed on order:4 allocations.
>>
>> I am curious as to what is going on there. Do you have the output from
>> these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?
>
>>
>> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
>> > it seemed to be a hard requirement for something, but I didn't find what.
>>
>> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
>> be able to enable it at runtime.
>
> Yes, I thought so.
>
>>
>> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
>> > the expected order:3, which then results in OOM-killing rather than direct
>> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
>> > makes no real difference to the outcome: swapping loads still abort early.
>>
>> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
>>
>> Ahh. Ok debugging increased the object size to an order 4. This should be
>> order 3 without debugging.
>
> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.
>
>>
>> Why are the slab allocators used to create slab caches for large object
>> sizes?
>
> There may be more optimal ways to allocate, but I expect that when
> the ppc guys are writing the code to handle both 4k and 64k page sizes,
> kmem caches offer the best span of possibility without complication.
>
>>
>> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
>> > with 4k pages would do better not to expect to support a 128TB userspace.
>>
>> I thought you had these huge 64k page sizes?
>
> ppc64 does support 64k page sizes, and they've been the default for years;
> but since 4k pages are still supported, I choose to use those (I doubt
> I could ever get the same load going with 64k pages).

4k is pretty much required on ppc64 when it comes to nouveau:

https://bugs.freedesktop.org/show_bug.cgi?id=94757

2cts


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Hugh Dickins
[ Merging two mails into one response ]

On Wed, 31 May 2017, Christoph Lameter wrote:
> On Tue, 30 May 2017, Hugh Dickins wrote:
> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default 
> > order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
> 
> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.
> 
> I am curious as to what is going on there. Do you have the output from
> these failed allocations?

I thought the relevant output was in my mail.  I did skip the Mem-Info
dump, since that just seemed noise in this case: we know memory can get
fragmented.  What more output are you looking for?

> 
> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> > it seemed to be a hard requirement for something, but I didn't find what.
> 
> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
> be able to enable it at runtime.

Yes, I thought so.

> 
> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> > the expected order:3, which then results in OOM-killing rather than direct
> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> > makes no real difference to the outcome: swapping loads still abort early.
> 
> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
> 
> Ahh. Ok debugging increased the object size to an order 4. This should be
> order 3 without debugging.

But it was still order 4 when booted with slub_debug=O, which surprised me.
And that surprises you too?  If so, then we ought to dig into it further.

> 
> Why are the slab allocators used to create slab caches for large object
> sizes?

There may be more optimal ways to allocate, but I expect that when
the ppc guys are writing the code to handle both 4k and 64k page sizes,
kmem caches offer the best span of possibility without complication.

> 
> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> > with 4k pages would do better not to expect to support a 128TB userspace.
> 
> I thought you had these huge 64k page sizes?

ppc64 does support 64k page sizes, and they've been the default for years;
but since 4k pages are still supported, I choose to use those (I doubt
I could ever get the same load going with 64k pages).

Hugh


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Christoph Lameter
On Wed, 31 May 2017, Michael Ellerman wrote:

> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default 
> > order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.

Ahh. Ok debugging increased the object size to an order 4. This should be
order 3 without debugging.

> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.

I am curious as to what is going on there. Do you have the output from
these failed allocations?


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Christoph Lameter
On Tue, 30 May 2017, Hugh Dickins wrote:

> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.

CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
be able to enable it at runtime.

> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.

SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.

Why are the slab allocators used to create slab caches for large object
sizes?

> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.

I thought you had these huge 64k page sizes?



Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Michael Ellerman
Hugh Dickins  writes:

> Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> I find that swapping loads on ppc64 on G5 with 4k pages are failing:
>
> SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 
> 4, min order: 4
>   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>   node 0: slabs: 209, objs: 209, free: 8
> gcc: page allocation failure: order:4, 
> mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> Call Trace:
> [c090b5c0] [c04f8478] .dump_stack+0xa0/0xcc (unreliable)
> [c090b650] [c00eb194] .warn_alloc+0xf0/0x178
> [c090b710] [c00ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> [c090b8b0] [c013921c] .new_slab+0x234/0x608
> [c090b980] [c013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> [c090bad0] [c04f5a84] 
> .__slab_alloc.isra.61.constprop.63+0x54/0x70
> [c090bb70] [c013b864] .kmem_cache_alloc+0x140/0x288
> [c090bc30] [c004d934] .mm_init.isra.65+0x128/0x1c0
> [c090bcc0] [c0157810] .do_execveat_common.isra.39+0x294/0x690
> [c090bdb0] [c0157e70] .SyS_execve+0x28/0x38
> [c090be30] [c000a118] system_call+0x38/0xfc
>
> I did try booting with slub_debug=O as the message suggested, but that
> made no difference: it still hoped for but failed on order:4 allocations.
>
> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.
>
> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.
>
> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.
>
> I tried the obvious partial revert below, but it's not good enough:
> the system did not boot beyond
>
> Starting init: /sbin/init exists but couldn't execute it (error -7)
> Starting init: /bin/sh exists but couldn't execute it (error -7)
> Kernel panic - not syncing: No working init found. ...

Ouch, sorry.

I boot test a G5 with 4K pages, but I don't stress test it much so I
didn't notice this.

I think making 128TB depend on 64K pages makes sense, Aneesh is going to
try and do a patch for that.

cheers