Re: [PATCH V2] mm/page_alloc: Ensure that HUGETLB_PAGE_ORDER is less than MAX_ORDER

2021-04-19 Thread Christoph Lameter
On Mon, 19 Apr 2021, Anshuman Khandual wrote:

> >> Unfortunately the build test fails on both the platforms (powerpc and ia64)
> >> which subscribe HUGETLB_PAGE_SIZE_VARIABLE and where this check would make
> >> sense. I some how overlooked the cross compile build failure that actually
> >> detected this problem.
> >>
> >> But wondering why this assert is not holding true ? and how these platforms
> >> do not see the warning during boot (or do they ?) at mm/vmscan.c:1092 like
> >> arm64 did.
> >>
> >> static int __fragmentation_index(unsigned int order, struct 
> >> contig_page_info *info)
> >> {
> >>  unsigned long requested = 1UL << order;
> >>
> >>  if (WARN_ON_ONCE(order >= MAX_ORDER))
> >>  return 0;
> >> 
> >>
> >> Can pageblock_order really exceed MAX_ORDER - 1 ?

You can have larger blocks but you would need to allocate multiple
contigous max order blocks or do it at boot time before the buddy
allocator is active.

What IA64 did was to do this at boot time thereby avoiding the buddy
lists. And it had a separate virtual address range and page table for the
huge pages.

Looks like the current code does these allocations via CMA which should
also bypass the buddy allocator.


> > }
> >
> >
> > But it's kind of weird, isn't it? Let's assume we have MAX_ORDER - 1 
> > correspond to 4 MiB and pageblock_order correspond to 8 MiB.
> >
> > Sure, we'd be grouping pages in 8 MiB chunks, however, we cannot even
> > allocate 8 MiB chunks via the buddy. So only alloc_contig_range()
> > could really grab them (IOW: gigantic pages).
>
> Right.

But then you can avoid the buddy allocator.

> > Further, we have code like deferred_free_range(), where we end up
> > calling __free_pages_core()->...->__free_one_page() with
> > pageblock_order. Wouldn't we end up setting the buddy order to
> > something > MAX_ORDER -1 on that path?
>
> Agreed.

We would need to return the supersized block to the huge page pool and not
to the buddy allocator. There is a special callback in the compound page
sos that you can call an alternate free function that is not the buddy
allocator.

>
> >
> > Having pageblock_order > MAX_ORDER feels wrong and looks shaky.
> >
> Agreed, definitely does not look right. Lets see what other folks
> might have to say on this.
>
> + Christoph Lameter 
>

It was done for a long time successfully and is running in numerous
configurations.


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-02 Thread Christoph Lameter
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> SLUB versus SLAB, cpu versus memory?  Since someone has taken the
> trouble to write it with ctors in the past, I didn't feel on firm
> enough ground to recommend such a change.  But it may be obvious
> to someone else that your suggestion would be better (or worse).

Umm how about using alloc_pages() for pageframes?



Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-02 Thread Christoph Lameter
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Use kzalloc to zero it. And here is another example of using slab
allocations for page frames. Use the page allocator for this? The page
allocator is there for allocating page frames. The slab allocator main
purpose is to allocate small objects




Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-01 Thread Christoph Lameter
On Thu, 1 Jun 2017, Hugh Dickins wrote:

> CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are
>
> CONFIG_SLUB_DEBUG=y
> # CONFIG_SLUB_MEMCG_SYSFS_ON is not set
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y
> # CONFIG_SLAB_FREELIST_RANDOM is not set
> CONFIG_SLUB_CPU_PARTIAL=y
> CONFIG_SLABINFO=y
> # CONFIG_SLUB_DEBUG_ON is not set
> CONFIG_SLUB_STATS=y

Thats fine.

> But I think you are now surprised, when I say no slub_debug options
> were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
> (before I tried the new kernel with Aneesh's fix patch)
> in case they tell you anything...
>
> pgtable-2^12/poison:0
> pgtable-2^12/red_zone:0
> pgtable-2^12/reserved:0
> pgtable-2^12/sanity_checks:0
> pgtable-2^12/store_user:0

Ok so debugging was off but the slab cache has a ctor callback which
mandates that the free pointer cannot use the free object space when
the object is not in use. Thus the size of the object must be increased to
accomodate the freepointer.


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-06-01 Thread Christoph Lameter


> > I am curious as to what is going on there. Do you have the output from
> > these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?

The output for the failing allocations when you disabling debugging. For
that I would think that you need remove(!) the slub_debug statement on the 
kernel
command line. You can verify that debug is off by inspecting the values in
/sys/kernel/slab//

> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.

No it does no longer. I dont think slub_debug=O does disable debugging
(frankly I am not sure what it does). Please do not specify any debug options.



Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Christoph Lameter
On Wed, 31 May 2017, Michael Ellerman wrote:

> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default 
> > order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.

Ahh. Ok debugging increased the object size to an order 4. This should be
order 3 without debugging.

> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.

I am curious as to what is going on there. Do you have the output from
these failed allocations?


Re: 4.12-rc ppc64 4k-page needs costly allocations

2017-05-31 Thread Christoph Lameter
On Tue, 30 May 2017, Hugh Dickins wrote:

> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.

CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
be able to enable it at runtime.

> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.

SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.

Why are the slab allocators used to create slab caches for large object
sizes?

> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.

I thought you had these huge 64k page sizes?



Re: [PATCH] percpu: improve generic percpu modify-return implementation

2016-09-21 Thread Christoph Lameter
On Wed, 21 Sep 2016, Tejun Heo wrote:

> Hello, Nick.
>
> How have you been? :)
>

He is baack. Are we getting SL!B? ;-)



Re: [kernel-hardening] Re: [PATCH 9/9] mm: SLUB hardened usercopy support

2016-07-08 Thread Christoph Lameter
On Fri, 8 Jul 2016, Kees Cook wrote:

> Is check_valid_pointer() making sure the pointer is within the usable
> size? It seemed like it was checking that it was within the slub
> object (checks against s->size, wants it above base after moving
> pointer to include redzone, etc).

check_valid_pointer verifies that a pointer is pointing to the start of an
object. It is used to verify the internal points that SLUB used and
should not be modified to do anything different.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [kernel-hardening] Re: [PATCH 9/9] mm: SLUB hardened usercopy support

2016-07-08 Thread Christoph Lameter
On Fri, 8 Jul 2016, Michael Ellerman wrote:

> > I wonder if this code should be using size_from_object() instead of s->size?
>
> Hmm, not sure. Who's SLUB maintainer? :)

Me.

s->size is the size of the whole object including debugging info etc.
ksize() gives you the actual usable size of an object.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Vlastimil Babka wrote:

  NAK. This is changing slob behavior. With no node specified it must use
  alloc_pages because that obeys NUMA memory policies etc etc. It should not
  force allocation from the current node like what is happening here after
  the patch. See the code in slub.c that is similar.

 Doh, somehow I convinced myself that there's #else and alloc_pages() is only
 used for !CONFIG_NUMA so it doesn't matter. Here's a fixed version.

Acked-by: Christoph Lameter c...@linux.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 3/3] mm: use numa_mem_id() in alloc_pages_node()

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Vlastimil Babka wrote:

 numa_mem_id() is able to handle allocation from CPUs on memory-less nodes,
 so it's a more robust fallback than the currently used numa_node_id().

 Suggested-by: Christoph Lameter c...@linux.com
 Signed-off-by: Vlastimil Babka vba...@suse.cz
 Acked-by: David Rientjes rient...@google.com
 Acked-by: Mel Gorman mgor...@techsingularity.net

You can add my ack too if it helps.

Acked-by: Christoph Lameter c...@linux.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/3] mm: unify checks in alloc_pages_node() and __alloc_pages_node()

2015-07-30 Thread Christoph Lameter

Acked-by: Christoph Lameter c...@linux.com

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 1/3] mm: rename alloc_pages_exact_node to __alloc_pages_node

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Vlastimil Babka wrote:

 --- a/mm/slob.c
 +++ b/mm/slob.c
   void *page;

 -#ifdef CONFIG_NUMA
 - if (node != NUMA_NO_NODE)
 - page = alloc_pages_exact_node(node, gfp, order);
 - else
 -#endif
 - page = alloc_pages(gfp, order);
 + page = alloc_pages_node(node, gfp, order);

NAK. This is changing slob behavior. With no node specified it must use
alloc_pages because that obeys NUMA memory policies etc etc. It should not
force allocation from the current node like what is happening here after
the patch. See the code in slub.c that is similar.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-23 Thread Christoph Lameter
On Wed, 22 Jul 2015, David Rientjes wrote:

 Eek, yeah, that does look bad.  I'm not even sure the

   if (nid  0)
   nid = numa_node_id();

 is correct; I think this should be comparing to NUMA_NO_NODE rather than
 all negative numbers, otherwise we silently ignore overflow and nobody
 ever knows.

Comparing to NUMA_NO_NODE would be better. Also use numa_mem_id() instead
to support memoryless nodes better?

 The only possible downside would be existing users of
 alloc_pages_node() that are calling it with an offline node.  Since it's a
 VM_BUG_ON() that would catch that, I think it should be changed to a
 VM_WARN_ON() and eventually fixed up because it's nonsensical.
 VM_BUG_ON() here should be avoided.

The offline node thing could be addresses by using numa_mem_id()?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] mm: rename and document alloc_pages_exact_node

2015-07-21 Thread Christoph Lameter
On Tue, 21 Jul 2015, Vlastimil Babka wrote:

 The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 (page
 allocator: do not check NUMA node ID when the caller knows the node is valid)
 as an optimized variant of alloc_pages_node(), that doesn't allow the node id
 to be -1. Unfortunately the name of the function can easily suggest that the
 allocation is restricted to the given node. In truth, the node is only
 preferred, unless __GFP_THISNODE is among the gfp flags.

Yup. I complained about this when this was introduced. Glad to see this
fixed. Initially this was alloc_pages_node() which just means that a node
is specified. The exact behavior of the allocation is determined by flags
such as GFP_THISNODE. I'd rather have that restored because otherwise we
get into weird code like the one below. And such an arrangement also
leaves the way open to add more flags in the future that may change the
allocation behavior.


   area-nid = nid;
   area-order = order;
 - area-pages = alloc_pages_exact_node(area-nid,
 + area-pages = alloc_pages_prefer_node(area-nid,
   GFP_KERNEL|__GFP_THISNODE,
   area-order);

This is not preferring a node but requiring alloction on that node.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: Replace __get_cpu_var uses

2014-10-29 Thread Christoph Lameter
On Wed, 29 Oct 2014, Michael Ellerman wrote:

   #define __ARCH_IRQ_STAT
 
  -#define local_softirq_pending()
  __get_cpu_var(irq_stat).__softirq_pending
  +#define local_softirq_pending()
  __this_cpu_read(irq_stat.__softirq_pending)
  +#define set_softirq_pending(x) __this_cpu_write(irq_stat._softirq_pending, 
  (x))
  +#define or_softirq_pending(x) __this_cpu_or(irq_stat._softirq_pending, (x))

 This breaks the build, because we also get the version of set_ and or_ from
 include/linux/interrupt.h, and then because it's __softirq_pending.

 Fixed by adding:

 #define __ARCH_SET_SOFTIRQ_PENDING

 And fixing the typo.

Ok.

 
   void __set_breakpoint(struct arch_hw_breakpoint *brk)
   {
  -   __get_cpu_var(current_brk) = *brk;
  +   __this_cpu_write(current_brk, *brk);

 This breaks the build because we're trying to do a structure assignment but
 __this_cpu_write() only supports certain sizes.

 I replaced it with this which I think is right?

   memcpy(this_cpu_ptr(current_brk), brk, sizeof(*brk));



Yes that is right. Thank you.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc: Replace __get_cpu_var uses

2014-10-27 Thread Christoph Lameter
Ping? We are planning to remove support for __get_cpu_var in the
3.19 merge period. I can move the definition for __get_cpu_var into the
powerpc per cpu definition instead if we cannot get this merged?

On Tue, 21 Oct 2014, Christoph Lameter wrote:


 This still has not been merged and now powerpc is the only arch that does
 not have this change. Sorry about missing linuxppc-dev before.


 V2-V2
   - Fix up to work against 3.18-rc1

 __get_cpu_var() is used for multiple purposes in the kernel source. One of
 them is address calculation via the form __get_cpu_var(x).  This calculates
 the address for the instance of the percpu variable of the current processor
 based on an offset.

 Other use cases are for storing and retrieving data from the current
 processors percpu area.  __get_cpu_var() can be used as an lvalue when
 writing data or on the right side of an assignment.

 __get_cpu_var() is defined as :


 #define __get_cpu_var(var) (*this_cpu_ptr((var)))



 __get_cpu_var() always only does an address determination. However, store
 and retrieve operations could use a segment prefix (or global register on
 other platforms) to avoid the address calculation.

 this_cpu_write() and this_cpu_read() can directly take an offset into a
 percpu area and use optimized assembly code to read and write per cpu
 variables.


 This patch converts __get_cpu_var into either an explicit address
 calculation using this_cpu_ptr() or into a use of this_cpu operations that
 use the offset.  Thereby address calculations are avoided and less registers
 are used when code is generated.

 At the end of the patch set all uses of __get_cpu_var have been removed so
 the macro is removed too.

 The patch set includes passes over all arches as well. Once these operations
 are used throughout then specialized macros can be defined in non -x86
 arches as well in order to optimize per cpu access by f.e.  using a global
 register that may be set to the per cpu base.




 Transformations done to __get_cpu_var()


 1. Determine the address of the percpu instance of the current processor.

   DEFINE_PER_CPU(int, y);
   int *x = __get_cpu_var(y);

 Converts to

   int *x = this_cpu_ptr(y);


 2. Same as #1 but this time an array structure is involved.

   DEFINE_PER_CPU(int, y[20]);
   int *x = __get_cpu_var(y);

 Converts to

   int *x = this_cpu_ptr(y);


 3. Retrieve the content of the current processors instance of a per cpu
 variable.

   DEFINE_PER_CPU(int, y);
   int x = __get_cpu_var(y)

Converts to

   int x = __this_cpu_read(y);


 4. Retrieve the content of a percpu struct

   DEFINE_PER_CPU(struct mystruct, y);
   struct mystruct x = __get_cpu_var(y);

Converts to

   memcpy(x, this_cpu_ptr(y), sizeof(x));


 5. Assignment to a per cpu variable

   DEFINE_PER_CPU(int, y)
   __get_cpu_var(y) = x;

Converts to

   __this_cpu_write(y, x);


 6. Increment/Decrement etc of a per cpu variable

   DEFINE_PER_CPU(int, y);
   __get_cpu_var(y)++

Converts to

   __this_cpu_inc(y)


 Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
 CC: Paul Mackerras pau...@samba.org
 Signed-off-by: Christoph Lameter c...@linux.com
 ---
  arch/powerpc/include/asm/hardirq.h   |  4 +++-
  arch/powerpc/include/asm/tlbflush.h  |  4 ++--
  arch/powerpc/include/asm/xics.h  |  8 
  arch/powerpc/kernel/dbell.c  |  2 +-
  arch/powerpc/kernel/hw_breakpoint.c  |  6 +++---
  arch/powerpc/kernel/iommu.c  |  2 +-
  arch/powerpc/kernel/irq.c|  4 ++--
  arch/powerpc/kernel/kgdb.c   |  2 +-
  arch/powerpc/kernel/kprobes.c|  6 +++---
  arch/powerpc/kernel/mce.c| 24 
  arch/powerpc/kernel/process.c| 10 +-
  arch/powerpc/kernel/smp.c|  6 +++---
  arch/powerpc/kernel/sysfs.c  |  4 ++--
  arch/powerpc/kernel/time.c   | 22 +++---
  arch/powerpc/kernel/traps.c  |  6 +++---
  arch/powerpc/kvm/e500.c  | 14 +++---
  arch/powerpc/kvm/e500mc.c|  4 ++--
  arch/powerpc/mm/hash_native_64.c |  2 +-
  arch/powerpc/mm/hash_utils_64.c  |  2 +-
  arch/powerpc/mm/hugetlbpage-book3e.c |  6 +++---
  arch/powerpc/mm/hugetlbpage.c|  2 +-
  arch/powerpc/mm/stab.c   | 12 ++--
  arch/powerpc/perf/core-book3s.c  | 22 +++---
  arch/powerpc/perf/core-fsl-emb.c |  6 +++---
  arch/powerpc/platforms/cell/interrupt.c  |  6 +++---
  arch/powerpc/platforms/ps3/interrupt.c   |  2 +-
  arch/powerpc/platforms/pseries/dtl.c |  2 +-
  arch/powerpc/platforms/pseries/hvCall_inst.c |  4 ++--
  arch/powerpc/platforms/pseries/iommu.c   |  8 
  arch/powerpc

Re: powerpc: Replace __get_cpu_var uses

2014-10-27 Thread Christoph Lameter
On Tue, 28 Oct 2014, Michael Ellerman wrote:

 I'm happy to put it in a topic branch for 3.19, or move the definition or
 whatever, your choice Christoph.


Get the patch merged please.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

powerpc: Replace __get_cpu_var uses

2014-10-21 Thread Christoph Lameter

This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.


V2-V2
  - Fix up to work against 3.18-rc1

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form __get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :


#define __get_cpu_var(var) (*this_cpu_ptr((var)))



__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.


This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e.  using a global
register that may be set to the per cpu base.




Transformations done to __get_cpu_var()


1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = __get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(y);


2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

Converts to

int *x = this_cpu_ptr(y);


3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

   Converts to

int x = __this_cpu_read(y);


4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

   Converts to

memcpy(x, this_cpu_ptr(y), sizeof(x));


5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

   Converts to

__this_cpu_write(y, x);


6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

   Converts to

__this_cpu_inc(y)


Cc: Benjamin Herrenschmidt b...@kernel.crashing.org
CC: Paul Mackerras pau...@samba.org
Signed-off-by: Christoph Lameter c...@linux.com
---
 arch/powerpc/include/asm/hardirq.h   |  4 +++-
 arch/powerpc/include/asm/tlbflush.h  |  4 ++--
 arch/powerpc/include/asm/xics.h  |  8 
 arch/powerpc/kernel/dbell.c  |  2 +-
 arch/powerpc/kernel/hw_breakpoint.c  |  6 +++---
 arch/powerpc/kernel/iommu.c  |  2 +-
 arch/powerpc/kernel/irq.c|  4 ++--
 arch/powerpc/kernel/kgdb.c   |  2 +-
 arch/powerpc/kernel/kprobes.c|  6 +++---
 arch/powerpc/kernel/mce.c| 24 
 arch/powerpc/kernel/process.c| 10 +-
 arch/powerpc/kernel/smp.c|  6 +++---
 arch/powerpc/kernel/sysfs.c  |  4 ++--
 arch/powerpc/kernel/time.c   | 22 +++---
 arch/powerpc/kernel/traps.c  |  6 +++---
 arch/powerpc/kvm/e500.c  | 14 +++---
 arch/powerpc/kvm/e500mc.c|  4 ++--
 arch/powerpc/mm/hash_native_64.c |  2 +-
 arch/powerpc/mm/hash_utils_64.c  |  2 +-
 arch/powerpc/mm/hugetlbpage-book3e.c |  6 +++---
 arch/powerpc/mm/hugetlbpage.c|  2 +-
 arch/powerpc/mm/stab.c   | 12 ++--
 arch/powerpc/perf/core-book3s.c  | 22 +++---
 arch/powerpc/perf/core-fsl-emb.c |  6 +++---
 arch/powerpc/platforms/cell/interrupt.c  |  6 +++---
 arch/powerpc/platforms/ps3/interrupt.c   |  2 +-
 arch/powerpc/platforms/pseries/dtl.c |  2 +-
 arch/powerpc/platforms/pseries/hvCall_inst.c |  4 ++--
 arch/powerpc/platforms/pseries/iommu.c   |  8 
 arch/powerpc/platforms/pseries/lpar.c|  6 +++---
 arch/powerpc/platforms/pseries/ras.c |  4 ++--
 arch/powerpc/sysdev/xics/xics-common.c   |  2 +-
 32 files changed, 108 insertions(+), 106 deletions(-)

Index: linux/arch/powerpc/include/asm/hardirq.h

Re: [RFC PATCH v3 1/4] topology: add support for node_to_mem_node() to determine the fallback node

2014-08-14 Thread Christoph Lameter
On Wed, 13 Aug 2014, Nishanth Aravamudan wrote:

 +++ b/include/linux/topology.h
 @@ -119,11 +119,20 @@ static inline int numa_node_id(void)
   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
   */
  DECLARE_PER_CPU(int, _numa_mem_);
 +extern int _node_numa_mem_[MAX_NUMNODES];

Why are these variables starting with an _ ?
Maybe _numa_mem was defined that way because it is typically not defined.
We dont do this in other situations.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: Kernel build issues after yesterdays merge by Linus

2014-06-12 Thread Christoph Lameter
Goobledieguy due to missing Mime header.

On Thu, 12 Jun 2014, David Laight wrote:

 RnJvbTogQW50b24gQmxhbmNoYXJkDQouLi4NCj4gZGlmZiAtLWdpdCBhL2FyY2gvcG93ZXJwYy9i
 b290L2luc3RhbGwuc2ggYi9hcmNoL3Bvd2VycGMvYm9vdC9pbnN0YWxsLnNoDQo+IGluZGV4IGI2
 YTI1NmIuLmUwOTZlNWEgMTAwNjQ0DQo+IC0tLSBhL2FyY2gvcG93ZXJwYy9ib290L2luc3RhbGwu
 c2gNCj4gKysrIGIvYXJjaC9wb3dlcnBjL2Jvb3QvaW5zdGFsbC5zaA0KPiBAQCAtMjMsOCArMjMs
 OCBAQCBzZXQgLWUNCj4gDQo+ICAjIFVzZXIgbWF5IGhhdmUgYSBjdXN0b20gaW5zdGFsbCBzY3Jp
 cHQNCj4gDQo+IC1pZiBbIC14IH4vYmluLyR7SU5TVEFMTEtFUk5FTH0gXTsgdGhlbiBleGVjIH4v
 YmluLyR7SU5TVEFMTEtFUk5FTH0gIiRAIjsgZmkNCj4gLWlmIFsgLXggL3NiaW4vJHtJTlNUQUxM
 S0VSTkVMfSBdOyB0aGVuIGV4ZWMgL3NiaW4vJHtJTlNUQUxMS0VSTkVMfSAiJEAiOyBmaQ0KPiAr
 aWYgWyAteCB+L2Jpbi8ke0lOU1RBTExLRVJORUx9IF07IHRoZW4gZXhlYyB+L2Jpbi8ke0lOU1RB
 TExLRVJORUx9ICQxICQyICQzICQ0OyBmaQ0KPiAraWYgWyAteCAvc2Jpbi8ke0lOU1RBTExLRVJO
 RUx9IF07IHRoZW4gZXhlYyAvc2Jpbi8ke0lOU1RBTExLRVJORUx9ICQxICQyICQzICQ0OyBmaQ0K
 DQpZb3UgcHJvYmFibHkgd2FudCB0byBlbmNsb3NlIHRoZSAkMSBpbiAiIGFzOg0KDQo+ICtpZiBb
 IC14IC9zYmluLyR7SU5TVEFMTEtFUk5FTH0gXTsgdGhlbiBleGVjIC9zYmluLyR7SU5TVEFMTEtF
 Uk5FTH0gIiQxIiAiJDIiICIkMyIgIiQ0IjsgZmkNCg0KCURhdmlkDQoNCg==

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Kernel build issues after yesterdays merge by Linus

2014-06-11 Thread Christoph Lameter
This is under Ubuntu Utopic Unicorn on a Power 8 system while simply
trying to build with the Ubuntu standard kernel config. It could be that
these issues come about because we do not have an rc1 yet but I wanted to
give some early notice. Also this is a new arch to me so I may not be
aware of how things work.


1. Bad relocation while building:

root@rd-power8:/rdhome/clameter/linux# make
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  CALLscripts/checksyscalls.sh
  CHK include/generated/compile.h
  SKIPPED include/generated/compile.h
  CALLarch/powerpc/kernel/systbl_chk.sh
  CALLarch/powerpc/kernel/prom_init_check.sh
  CHK kernel/config_data.h
  CALLarch/powerpc/relocs_check.pl
WARNING: 1 bad relocations
c0cc7df0 R_PPC64_ADDR64__crc_TOC.



2. make install fails

root@rd-power8:/rdhome/clameter/linux# make install
sh -x /rdhome/clameter/linux/arch/powerpc/boot/install.sh 3.15.0+
vmlinux System.map /boot arch/powerpc/boot/zImage.pseries
arch/powerpc/boot/zImage.epapr
+ set -e
+ [ -x /home/clameter/bin/installkernel ]
+ [ -x /sbin/installkernel ]
+ exec /sbin/installkernel 3.15.0+ vmlinux System.map /boot
arch/powerpc/boot/zImage.pseries arch/powerpc/boot/zImage.epapr
Usage: installkernel version image System.map directory
/rdhome/clameter/linux/arch/powerpc/boot/Makefile:393: recipe for target
'install' failed
make[1]: *** [install] Error 1
/rdhome/clameter/linux/arch/powerpc/Makefile:294: recipe for target
'install' failed
make: *** [install] Error 2



3. Ubuntu make-kpkg fails

clameter@rd-power8:~/linux$ fakeroot make-kpkg --initrd --revision 1
kernel_image
exec make kpkg_version=13.013 -f
/usr/share/kernel-package/ruleset/minimal.mk debian DEBIAN_REVISION=1
INITRD=YES
== making target debian/stamp/conf/minimal_debian [new prereqs:
]==
This is kernel package version 13.013.
test -d debian || mkdir debian
test ! -e stamp-building || rm -f stamp-building
install -p -m 755 /usr/share/kernel-package/rules debian/rules
for file in ChangeLog  Control  Control.bin86 config templates.in rules;
do  \
cp -f  /usr/share/kernel-package/$file ./debian/;
\
done
cp: cannot stat ‘/usr/share/kernel-package/ChangeLog’: No such file or
directory
for dir  in Config docs examples ruleset scripts pkg po;  do
\
  cp -af /usr/share/kernel-package/$dir  ./debian/;
\
done
test -f debian/control || sed -e 's/=V/../g'  \
-e 's/=D/1/g' -e 's/=A/ppc64el/g'  \
-e 's/=SA//g'  \
-e 's/=I//g'\
-e 's/=CV/./g'  \
-e 's/=M/Unknown Kernel Package Maintainer
unkn...@unconfigured.in.etc.kernel-pkg.conf/g'
\
-e 's/=ST/linux/g'  -e 's/=B/ppc64el/g'\
-e 's/=R//g'/usr/share/kernel-package/Control 
debian/control
test -f debian/changelog ||  sed -e 's/=V/../g'   \
-e 's/=D/1/g'-e 's/=A/ppc64el/g'   \
-e 's/=ST/linux/g' -e 's/=B/ppc64el/g' \
-e 's/=M/Unknown Kernel Package Maintainer
unkn...@unconfigured.in.etc.kernel-pkg.conf/g'
\
 /usr/share/kernel-package/changelog  debian/changelog
chmod 0644 debian/control debian/changelog
test -d ./debian/stamp || mkdir debian/stamp
make -f debian/rules debian/stamp/conf/kernel-conf
make[1]: Entering directory '/rdhome/clameter/linux'
debian/ruleset/misc/checks.mk:36: *** Error. I do not know where the
kernel image goes to [kimagedest undefined] The usual case for this is
that I could not determine which arch or subarch this machine belongs to.
Please specify a subarch, and try again..  Stop.
make[1]: Leaving directory '/rdhome/clameter/linux'
/usr/share/kernel-package/ruleset/minimal.mk:93: recipe for target
'debian/stamp/conf/minimal_debian' failed
make: *** [debian/stamp/conf/minimal_debian] Error 2
Failed to create a ./debian directory:  at /usr/bin/make-kpkg line 966.




4. Errors during build:

Lots of integer to differnt pointer size conversions?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

power and percpu: Could we move the paca into the percpu area?

2014-06-11 Thread Christoph Lameter

Looking at arch/powerpc/include/asm/percpu.h I see that the per cpu offset
comes from a local_paca field and local_paca is in r13. That means that
for all percpu operations we first have to determine the address through a
memory access.

Would it be possible to put the paca at the beginning of the percpu data
area and then have r31 point to the percpu area?

power has these nice instructions that fetch from an offset relative to a
base register which could be used throughout for percpu operations in the
kernel (similar to x86 segment registers).

With that we may also be able to use the atomic ops for fast percpu access
so that we can avoid the irq enable/disable sequence that is now required
for percpu atomics. Would result in fast and reliable percpu
counters for powerpc.

I.e. powerpc atomic inc
static __inline__ void atomic_inc(atomic_t *v)
{
int t;

__asm__ __volatile__(
1: lwarx   %0,0,%2 # atomic_inc\n\
addic   %0,%0,1\n
PPC405_ERR77(0,%2)
   stwcx.  %0,0,%2 \n\
bne-1b
: =r (t), +m (v-counter)
: r (v-counter)
: cc, xer);
}

Could be used as a template to get:

static __inline__ void raw_cpu_inc_4(__percpu void *v)
{
int t;

__asm__ __volatile__(
1: lwarx   %0,r31,%2 # percpu_inc\n\
addic   %0,%0,1\n
PPC405_ERR77(0,%2)
   stwcx.  %0,r31,%2 \n\
bne-1b
: =r (t), +m (v)
: r (v-counter)
: cc, xer);
}

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Node 0 not necessary for powerpc?

2014-05-21 Thread Christoph Lameter
On Mon, 19 May 2014, Nishanth Aravamudan wrote:

 I'm seeing a panic at boot with this change on an LPAR which actually
 has no Node 0. Here's what I think is happening:

 start_kernel
 ...
 - setup_per_cpu_areas
 - pcpu_embed_first_chunk
 - pcpu_fc_alloc
 - ___alloc_bootmem_node(NODE_DATA(cpu_to_node(cpu), ...
 - smp_prepare_boot_cpu
 - set_numa_node(boot_cpuid)

 So we panic on the NODE_DATA call. It seems that ia64, at least, uses
 pcpu_alloc_first_chunk rather than embed. x86 has some code to handle
 early calls of cpu_to_node (early_cpu_to_node) and sets the mapping for
 all CPUs in setup_per_cpu_areas().

Maybe we can switch ia64 too embed? Tejun: Why are there these
dependencies?

 Thoughts? Does that mean we need something similar to x86 for powerpc?

Tejun is the expert in this area. CCing him.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-04-03 Thread Christoph Lameter
On Mon, 31 Mar 2014, Nishanth Aravamudan wrote:

 Yep. The node exists, it's just fully exhausted at boot (due to the
 presence of 16GB pages reserved at boot-time).

Well if you want us to support that then I guess you need to propose
patches to address this issue.

 I'd appreciate a bit more guidance? I'm suggesting that in this case the
 node functionally has no memory. So the page allocator should not allow
 allocations from it -- except (I need to investigate this still)
 userspace accessing the 16GB pages on that node, but that, I believe,
 doesn't go through the page allocator at all, it's all from hugetlb
 interfaces. It seems to me there is a bug in SLUB that we are noting
 that we have a useless per-node structure for a given nid, but not
 actually preventing requests to that node or reclaim because of those
 allocations.

Well if you can address that without impacting the fastpath then we could
do this. Otherwise we would need a fake structure here to avoid adding
checks to the fastpath

 I think there is a logical bug (even if it only occurs in this
 particular corner case) where if reclaim progresses for a THISNODE
 allocation, we don't check *where* the reclaim is progressing, and thus
 may falsely be indicating that we have done some progress when in fact
 the allocation that is causing reclaim will not possibly make any more
 progress.

Ok maybe we could address this corner case. How would you do this?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-03-28 Thread Christoph Lameter
On Thu, 27 Mar 2014, Nishanth Aravamudan wrote:

  That looks to be the correct way to handle things. Maybe mark the node as
  offline or somehow not present so that the kernel ignores it.

 This is a SLUB condition:

 mm/slub.c::early_kmem_cache_node_alloc():
 ...
 page = new_slab(kmem_cache_node, GFP_NOWAIT, node);
 ...

So the page allocation from the node failed. We have a strange boot
condition where the OS is aware of anode but allocations on that node
fail.

  if (page_to_nid(page) != node) {
 printk(KERN_ERR SLUB: Unable to allocate memory from 
 node %d\n, node);
 printk(KERN_ERR SLUB: Allocating a useless per node 
 structure 
 in order to be able to continue\n);
 }
 ...

 Since this is quite early, and we have not set up the nodemasks yet,
 does it make sense to perhaps have a temporary init-time nodemask that
 we set bits in here, and fix-up those nodes when we setup the
 nodemasks?

Please take care of this earlier than this. The page allocator in general
should allow allocations from all nodes with memory during boot,




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-03-25 Thread Christoph Lameter
On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:

 Anyone have any ideas here?

Dont do that? Check on boot to not allow exhausting a node with huge
pages?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-03-25 Thread Christoph Lameter
On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:

 On 25.03.2014 [11:17:57 -0500], Christoph Lameter wrote:
  On Mon, 24 Mar 2014, Nishanth Aravamudan wrote:
 
   Anyone have any ideas here?
 
  Dont do that? Check on boot to not allow exhausting a node with huge
  pages?

 Gigantic hugepages are allocated by the hypervisor (not the Linux VM),

Ok so the kernel starts booting up and then suddenly the hypervisor takes
the 2 16G pages before even the slab allocator is working?

Not sure if I understand that correctly.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Bug in reclaim logic with exhausted nodes?

2014-03-25 Thread Christoph Lameter
On Tue, 25 Mar 2014, Nishanth Aravamudan wrote:

 On power, very early, we find the 16G pages (gpages in the powerpc arch
 code) in the device-tree:

 early_setup -
   early_init_mmu -
   htab_initialize -
   htab_init_page_sizes -
   htab_dt_scan_hugepage_blocks -
   memblock_reserve
   which marks the memory
   as reserved
   add_gpage
   which saves the address
   off so future calls for
   alloc_bootmem_huge_page()

 hugetlb_init -
   hugetlb_init_hstates -
   hugetlb_hstate_alloc_pages -
   alloc_bootmem_huge_page

  Not sure if I understand that correctly.

 Basically this is present memory that is reserved for the 16GB usage
 per the LPAR configuration. We honor that configuration in Linux based
 upon the contents of the device-tree. It just so happens in the
 configuration from my original e-mail that a consequence of this is that
 a NUMA node has memory (topologically), but none of that memory is free,
 nor will it ever be free.

Well dont do that

 Perhaps, in this case, we could just remove that node from the N_MEMORY
 mask? Memory allocations will never succeed from the node, and we can
 never free these 16GB pages. It is really not any different than a
 memoryless node *except* when you are using the 16GB pages.

That looks to be the correct way to handle things. Maybe mark the node as
offline or somehow not present so that the kernel ignores it.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Node 0 not necessary for powerpc?

2014-03-12 Thread Christoph Lameter
On Tue, 11 Mar 2014, Nishanth Aravamudan wrote:
 I have a P7 system that has no node0, but a node0 shows up in numactl
 --hardware, which has no cpus and no memory (and no PCI devices):

Well as you see from the code there has been so far the assumption that
node 0 has memory. I have never run a machine that has no node 0 memory.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup

2014-02-24 Thread Christoph Lameter
On Fri, 21 Feb 2014, Nishanth Aravamudan wrote:

 I added two calls to local_memory_node(), I *think* both are necessary,
 but am willing to be corrected.

 One is in map_cpu_to_node() and one is in start_secondary(). The
 start_secondary() path is fine, AFAICT, as we are up  running at that
 point. But in [the renamed function] update_numa_cpu_node() which is
 used by hotplug, we get called from do_init_bootmem(), which is before
 the zonelists are setup.

 I think both calls are necessary because I believe the
 arch_update_cpu_topology() is used for supporting firmware-driven
 home-noding, which does not invoke start_secondary() again (the
 processor is already running, we're just updating the topology in that
 situation).

 Then again, I could special-case the do_init_bootmem callpath, which is
 only called at kernel init time?

Well taht looks to be simpler.

  I do agree that calling local_memory_node() too early then trying to
  fudge around the consequences seems rather wrong.

 If the answer is to simply not call local_memory_node() early, I'll
 submit a patch to at least add a comment, as there's nothing in the code
 itself to prevent this from happening and is guaranteed to oops.

Ok.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-24 Thread Christoph Lameter
On Mon, 24 Feb 2014, Joonsoo Kim wrote:

  It will not common get there because of the tracking. Instead a per cpu
  object will be used.
   get_partial_node() always fails even if there are some partial slab on
   memoryless node's neareast node.
 
  Correct and that leads to a page allocator action whereupon the node will
  be marked as empty.

 Why do we need to request to a page allocator if there is partial slab?
 Checking whether node is memoryless or not is really easy, so we don't need
 to skip this. To skip this is suboptimal solution.

The page allocator action is also used to determine to which other node we
should fall back if the node is empty. So we need to call the page
allocator when the per cpu slab is exhaused with the node of the
memoryless node to get memory from the proper fallback node.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-20 Thread Christoph Lameter
On Wed, 19 Feb 2014, David Rientjes wrote:

 On Tue, 18 Feb 2014, Christoph Lameter wrote:

  Its an optimization to avoid calling the page allocator to figure out if
  there is memory available on a particular node.
 Thus this patch breaks with memory hot-add for a memoryless node.

As soon as the per cpu slab is exhausted the node number of the so far
empty node will be used for allocation. That will be sucessfull and the
node will no longer be marked as empty.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 1/3] mm: return NUMA_NO_NODE in local_memory_node if zonelists are not setup

2014-02-20 Thread Christoph Lameter
On Wed, 19 Feb 2014, Nishanth Aravamudan wrote:

 We can call local_memory_node() before the zonelists are setup. In that
 case, first_zones_zonelist() will not set zone and the reference to
 zone-node will Oops. Catch this case, and, since we presumably running
 very early, just return that any node will do.

Really? Isnt there some way to avoid this call if zonelists are not setup
yet?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-19 Thread Christoph Lameter
On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

 the performance impact of the underlying NUMA configuration. I guess we
 could special-case memoryless/cpuless configurations somewhat, but I
 don't think there's any reason to do that if we can make memoryless-node
 support work in-kernel?

Well we can make it work in-kernel but it always has been a bit wacky (as
is the idea of numa memory nodes without memory).
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter
On Mon, 17 Feb 2014, Joonsoo Kim wrote:

 On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
  Here is another patch with some fixes. The additional logic is only
  compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
 
  Subject: slub: Memoryless node support
 
  Support memoryless nodes by tracking which allocations are failing.

 I still don't understand why this tracking is needed.

Its an optimization to avoid calling the page allocator to figure out if
there is memory available on a particular node.

 All we need for allcation targeted to memoryless node is to fallback proper
 node, that it, numa_mem_id() node of targeted node. My previous patch
 implements it and use proper fallback node on every allocation code path.
 Why this tracking is needed? Please elaborate more on this.

Its too slow to do that on every alloc. One needs to be able to satisfy
most allocations without switching percpu slabs for optimal performance.

  Allocations targeted to the nodes without memory fall back to the
  current available per cpu objects and if that is not available will
  create a new slab using the page allocator to fallback from the
  memoryless node to some other node.

And what about the next alloc? Assuem there are N allocs from a memoryless
node this means we push back the partial slab on each alloc and then fall
back?

   {
  void *object;
  -   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
  +   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
  object = get_partial_node(s, get_node(s, searchnode), c, flags);
  if (object || node != NUMA_NO_NODE)

 This isn't enough.
 Consider that allcation targeted to memoryless node.

It will not common get there because of the tracking. Instead a per cpu
object will be used.

 get_partial_node() always fails even if there are some partial slab on
 memoryless node's neareast node.

Correct and that leads to a page allocator action whereupon the node will
be marked as empty.

 We should fallback to some proper node in this case, since there is no slab
 on memoryless node.

NUMA is about optimization of memory allocations. It is often *not* about
correctness but heuristics are used in many cases. F.e. see the zone
reclaim logic, zone reclaim mode, fallback scenarios in the page allocator
etc etc.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter
On Mon, 17 Feb 2014, Joonsoo Kim wrote:

 On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
  Hi Joonsoo,
  Also, given that only ia64 and (hopefuly soon) ppc64 can set
  CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
  memoryless nodes present? Even with fakenuma? Just curious.

x86_64 currently does not support memoryless nodes otherwise it would
have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
node without memory but just processors. Not very efficient. Not sure why
people use these configurations.

 I don't know, because I'm not expert on NUMA system :)
 At first glance, fakenuma can't be used for testing
 CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Well yeah. You'd have to do some mods to enable that testing.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter
On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:


 Well, on powerpc, with the hypervisor providing the resources and the
 topology, you can have cpuless and memoryless nodes. I'm not sure how
 fake the NUMA is -- as I think since the resources are virtualized to
 be one system, it's logically possible that the actual topology of the
 resources can be CPUs from physical node 0 and memory from physical node
 2. I would think with KVM on a sufficiently large (physically NUMA
 x86_64) and loaded system, one could cause the same sort of
 configuration to occur for a guest?

Ok but since you have a virtualized environment: Why not provide a fake
home node with fake memory that could be anywhere? This would avoid the
whole problem of supporting such a config at the kernel level.

Do not have a fake node that has no memory.

 In any case, these configurations happen fairly often on long-running
 (not rebooted) systems as LPARs are created/destroyed, resources are
 DLPAR'd in and out of LPARs, etc.

Ok then also move the memory of the local node somewhere?

 I might look into it, as it might have sped up testing these changes.

I guess that will be necessary in order to support the memoryless nodes
long term.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-18 Thread Christoph Lameter
On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

 We use the topology provided by the hypervisor, it does actually reflect
 where CPUs and memory are, and their corresponding performance/NUMA
 characteristics.

And so there are actually nodes without memory that have processors?
Can the hypervisor or the linux arch code be convinced to ignore nodes
without memory or assign a sane default node to processors?

  Ok then also move the memory of the local node somewhere?

 This happens below the OS, we don't control the hypervisor's decisions.
 I'm not sure if that's what you are suggesting.

You could also do this from the powerpc arch code by sanitizing the
processor / node information that is then used by Linux.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-12 Thread Christoph Lameter
Here is another patch with some fixes. The additional logic is only
compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.

Subject: slub: Memoryless node support

Support memoryless nodes by tracking which allocations are failing.
Allocations targeted to the nodes without memory fall back to the
current available per cpu objects and if that is not available will
create a new slab using the page allocator to fallback from the
memoryless node to some other node.

Signed-off-by: Christoph Lameter c...@linux.com

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c2014-02-12 16:07:48.957869570 -0600
+++ linux/mm/slub.c 2014-02-12 16:09:22.198928260 -0600
@@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+static nodemask_t empty_nodes;
+#endif
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
void *last;
void *p;
int order;
+   int alloc_node;

BUG_ON(flags  GFP_SLAB_BUG_MASK);

page = allocate_slab(s,
flags  (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-   if (!page)
+   if (!page) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   if (node != NUMA_NO_NODE)
+   node_set(node, empty_nodes);
+#endif
goto out;
+   }

order = compound_order(page);
-   inc_slabs_node(s, page_to_nid(page), page-objects);
+   alloc_node = page_to_nid(page);
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   node_clear(alloc_node, empty_nodes);
+   if (node != NUMA_NO_NODE  alloc_node != node)
+   node_set(node, empty_nodes);
+#endif
+   inc_slabs_node(s, alloc_node, page-objects);
memcg_bind_pages(s, order);
page-slab_cache = s;
__SetPageSlab(page);
@@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-   if (!page || (node != NUMA_NO_NODE  page_to_nid(page) != node))
+   int page_node = page_to_nid(page);
+
+   if (!page)
return 0;
+
+   if (node != NUMA_NO_NODE) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+   if (node_isset(node, empty_nodes))
+   return 1;
+#endif
+   if (page_node != node)
+   return 0;
+   }
 #endif
return 1;
 }
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-11 Thread Christoph Lameter
On Mon, 10 Feb 2014, Joonsoo Kim wrote:

 On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
  Here is a draft of a patch to make this work with memoryless nodes.
 
  The first thing is that we modify node_match to also match if we hit an
  empty node. In that case we simply take the current slab if its there.

 Why not inspecting whether we can get the page on the best node such as
 numa_mem_id() node?

Its expensive to do so.

 empty_node cannot be set on memoryless node, since page allocation would
 succeed on different node.

Ok then we need to add a check for being on the rignt node there too.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node

2014-02-07 Thread Christoph Lameter
On Fri, 7 Feb 2014, Joonsoo Kim wrote:

  This check wouild need to be something that checks for other contigencies
  in the page allocator as well. A simple solution would be to actually run
  a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
  If that fails then fallback. See how fallback_alloc() does it in slab.
 

 Hello, Christoph.

 This !node_present_pages() ensure that allocation on this node cannot succeed.
 So we can directly use numa_mem_id() here.

Yes of course we can use numa_mem_id().

But the check is only for not having any memory at all on a node. There
are other reason for allocations to fail on a certain node. The node could
have memory that cannot be reclaimed, all dirty, beyond certain
thresholds, not in the current set of allowed nodes etc etc.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-07 Thread Christoph Lameter
Here is a draft of a patch to make this work with memoryless nodes.

The first thing is that we modify node_match to also match if we hit an
empty node. In that case we simply take the current slab if its there.

If there is no current slab then a regular allocation occurs with the
memoryless node. The page allocator will fallback to a possible node and
that will become the current slab. Next alloc from a memoryless node
will then use that slab.

For that we also add some tracking of allocations on nodes that were not
satisfied using the empty_node[] array. A successful alloc on a node
clears that flag.

I would rather avoid the empty_node[] array since its global and there may
be thread specific allocation restrictions but it would be expensive to do
an allocation attempt via the page allocator to make sure that there is
really no page available from the page allocator.

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c2014-02-03 13:19:22.896853227 -0600
+++ linux/mm/slub.c 2014-02-07 12:44:49.311494806 -0600
@@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+static int empty_node[MAX_NUMNODES];
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
void *last;
void *p;
int order;
+   int alloc_node;

BUG_ON(flags  GFP_SLAB_BUG_MASK);

page = allocate_slab(s,
flags  (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-   if (!page)
+   if (!page) {
+   if (node != NUMA_NO_NODE)
+   empty_node[node] = 1;
goto out;
+   }

order = compound_order(page);
-   inc_slabs_node(s, page_to_nid(page), page-objects);
+   alloc_node = page_to_nid(page);
+   empty_node[alloc_node] = 0;
+   inc_slabs_node(s, alloc_node, page-objects);
memcg_bind_pages(s, order);
page-slab_cache = s;
__SetPageSlab(page);
@@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
struct kmem_cache_cpu *c)
 {
void *object;
-   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
@@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-   if (!page || (node != NUMA_NO_NODE  page_to_nid(page) != node))
+   int page_node;
+
+   /* No data means no match */
+   if (!page)
return 0;
+
+   /* Node does not matter. Therefore anything is a match */
+   if (node == NUMA_NO_NODE)
+   return 1;
+
+   /* Did we hit the requested node ? */
+   page_node = page_to_nid(page);
+   if (page_node == node)
+   return 1;
+
+   /* If the node has available data then we can use it. Mismatch */
+   return !empty_node[page_node];
+
+   /* Target node empty so just take anything */
 #endif
return 1;
 }

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node

2014-02-07 Thread Christoph Lameter
On Fri, 7 Feb 2014, Joonsoo Kim wrote:

 
  It seems like a better approach would be to do this when a node is brought
  online and determine the fallback node based not on the zonelists as you
  do here but rather on locality (such as through a SLIT if provided, see
  node_distance()).

 Hmm...
 I guess that zonelist is base on locality. Zonelist is generated using
 node_distance(), so I think that it reflects locality. But, I'm not expert
 on NUMA, so please let me know what I am missing here :)

The next node can be found by going through the zonelist of a node and
checking for available memory. See fallback_alloc().

There is a function node_distance() that determines the relative
performance of a memory access from one to the other node.
The building of the fallback list for every node in build_zonelists()
relies on that.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()

2014-02-06 Thread Christoph Lameter
On Thu, 6 Feb 2014, Joonsoo Kim wrote:

 Currently, if allocation constraint to node is NUMA_NO_NODE, we search
 a partial slab on numa_node_id() node. This doesn't work properly on the
 system having memoryless node, since it can have no memory on that node and
 there must be no partial slab on that node.

 On that node, page allocation always fallback to numa_mem_id() first. So
 searching a partial slab on numa_node_id() in that case is proper solution
 for memoryless node case.

Acked-by: Christoph Lameter c...@linux.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-06 Thread Christoph Lameter
On Wed, 5 Feb 2014, Nishanth Aravamudan wrote:

  Right so if we are ignoring the node then the simplest thing to do is to
  not deactivate the current cpu slab but to take an object from it.

 Ok, that's what Anton's patch does, I believe. Are you ok with that
 patch as it is?

No. Again his patch only works if the node is memoryless not if there are
other issues that prevent allocation from that node.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()

2014-02-06 Thread Christoph Lameter
On Thu, 6 Feb 2014, David Rientjes wrote:

 I think you'll need to send these to Andrew since he appears to be picking
 up slub patches these days.

I can start managing merges again if Pekka no longer has the time.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node

2014-02-06 Thread Christoph Lameter
On Thu, 6 Feb 2014, Joonsoo Kim wrote:

 diff --git a/mm/slub.c b/mm/slub.c
 index cc1f995..c851f82 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
   void *object;
   int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

 + if (node == NUMA_NO_NODE)
 + searchnode = numa_mem_id();
 + else {
 + searchnode = node;
 + if (!node_present_pages(node))

This check wouild need to be something that checks for other contigencies
in the page allocator as well. A simple solution would be to actually run
a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
If that fails then fallback. See how fallback_alloc() does it in slab.

 + searchnode = get_numa_mem(node);
 + }

 @@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t 
 gfpflags, int node,
  redo:

   if (unlikely(!node_match(page, node))) {
 - stat(s, ALLOC_NODE_MISMATCH);
 - deactivate_slab(s, page, c-freelist);
 - c-page = NULL;
 - c-freelist = NULL;
 - goto new_slab;
 + int searchnode = node;
 +
 + if (node != NUMA_NO_NODE  !node_present_pages(node))

Same issue here. I would suggest not deactivating the slab and first check
if the node has no pages. If so then just take an object from the current
cpu slab. If that is not available do an allcoation from the indicated
node and take whatever the page allocator gave you.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-05 Thread Christoph Lameter
On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:

  If the target node allocation fails (for whatever reason) then I would
  recommend for simplicities sake to change the target node to
  NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
  complex solution would be to look through partial lists in increasing
  distance to find a partially used slab that is reasonable close to the
  current node. Slab has logic like that in fallback_alloc(). Slubs
  get_any_partial() function does something close to what you want.

 I apologize for my own ignorance, but I'm having trouble following.
 Anton's original patch did fallback to the current cpu slab, but I'm not
 sure any NUMA_NO_NODE change is necessary there. At the point we're
 deactivating the slab (in the current code, in __slab_alloc()), we have
 successfully allocated from somewhere, it's just not on the node we
 expected to be on.

Right so if we are ignoring the node then the simplest thing to do is to
not deactivate the current cpu slab but to take an object from it.

 So perhaps you are saying to make a change lower in the code? I'm not
 sure where it makes sense to change the target node in that case. I'd
 appreciate any guidance you can give.

This not an easy thing to do. If the current slab is not the right node
but would be the node from which the page allocator would be returning
memory then the current slab can still be allocated from. If the fallback
is to another node then the current cpu slab needs to be deactivated and
the allocation from that node needs to proceeed. Have a look at
fallback_alloc() in the slab allocator.

A allocation attempt from the page allocator can be restricted to a
specific node through GFP_THIS_NODE.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-04 Thread Christoph Lameter
On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

 Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
 the $SUBJECT issue.

Hmmm... I am not sure that this is a general solution. The fallback to
other nodes can not only occur because a node has no memory as his patch
assumes.

If the target node allocation fails (for whatever reason) then I would
recommend for simplicities sake to change the target node to NUMA_NO_NODE
and just take whatever is in the current cpu slab. A more complex solution
would be to look through partial lists in increasing distance to find a
partially used slab that is reasonable close to the current node. Slab has
logic like that in fallback_alloc(). Slubs get_any_partial() function does
something close to what you want.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-02-03 Thread Christoph Lameter
On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

 So what's the status of this patch? Christoph, do you think this is fine
 as it is?

Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
already acked the patch.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-30 Thread Christoph Lameter
On Wed, 29 Jan 2014, Nishanth Aravamudan wrote:

 exactly what the caller intends.

 int searchnode = node;
 if (node == NUMA_NO_NODE)
   searchnode = numa_mem_id();
 if (!node_present_pages(node))
   searchnode = local_memory_node(node);

 The difference in semantics from the previous is that here, if we have a
 memoryless node, rather than using the CPU's nearest NUMA node, we use
 the NUMA node closest to the requested one?

The idea here is that the page allocator will do the fallback to other
nodes. This check for !node_present should not be necessary. SLUB needs to
accept the page from whatever node the page allocator returned and work
with that.

The problem is the check for having a slab from the right node may fall
again after another attempt to allocate from the same node. SLUB will then
push the slab from the *wrong* node back to the partial lists and may
attempt another allocation that will again be successful but return memory
from another node. That way the partial lists from a particular node are
growing uselessly.

One way to solve this may be to check if memory is actually allocated
from the requested node and fallback to NUMA_NO_NODE (which will use the
last allocated slab) for future allocs if the page allocator returned
memory from a different node (unless GFP_THIS_NODE is set of course).
Otherwise we end up replicating  the page allocator logic in slub like in
slab. That is what I wanted to
avoid.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-29 Thread Christoph Lameter
On Tue, 28 Jan 2014, Nishanth Aravamudan wrote:

 This helps about the same as David's patch -- but I found the reason
 why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
 shortly for that and one other case I found.

Oww...

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: enable CONFIG_HAVE_MEMORYLESS_NODES

2014-01-29 Thread Christoph Lameter
On Tue, 28 Jan 2014, Nishanth Aravamudan wrote:

 Anton Blanchard found an issue with an LPAR that had no memory in Node
 0. Christoph Lameter recommended, as one possible solution, to use
 numa_mem_id() for locality of the nearest memory node-wise. However,
 numa_mem_id() [and the other related APIs] are only useful if
 CONFIG_HAVE_MEMORYLESS_NODES is set. This is only the case for ia64
 currently, but clearly we can have memoryless nodes on ppc64. Add the
 Kconfig option and define it to be the same value as CONFIG_NUMA.

Well this is trivial but if you need encouragement:

Reviewed-by: Christoph Lameter c...@linux.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-27 Thread Christoph Lameter
On Fri, 24 Jan 2014, David Rientjes wrote:

 kmalloc_node(nid) and kmem_cache_alloc_node(nid) should fallback to nodes
 other than nid when memory can't be allocated, these functions only
 indicate a preference.

The nid passed indicated a preference unless __GFP_THIS_NODE is specified.
Then the allocation must occur on that node.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-27 Thread Christoph Lameter
On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

 As to cpu_to_node() being passed to kmalloc_node(), I think an
 appropriate fix is to change that to cpu_to_mem()?

Yup.

  Yeah, the default policy should be to fallback to local memory if the node
  passed is memoryless.

 Thanks!

I would suggest to use NUMA_NO_NODE instead. That will fit any slab that
we may be currently allocating from or can get a hold of and is mosty
efficient.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-24 Thread Christoph Lameter
On Fri, 24 Jan 2014, Wanpeng Li wrote:

 
 diff --git a/mm/slub.c b/mm/slub.c
 index 545a170..a1c6040 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
  void *object;
  int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

This needs to be numa_mem_id() and numa_mem_id would need to be
consistently used.

 
 +if (!node_present_pages(searchnode))
 +searchnode = numa_mem_id();

Probably wont need that?

 +
  object = get_partial_node(s, get_node(s, searchnode), c, flags);
  if (object || node != NUMA_NO_NODE)
  return object;
 

 The bug still can't be fixed w/ this patch.

Some more detail would be good. If memory is requested from a particular
node then it would be best to use one that has memory. Callers also may
have used numa_node_id() and that also would need to be fixed.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-20 Thread Christoph Lameter
On Mon, 20 Jan 2014, Wanpeng Li wrote:

 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 

 The patch fix the bug. However, the kernel crashed very quickly after running
 stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).
[  287.464285] Unable to handle kernel paging request for data at address 
0x0001
[  287.464289] Faulting instruction address: 0xc0445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 
arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver 
nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT 
ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc 
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter 
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter 
ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto 
pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput 
dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi 
scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c0fde590 ti: c001fffd task.ti: 
c10a4000
[  287.464382] NIP: c0445af8 LR: c0445bcc CTR: c0445b90
[  287.464385] REGS: c001fffd38e0 TRAP: 0300   Not tainted  
(3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 88002084  XER: 
0001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c000908c
[  287.464401] DAR: 0001, DSISR: 4000
[  287.464403]
GPR00: d3649a04 c001fffd3b60 c10a94d0 0003
GPR04: c0018d841048 c001fffd3bd0 0012 d364eff0
GPR08: c001fffd3bd0 0001 d364d688 c0445b90
GPR12: d364b960 c7e0 042ac510 0060
GPR16: 0020 fb19 c1122100 
GPR20: c0a94680 c1122180 c0a94680 000a
GPR24: 0100  0001 c001ef90
GPR28: c001d6c066f0 c001aea03520 c001bc9a2640 c0018d841680
[  287.464447] NIP [c0445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c0445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [80009032]
[  287.464455] Call Trace:
[  287.464458] [c001fffd3b60] [c001fffd3c00] 0xc001fffd3c00 
(unreliable)
[  287.464467] [c001fffd3bf0] [d3649a04] 
.ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c001fffd3cb0] [d36495b8] 
.ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c001fffd3d30] [d3649fe4] .ibmvfc_tasklet+0xd4/0x250 
[ibmvfc]
[  287.464494] [c001fffd3de0] [c009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c001fffd3e90] [c009a668] .__do_softirq+0x148/0x360
[  287.464503] [c001fffd3f90] [c00218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c001fffcfdf0] [c00107e0] .do_softirq+0xd0/0x100
[  287.464511] [c001fffcfe80] [c009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c001fffcff10] [c0010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c001fffcff90] [c00218cc] .call_do_irq+0x14/0x24
[  287.464522] [c10a76d0] [c00105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0x
[  287.464527] LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c10a7770] [c0002494] 
hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540] LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c10a7a60] [0001] 0x1 (unreliable)
[  287.464550] [c10a7ad0] [c0074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c10a7b50] [c05538f4] 

Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-08-06 Thread Christoph Lameter
On Tue, 6 Aug 2013, Wladislav Wiebe wrote:

 ok, just saw in slab/for-linus branch that those stuff is reverted again..

No that was only for the 3.11 merge by Linus. The 3.12 patches have not
been put into pekkas tree.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-07-31 Thread Christoph Lameter
On Wed, 31 Jul 2013, Wladislav Wiebe wrote:

 on a PPC 32-Bit board with a Linux Kernel v3.10.0 I see trouble with 
 kmalloc_slab.
 Basically at system startup, something request a size of 8388608 b,
 but KMALLOC_MAX_SIZE has 4194304 b in our case. It points a WARNING at:

 ..
 NIP [c0099fec] kmalloc_slab+0x60/0xe8
 LR [c0099fd4] kmalloc_slab+0x48/0xe8
 Call Trace:
 [ccd3be60] [c0099fd4] kmalloc_slab+0x48/0xe8 (unreliable)
 [ccd3be70] [c00ae650] __kmalloc+0x20/0x1b4
 [ccd3be90] [c00d46f4] seq_read+0x2a4/0x540
 [ccd3bee0] [c00fe09c] proc_reg_read+0x5c/0x90
 [ccd3bef0] [c00b4e1c] vfs_read+0xa4/0x150
 [ccd3bf10] [c00b500c] SyS_read+0x4c/0x84
 [ccd3bf40] [c000be80] ret_from_syscall+0x0/0x3c
 ..

 Do you have any idea how I can analyze where these 8388608 b coming from?

It comes from the kmalloc in seq_read(). And 8M read from the proc
filesystem? Wow. Maybe switch the kmalloc to vmalloc()?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-07-31 Thread Christoph Lameter
This patch will suppress the warnings by using the page allocator wrappers
of the slab allocators. These are page sized allocs after all.


Subject: seq_file: Use kmalloc_large for page sized allocation

There is no point in using the slab allocation functions for large page
order allocation. Use the kmalloc_large() wrappers which will cause calls
to the page alocator instead.

This fixes the warning about large allocs but it will still cause
high order allocs to occur that could fail because of memory
fragmentation. Maybe switch to vmalloc if we really want to allocate multi
megabyte buffers for proc fs?

Signed-off-by: Christoph Lameter c...@linux.com

Index: linux/fs/seq_file.c
===
--- linux.orig/fs/seq_file.c2013-07-10 14:03:15.367134544 -0500
+++ linux/fs/seq_file.c 2013-07-31 10:11:42.671736131 -0500
@@ -96,7 +96,7 @@ static int traverse(struct seq_file *m,
return 0;
}
if (!m-buf) {
-   m-buf = kmalloc(m-size = PAGE_SIZE, GFP_KERNEL);
+   m-buf = kmalloc_large(m-size = PAGE_SIZE, GFP_KERNEL);
if (!m-buf)
return -ENOMEM;
}
@@ -136,7 +136,7 @@ static int traverse(struct seq_file *m,
 Eoverflow:
m-op-stop(m, p);
kfree(m-buf);
-   m-buf = kmalloc(m-size = 1, GFP_KERNEL);
+   m-buf = kmalloc_large(m-size = 1, GFP_KERNEL);
return !m-buf ? -ENOMEM : -EAGAIN;
 }

@@ -191,7 +191,7 @@ ssize_t seq_read(struct file *file, char

/* grab buffer if we didn't have one */
if (!m-buf) {
-   m-buf = kmalloc(m-size = PAGE_SIZE, GFP_KERNEL);
+   m-buf = kmalloc_large(m-size = PAGE_SIZE, GFP_KERNEL);
if (!m-buf)
goto Enomem;
}
@@ -232,7 +232,7 @@ ssize_t seq_read(struct file *file, char
goto Fill;
m-op-stop(m, p);
kfree(m-buf);
-   m-buf = kmalloc(m-size = 1, GFP_KERNEL);
+   m-buf = kmalloc_large(m-size = 1, GFP_KERNEL);
if (!m-buf)
goto Enomem;
m-count = 0;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-07-31 Thread Christoph Lameter
Crap you cannot do PAGE_SIZE allocations with kmalloc_large. Fails when
freeing pages. Need to only do the multiple page allocs with
kmalloc_large.

Subject: seq_file: Use kmalloc_large for page sized allocation

There is no point in using the slab allocation functions for
large page order allocation. Use kmalloc_large().

This fixes the warning about large allocs but it will still cause
large contiguous allocs that could fail because of memory fragmentation.

Signed-off-by: Christoph Lameter c...@linux.com

Index: linux/fs/seq_file.c
===
--- linux.orig/fs/seq_file.c2013-07-31 10:39:03.050472030 -0500
+++ linux/fs/seq_file.c 2013-07-31 10:39:03.050472030 -0500
@@ -136,7 +136,7 @@ static int traverse(struct seq_file *m,
 Eoverflow:
m-op-stop(m, p);
kfree(m-buf);
-   m-buf = kmalloc(m-size = 1, GFP_KERNEL);
+   m-buf = kmalloc_large(m-size = 1, GFP_KERNEL);
return !m-buf ? -ENOMEM : -EAGAIN;
 }

@@ -232,7 +232,7 @@ ssize_t seq_read(struct file *file, char
goto Fill;
m-op-stop(m, p);
kfree(m-buf);
-   m-buf = kmalloc(m-size = 1, GFP_KERNEL);
+   m-buf = kmalloc_large(m-size = 1, GFP_KERNEL);
if (!m-buf)
goto Enomem;
m-count = 0;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: mm/slab: ppc: ubi: kmalloc_slab WARNING / PPC + UBI driver

2013-07-31 Thread Christoph Lameter
On Wed, 31 Jul 2013, Wladislav Wiebe wrote:

 Thanks for the point, do you plan to make kmalloc_large available for extern 
 access in a separate mainline patch?
 Since kmalloc_large is statically defined in slub_def.h and when including it 
 to seq_file.c
 we have a lot of conflicting types:

You cannot separatly include slub_def.h. slab.h includes slub_def.h for
you. What problem did you try to fix by doing so?

There is a patch pending that moves kmalloc_large to slab.h. So maybe we
have to wait a merge period in order to be able to use it with other
allocators than slub.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v5 04/14] memory-hotplug: remove /sys/firmware/memmap/X sysfs

2013-01-02 Thread Christoph Lameter
On Thu, 27 Dec 2012, Tang Chen wrote:

 On 12/26/2012 11:30 AM, Kamezawa Hiroyuki wrote:
  @@ -41,6 +42,7 @@ struct firmware_map_entry {
 const char  *type;  /* type of the memory range */
 struct list_headlist;   /* entry for the linked list */
 struct kobject  kobj;   /* kobject for each entry */
  +  unsigned intbootmem:1; /* allocated from bootmem */
 };
 
  Can't we detect from which the object is allocated from, slab or bootmem ?
 
  Hm, for example,
 
   PageReserved(virt_to_page(address_of_obj)) ?
   PageSlab(virt_to_page(address_of_obj)) ?
 

 Hi Kamezawa-san,

 I think we can detect it without a new member. I think bootmem:1 member
 is just for convenience. I think I can remove it. :)

Larger size slab allocations may fall back to the page allocator but then
the slabs do not track this allocation. That memory can be freed using the
page allocator.

If you see pageslab then you can always remove using the slab allocator.
Otherwise the page allocator should work (unless it was some
special case bootmem allocation).

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH SLAB 1/2 v3] duplicate the cache name in SLUB's saved_alias list, SLAB, and SLOB

2012-07-09 Thread Christoph Lameter

 I was pointed by Glauber to the slab common code patches. I need some
 more time to read the patches. Now I think the slab/slot changes in this
 v3 are not needed, and can be ignored.

That may take some kernel cycles. You have a current issue here that needs
to be fixed.

  down_write(slub_lock);
  -   s = find_mergeable(size, align, flags, name, ctor);
  +   s = find_mergeable(size, align, flags, n, ctor);
  if (s) {
  s-refcount++;
  /*

   ..
   up_write(slub_lock);
   return s;
   }

 Here, the function returns without name string n be kfreed.

That is intentional since the string n is still referenced by the entry
that sysfs_slab_alias has created.

 But we couldn't kfree n here, because in sysfs_slab_alias(), if
 (slab_state  SYS_FS), the name need to be kept valid until
 slab_sysfs_init() is finished adding the entry into sysfs.

Right that is why it is not freed and that is what fixes the issue you
see.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH v3 0/13] memory-hotplug : hot-remove physical memory

2012-07-09 Thread Christoph Lameter

On Mon, 9 Jul 2012, Yasuaki Ishimatsu wrote:

 Even if you apply these patches, you cannot remove the physical memory
 completely since these patches are still under development. I want you to
 cooperate to improve the physical memory hot-remove. So please review these
 patches and give your comment/idea.

Could you at least give a method on how you want to do physical memory
removal? You would have to remove all objects from the range you want to
physically remove. That is only possible under special circumstances and
with a limited set of objects. Even if you exclusively use ZONE_MOVEABLE
you still may get cases where pages are pinned for a long time.

I am not sure that these patches are useful unless we know where you are
going with this. If we end up with a situation where we still cannot
remove physical memory then this patchset is not helpful.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH SLAB 1/2 v3] duplicate the cache name in SLUB's saved_alias list, SLAB, and SLOB

2012-07-06 Thread Christoph Lameter
I thought I posted this a couple of days ago. Would this not fix things
without having to change all the allocators?


Subject: slub: Dup name earlier in kmem_cache_create

Dup the name earlier in kmem_cache_create so that alias
processing is done using the copy of the string and not
the string itself.

Signed-off-by: Christoph Lameter c...@linux.com

---
 mm/slub.c |   29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2012-06-11 08:49:56.0 -0500
+++ linux-2.6/mm/slub.c 2012-07-03 15:17:37.0 -0500
@@ -3933,8 +3933,12 @@ struct kmem_cache *kmem_cache_create(con
if (WARN_ON(!name))
return NULL;

+   n = kstrdup(name, GFP_KERNEL);
+   if (!n)
+   goto out;
+
down_write(slub_lock);
-   s = find_mergeable(size, align, flags, name, ctor);
+   s = find_mergeable(size, align, flags, n, ctor);
if (s) {
s-refcount++;
/*
@@ -3944,7 +3948,7 @@ struct kmem_cache *kmem_cache_create(con
s-objsize = max(s-objsize, (int)size);
s-inuse = max_t(int, s-inuse, ALIGN(size, sizeof(void *)));

-   if (sysfs_slab_alias(s, name)) {
+   if (sysfs_slab_alias(s, n)) {
s-refcount--;
goto err;
}
@@ -3952,31 +3956,26 @@ struct kmem_cache *kmem_cache_create(con
return s;
}

-   n = kstrdup(name, GFP_KERNEL);
-   if (!n)
-   goto err;
-
s = kmalloc(kmem_size, GFP_KERNEL);
if (s) {
if (kmem_cache_open(s, n,
size, align, flags, ctor)) {
list_add(s-list, slab_caches);
up_write(slub_lock);
-   if (sysfs_slab_add(s)) {
-   down_write(slub_lock);
-   list_del(s-list);
-   kfree(n);
-   kfree(s);
-   goto err;
-   }
-   return s;
+   if (!sysfs_slab_add(s))
+   return s;
+
+   down_write(slub_lock);
+   list_del(s-list);
}
kfree(s);
}
-   kfree(n);
+
 err:
+   kfree(n);
up_write(slub_lock);

+out:
if (flags  SLAB_PANIC)
panic(Cannot create slabcache %s\n, name);
else
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH powerpc 2/2] kfree the cache name of pgtable cache if SLUB is used

2012-07-03 Thread Christoph Lameter
On Mon, 25 Jun 2012, Li Zhong wrote:

 This patch tries to kfree the cache name of pgtables cache if SLUB is
 used, as SLUB duplicates the cache name, and the original one is leaked.

SLAB also does not free the name. Why would you have an #ifdef in there?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH powerpc 2/2] kfree the cache name of pgtable cache if SLUB is used

2012-07-03 Thread Christoph Lameter
Looking through the emails it seems that there is an issue with alias
strings. That can be solved by duping the name of the slab earlier in 
kmem_cache_create().
Does this patch fix the issue?

Subject: slub: Dup name earlier in kmem_cache_create

Dup the name earlier in kmem_cache_create so that alias
processing is done using the copy of the string and not
the string itself.

Signed-off-by: Christoph Lameter c...@linux.com

---
 mm/slub.c |   29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c2012-06-11 08:49:56.0 -0500
+++ linux-2.6/mm/slub.c 2012-07-03 15:17:37.0 -0500
@@ -3933,8 +3933,12 @@ struct kmem_cache *kmem_cache_create(con
if (WARN_ON(!name))
return NULL;

+   n = kstrdup(name, GFP_KERNEL);
+   if (!n)
+   goto out;
+
down_write(slub_lock);
-   s = find_mergeable(size, align, flags, name, ctor);
+   s = find_mergeable(size, align, flags, n, ctor);
if (s) {
s-refcount++;
/*
@@ -3944,7 +3948,7 @@ struct kmem_cache *kmem_cache_create(con
s-objsize = max(s-objsize, (int)size);
s-inuse = max_t(int, s-inuse, ALIGN(size, sizeof(void *)));

-   if (sysfs_slab_alias(s, name)) {
+   if (sysfs_slab_alias(s, n)) {
s-refcount--;
goto err;
}
@@ -3952,31 +3956,26 @@ struct kmem_cache *kmem_cache_create(con
return s;
}

-   n = kstrdup(name, GFP_KERNEL);
-   if (!n)
-   goto err;
-
s = kmalloc(kmem_size, GFP_KERNEL);
if (s) {
if (kmem_cache_open(s, n,
size, align, flags, ctor)) {
list_add(s-list, slab_caches);
up_write(slub_lock);
-   if (sysfs_slab_add(s)) {
-   down_write(slub_lock);
-   list_del(s-list);
-   kfree(n);
-   kfree(s);
-   goto err;
-   }
-   return s;
+   if (!sysfs_slab_add(s))
+   return s;
+
+   down_write(slub_lock);
+   list_del(s-list);
}
kfree(s);
}
-   kfree(n);
+
 err:
+   kfree(n);
up_write(slub_lock);

+out:
if (flags  SLAB_PANIC)
panic(Cannot create slabcache %s\n, name);
else
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: fix kernel BUG at mm/slub.c:1950!

2011-06-13 Thread Christoph Lameter
On Sun, 12 Jun 2011, Hugh Dickins wrote:

 3.0-rc won't boot with SLUB on my PowerPC G5: kernel BUG at mm/slub.c:1950!
 Bisected to 1759415e630e slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery.

 After giving myself a medal for finding the BUG on line 1950 of mm/slub.c
 (it's actually the
   VM_BUG_ON((unsigned long)(pcp1) % (2 * sizeof(pcp1)));
 on line 268 of the morass that is include/linux/percpu.h)
 I tried the following alignment patch and found it to work.

Hmmm.. The allocpercpu in alloc_kmem_cache_cpus should take care of the
alignment. Uhh.. I see that a patch that removes the #ifdef CMPXCHG_LOCAL
was not applied? Pekka?


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: fix kernel BUG at mm/slub.c:1950!

2011-06-13 Thread Christoph Lameter
On Mon, 13 Jun 2011, Pekka Enberg wrote:

  Hmmm.. The allocpercpu in alloc_kmem_cache_cpus should take care of the
  alignment. Uhh.. I see that a patch that removes the #ifdef CMPXCHG_LOCAL
  was not applied? Pekka?

 This patch?

 http://git.kernel.org/?p=linux/kernel/git/penberg/slab-2.6.git;a=commitdiff;h=d4d84fef6d0366b585b7de13527a0faeca84d9ce

 It's queued and will be sent to Linus soon.

Ok it will also fix Hugh's problem then.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter
On Thu, 23 Sep 2010, Christian Riesch wrote:

   It implies clock tuning in userspace for a potential sub microsecond
   accurate clock. The clock accuracy will be limited by user space
   latencies and noise. You wont be able to discipline the system clock
   accurately.
 
  Noise matters, latency doesn't.

 Well put! That's why we need hardware support for PTP timestamping to reduce
 the noise, but get along well with the clock servo that is steering the PHC in
 user space.

Even if I buy into the catch phrase above: User space is subject to noise
that the in kernel code is not. If you do the tuning over long intervals
then it hopefully averages out but it still causes jitter effects that
affects the degree of accuracy (or sync) that you can reach. And the noise
varies with the load on the system.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter
On Thu, 23 Sep 2010, john stultz wrote:

   3) Further, the PTP hardware counter can be simply set to a new offset
   to put it in line with the network time. This could cause trouble with
   timekeeping much like unsynced TSCs do.
 
  You can do the same for system time.

 Settimeofday does allow CLOCK_REALTIME to jump, but the CLOCK_MONOTONIC
 time cannot jump around. Having a clocksource that is non-monotonic
 would break this.

Currently time runs at the same speed. CLOCK_MONOTONIC runs at a offset
to CLOCK_REALTIME. We are creating APIs here that allow time to run at
different speeds.

 The design actually avoids most userland induced latency.

 1) On the PTP hardware syncing point, the reference packet gets
 timestamped with the PTP hardware time on arrival. This allows the
 offset calculation to be done in userland without introducing latency.

The timestamps allows the calculation of the network transmission time I
guess and therefore its more accurate to calculate that effect out. Ok but
then the overhead of getting to code in user space (that does the proper
clock adjustments) is resulting in the addition of a relatively long time
that is subject to OS scheduling latencies and noises.

 2) On the system syncing side, the proposal for the PPS interrupt allows
 the PTP hardware to trigger an interrupt on the second boundary that
 would take a timestamp of the system time. Then the pps interface allows
 for the timestamp to be read from userland allowing the offset to be
 calculated without introducing additional latency.

Sorry dont really get the whole picture here it seems. Sounds like one is
going through additional unnecessary layers. Why would the PTP hardware
triggger an interrupt? I thought the PTP messages came in via
timestamping and are then processed by software. Then the software is
issuing a hardware interrupt that then triggers the PPS subsystem. And
that is supposed to be better than directly interfacing with the PTP?


 Additionally, even just in userland, it would be easy to bracket two
 reads of the system time around one read of the PTP clock to bound any
 userland latency fairly well. It may not be as good as the PPS interface
 (although that depends on the interrupt latency), but if the accesses
 are all local, it probably could get fairly close.

That sounds hacky.

  Ok maybe we need some sort of control interface to manage the clock like
  the others have.

 That's what the clock_adjtime call provides.

Ummm... You are managing a hardware device with hardware (driver) specific
settings. That is currently being done via ioctls. Why generalize it?

  The posix clocks today assumes one notion of real time in the kernel.
  All clocks increase in lockstep (aside from offset updates).

 Not true. The cputime clockids do not increment at the same rate (as the
 apps don't always run). Further CLOCK_MONOTONIC_RAW provides a non-freq
 corrected view of CLOCK_MONOTONIC, so it increments at a slightly
 different rate.

cputime clockids are not tracking time but cpu resource use.

 Re-using the fairly nice (Alan of course disagrees :) posix interface
 seems at least a little better for application developers who actually
 have to use the hardware.

Well it may also be confusing for others. The application developers also
will have a hard time using a generic clock interface to control PTP
device specific things like frequencies, rates etc etc. So you always need
to ioctl/device specific control interface regardless.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-27 Thread Christoph Lameter

On Fri, 24 Sep 2010, Alan Cox wrote:

 Whether you add new syscalls or do the fd passing using flags and hide
 the ugly bits in glibc is another question.

Use device specific ioctls instead of syscalls?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-23 Thread Christoph Lameter
On Thu, 23 Sep 2010, Richard Cochran wrote:

   Support for obtaining timestamps from a PHC already exists via the
   SO_TIMESTAMPING socket option, integrated in kernel version 2.6.30.
   This patch set completes the picture by allow user space programs to
   adjust the PHC and to control its ancillary features.

Is there a way to use the PHC as a system clock? I think the main benefit
of PTP is to have syncronized time on multiple machines in a cluster. That
may mean getting rid of ntp and using an in kernel PHC based way to sync time.

So as far as the POSIX standard is concerned, offering a clock id
to represent the PHC would be acceptable.

Sure but what would you do with it? HPET timer support has no such need.

 3.2.1 Using the POSIX Clock API
 

 Looking at the mapping from PHC operation to the POSIX clock API,
 we see that two of the basic clock operations, marked with *, have
 no POSIX equivalent. The items marked NA are peculiar to PHCs and
 will be discussed separately, below.

   Clock Operation   POSIX function
  -+-
   Set time  clock_gettime
   Get time  clock_settime
   Shift the clock   *
   Adjust clock frequency*
  -+-
   Time stamp external eventsNA
   Enable PPS events NA
   Periodic output signals   NA
   One shot or periodic alarms   timer_create, timer_settime

 In contrast to the standard Linux system clock, a PHC is
 adjustable in hardware, for example using frequency compensation
 registers or a VCO. The ability to directly tune the PHC is
 essential to reap the benefit of hardware timestamping.

There is a reason for not being able to shift posix clocks: The system has
one time base. The various clocks are contributing to maintaining that
sytem wide time.

I do not understand why you want to maintain different clocks running at
different speeds. Certainly interesting for some uses I guess that I
do not have the energy to imagine right now. But can we get the PTP killer
feature of synchronized accurate system time first?

 3.3 Synchronizing the Linux System Time
 

One could offer a PHC as a combined clock source and clock event
device. The advantage of this approach would be that it obviates
the need for synchronization when the PHC is selected as the system
timer. However, some PHCs, namely the PHY based clocks, cannot be
used in this way.

Why not? Do PHY based clock not at least provide a counter that increments
in synchronized intervals throughout the network?

Instead, the patch set provides a way to offer a Pulse Per Second
(PPS) event from the PHC to the Linux PPS subsystem. A user space
application can read the PPS events and tune the system clock, just
like when using other external time sources like radio clocks or
GPS.

User space is subject to various latencies created by the OS etc. I would
that in order to have fine grained (read microsecond) accurary we would
have to run the portions that are relevant to obtaining the desired
accuracy in the kernel.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-23 Thread Christoph Lameter
On Thu, 23 Sep 2010, Jacob Keller wrote:

  There is a reason for not being able to shift posix clocks: The system has
  one time base. The various clocks are contributing to maintaining that
  sytem wide time.
 
  Adjusting clocks is absolutely essential for proper functioning of the PTP
 protocol. The slave obtains and calculates the offset from master and uses
 that in order to adjust the clock properly, The problem is that the
 timestamps are done via the hardware. We need a method to expose that
 hardware so that the ptp software can properly adjust those clocks.

There is no way to use that clock directly to avoid all the user space
tuning etc? There are already tuning mechanisms in the kernel that do this
with system time based on periodic clocks. If you calculate the
nanoseconds since the epoch then you should be able to use that to tune
system time.

  I do not understand why you want to maintain different clocks running at
  different speeds. Certainly interesting for some uses I guess that I
  do not have the energy to imagine right now. But can we get the PTP killer
  feature of synchronized accurate system time first?
 

 The problem is maintaining a hardware clock at the correct speed/frequency
 and time. The timestamping is done via hardware, and that hardware clock
 needs to be accurate. We need to be able to modify that clock. Yes, having
 the system time be the same value would be nice, but the problem comes
 because we don't want to jump through hoops to keep that hardware clock
 accurate to the ptp protocol running on the network.

Then allow system time == hardware clock?

 All of the necessary features for microsecond or better accuracy are done
 via the hardware. You can get accuracy to within 10 mircoseconds while only
 sending sync packets and such once per second. The reason is because the
 hardware timestamps are very accurate. But if we can't properly adjust the
 clocks time and frequency, we cannot maintain the accuracy of the
 timestamps.

You can already adjust the system time with the existing APIs. Tuning
hardware clocks is currently done using device specific controls. But I
would think that you do not need to expose this to user space if you can
do it all in kernel.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-23 Thread Christoph Lameter
On Thu, 23 Sep 2010, john stultz wrote:

 This was my initial gut reaction as well, but in the end, I agree with
 Richard that in the case of one or multiple PTP hardware clocks, we
 really can't abstract over the different time domains.

My (arguably still superficial) review of the source does not show
anything that would make me reach that conclusion.

 I really don't think the PTP clock can be used as a clocksource sanely.

 First, the hardware access is much to slow for system timekeeping.

The HPET or pit timesource are also quite slow these days. You only need
access periodically to essentially tune the TSC ratio.

 Second, there is the problem that the system time is a software clock,
 and adjustments made (like freq) are made in the layer that interprets
 the underlying hardware cycle counter. Adjustments made in PTP (in order
 to sync the network timestamps) are made at the hardware level.

From what I can see the PTP clocks are periodic hardware cycle counters
like any other clock that we currently support. If its configurable enough
then setup a hardware cycle counter that mimics nanoseconds since the
epoch as closely as possible and use that to sync the TSC rate to. Makes
it very easy.

 This would cause a disconnect between the hardware freq understood by
 the system time management code and the actual hardware freq.

We can switch underlying clocks for system time already. We can adapt to a
different hw frequency. But then I do not know why adjust the freq? I
thought the point was that the periodic clock was network synchronized and
can be used as the master clock for multiple machines?

 Richard, I'd actually strike this paragraph from the rational, as I feel
 it has the tendency to confuse as it suggests having the PHC as a
 clocksource is feasible when really it isn't. Or alternatively, maybe
 express more clearly why its not feasible, so it doesn't just seem like
 a minor design choice.

Sorry but I still feel that this is pretty much a misguided approach that
creates unnecessary layers in the kernel. The trivial easy approach was
not done (copy a driver from drivers/clocksource, modify so that it
programs access to a centralized periodic ptp signal and uses it for
system sync).
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/8] ptp: Added a clock that uses the eTSEC found on the MPC85xx.

2010-09-23 Thread Christoph Lameter
On Thu, 23 Sep 2010, Richard Cochran wrote:

 +* Gianfar PTP clock nodes
 +
 +General Properties:
 +
 +  - compatible   Should be fsl,etsec-ptp
 +  - reg  Offset and length of the register set for the device
 +  - interrupts   There should be at least two interrupts. Some devices
 + have as many as four PTP related interrupts.
 +
 +Clock Properties:
 +
 +  - tclk-period  Timer reference clock period in nanoseconds.
 +  - tmr-prsc Prescaler, divides the output clock.
 +  - tmr-add  Frequency compensation value.
 +  - cksel0= external clock, 1= eTSEC system clock, 3= RTC clock 
 input.
 + Currently the driver only supports choice 1.
 +  - tmr-fiper1   Fixed interval period pulse generator.
 +  - tmr-fiper2   Fixed interval period pulse generator.
 +  - max-adj  Maximum frequency adjustment in parts per billion.
 +
 +  These properties set the operational parameters for the PTP
 +  clock. You must choose these carefully for the clock to work right.
 +  Here is how to figure good values:
 +
 +  TimerOsc = system clock   MHz
 +  tclk_period  = desired clock period   nanoseconds
 +  NominalFreq  = 1000 / tclk_period MHz
 +  FreqDivRatio = TimerOsc / NominalFreq (must be greater that 1.0)
 +  tmr_add  = ceil(2^32 / FreqDivRatio)
 +  OutputClock  = NominalFreq / tmr_prsc MHz
 +  PulseWidth   = 1 / OutputClockmicroseconds
 +  FiperFreq1   = desired frequency in Hz
 +  FiperDiv1= 100 * OutputClock / FiperFreq1
 +  tmr_fiper1   = tmr_prsc * tclk_period * FiperDiv1 - tclk_period
 +  max_adj  = 10 * (FreqDivRatio - 1.0) - 1

Great stuff for clock synchronization...

 +  The calculation for tmr_fiper2 is the same as for tmr_fiper1. The
 +  driver expects that tmr_fiper1 will be correctly set to produce a 1
 +  Pulse Per Second (PPS) signal, since this will be offered to the PPS
 +  subsystem to synchronize the Linux clock.

Argh. And conceptually completely screwed up. Why go through the PPS
subsystem if you can directly tune the system clock based on a number of
the cool periodic clock features that you have above? See how the other
clocks do that easily? Look into drivers/clocksource. Add it there.

Please do not introduce useless additional layers for clock sync. Load
these ptp clocks like the other regular clock modules and make them sync
system time like any other clock.

Really guys: I want a PTP solution! Now! And not some idiotic additional
kernel layers that just pass bits around because its so much fun and
screws up clock accurary in due to the latency noise introduced while
having so much fun with the bits.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/8] ptp: Added a clock that uses the eTSEC found on the MPC85xx.

2010-09-23 Thread Christoph Lameter
On Thu, 23 Sep 2010, Alan Cox wrote:

  Please do not introduce useless additional layers for clock sync. Load
  these ptp clocks like the other regular clock modules and make them sync
  system time like any other clock.

 I don't think you understand PTP. PTP has masters, a system can need to
 be honouring multiple conflicting masters at once.

The upshot of it all has to be some synchronized notion of time regardless
of how many other things are going on under the hood. And the spec here
suggests a hardware able to generate periodic accurate events that can be
used to sync system time.

  Really guys: I want a PTP solution! Now! And not some idiotic additional
  kernel layers that just pass bits around because its so much fun and
  screws up clock accurary in due to the latency noise introduced while
  having so much fun with the bits.

 There are some interesting complications in putting a PTP sync
 interface in kernel.

If the PTP logic internally has to juggle multiple clocks then that is a
complication for the driver ok. In any case the driver ultimately has to
provide *one* source of time for the system to sync to.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH v6 0/8] ptp: IEEE 1588 hardware clock support

2010-09-23 Thread Christoph Lameter
On Thu, 23 Sep 2010, john stultz wrote:

  The HPET or pit timesource are also quite slow these days. You only need
  access periodically to essentially tune the TSC ratio.

 If we're using the TSC, then we're not using the PTP clock as you
 suggest. Further the HPET and PIT aren't used to steer the system time
 when we are using the TSC as a clocksource. Its only used to calibrate
 the initial constant freq used by the timekeeping code (and if its
 non-constant, we throw it out).

There is no other scalable time source available for fast timer access
than the time stamp counter in the cpu. Other time source require
memory accesses which is inherently slower.

An accurate other time source is used to adjust this clock. NTP does that
via the clock interfaces from user space which has its problems with
accuracy. PTP can provide the network synced time access
that would a more accurate calibration of the time.

 2) The way PTP clocks are steered to sync with network time causes their
 hardware freq to actually change. Since these adjustments are done on
 the hardware clock level, and not on the system time level, the
 adjustments to sync the system time/freq would then be made incorrect by
 PTP hardware adjustments.

Right. So use these as a way to fine tune the TSC clock (and thereby the
system time).

 3) Further, the PTP hardware counter can be simply set to a new offset
 to put it in line with the network time. This could cause trouble with
 timekeeping much like unsynced TSCs do.

You can do the same for system time.

 Now, what you seem to be suggesting is to use the TSC (or whatever
 clocksource the system time is using) but to steer the system time using
 the PTP clock. This is actually what is being proposed, however, the
 steering is done in userland. This is due to the fact that there are two
 components to the steering, 1) adjusting the PTP clock hardware to
 network time and 2) adjusting the system time to the PTP hardware. By
 exposing the PTP clock to userland via the posix clocks interface, we
 allow this to easily be done.

Userland code would introduce latencies that would make sub microsecond
time sync very difficult.

  We can switch underlying clocks for system time already. We can adapt to a
  different hw frequency.

 Actually no. The timekeeping code requires a fixed freq counter. Dealing
 with hardware freq changes is difficult, because error is introduced by
 the latency between when the freq changes and when the timekeeping code
 is notified of it. So the system treats the hardware counters as fixed
 freq. Now, hardware does vary freq ever so slightly as thermal
 conditions change, but this is addressed in userland and corrected via
 adjtimex.

Acadmic hair splitting? I have repeatedly switched between different
clocks on various systems. So its difficult but we do it?

 Unnecessary layers? Where? This approach has less in-kernel layers, as
 it exposes the PTP clock to userland, instead of trying to layer things
 on top of it and stretching the system time abstraction to cover it.

You dont need the user APIs if you directly use the PTP time source to
steer the system clock. In fact I think you have to do it in kernel space
since user space latencies will degrade accuracy otherwise.

 I've argued through the approach trying to keep it all internal to the
 kernel, but to do so would be anything but trivial. Further, there's the
 case of master-clocks, where the PTP hardware must be synced to system
 time, instead of the other way around. And then there's the case of
 boundary-clocks, which may have multiple PTP hardware clocks that have
 to be synced.

Ok maybe we need some sort of control interface to manage the clock like
the others have.

 I think exposing this through the posix clock interface is really the
 best approach. Its not a static clockid, so its not something most apps
 will ever have to deal with, but it allows the few apps that really need
 to have access to the PTP clock hardware can do so in a clean way.

It implies clock tuning in userspace for a potential sub microsecond
accurate clock. The clock accuracy will be limited by user space
latencies and noise. You wont be able to discipline the system clock
accurately.

The posix clocks today assumes one notion of real time in the kernel.
All clocks increase in lockstep (aside from offset updates). This approach
here result in multiple notions of time increasing at various speeds.
And it implies that someone is user space is trying to tinker around with
extremely low latencies using system call APIs that take much longer than
these intervals to process the data.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

2010-03-01 Thread Christoph Lameter
On Mon, 1 Mar 2010, Mel Gorman wrote:

 Christoph, how feasible would it be to allow parallel reclaimers in
 __zone_reclaim() that back off at a rate depending on the number of
 reclaimers?

Not too hard. Zone locking is there but there may be a lot of bouncing
cachelines if you run it concurrently.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

2010-02-19 Thread Christoph Lameter
On Fri, 19 Feb 2010, Mel Gorman wrote:

   The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
   zone reclaim.
 

 I've no problem with the patch anyway.

Nor do I.

  - We seem to end up racing between zone_watermark_ok, zone_reclaim and
buffered_rmqueue. Since everyone is in here the memory one thread reclaims
may be stolen by another thread.
 

 You're pretty much on the button here. Only one thread at a time enters
 zone_reclaim. The others back off and try the next zone in the zonelist
 instead. I'm not sure what the original intention was but most likely it
 was to prevent too many parallel reclaimers in the same zone potentially
 dumping out way more data than necessary.

Yes it was to prevent concurrency slowing down reclaim. At that time the
number of processors per NUMA node was 2 or so. The number of pages that
are reclaimed is limited to avoid tossing too many page cache pages.

 You could experiment with waiting on the bit if the GFP flags allowi it? The
 expectation would be that the reclaim operation does not take long. Wait
 on the bit, if you are making the forward progress, recheck the
 watermarks before continueing.

You could reclaim more pages during a zone reclaim pass? Increase the
nr_to_reclaim in __zone_reclaim() and see if that helps. One zone reclaim
pass should reclaim enough local pages to keep the processors on a node
happy for a reasonable interval. Maybe do a fraction of a zone? 1/16th?



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim

2010-02-19 Thread Christoph Lameter
On Fri, 19 Feb 2010, Balbir Singh wrote:

  zone_reclaim. The others back off and try the next zone in the zonelist
  instead. I'm not sure what the original intention was but most likely it
  was to prevent too many parallel reclaimers in the same zone potentially
  dumping out way more data than necessary.
 
  Yes it was to prevent concurrency slowing down reclaim. At that time the
  number of processors per NUMA node was 2 or so. The number of pages that
  are reclaimed is limited to avoid tossing too many page cache pages.
 

 That is interesting, I always thought it was to try and free page
 cache first. For example with zone-min_unmapped_pages, if
 zone_pagecache_reclaimable is greater than unmapped pages, we start
 reclaim the cached pages first. The min_unmapped_pages almost sounds
 like the higher level watermark - or am I misreading the code.

Indeed the purpose is to free *old* page cache pages.

The min_unmapped_pages is to protect a mininum of the page cache pages /
fs metadata from zone reclaim so that ongoing file I/O is not impacted.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 2/2][v2] powerpc: Make the CMM memory hotplug aware

2009-10-16 Thread Christoph Lameter
On Thu, 15 Oct 2009, Gerald Schaefer wrote:

  The pages allocated as __GFP_MOVABLE are used to store the list of pages
  allocated by the balloon.  They reference virtual addresses and it would
  be fine for the kernel to migrate the physical pages for those, the
  balloon would not notice this.

 Does page migration really work for kernel pages that were allocated
 with __get_free_page()? I was wondering if we can do this on s390, where
 we have a 1:1 mapping of kernel virtual to physical addresses, but
 looking at migrate_pages() and friends, it seems that kernel pages
 w/o mapping and rmap should not be migrateable at all. Any thoughts from
 the memory migration experts?

page migration only works for pages where we have some way of accounting
for all the references to a page. This usually mean using reverse mappings
(anon list, radix trees and page tables).


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 6/6] Add support for __read_mostly to linux/cache.h

2009-05-01 Thread Christoph Lameter
On Fri, 1 May 2009, Sam Ravnborg wrote:

 Are there any specific reason why we do not support read_mostly on all
 architectures?

Not that I know of.

 read_mostly is about grouping rarely written data together
 so what is needed is to introduce this section in the remaining
 archtectures.

 Christoph - git log says you did the inital implmentation.
 Do you agree?

Yes.

There is some concern that __read_mostly is needlessly applied to
numerous variables that are not used in hot code paths. This may make
__read_mostly ineffective and actually increase the cache footprint of a
function since global variables are no longer in the same cacheline. If
such a function is called and the caches are cold then two cacheline
fetches have to be done instead of one.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [RFC] [PATCH 0/5 V2] Huge page backed user-space stacks

2008-07-30 Thread Christoph Lameter
Mel Gorman wrote:

 With Erics patch and libhugetlbfs, we can automatically back text/data[1],
 malloc[2] and stacks without source modification. Fairly soon, libhugetlbfs
 will also be able to override shmget() to add SHM_HUGETLB. That should cover
 a lot of the memory-intensive apps without source modification.

So we are quite far down the road to having a VM that supports 2 page sizes 4k 
and 2M?

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter
On Tue, 4 Mar 2008, Pekka Enberg wrote:

I suspect the WARN_ON() is bogus although I really don't know that part
of the code all too well. Mel?
   
 
   The warn-on is valid. A situation should not exist that allows both flags 
  to
   be set. I suspect  if remove-set_migrateflags.patch was reverted from -mm
   the warning would not trigger. Christoph, would it be reasonable to always
   clear __GFP_MOVABLE when __GFP_RECLAIMABLE is set for SLAB_RECLAIM_ACCOUNT.

Slab allocations should never be passed these flags since the slabs do 
their own thing there.

The following patch would clear these in slub:

---
 mm/slub.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.25-rc3-mm1/mm/slub.c
===
--- linux-2.6.25-rc3-mm1.orig/mm/slub.c 2008-03-04 11:53:47.600342756 -0800
+++ linux-2.6.25-rc3-mm1/mm/slub.c  2008-03-04 11:55:40.153855150 -0800
@@ -1033,8 +1033,8 @@ static struct page *allocate_slab(struct
struct page *page;
int pages = 1  s-order;
 
+   flags = ~GFP_MOVABLE_MASK;
flags |= s-allocflags;
-
page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY,
node, s-order);
if (unlikely(!page)) {
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter
On Tue, 4 Mar 2008, Pekka Enberg wrote:

 [c9edf5f0] [c00b56e4] 
  .__alloc_pages_internal+0xf8/0x470
 [c9edf6e0] [c00e0458] .kmem_getpages+0x8c/0x194
 [c9edf770] [c00e1050] .fallback_alloc+0x194/0x254
 [c9edf820] [c00e14b0] .kmem_cache_alloc+0xd8/0x144

Ahh! This is SLAB. slub does not suffer this problem since new_slab() 
masks the bits correctly.

So we need to fix SLAB.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter
On Tue, 4 Mar 2008, Pekka J Enberg wrote:

 On Tue, 4 Mar 2008, Christoph Lameter wrote:
  Slab allocations should never be passed these flags since the slabs do 
  their own thing there.
  
  The following patch would clear these in slub:
 
 Here's the same fix for SLAB:

That is an immediate fix ok. But there must be some location where SLAB 
does the masking of the gfp bits where things go wrong. Looking for that.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter
I think this is the correct fix.

The NUMA fallback logic should be passing local_flags to kmem_get_pages() 
and not simply the flags.

Maybe a stable candidate since we are now simply 
passing on flags to the page allocator on the fallback path.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/slab.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.25-rc3-mm1/mm/slab.c
===
--- linux-2.6.25-rc3-mm1.orig/mm/slab.c 2008-03-04 12:01:07.430911920 -0800
+++ linux-2.6.25-rc3-mm1/mm/slab.c  2008-03-04 12:04:54.449857145 -0800
@@ -3277,7 +3277,7 @@ retry:
if (local_flags  __GFP_WAIT)
local_irq_enable();
kmem_flagcheck(cache, flags);
-   obj = kmem_getpages(cache, flags, -1);
+   obj = kmem_getpages(cache, local_flags, -1);
if (local_flags  __GFP_WAIT)
local_irq_disable();
if (obj) {

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [BUG] 2.6.25-rc3-mm1 kernel panic while bootup on powerpc ()

2008-03-04 Thread Christoph Lameter
On Tue, 4 Mar 2008, Pekka Enberg wrote:

 Looking at the code, it's triggerable in 2.6.24.3 at least. Why we don't have
 a report yet, probably because (1) the default allocator is SLUB which doesn't
 suffer from this and (2) you need a big honkin' NUMA box that causes fallback
 allocations to happen to trigger it.

Plus the issue only became a problem after the antifrag stuff went in. 
That came with SLUB as the default.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter
On Wed, 23 Jan 2008, Pekka J Enberg wrote:

 I still think Christoph's kmem_getpages() patch is correct (to fix 
 cache_grow() oops) but I overlooked the fact that none the callers of 
 cache_alloc_node() deal with bootstrapping (with the exception of 
 __cache_alloc_node() that even has a comment about it).

My patch is useless. kmem_getpages called with nodeid == -1 falls back 
correctly to the available node. The problem is that the node structures 
for the page does not exist.
 
 But what I am really wondering about is, why wasn't the 
 N_NORMAL_MEMORY revert enough? I assume this used to work before so what 
 more do we need to revert for 2.6.24?

I think that is because SLUB relaxed the requirements on having regular 
memory on the boot node. Now the expectation is that SLAB can do the same.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter
On Wed, 23 Jan 2008, Pekka J Enberg wrote:

 Furthermore, don't let kmem_getpages() call alloc_pages_node() if nodeid 
 passed
 to it is -1 as the latter will always translate that to numa_node_id() which
 might not have -nodelist that caused the invocation of fallback_alloc() in 
 the
 first place (for example, during bootstrap).

kmem_getpages is called without GFP_THISNODE. This 
alloc_pages_node(numa_node_id(), ...) will fall back to the next node with 
memory.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter
On Wed, 23 Jan 2008, Mel Gorman wrote:

 This patch adds the necessary checks to make sure a kmem_list3 exists for
 the preferred node used when growing the cache. If the preferred node has
 no nodelist then the currently running node is used instead. This
 problem only affects the SLAB allocator, SLUB appears to work fine.

That is a dangerous thing to do. SLAB per cpu queues will contain foreign 
objects which may cause troubles when pushing the objects back. I think we 
may be lucky that these objects are consumed at boot. If all of the 
foreign objects are consumed at boot then we are fine. At least an 
explanation as to this issue should be added to the patch.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter
On Wed, 23 Jan 2008, Pekka J Enberg wrote:

 Fine. But, why are we hitting fallback_alloc() in the first place? It's 
 definitely not because of missing -nodelists as we do:
 
 cache_cache.nodelists[node] = initkmem_list3[CACHE_CACHE];
 
 before attempting to set up kmalloc caches. Now, if I understood 
 correctly, we're booting off a memoryless node so kmem_getpages() will 
 return NULL thus forcing us to fallback_alloc() which is unavailable at 
 this point.
 
 As far as I can tell, there are two ways to fix this:
 
   (1) don't boot off a memoryless node (why are we doing this in the first 
   place?)

Right. That is the solution that I would prefer.

   (2) initialize cache_cache.nodelists with initmem_list3 equivalents
   for *each node hat has normal memory*

Or simply do it for all. SLAB bootstrap is very complex thing though.

 
 I am still wondering why this worked before, though.

I doubt it did ever work for SLAB.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter
On Wed, 23 Jan 2008, Pekka Enberg wrote:

 I think Mel said that their configuration did work with 2.6.23
 although I also wonder how that's possible. AFAIK there has been some
 changes in the page allocator that might explain this. That is, if
 kmem_getpages() returned pages for memoryless node before, bootstrap
 would have worked.

Regular kmem_getpages is called with GFP_THISNODE set. There was some 
breakage in 2.6.22 and before with GFP_THISNODE returning pages from the 
wrong node if a node had no memory. So it may have worked accidentally and 
in an unsafe manner because the pages would have been associated with the 
wrong node which could trigger bug ons and locking troubles.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Fix boot problem in situations where the boot CPU is running on a memoryless node

2008-01-23 Thread Christoph Lameter
On Wed, 23 Jan 2008, Nishanth Aravamudan wrote:

 Right, so it might have functioned before, but the correctness was
 wobbly at best... Certainly the memoryless patch series has tightened
 that up, but we missed these SLAB issues.
 
 I see that your patch fixed Olaf's machine, Pekka. Nice work on
 everyone's part tracking this stuff down.

Another important result is that I found that GFP_THISNODE is actually 
required for proper SLAB operation and not only an optimization. Fallback 
can lead to very bad results. I have two customer reported instances of 
SLAB corruption here that can be explained now due to fallback to another 
node. Foreign objects enter the per cpu queue. The wrong node lock is 
taken during cache_flusharray(). Fields in the struct slab can become 
corrupted. It typically hits the list field and the inuse field.



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: crash in kmem_cache_init

2008-01-22 Thread Christoph Lameter
On Tue, 22 Jan 2008, Mel Gorman wrote:

  After you reverted the slab memoryless node patch there should be per node 
  structures created for node 0 unless the node is marked offline. Is it? If 
  so then you are booting a cpu that is associated with an offline node. 
  
 
 I'll roll a patch that prints out the online states before startup and
 see what it looks like.

Ok. Great.

 
   Can you see a better solution than this?
  
  Well this means that bootstrap will work by introducing foreign objects 
  into the per cpu queue (should only hold per cpu objects). They will 
  later be consumed and then the queues will contain the right objects so 
  the effect of the patch is minimal.
  
 
 By minimal, do you mean that you expect it to break in some other
 respect later or minimal as in this is bad but should not have no
 adverse impact.

Should not have any adverse impact after the objects from the cpu queue 
have been consumed. If the cache_reaper tries to shift objects back 
from the per cpu queue into slabs then BUG_ONs may be triggered. Make sure 
you run the tests with full debugging please.

 Whatever this was a problem fixed in the past or not, it's broken again now
 :( . It's possible that there is a __GFP_THISNODE that can be dropped early
 at boot-time that would also fix this problem in a way that doesn't
 affect runtime (like altering cache_grow in my patch does).

The dropping of GFP_THISNODE has the same effect as your patch. 
Objects from another node get into the per cpu queue. And on free we 
assume that per cpu queue objects are from the local node. If debug is on 
then we check that with BUG_ONs.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: crash in kmem_cache_init

2008-01-22 Thread Christoph Lameter
On Tue, 22 Jan 2008, Olaf Hering wrote:

 It crashes now in a different way if the patch below is applied:

Yup no l3 structure for the current node. We are early in boostrap. You 
could just check if the l3 is there and if not just skip starting the 
reaper? This will be redone later anyways. Not sure if this will solve all 
your issues though. An l3 for the current node that we are booting on 
needs to be created early on for SLAB bootstrap to succeed. AFAICT SLUB 
doesnt care and simply uses whatever the page allocator gives it for the 
cpu slab. We may have gotten there because you only tested with SLUB 
recently and thus changes got in that broke SLAB boot assumptions.


 0xc00fe018 is in setup_cpu_cache 
 (/home/olaf/kernel/git/linux-2.6-numa/mm/slab.c:2111).
 2106BUG_ON(!cachep-nodelists[node]);
 2107
 kmem_list3_init(cachep-nodelists[node]);
 2108}
 2109}
 2110}

if (cachep-nodelists[numa_node_id()])
return;

 2111cachep-nodelists[numa_node_id()]-next_reap =
 2112jiffies + REAPTIMEOUT_LIST3 +
 2113((unsigned long)cachep) % REAPTIMEOUT_LIST3;
 2114
 2115cpu_cache_get(cachep)-avail = 0;
 
 
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


  1   2   >