Re: [PATCH V2] mm/page_alloc: Ensure that HUGETLB_PAGE_ORDER is less than MAX_ORDER

2021-04-19 Thread Christoph Lameter
On Mon, 19 Apr 2021, Anshuman Khandual wrote:

> >> Unfortunately the build test fails on both the platforms (powerpc and ia64)
> >> which subscribe HUGETLB_PAGE_SIZE_VARIABLE and where this check would make
> >> sense. I some how overlooked the cross compile build failure that actually
> >> detected this problem.
> >>
> >> But wondering why this assert is not holding true ? and how these platforms
> >> do not see the warning during boot (or do they ?) at mm/vmscan.c:1092 like
> >> arm64 did.
> >>
> >> static int __fragmentation_index(unsigned int order, struct 
> >> contig_page_info *info)
> >> {
> >>  unsigned long requested = 1UL << order;
> >>
> >>  if (WARN_ON_ONCE(order >= MAX_ORDER))
> >>  return 0;
> >> 
> >>
> >> Can pageblock_order really exceed MAX_ORDER - 1 ?

You can have larger blocks but you would need to allocate multiple
contigous max order blocks or do it at boot time before the buddy
allocator is active.

What IA64 did was to do this at boot time thereby avoiding the buddy
lists. And it had a separate virtual address range and page table for the
huge pages.

Looks like the current code does these allocations via CMA which should
also bypass the buddy allocator.


> > }
> >
> >
> > But it's kind of weird, isn't it? Let's assume we have MAX_ORDER - 1 
> > correspond to 4 MiB and pageblock_order correspond to 8 MiB.
> >
> > Sure, we'd be grouping pages in 8 MiB chunks, however, we cannot even
> > allocate 8 MiB chunks via the buddy. So only alloc_contig_range()
> > could really grab them (IOW: gigantic pages).
>
> Right.

But then you can avoid the buddy allocator.

> > Further, we have code like deferred_free_range(), where we end up
> > calling __free_pages_core()->...->__free_one_page() with
> > pageblock_order. Wouldn't we end up setting the buddy order to
> > something > MAX_ORDER -1 on that path?
>
> Agreed.

We would need to return the supersized block to the huge page pool and not
to the buddy allocator. There is a special callback in the compound page
sos that you can call an alternate free function that is not the buddy
allocator.

>
> >
> > Having pageblock_order > MAX_ORDER feels wrong and looks shaky.
> >
> Agreed, definitely does not look right. Lets see what other folks
> might have to say on this.
>
> + Christoph Lameter 
>

It was done for a long time successfully and is running in numerous
configurations.


Re: [PATCH] infiniband: ulp: Remove unnecessary struct declaration

2021-04-15 Thread Christoph Lameter
On Thu, 15 Apr 2021, Wan Jiabing wrote:

> struct ipoib_cm_tx is defined at 245th line.
> And the definition is independent on the MACRO.
> The declaration here is unnecessary. Remove it.

Correct.

Reviewed-by: Christoph Lameter 



Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access

2021-04-15 Thread Christoph Lameter
On Wed, 14 Apr 2021, Masayoshi Mizuma wrote:

> Please feel free to add:
>
> Tested-by: Masayoshi Mizuma 
>
> Thanks!
> Masa
>

Would you please stop quoting the whole patch when you have nothing to say
about the details? It is enough to just respond without quoting. I was
looking through this trying to find something you said about individual
sections of code but there was nothing.





Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem

2021-03-16 Thread Christoph Lameter
On Mon, 15 Mar 2021, Yang Shi wrote:

> > It seems like CONFIG_SLUB_DEBUG is a more popular option than 
> > CONFIG_SLUB_STATS.
> > CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is 
> > off.
> > I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.
>
> I think CONFIG_SLUB_DEBUG is enabled by default on most distros since
> it is supposed not incur too much overhead unless specific debug (i.e.
> red_zone) is turned on on demand.

Correct. CONFIG_SLUB_DEBUG includes the code so the debugging can be
enabled on Distro kernels with a kernel command line option. So you dont
have to recompile the kernel to find weird memory corruption issues from
strange device drivers.

Somehow my email address dropped off this thread.



Re: [mm, slub] 8ff60eb052: stress-ng.rawpkt.ops_per_sec -47.9% regression

2021-03-09 Thread Christoph Lameter
On Tue, 9 Mar 2021, Linus Torvalds wrote:

> So when you wrote:
>
> However, the current code accidentally stops looking at the partial list
> completely in that case.  Especially on kernels without CONFIG_NUMA set,
> this means that get_partial() fails and new_slab_objects() falls back to
> new_slab(), allocating new pages.  This could lead to an unnecessary
> increase in memory fragmentation.
>
> it really looks like this might well have been very intentional
> indeed. Or at least very beneficial for _some_ loads.
>
> Comments?

Yes the thought was that adding an additional page when contention is
there on the page objects will increase possible concurrency while
avoiding locks and increase the ability to allocate / free concurrently
from a multitude of objects.



Re: [PATCH] mm/slub: Add slub_debug option to panic on memory corruption

2021-03-09 Thread Christoph Lameter
On Tue, 9 Mar 2021, Georgi Djakov wrote:

> Being able to stop the system immediately when a memory corruption
> is detected is crucial to finding the source of it. This is very
> useful when the memory can be inspected with kdump or other tools.

Hmmm ok.

>  static void restore_bytes(struct kmem_cache *s, char *message, u8 data,
>   void *from, void *to)
>  {
> + if (slub_debug & SLAB_CORRUPTION_PANIC)
> + panic("slab: object overwritten\n");
>   slab_fix(s, "Restoring 0x%p-0x%p=0x%x\n", from, to - 1, data);
>   memset(from, data, to - from);
>  }

Why panic here? This should only be called late in the bug reporting when
an error has already been printed.



Re: [PATCH v2 3/3] mm/slub: Use percpu partial free counter

2021-03-03 Thread Christoph Lameter
On Wed, 3 Mar 2021, Matthew Wilcox wrote:

> > Can this be allocated in an interrupt context?
> >
> > And I am not sure how local_t relates to that? Percpu counters can be used
> > in an interrupt context without the overhead of the address calculations
> > that are required by a local_t.
>
> As I understand the patch, this counts the number of partially free slabs.
> So if we start to free an object from a completely full slab in process
> context, as "load x, add one to x, store x" and take an interrupt
> between loading x and adding one to x, that interrupt handler might
> free a different object from another completely full slab.  that would
> also load the same x, add one to it and store x, but then the process
> context would add one to the old x, overwriting the updated value from
> interrupt context.

this_cpu operations are "atomic" vs. preemption but on some platforms not
vs interrupts. That could be an issue in kmem_cache_free(). This would
need a modification to the relevant this_cpu ops so that interrupts are
disabled on those platforms.

Like this_cpu_inc_irq() or so?


> it's not the likeliest of races, and i don't know how important it is
> that these counters remain accurate.  but using a local_t instead of
> a percpu long would fix the problem.  i don't know why you think that
> a local_t needs "address calculations".  perhaps you've misremembered
> what a local_t is?

local_t does not include the address arithmetic that the this_cpu
operation can implicitly perform on x86 f.e. with an segment register or
maybe another register on another platform thereby avoiding the need to
disable preemption or interrupts.

Therefore a manual calculation of the target address for a local_t
operation needs to be done beforehand which usually means disabling
interrupts and/or preemption for the code segment. Otherwise we may end up
on a different processor due to scheduler or other interruptions and use
the percpu counter value of a different processor which could be racy.


Re: [PATCH v2 3/3] mm/slub: Use percpu partial free counter

2021-03-03 Thread Christoph Lameter
On Wed, 3 Mar 2021, Matthew Wilcox wrote:

> On Tue, Mar 02, 2021 at 10:14:53AM +0100, Christoph Lameter wrote:
> > On Mon, 10 Aug 2020, Xunlei Pang wrote:
> > > - atomic_long_t partial_free_objs;
> > > + atomic_long_t __percpu *partial_free_objs;
> >
> > A percpu counter is never atomic. Just use unsigned long and use this_cpu
> > operations for this thing. That should cut down further on the overhead.
>
> What about allocations from interrupt context?  Should this be a local_t
> instead?

Can this be allocated in an interrupt context?

And I am not sure how local_t relates to that? Percpu counters can be used
in an interrupt context without the overhead of the address calculations
that are required by a local_t.


Re: [PATCH] include/linux/slab.h: use for() and left shift to calculate

2021-03-02 Thread Christoph Lameter
On Tue, 2 Mar 2021, Yejune Deng wrote:

> use for() and left shift to calculate the value that compared with size.

There is a reason for the madness...

The current code was written so compilers can do proper constant folding
and eliminate the whole function entirely.

If you want this change then please verify that all compilers currently in
use with this code do proper constant folding and never generate code for
the for loop.


Re: [PATCH v2 3/3] mm/slub: Use percpu partial free counter

2021-03-02 Thread Christoph Lameter
On Mon, 10 Aug 2020, Xunlei Pang wrote:

>
> diff --git a/mm/slab.h b/mm/slab.h
> index c85e2fa..a709a70 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -616,7 +616,7 @@ struct kmem_cache_node {
>  #ifdef CONFIG_SLUB
>   unsigned long nr_partial;
>   struct list_head partial;
> - atomic_long_t partial_free_objs;
> + atomic_long_t __percpu *partial_free_objs;

A percpu counter is never atomic. Just use unsigned long and use this_cpu
operations for this thing. That should cut down further on the overhead.

> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1775,11 +1775,21 @@ static void discard_slab(struct kmem_cache *s, struct 
> page *page)
>  /*
>   * Management of partially allocated slabs.
>   */
> +static inline long get_partial_free(struct kmem_cache_node *n)
> +{
> + long nr = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + nr += atomic_long_read(per_cpu_ptr(n->partial_free_objs, cpu));

this_cpu_read(*n->partial_free_objs)

>  static inline void
>  __update_partial_free(struct kmem_cache_node *n, long delta)
>  {
> - atomic_long_add(delta, >partial_free_objs);
> + atomic_long_add(delta, this_cpu_ptr(n->partial_free_objs));

this_cpu_add()

and so on.


Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-02-04 Thread Christoph Lameter
On Thu, 4 Feb 2021, Vincent Guittot wrote:

> > So what is preferrable here now? Above or other quick fix or reverting
> > the original commit?
>
> I'm fine with whatever the solution as long as we can use keep using
> nr_cpu_ids when other values like num_present_cpus, don't reflect
> correctly the system

AFAICT they are correctly reflecting the current state of the system.

The problem here is the bringup of the system and the tuning therefor.

One additional thing that may help: The slab caches can work in a degraded
mode where no fastpath allocations can occur. That mode is used primarily
for debugging but maybe that mode can also help during bootstrap to avoid
having to deal with the per cpu data and so on.

In degraded mode SLUB will take a lock for each operation on an object.

In this mode the following is true

  kmem_cache_cpu->page == NULL
  kmem_cache_cpu->freelist == NULL

   kmem_cache_debug(s) == true

So if you define a new debug mode and include it in SLAB_DEBUG_FLAGS then
you can force SLUB to fallback to operations where a lock is taken and
where slab allocation can be stopped. This may be ok for bring up.

The debug flags are also tied to some wizardry that can patch the code at
runtime to optimize for debubgging or fast operations. You would tie into
that one as well. Start in debug mode by default and switch to fast
operations after all processors are up.




Re: [PATCH] mm/slub: embed __slab_alloc to its caller

2021-02-02 Thread Christoph Lameter
On Tue, 2 Feb 2021, Abel Wu wrote:

> Since slab_alloc_node() is the only caller of __slab_alloc(), embed
> __slab_alloc() to its caller to save function call overhead. This
> will also expand the caller's code block size a bit, but hackbench
> tests on both host and guest didn't show a difference w/ or w/o
> this patch.

slab_alloc_node is an always_inline function. It is intentional that only
the fast path was inlined and not the slow path.


Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-01-28 Thread Christoph Lameter
On Thu, 28 Jan 2021, Michal Hocko wrote:

> > > If you kill the allocating process then yes, it would work, but your
> > > process might be the very last to be selected.
> >
> > OOMs are different if you have a "constrained allocation". In that case it
> > is the fault of the process who wanted memory with certain conditions.
> > That memory is not available. General memory is available though. In that
> > case the allocating process is killed.
>
> I do not see this implementation would do anything like that. Neither
> anything like that implemented in the oom killer. Constrained
> allocations (cpusets/memcg/mempolicy) only do restrict their selection
> to processes which belong to the same domain. So I am not really sure
> what you are referring to. The is only a global knob to _always_ kill
> the allocating process on OOM.

Constrained allocations refer to allocations where the NUMA nodes are
restricted or something else does not allow the use of arbitrary memory.
The OOM killer changes its behavior. In the past we fell back to killing
the calling process.

See constrained_alloc() in mm/oom_kill.c

static const char * const oom_constraint_text[] = {
[CONSTRAINT_NONE] = "CONSTRAINT_NONE",
[CONSTRAINT_CPUSET] = "CONSTRAINT_CPUSET",
[CONSTRAINT_MEMORY_POLICY] = "CONSTRAINT_MEMORY_POLICY",
[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
};

/*
 * Determine the type of allocation constraint.
 */
static enum oom_constraint constrained_alloc(struct oom_control *oc)
{



Re: [PATCH v16 07/11] secretmem: use PMD-size pages to amortize direct map fragmentation

2021-01-28 Thread Christoph Lameter
On Thu, 28 Jan 2021, Michal Hocko wrote:

> > So, if I understand your concerns correct this implementation has two
> > issues:
> > 1) allocation failure at page fault that causes unrecoverable OOM and
> > 2) a possibility for an unprivileged user to deplete secretmem pool and
> > cause (1) to others
> >
> > I'm not really familiar with OOM internals, but when I simulated an
> > allocation failure in my testing only the allocating process and it's
> > parent were OOM-killed and then the system continued normally.
>
> If you kill the allocating process then yes, it would work, but your
> process might be the very last to be selected.

OOMs are different if you have a "constrained allocation". In that case it
is the fault of the process who wanted memory with certain conditions.
That memory is not available. General memory is available though. In that
case the allocating process is killed.


Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-01-27 Thread Christoph Lameter
On Tue, 26 Jan 2021, Will Deacon wrote:

> > Hm, but booting the secondaries is just a software (kernel) action? They are
> > already physically there, so it seems to me as if the cpu_present_mask is 
> > not
> > populated correctly on arm64, and it's just a mirror of cpu_online_mask?
>
> I think the present_mask retains CPUs if they are hotplugged off, whereas
> the online mask does not. We can't really do any better on arm64, as there's
> no way of telling that a CPU is present until we've seen it.

The order of each page in a kmem cache --and therefore also the number
of objects in a slab page-- can be different because that information is
stored in the page struct.

Therefore it is possible to retune the order while the cache is in operaton.

This means you can run an initcall after all cpus have been brought up to
set the order and number of objects in a slab page differently.

The older slab pages will continue to exist with the old orders until they
are freed.



Re: [PATCH 1/1] arm64/sparsemem: reduce SECTION_SIZE_BITS

2021-01-21 Thread Christoph Lameter
On Wed, 20 Jan 2021, Sudarshan Rajagopalan wrote:

> But there are other problems in reducing SECTION_SIZE_BIT. Reducing it by too
> much would over populate /sys/devices/system/memory/ and also consume too many
> page->flags bits in the !vmemmap case. Also section size needs to be multiple
> of 128MB to have PMD based vmemmap mapping with CONFIG_ARM64_4K_PAGES.

There is also the issue of requiring more space in the TLB cache with
smaller page sizes. Or does ARM resolve these into smaller TLB entries
anyways (going on my x86 kwon how here)? Anyways if there are only a few
TLB entries then the effect could
be significant.



Re: [RFC PATCH v0] mm/slub: Let number of online CPUs determine the slub page order

2021-01-21 Thread Christoph Lameter
On Thu, 21 Jan 2021, Bharata B Rao wrote:

> > The problem is that calculate_order() is called a number of times
> > before secondaries CPUs are booted and it returns 1 instead of 224.
> > This makes the use of num_online_cpus() irrelevant for those cases
> >
> > After adding in my command line "slub_min_objects=36" which equals to
> > 4 * (fls(num_online_cpus()) + 1) with a correct num_online_cpus == 224
> > , the regression diseapears:
> >
> > 9 iterations of hackbench -l 16000 -g 16: 3.201sec (+/- 0.90%)
>
> Should we have switched to num_present_cpus() rather than
> num_online_cpus()? If so, the below patch should address the
> above problem.

There is certainly an initcall after secondaries are booted where we could
redo the calculate_order?

Or the num_online_cpus needs to be up to date earlier. Why does this issue
not occur on x86? Does x86 have an up to date num_online_cpus earlier?




Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-18 Thread Christoph Lameter
On Mon, 18 Jan 2021, Michal Hocko wrote:

> > Hm this would be similar to recommending a periodical echo > drop_caches
> > operation. We actually discourage from that (and yeah, some tools do that, 
> > and
> > we now report those in dmesg). I believe the kernel should respond to memory
> > pressure and not OOM prematurely by itself, including SLUB.
>
> Absolutely agreed! Partial caches are a very deep internal
> implementation detail of the allocator and admin has no bussiness into
> fiddling with that. This would only lead to more harm than good.
> Comparision to drop_caches is really exact!

Really? The maximum allocation here has a upper boundary that depends on
the number of possible partial per cpu slabs. There is a worst case
scenario that is not nice and wastes some memory but it is not an OOM
situation and the system easily recovers from it.

The slab shrinking is not needed but if you are concerned about reclaiming
more memory right now then I guess you may want to run the slab shrink
operation.

Dropping the page cache is bad? Well sometimes you want more free memory
due to a certain operation that needs to be started and where you do not
want the overhead of page cache processing.

You can go crazy and expect magical things from either operation. True.







Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-14 Thread Christoph Lameter
On Wed, 13 Jan 2021, Jann Horn wrote:

> Some brainstorming:
>
> Maybe you could have an atomic counter in kmem_cache_cpu that tracks
> the number of empty frozen pages that are associated with a specific
> CPU? So the freeing slowpath would do its cmpxchg_double, and if the


The latencies of these functions are so low that any additional counter
will have significant performance impacts. An atomic counter would be waay
out there.

> You could additionally have a plain percpu counter, not tied to the

The performance critical counters are already all per cpu. I enhanced the
percpu subsystem specifically to support latency critical operations in
the fast path of the slab allocators.


Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-12 Thread Christoph Lameter
On Tue, 12 Jan 2021, Jann Horn wrote:

> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]

Well yeah all true. There is however a slabinfo tool that has an -s option
to shrink all slabs.

slabinfo -s

So you could put that somewhere that executes if the system is
idle or put it into cron or so.

This is a heavy handed operation through. You could switch off partial cpu
slabs completely to avoid the issue. Nothing came to mind in the past on
how to solve this without sacrificing significant performance or cause
some system processing at random times while the shrinking runs. No one
wants any of that.

Being able to do it from userspace cleanly shifts the burden to userspace  ;-)
You can even do random heavy system processing from user space if you put
a sleep there for random seconds between the shrinking runs.



Re: [PATCH] MAINTAINERS: add myself as slab allocators maintainer

2021-01-08 Thread Christoph Lameter
I am ok with you as a slab maintainer. I have seen some good work from
you.

Acked-by: Christoph Lameter 



Re: [RFC 0/3] mm, slab, slub: remove cpu and memory hotplug locks

2021-01-06 Thread Christoph Lameter
On Wed, 6 Jan 2021, Vlastimil Babka wrote:

> rather accept some wasted memory in scenarios that should be rare anyway (full
> memory hot remove), as we do the same in other contexts already. It's all RFC
> for now, as I might have missed some reason why it's not safe.

Looks good to me. My only concern is the kernel that has hotplug disabled.
Current code allows the online/offline checks to be optimized away.

Can this patch be enhanced to do the same?



Re: [PATCH] mm/slub: fix -Wunused-function compiler warnings

2019-09-17 Thread Christoph Lameter
On Tue, 17 Sep 2019, David Rientjes wrote:

> Acked-by: David Rientjes 

Ditto



Re: [PATCH] mm/slub: fix -Wunused-function compiler warnings

2019-09-17 Thread Christoph Lameter
On Tue, 17 Sep 2019, David Rientjes wrote:

> On Tue, 17 Sep 2019, Qian Cai wrote:
>
> > tid_to_cpu() and tid_to_event() are only used in note_cmpxchg_failure()
> > when SLUB_DEBUG_CMPXCHG=y, so when SLUB_DEBUG_CMPXCHG=n by default,
> > Clang will complain that those unused functions.
> >
> > Signed-off-by: Qian Cai 
>
> Acked-by: David Rientjes 

Ditto



[RFC 1/7] slub: Replace ctor field with ops field in /sys/slab/*

2018-12-20 Thread Christoph Lameter
Create an ops field in /sys/slab/*/ops to contain all the callback
operations defined for a slab cache. This will be used to display
the additional callbacks that will be defined soon to enable
defragmentation.

Display the existing ctor callback in the ops fields contents.

Signed-off-by: Christoph Lameter 

---
 mm/slub.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -4994,13 +4994,18 @@ static ssize_t cpu_partial_store(struct
 }
 SLAB_ATTR(cpu_partial);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
+   int x = 0;
+
if (!s->ctor)
return 0;
-   return sprintf(buf, "%pS\n", s->ctor);
+
+   if (s->ctor)
+   x += sprintf(buf + x, "ctor : %pS\n", s->ctor);
+   return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -5413,7 +5418,7 @@ static struct attribute *slab_attrs[] =
_partial_attr.attr,
_attr.attr,
_slabs_attr.attr,
-   _attr.attr,
+   _attr.attr,
_attr.attr,
_attr.attr,
_align_attr.attr,



[RFC 6/7] slub: Extend slabinfo to support -D and -F options

2018-12-20 Thread Christoph Lameter
-F lists caches that support moving and defragmentation

-C lists caches that use a ctor.

Change field names for defrag_ratio and remote_node_defrag_ratio.

Add determination of the allocation ratio for a slab. The allocation ratio
is the percentage of available slots for objects in use.

Signed-off-by: Christoph Lameter 

---
 Documentation/vm/slabinfo.c |   48 +++-
 1 file changed, 43 insertions(+), 5 deletions(-)

Index: linux/tools/vm/slabinfo.c
===
--- linux.orig/tools/vm/slabinfo.c
+++ linux/tools/vm/slabinfo.c
@@ -33,6 +33,8 @@ struct slabinfo {
unsigned int hwcache_align, object_size, objs_per_slab;
unsigned int sanity_checks, slab_size, store_user, trace;
int order, poison, reclaim_account, red_zone;
+   int movable, ctor;
+   int defrag_ratio, remote_node_defrag_ratio;
unsigned long partial, objects, slabs, objects_partial, objects_total;
unsigned long alloc_fastpath, alloc_slowpath;
unsigned long free_fastpath, free_slowpath;
@@ -67,6 +69,8 @@ int show_report;
 int show_alias;
 int show_slab;
 int skip_zero = 1;
+int show_movable;
+int show_ctor;
 int show_numa;
 int show_track;
 int show_first_alias;
@@ -109,14 +113,16 @@ static void fatal(const char *x, ...)
 
 static void usage(void)
 {
-   printf("slabinfo 4/15/2011. (c) 2007 sgi/(c) 2011 Linux Foundation.\n\n"
-   "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+   printf("slabinfo 4/15/2017. (c) 2007 sgi/(c) 2011 Linux Foundation/(c) 
2017 Jump Trading LLC.\n\n"
+   "slabinfo [-aCdDefFhnpvtsz] [-d debugopts] [slab-regexp]\n"
"-a|--aliases   Show aliases\n"
"-A|--activity  Most active slabs first\n"
"-d|--debug= Set/Clear Debug options\n"
+   "-C|--ctor  Show slabs with ctors\n"
"-D|--display-activeSwitch line format to activity\n"
"-e|--empty Show empty slabs\n"
"-f|--first-alias   Show first alias\n"
+   "-F|--movable   Show caches that support movable 
objects\n"
"-h|--help  Show usage information\n"
"-i|--inverted  Inverted list\n"
"-l|--slabs Show slabs\n"
@@ -369,7 +375,7 @@ static void slab_numa(struct slabinfo *s
return;
 
if (!line) {
-   printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
+   printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab");
for(node = 0; node <= highest_node; node++)
printf(" %4d", node);
printf("\n--");
@@ -378,6 +384,7 @@ static void slab_numa(struct slabinfo *s
printf("\n");
}
printf("%-21s ", mode ? "All slabs" : s->name);
+   printf("%3d ", s->remote_node_defrag_ratio);
for(node = 0; node <= highest_node; node++) {
char b[20];
 
@@ -535,6 +542,8 @@ static void report(struct slabinfo *s)
printf("** Slabs are destroyed via RCU\n");
if (s->reclaim_account)
printf("** Reclaim accounting active\n");
+   if (s->movable)
+   printf("** Defragmentation at %d%%\n", s->defrag_ratio);
 
printf("\nSizes (bytes) Slabs  Debug
Memory\n");

printf("\n");
@@ -585,6 +594,12 @@ static void slabcache(struct slabinfo *s
if (show_empty && s->slabs)
return;
 
+   if (show_movable && !s->movable)
+   return;
+
+   if (show_ctor && !s->ctor)
+   return;
+
if (sort_loss == 0)
store_size(size_str, slab_size(s));
else
@@ -599,6 +614,10 @@ static void slabcache(struct slabinfo *s
*p++ = '*';
if (s->cache_dma)
*p++ = 'd';
+   if (s->movable)
+   *p++ = 'F';
+   if (s->ctor)
+   *p++ = 'C';
if (s->hwcache_align)
*p++ = 'A';
if (s->poison)
@@ -633,7 +652,8 @@ static void slabcache(struct slabinfo *s
printf("%-21s %8ld %7d %15s %14s %4d %1d %3ld %3ld %s\n",
s->name, s->objects, s->object_size, size_str, dist_str,
s->objs_per_slab, s->order,
-   s

[RFC 7/7] xarray: Implement migration function for objects

2018-12-20 Thread Christoph Lameter
Implement functions to migrate objects. This is based on
initial code by Matthew Wilcox and was modified to work with
slab object migration.

Signed-off-by: Christoph Lameter 

Index: linux/lib/radix-tree.c
===
--- linux.orig/lib/radix-tree.c
+++ linux/lib/radix-tree.c
@@ -1613,6 +1613,18 @@ static int radix_tree_cpu_dead(unsigned
return 0;
 }
 
+
+extern void xa_object_migrate(void *tree_node, int numa_node);
+
+static void radix_tree_migrate(struct kmem_cache *s, void **objects, int nr,
+   int node, void *private)
+{
+   int i;
+
+   for (i=0; iarray);
+   void __rcu **slot;
+   struct xa_node *new_node;
+   int i;
+
+   /* Freed or not yet in tree then skip */
+   if (!xa || xa == XA_FREE_MARK)
+   return;
+
+   new_node = kmem_cache_alloc_node(radix_tree_node_cachep, GFP_KERNEL, 
numa_node);
+
+   xa_lock_irq(xa);
+
+   /* Check again. */
+   if (xa != node->array || !list_empty(>private_list)) {
+   node = new_node;
+   goto unlock;
+   }
+
+   memcpy(new_node, node, sizeof(struct xa_node));
+
+   /* Move pointers to new node */
+   INIT_LIST_HEAD(_node->private_list);
+   for (i = 0; i < XA_CHUNK_SIZE; i++) {
+   void *x = xa_entry_locked(xa, new_node, i);
+
+   if (xa_is_node(x))
+   rcu_assign_pointer(xa_to_node(x)->parent, new_node);
+   }
+   if (!new_node->parent)
+   slot = >xa_head;
+   else
+   slot = _parent_locked(xa, new_node)->slots[new_node->offset];
+   rcu_assign_pointer(*slot, xa_mk_node(new_node));
+
+unlock:
+   xa_unlock_irq(xa);
+   xa_node_free(node);
+   rcu_barrier();
+   return;
+
+}
+
 #ifdef XA_DEBUG
 void xa_dump_node(const struct xa_node *node)
 {



[RFC 4/7] slub: Sort slab cache list and establish maximum objects for defrag slabs

2018-12-20 Thread Christoph Lameter
It is advantageous to have all defragmentable slabs together at the
beginning of the list of slabs so that there is no need to scan the
complete list. Put defragmentable caches first when adding a slab cache
and others last.

Determine the maximum number of objects in defragmentable slabs. This allows
the sizing of the array holding refs to objects in a slab later.

Signed-off-by: Christoph Lameter 

---
 mm/slub.c |   26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -196,6 +196,9 @@ static inline bool kmem_cache_has_cpu_pa
 /* Use cmpxchg_double */
 #define __CMPXCHG_DOUBLE   ((slab_flags_t __force)0x4000U)
 
+/* Maximum objects in defragmentable slabs */
+static unsigned int max_defrag_slab_objects;
+
 /*
  * Tracking user of a slab.
  */
@@ -4310,22 +4313,45 @@ int __kmem_cache_create(struct kmem_cach
return err;
 }
 
+/*
+ * Allocate a slab scratch space that is sufficient to keep at least
+ * max_defrag_slab_objects pointers to individual objects and also a bitmap
+ * for max_defrag_slab_objects.
+ */
+static inline void *alloc_scratch(void)
+{
+   return kmalloc(max_defrag_slab_objects * sizeof(void *) +
+   BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long),
+   GFP_KERNEL);
+}
+
 void kmem_cache_setup_mobility(struct kmem_cache *s,
kmem_isolate_func isolate, kmem_migrate_func migrate)
 {
+   int max_objects = oo_objects(s->max);
+
/*
 * Defragmentable slabs must have a ctor otherwise objects may be
 * in an undetermined state after they are allocated.
 */
BUG_ON(!s->ctor);
+
+   mutex_lock(_mutex);
+
s->isolate = isolate;
s->migrate = migrate;
+
/*
 * Sadly serialization requirements currently mean that we have
 * to disable fast cmpxchg based processing.
 */
s->flags &= ~__CMPXCHG_DOUBLE;
 
+   list_move(>list, _caches);  /* Move to top */
+   if (max_objects > max_defrag_slab_objects)
+   max_defrag_slab_objects = max_objects;
+
+   mutex_unlock(_mutex);
 }
 EXPORT_SYMBOL(kmem_cache_setup_mobility);
 
Index: linux/mm/slab_common.c
===
--- linux.orig/mm/slab_common.c
+++ linux/mm/slab_common.c
@@ -393,7 +393,7 @@ static struct kmem_cache *create_cache(c
goto out_free_cache;
 
s->refcount = 1;
-   list_add(>list, _caches);
+   list_add_tail(>list, _caches);
memcg_link_cache(s);
 out:
if (err)



[RFC 3/7] slub: Add isolate() and migrate() methods

2018-12-20 Thread Christoph Lameter
Add the two methods needed for moving objects and enable the
display of the callbacks via the /sys/kernel/slab interface.

Add documentation explaining the use of these methods and the prototypes
for slab.h. Add functions to setup the callbacks method for a slab cache.

Add empty functions for SLAB/SLOB. The API is generic so it
could be theoretically implemented for these allocators as well.

Signed-off-by: Christoph Lameter 

---
 include/linux/slab.h |   50 +++
 include/linux/slub_def.h |3 ++
 mm/slub.c|   29 ++-
 3 files changed, 81 insertions(+), 1 deletion(-)

Index: linux/include/linux/slub_def.h
===
--- linux.orig/include/linux/slub_def.h
+++ linux/include/linux/slub_def.h
@@ -99,6 +99,9 @@ struct kmem_cache {
gfp_t allocflags;   /* gfp flags to use on each alloc */
int refcount;   /* Refcount for slab cache destroy */
void (*ctor)(void *);
+   kmem_isolate_func *isolate;
+   kmem_migrate_func *migrate;
+
unsigned int inuse; /* Offset to metadata */
unsigned int align; /* Alignment */
unsigned int red_left_pad;  /* Left redzone padding size */
Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3498,7 +3498,6 @@ static int calculate_sizes(struct kmem_c
else
s->flags &= ~__OBJECT_POISON;
 
-
/*
 * If we are Redzoning then check if there is some space between the
 * end of the object and the free pointer. If not then add an
@@ -4311,6 +4310,25 @@ int __kmem_cache_create(struct kmem_cach
return err;
 }
 
+void kmem_cache_setup_mobility(struct kmem_cache *s,
+   kmem_isolate_func isolate, kmem_migrate_func migrate)
+{
+   /*
+* Defragmentable slabs must have a ctor otherwise objects may be
+* in an undetermined state after they are allocated.
+*/
+   BUG_ON(!s->ctor);
+   s->isolate = isolate;
+   s->migrate = migrate;
+   /*
+* Sadly serialization requirements currently mean that we have
+* to disable fast cmpxchg based processing.
+*/
+   s->flags &= ~__CMPXCHG_DOUBLE;
+
+}
+EXPORT_SYMBOL(kmem_cache_setup_mobility);
+
 void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 {
struct kmem_cache *s;
@@ -5004,6 +5022,20 @@ static ssize_t ops_show(struct kmem_cach
 
if (s->ctor)
x += sprintf(buf + x, "ctor : %pS\n", s->ctor);
+
+   if (s->isolate) {
+   x += sprintf(buf + x, "isolate : ");
+   x += sprint_symbol(buf + x,
+   (unsigned long)s->isolate);
+   x += sprintf(buf + x, "\n");
+   }
+
+   if (s->migrate) {
+   x += sprintf(buf + x, "migrate : ");
+   x += sprint_symbol(buf + x,
+   (unsigned long)s->migrate);
+   x += sprintf(buf + x, "\n");
+   }
return x;
 }
 SLAB_ATTR_RO(ops);
Index: linux/include/linux/slab.h
===
--- linux.orig/include/linux/slab.h
+++ linux/include/linux/slab.h
@@ -153,6 +153,68 @@ void memcg_deactivate_kmem_caches(struct
 void memcg_destroy_kmem_caches(struct mem_cgroup *);
 
 /*
+ * Function prototypes passed to kmem_cache_setup_mobility() to enable mobile
+ * objects and targeted reclaim in slab caches.
+ */
+
+/*
+ * kmem_cache_isolate_func() is called with locks held so that the slab
+ * objects cannot be freed. We are in an atomic context and no slab
+ * operations may be performed. The purpose of kmem_cache_isolate_func()
+ * is to pin the object so that it cannot be freed until
+ * kmem_cache_migrate_func() has processed them. This may be accomplished
+ * by increasing the refcount or setting a flag.
+ *
+ * Parameters passed are the number of objects to process and an array of
+ * pointers to objects which are intended to be moved.
+ *
+ * Returns a pointer that is passed to the migrate function. If any objects
+ * cannot be touched at this point then the pointer may indicate a
+ * failure and then the migration function can simply remove the references
+ * that were already obtained. The private data could be used to track
+ * the objects that were already pinned.
+ *
+ * The object pointer array passed is also passed to kmem_cache_migrate().
+ * The function may remove objects from the array by setting pointers to
+ * NULL. This is useful if we can determine that an object is being freed
+ * because kmem_cache_isolate_func() was called when the subsystem
+ * was calling kmem_cache_free().
+ * In that case it is not necessary to increase the r

[RFC 5/7] slub: Slab defrag core

2018-12-20 Thread Christoph Lameter
Slab defragmentation may occur:

1. Unconditionally when kmem_cache_shrink is called on a slab cache by the
   kernel calling kmem_cache_shrink.

2. Through the use of the slabinfo command.

3. Per node defrag conditionally when kmem_cache_defrag() is called
   (can be called from reclaim code with a later patch).

   Defragmentation is only performed if the fragmentation of the slab
   is lower than the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab page can accomodate.

   The scanning of slab caches is optimized because the
   defragmentable slabs come first on the list. Thus we can terminate scans
   on the first slab encountered that does not support defragmentation.

   kmem_cache_defrag() takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.

A couple of functions must be setup via a call to kmem_cache_setup_defrag()
in order for a slabcache to support defragmentation. These are

kmem_defrag_isolate_func (void *isolate(struct kmem_cache *s, void **objects, 
int nr))

Must stabilize that the objects and ensure that they will not be freed 
until
the migration function is complete. SLUB guarantees that
the objects are still allocated. However, other threads may be blocked
in slab_free() attempting to free objects in the slab. These may succeed
as soon as isolate() returns to the slab allocator. The function must
be able to detect such situations and void the attempts to free such
objects (by for example voiding the corresponding entry in the objects
array).

No slab operations may be performed in isolate(). Interrupts
are disabled. What can be done is very limited. The slab lock
for the page that contains the object is taken. Any attempt to perform
a slab operation may lead to a deadlock.

kmem_defrag_isolate_func returns a private pointer that is passed to
kmem_defrag_kick_func(). Should we be unable to obtain all references
then that pointer may indicate to the kick() function that it should
not attempt any object removal or move but simply undo the measure
that were used to stabilize the object.

kmem_defrag_migrate_func (void migrate(struct kmem_cache *, void **objects, int 
nr,
int node, void *get_result))

After SLUB has stabilzed the objects in a
slab it will then drop all locks and use migrate() to move objects out
of the slab. The existence of the object is guaranteed by virtue of
the earlier obtained references via kmem_defrag_get_func(). The
callback may perform any slab operation since no locks are held at
the time of call.

The callback should remove the object from the slab in some way. This
may be accomplished by reclaiming the object and then running
kmem_cache_free() or reallocating it and then running
kmem_cache_free(). Reallocation is advantageous because the partial
slabs were just sorted to have the partial slabs with the most objects
first. Reallocation is likely to result in filling up a slab in
addition to freeing up one slab. A filled up slab can also be removed
from the partial list. So there could be a double effect.

kmem_defrag_migrate_func() does not return a result. SLUB will check
the number of remaining objects in the slab. If all objects were
removed then the slab is freed and we have reduced the overall
fragmentation of the slab cache.

Signed-off-by: Christoph Lameter 

---
 include/linux/slab.h |3 
 mm/slub.c|  265 ---
 2 files changed, 215 insertions(+), 53 deletions(-)

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -351,6 +351,12 @@ static __always_inline void slab_lock(st
bit_spin_lock(PG_locked, >flags);
 }
 
+static __always_inline int slab_trylock(struct page *page)
+{
+   VM_BUG_ON_PAGE(PageTail(page), page);
+   return bit_spin_trylock(PG_locked, >flags);
+}
+
 static __always_inline void slab_unlock(struct page *page)
 {
VM_BUG_ON_PAGE(PageTail(page), page);
@@ -3946,79 +3952,6 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
-#define SHRINK_PROMOTE_MAX 32
-
-/*
- * kmem_cache_shrink discards empty slabs and promotes the slabs filled
- * up most to the head of the partial lists. New allocations will then
- * fill those up and thus they can be removed from the partial lists.
- *
- * The slabs with the least items are placed last. This results in them
- * being allocated from last increasing the chance that the last objects
- * are freed in them.
- */
-int __kmem_cache_shrink(struct kmem

[RFC 2/7] slub: Add defrag_ratio field and sysfs support

2018-12-20 Thread Christoph Lameter
"defrag_ratio" is used to set the threshold at which defragmentation
should be attempted on a slab page.

"defrag_ratio" is percentage in the range of 1 - 100. If more than
that percentage of slots in a slab page are unused the the slab page
will become subject to defragmentation.

Add a defrag ratio field and set it to 30% by default. A limit of 30% specifies
that less than 3 out of 10 available slots for objects need to be leftover
before slab defragmentation will be attempted on the remaining objects.

Signed-off-by: Christoph Lameter 

---
 Documentation/ABI/testing/sysfs-kernel-slab |   13 +
 include/linux/slub_def.h|6 ++
 mm/slub.c   |   23 +++
 3 files changed, 42 insertions(+)

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3628,6 +3628,7 @@ static int kmem_cache_open(struct kmem_c
 
set_cpu_partial(s);
 
+   s->defrag_ratio = 30;
 #ifdef CONFIG_NUMA
s->remote_node_defrag_ratio = 1000;
 #endif
@@ -5113,6 +5114,27 @@ static ssize_t destroy_by_rcu_show(struc
 }
 SLAB_ATTR_RO(destroy_by_rcu);
 
+static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf)
+{
+   return sprintf(buf, "%d\n", s->defrag_ratio);
+}
+
+static ssize_t defrag_ratio_store(struct kmem_cache *s,
+   const char *buf, size_t length)
+{
+   unsigned long ratio;
+   int err;
+
+   err = kstrtoul(buf, 10, );
+   if (err)
+   return err;
+
+   if (ratio < 100)
+   s->defrag_ratio = ratio;
+   return length;
+}
+SLAB_ATTR(defrag_ratio);
+
 #ifdef CONFIG_SLUB_DEBUG
 static ssize_t slabs_show(struct kmem_cache *s, char *buf)
 {
@@ -5437,6 +5459,7 @@ static struct attribute *slab_attrs[] =
_attr.attr,
_calls_attr.attr,
_calls_attr.attr,
+   _ratio_attr.attr,
 #endif
 #ifdef CONFIG_ZONE_DMA
_dma_attr.attr,
Index: linux/Documentation/ABI/testing/sysfs-kernel-slab
===
--- linux.orig/Documentation/ABI/testing/sysfs-kernel-slab
+++ linux/Documentation/ABI/testing/sysfs-kernel-slab
@@ -180,6 +180,19 @@ Description:
list.  It can be written to clear the current count.
Available when CONFIG_SLUB_STATS is enabled.
 
+What:  /sys/kernel/slab/cache/defrag_ratio
+Date:  December 2018
+KernelVersion: 4.18
+Contact:   Christoph Lameter 
+   Pekka Enberg ,
+Description:
+   The defrag_ratio files allows the control of how agressive
+   slab fragmentation reduction works at reclaiming objects from
+   sparsely populated slabs. This is a percentage. If a slab
+   has more than this percentage of available object then reclaim
+   will attempt to reclaim objects so that the whole slab
+   page can be freed. The default is 30%.
+
 What:  /sys/kernel/slab/cache/deactivate_to_tail
 Date:  February 2008
 KernelVersion: 2.6.25
Index: linux/include/linux/slub_def.h
===
--- linux.orig/include/linux/slub_def.h
+++ linux/include/linux/slub_def.h
@@ -104,6 +104,12 @@ struct kmem_cache {
unsigned int red_left_pad;  /* Left redzone padding size */
const char *name;   /* Name (only for display!) */
struct list_head list;  /* List of slab caches */
+   int defrag_ratio;   /*
+* Ratio used to check the percentage of
+* objects allocate in a slab page.
+* If less than this ratio is allocated
+* then reclaim attempts are made.
+*/
 #ifdef CONFIG_SYSFS
struct kobject kobj;/* For sysfs */
struct work_struct kobj_remove_work;



[RFC 0/7] Slab object migration for xarray V2

2018-12-20 Thread Christoph Lameter
V1->V2:
- Works now on top of 4.20-rc7
- A couple of bug fixes


This is a patchset on top of Matthew Wilcox Xarray code and implements
slab object migration for xarray nodes. The migration is integrated into
the defragmetation and shrinking logic of the slab allocator.

Defragmentation will ensure that all xarray slab pages have
less objects available than specified by the slab defrag ratio.

Slab shrinking will create a slab cache with optimal object
density. Only one slab page will have available objects per node.

To test apply this patchset and run a workload that uses lots of radix tree 
objects


Then go to

/sys/kernel/slab/radix_tree_node

Inspect the number of total objects that the slab can handle

cat total_objects

qmdr:/sys/kernel/slab/radix_tree_node# cat objects
868 N0=448 N1=168 N2=56 N3=196

And the number of slab pages used for those

cat slabs

qmdr:/sys/kernel/slab/radix_tree_node# cat slabs
31 N0=16 N1=6 N2=2 N3=7


Perform a cache shrink operation

echo 1 >shrink


Now see how the slab has changed:

qmdr:/sys/kernel/slab/radix_tree_node# cat slabs
30 N0=15 N1=6 N2=2 N3=7
qmdr:/sys/kernel/slab/radix_tree_node# cat objects
713 N0=349 N1=141 N2=52 N3=171


This is just a barebones approach using a special mode
of the slab migration patchset that does not require refcounts.



If this is acceptable then additional functionality can be added:

1. Migration of objects to a specific node. Not sure how to implement
   that. Using another sysfs field?

2. Dispersion of objects across all nodes (MPOL_INTERLEAVE)

3. Subsystems can request to move an object to a specific node.
   How to design such functionality best?

4. Tying into the page migration and page defragmentation logic so
   that so far unmovable pages that are in the way of creating a
   contiguous block of memory will become movable.
   This would mean checking for slab pages in the migration logic
   and calling slab to see if it can move the page by migrating
   all objects.

This is only possible for xarray for now but it would be worthwhile
to extend this to dentries and inodes.



Re: [PATCH] vmemmap, memory_hotplug: fallback to base pages for vmmap

2017-07-11 Thread Christoph Lameter
On Tue, 11 Jul 2017, Johannes Weiner wrote:

> Hi Michael,
>
> On Tue, Jul 11, 2017 at 04:25:58PM +0200, Michal Hocko wrote:
> > Ohh, scratch that. The patch is bogus. I have completely missed that
> > vmemmap_populate_hugepages already falls back to
> > vmemmap_populate_basepages. I have to revisit the bug report I have
> > received to see what happened apart from the allocation warning. Maybe
> > we just want to silent that warning.
>
> Yep, this should be fixed in 8e2cdbcb86b0 ("x86-64: fall back to
> regular page vmemmap on allocation failure").
>
> I figure it's good to keep some sort of warning there, though, as it
> could have performance implications when we fall back to base pages.

If someone gets to work on it then maybe also add giant page support?

We already have systems with terabytes of memory and one 1G vmemmap page
would map 128G of memory leading to a significant reduction in the use of
TLBs.



Re: [PATCH] vmemmap, memory_hotplug: fallback to base pages for vmmap

2017-07-11 Thread Christoph Lameter
On Tue, 11 Jul 2017, Johannes Weiner wrote:

> Hi Michael,
>
> On Tue, Jul 11, 2017 at 04:25:58PM +0200, Michal Hocko wrote:
> > Ohh, scratch that. The patch is bogus. I have completely missed that
> > vmemmap_populate_hugepages already falls back to
> > vmemmap_populate_basepages. I have to revisit the bug report I have
> > received to see what happened apart from the allocation warning. Maybe
> > we just want to silent that warning.
>
> Yep, this should be fixed in 8e2cdbcb86b0 ("x86-64: fall back to
> regular page vmemmap on allocation failure").
>
> I figure it's good to keep some sort of warning there, though, as it
> could have performance implications when we fall back to base pages.

If someone gets to work on it then maybe also add giant page support?

We already have systems with terabytes of memory and one 1G vmemmap page
would map 128G of memory leading to a significant reduction in the use of
TLBs.



Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-11 Thread Christoph Lameter
On Tue, 11 Jul 2017, Frederic Weisbecker wrote:

> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -787,6 +787,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct 
> > tick_sched *ts,
> > if (!ts->tick_stopped) {
> > calc_load_nohz_start();
> > cpu_load_update_nohz_start();
> > +   quiet_vmstat();
>
> This patch seems to make sense. Christoph?

Ok makes sense to me too. Was never entirely sure where the proper place
would be to call it.


Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-11 Thread Christoph Lameter
On Tue, 11 Jul 2017, Frederic Weisbecker wrote:

> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -787,6 +787,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct 
> > tick_sched *ts,
> > if (!ts->tick_stopped) {
> > calc_load_nohz_start();
> > cpu_load_update_nohz_start();
> > +   quiet_vmstat();
>
> This patch seems to make sense. Christoph?

Ok makes sense to me too. Was never entirely sure where the proper place
would be to call it.


Re: [PATCH] slub: make sure struct kmem_cache_node is initialized before publication

2017-07-10 Thread Christoph Lameter
On Mon, 10 Jul 2017, Alexander Potapenko wrote:

> >> Could the slab maintainers please take a look at these and also have a
> >> think about Alexander's READ_ONCE/WRITE_ONCE question?
> >
> > Was I cced on these?
> I've asked Andrew about READ_ONCE privately.

Please post to a mailing list and cc the maintainers?

> Since unfreeze_partials() sees uninitialized value of n->list_lock, I
> was suspecting there's a data race between unfreeze_partials() and
> init_kmem_cache_nodes().

I have not seen the details but I would suspect that this is related to
early boot issues? The list lock is initialized upon slab creation and at
that time no one can get to the kmem_cache structure.

There are a couple of boot time slabs that will temporarily be available.
and released upon boot completion.

> If so, reads and writes to s->node[node] must be acquire/release
> atomics (not actually READ_ONCE/WRITE_ONCE, but
> smp_load_acquire/smp_store_release).

Can we figure the reason for these out before proposing fixes?



Re: [PATCH] slub: make sure struct kmem_cache_node is initialized before publication

2017-07-10 Thread Christoph Lameter
On Mon, 10 Jul 2017, Alexander Potapenko wrote:

> >> Could the slab maintainers please take a look at these and also have a
> >> think about Alexander's READ_ONCE/WRITE_ONCE question?
> >
> > Was I cced on these?
> I've asked Andrew about READ_ONCE privately.

Please post to a mailing list and cc the maintainers?

> Since unfreeze_partials() sees uninitialized value of n->list_lock, I
> was suspecting there's a data race between unfreeze_partials() and
> init_kmem_cache_nodes().

I have not seen the details but I would suspect that this is related to
early boot issues? The list lock is initialized upon slab creation and at
that time no one can get to the kmem_cache structure.

There are a couple of boot time slabs that will temporarily be available.
and released upon boot completion.

> If so, reads and writes to s->node[node] must be acquire/release
> atomics (not actually READ_ONCE/WRITE_ONCE, but
> smp_load_acquire/smp_store_release).

Can we figure the reason for these out before proposing fixes?



Re: [PATCH] slub: make sure struct kmem_cache_node is initialized before publication

2017-07-07 Thread Christoph Lameter
On Fri, 7 Jul 2017, Andrew Morton wrote:

> On Fri,  7 Jul 2017 10:34:08 +0200 Alexander Potapenko  
> wrote:
>
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3389,8 +3389,8 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
> > return 0;
> > }
> >
> > -   s->node[node] = n;
> > init_kmem_cache_node(n);
> > +   s->node[node] = n;
> > }
> > return 1;
> >  }
>
> If this matters then I have bad feelings about free_kmem_cache_nodes():

At creation time the kmem_cache structure is private and no one can run a
free operation.

> Inviting a use-after-free?  I guess not, as there should be no way
> to look up these items at this stage.

Right.

> Could the slab maintainers please take a look at these and also have a
> think about Alexander's READ_ONCE/WRITE_ONCE question?

Was I cced on these?



Re: [PATCH] slub: make sure struct kmem_cache_node is initialized before publication

2017-07-07 Thread Christoph Lameter
On Fri, 7 Jul 2017, Andrew Morton wrote:

> On Fri,  7 Jul 2017 10:34:08 +0200 Alexander Potapenko  
> wrote:
>
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3389,8 +3389,8 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
> > return 0;
> > }
> >
> > -   s->node[node] = n;
> > init_kmem_cache_node(n);
> > +   s->node[node] = n;
> > }
> > return 1;
> >  }
>
> If this matters then I have bad feelings about free_kmem_cache_nodes():

At creation time the kmem_cache structure is private and no one can run a
free operation.

> Inviting a use-after-free?  I guess not, as there should be no way
> to look up these items at this stage.

Right.

> Could the slab maintainers please take a look at these and also have a
> think about Alexander's READ_ONCE/WRITE_ONCE question?

Was I cced on these?



Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-07 Thread Christoph Lameter
On Fri, 7 Jul 2017, Kees Cook wrote:

> If we also added a >0 offset, that would make things even less
> deterministic. Though I wonder if it would make the performance impact
> higher. The XOR patch right now is very light.

There would be barely any performance impact if you keep the offset within
a cacheline since most objects start on a cacheline boundary. The
processor has to fetch the cacheline anyways.


Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-07 Thread Christoph Lameter
On Fri, 7 Jul 2017, Kees Cook wrote:

> If we also added a >0 offset, that would make things even less
> deterministic. Though I wonder if it would make the performance impact
> higher. The XOR patch right now is very light.

There would be barely any performance impact if you keep the offset within
a cacheline since most objects start on a cacheline boundary. The
processor has to fetch the cacheline anyways.


Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-07 Thread Christoph Lameter
On Thu, 6 Jul 2017, Kees Cook wrote:

> Right. This is about blocking the escalation of attack capability. For
> slab object overflow flaws, there are mainly two exploitation methods:
> adjacent allocated object overwrite and adjacent freed object
> overwrite (i.e. a freelist pointer overwrite). The first attack
> depends heavily on which slab cache (and therefore which structures)
> has been exposed by the bug. It's a very narrow and specific attack
> method. The freelist attack is entirely general purpose since it
> provides a reliable way to gain arbitrary write capabilities.
> Protecting against that attack greatly narrows the options for an
> attacker which makes attacks more expensive to create and possibly
> less reliable (and reliability is crucial to successful attacks).


The simplest thing here is to vary the location of the freelist pointer.
That way you cannot hit the freepointer in a deterministic way

The freepointer is put at offset 0 right now. But you could put it
anywhere in the object.

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3467,7 +3467,8 @@ static int calculate_sizes(struct kmem_c
 */
s->offset = size;
size += sizeof(void *);
-   }
+   } else
+   s->offset = s->size / sizeof(void *) * 

 #ifdef CONFIG_SLUB_DEBUG
if (flags & SLAB_STORE_USER)


Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-07 Thread Christoph Lameter
On Thu, 6 Jul 2017, Kees Cook wrote:

> Right. This is about blocking the escalation of attack capability. For
> slab object overflow flaws, there are mainly two exploitation methods:
> adjacent allocated object overwrite and adjacent freed object
> overwrite (i.e. a freelist pointer overwrite). The first attack
> depends heavily on which slab cache (and therefore which structures)
> has been exposed by the bug. It's a very narrow and specific attack
> method. The freelist attack is entirely general purpose since it
> provides a reliable way to gain arbitrary write capabilities.
> Protecting against that attack greatly narrows the options for an
> attacker which makes attacks more expensive to create and possibly
> less reliable (and reliability is crucial to successful attacks).


The simplest thing here is to vary the location of the freelist pointer.
That way you cannot hit the freepointer in a deterministic way

The freepointer is put at offset 0 right now. But you could put it
anywhere in the object.

Index: linux/mm/slub.c
===
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3467,7 +3467,8 @@ static int calculate_sizes(struct kmem_c
 */
s->offset = size;
size += sizeof(void *);
-   }
+   } else
+   s->offset = s->size / sizeof(void *) * 

 #ifdef CONFIG_SLUB_DEBUG
if (flags & SLAB_STORE_USER)


Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-06 Thread Christoph Lameter
On Thu, 6 Jul 2017, Kees Cook wrote:

> On Thu, Jul 6, 2017 at 6:43 AM, Christoph Lameter <c...@linux.com> wrote:
> > On Wed, 5 Jul 2017, Kees Cook wrote:
> >
> >> @@ -3536,6 +3565,9 @@ static int kmem_cache_open(struct kmem_cache *s, 
> >> unsigned long flags)
> >>  {
> >>   s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
> >>   s->reserved = 0;
> >> +#ifdef CONFIG_SLAB_FREELIST_HARDENED
> >> + s->random = get_random_long();
> >> +#endif
> >>
> >>   if (need_reserve_slab_rcu && (s->flags & SLAB_TYPESAFE_BY_RCU))
> >>   s->reserved = sizeof(struct rcu_head);
> >>
> >
> > So if an attacker knows the internal structure of data then he can simply
> > dereference page->kmem_cache->random to decode the freepointer.
>
> That requires a series of arbitrary reads. This is protecting against
> attacks that use an adjacent slab object write overflow to write the
> freelist pointer. This internal structure is very reliable, and has
> been the basis of freelist attacks against the kernel for a decade.

These reads are not arbitrary. You can usually calculate the page struct
address easily from the address and then do a couple of loads to get
there.

Ok so you get rid of the old attacks because we did not have that
hardening in effect when they designed their approaches?

> It is a probabilistic defense, but then so is the stack protector.
> This is a similar defense; while not perfect it makes the class of
> attack much more difficult to mount.

Na I am not convinced of the "much more difficult". Maybe they will just
have to upgrade their approaches to fetch the proper values to decode.



Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-06 Thread Christoph Lameter
On Thu, 6 Jul 2017, Kees Cook wrote:

> On Thu, Jul 6, 2017 at 6:43 AM, Christoph Lameter  wrote:
> > On Wed, 5 Jul 2017, Kees Cook wrote:
> >
> >> @@ -3536,6 +3565,9 @@ static int kmem_cache_open(struct kmem_cache *s, 
> >> unsigned long flags)
> >>  {
> >>   s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
> >>   s->reserved = 0;
> >> +#ifdef CONFIG_SLAB_FREELIST_HARDENED
> >> + s->random = get_random_long();
> >> +#endif
> >>
> >>   if (need_reserve_slab_rcu && (s->flags & SLAB_TYPESAFE_BY_RCU))
> >>   s->reserved = sizeof(struct rcu_head);
> >>
> >
> > So if an attacker knows the internal structure of data then he can simply
> > dereference page->kmem_cache->random to decode the freepointer.
>
> That requires a series of arbitrary reads. This is protecting against
> attacks that use an adjacent slab object write overflow to write the
> freelist pointer. This internal structure is very reliable, and has
> been the basis of freelist attacks against the kernel for a decade.

These reads are not arbitrary. You can usually calculate the page struct
address easily from the address and then do a couple of loads to get
there.

Ok so you get rid of the old attacks because we did not have that
hardening in effect when they designed their approaches?

> It is a probabilistic defense, but then so is the stack protector.
> This is a similar defense; while not perfect it makes the class of
> attack much more difficult to mount.

Na I am not convinced of the "much more difficult". Maybe they will just
have to upgrade their approaches to fetch the proper values to decode.



Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Christoph Lameter
On Thu, 6 Jul 2017, Roman Gushchin wrote:

> +#define PGALLOC_EVENTS_SIZE (MAX_NR_ZONES * MAX_ORDER)
> +#define PGALLOC_EVENTS_CUT_SIZE (MAX_NR_ZONES * (MAX_ORDER - 1))
> +#define PGALLOC_FIRST_ZONE (PGALLOC_NORMAL - ZONE_NORMAL)


You are significantly increasing the per cpu counters (ZONES *
MAX_ORDER * cpus!!!). This will increase the cache footprint of critical
functions significantly and thus lead to regressions.

Typically counters for zones are placed in the zone structures but
you would also significantly increase the per zone counters ...



Re: [PATCH] mm: make allocation counters per-order

2017-07-06 Thread Christoph Lameter
On Thu, 6 Jul 2017, Roman Gushchin wrote:

> +#define PGALLOC_EVENTS_SIZE (MAX_NR_ZONES * MAX_ORDER)
> +#define PGALLOC_EVENTS_CUT_SIZE (MAX_NR_ZONES * (MAX_ORDER - 1))
> +#define PGALLOC_FIRST_ZONE (PGALLOC_NORMAL - ZONE_NORMAL)


You are significantly increasing the per cpu counters (ZONES *
MAX_ORDER * cpus!!!). This will increase the cache footprint of critical
functions significantly and thus lead to regressions.

Typically counters for zones are placed in the zone structures but
you would also significantly increase the per zone counters ...



Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-06 Thread Christoph Lameter
On Wed, 5 Jul 2017, Kees Cook wrote:

> @@ -3536,6 +3565,9 @@ static int kmem_cache_open(struct kmem_cache *s, 
> unsigned long flags)
>  {
>   s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
>   s->reserved = 0;
> +#ifdef CONFIG_SLAB_FREELIST_HARDENED
> + s->random = get_random_long();
> +#endif
>
>   if (need_reserve_slab_rcu && (s->flags & SLAB_TYPESAFE_BY_RCU))
>   s->reserved = sizeof(struct rcu_head);
>

So if an attacker knows the internal structure of data then he can simply
dereference page->kmem_cache->random to decode the freepointer.

Assuming someone is already targeting a freelist pointer (which indicates
detailed knowledge of the internal structure) then I would think that
someone like that will also figure out how to follow the pointer links to
get to the random value.

Not seeing the point of all of this.



Re: [PATCH v3] mm: Add SLUB free list pointer obfuscation

2017-07-06 Thread Christoph Lameter
On Wed, 5 Jul 2017, Kees Cook wrote:

> @@ -3536,6 +3565,9 @@ static int kmem_cache_open(struct kmem_cache *s, 
> unsigned long flags)
>  {
>   s->flags = kmem_cache_flags(s->size, flags, s->name, s->ctor);
>   s->reserved = 0;
> +#ifdef CONFIG_SLAB_FREELIST_HARDENED
> + s->random = get_random_long();
> +#endif
>
>   if (need_reserve_slab_rcu && (s->flags & SLAB_TYPESAFE_BY_RCU))
>   s->reserved = sizeof(struct rcu_head);
>

So if an attacker knows the internal structure of data then he can simply
dereference page->kmem_cache->random to decode the freepointer.

Assuming someone is already targeting a freelist pointer (which indicates
detailed knowledge of the internal structure) then I would think that
someone like that will also figure out how to follow the pointer links to
get to the random value.

Not seeing the point of all of this.



Re: [PATCH v2] mm: Add SLUB free list pointer obfuscation

2017-06-29 Thread Christoph Lameter
On Sun, 25 Jun 2017, Kees Cook wrote:

> The difference gets lost in the noise, but if the above is sensible,
> it's 0.07% slower. ;)

Hmmm... These differences add up. Also in a repetative benchmark like that
you do not see the impact that the additional cacheline use in the cpu
cache has on larger workloads. Those may be pushed over the edge of l1 or
l2 capacity at some point which then causes drastic regressions.



Re: [PATCH v2] mm: Add SLUB free list pointer obfuscation

2017-06-29 Thread Christoph Lameter
On Sun, 25 Jun 2017, Kees Cook wrote:

> The difference gets lost in the noise, but if the above is sensible,
> it's 0.07% slower. ;)

Hmmm... These differences add up. Also in a repetative benchmark like that
you do not see the impact that the additional cacheline use in the cpu
cache has on larger workloads. Those may be pushed over the edge of l1 or
l2 capacity at some point which then causes drastic regressions.



Re: [PATCH] slub: make sysfs file removal asynchronous

2017-06-28 Thread Christoph Lameter
On Tue, 20 Jun 2017, Tejun Heo wrote:

> And we have to weight that against the possibility of breakage from
> the backport, however low it may be, right?  I'm not strongly
> convinced either way on this one and AFAICS the slub sysfs files there
> are mostly for debugging, so we'd be risking breakage in a way more
> common path (kmem_cache destruction) to avoid unlikely deadlock with a
> debug facility.  I think -stable backports should be conservative and
> justified as breaking things through -stable undermines the whole
> thing.

The sysfs files are mainly used for reporting (the "slabinfo" tool
accesses these files f.e.)

But given the high rate of breakage of sysfs related patches: Lets just
skip stable for now.



Re: [PATCH] slub: make sysfs file removal asynchronous

2017-06-28 Thread Christoph Lameter
On Tue, 20 Jun 2017, Tejun Heo wrote:

> And we have to weight that against the possibility of breakage from
> the backport, however low it may be, right?  I'm not strongly
> convinced either way on this one and AFAICS the slub sysfs files there
> are mostly for debugging, so we'd be risking breakage in a way more
> common path (kmem_cache destruction) to avoid unlikely deadlock with a
> debug facility.  I think -stable backports should be conservative and
> justified as breaking things through -stable undermines the whole
> thing.

The sysfs files are mainly used for reporting (the "slabinfo" tool
accesses these files f.e.)

But given the high rate of breakage of sysfs related patches: Lets just
skip stable for now.



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-30 Thread Christoph Lameter
On Fri, 26 May 2017, Marcelo Tosatti wrote:

> > interrupts and scheduler ticks. But what does this have to do with vmstat?
> >
> > Show me your dpdk code running and trace the tick on / off events  as well
> > as the vmstat invocations. Also show all system calls occurring on the cpu
> > that runs dpdk. That is necessary to see what triggers vmstat and how the
> > system reacts to the changes to the differentials.
>
> Sure, i can get that to you. The question remains: Are you arguing
> its not valid for a realtime application to use any system call
> which changes a vmstat counter?

A true realtime app would be conscientious of its use of the OS services
because the use of the services may cause additional latencies and also
cause timers etc to fire later. A realtime app that is willing to use
these services is therefore willing to tolerate larger latencies. A
realtime app that is using OS service may cause the timer tick to be
enabled which also causes additional latencies.

I have seen completely OS noise free processing for extended time period
when not using OS services and using RDMA for I/O. This fits my use case
well.

If there are really these high latencies because of kworker processing for
vmstat then maybe we need a different mechanism there (bh? or other
triggers) and maybe we are using far too many counters so that the
processing becomes a heavy user of resources.

> Because if they are allowed, then its obvious something like
> this is needed.

I am still wondering what benefit there is. Lets get clear on the test
load and see if this actually makes sense.



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-30 Thread Christoph Lameter
On Fri, 26 May 2017, Marcelo Tosatti wrote:

> > interrupts and scheduler ticks. But what does this have to do with vmstat?
> >
> > Show me your dpdk code running and trace the tick on / off events  as well
> > as the vmstat invocations. Also show all system calls occurring on the cpu
> > that runs dpdk. That is necessary to see what triggers vmstat and how the
> > system reacts to the changes to the differentials.
>
> Sure, i can get that to you. The question remains: Are you arguing
> its not valid for a realtime application to use any system call
> which changes a vmstat counter?

A true realtime app would be conscientious of its use of the OS services
because the use of the services may cause additional latencies and also
cause timers etc to fire later. A realtime app that is willing to use
these services is therefore willing to tolerate larger latencies. A
realtime app that is using OS service may cause the timer tick to be
enabled which also causes additional latencies.

I have seen completely OS noise free processing for extended time period
when not using OS services and using RDMA for I/O. This fits my use case
well.

If there are really these high latencies because of kworker processing for
vmstat then maybe we need a different mechanism there (bh? or other
triggers) and maybe we are using far too many counters so that the
processing becomes a heavy user of resources.

> Because if they are allowed, then its obvious something like
> this is needed.

I am still wondering what benefit there is. Lets get clear on the test
load and see if this actually makes sense.



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-25 Thread Christoph Lameter
On Thu, 25 May 2017, Marcelo Tosatti wrote:

> Argument? We're showing you the data that this is causing a latency
> problem for us.

Sorry I am not sure where the data shows a latency problem. There are
interrupts and scheduler ticks. But what does this have to do with vmstat?

Show me your dpdk code running and trace the tick on / off events  as well
as the vmstat invocations. Also show all system calls occurring on the cpu
that runs dpdk. That is necessary to see what triggers vmstat and how the
system reacts to the changes to the differentials.

Then please rerun the test by setting the vmstat_interval to 60.

Do another run with your modifications and show the difference.

> > Something that crossed my mind was to add a new tunable to set
> > the vmstat_interval for each CPU, this way we could essentially
> > disable it to the CPUs where DPDK is running. What's the implications
> > of doing this besides not getting up to date stats in /proc/vmstat
> > (which I still have to confirm would be OK)? Can this break anything
> > in the kernel for example?
>
> Well, you get incorrect statistics.

The statistics are never completely accurate. You will get less accurate
statistics but they will be correct. The differentials may not be
reflected in the counts shown via /proc but there is a cap on how
inaccurate those can becore.



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-25 Thread Christoph Lameter
On Thu, 25 May 2017, Marcelo Tosatti wrote:

> Argument? We're showing you the data that this is causing a latency
> problem for us.

Sorry I am not sure where the data shows a latency problem. There are
interrupts and scheduler ticks. But what does this have to do with vmstat?

Show me your dpdk code running and trace the tick on / off events  as well
as the vmstat invocations. Also show all system calls occurring on the cpu
that runs dpdk. That is necessary to see what triggers vmstat and how the
system reacts to the changes to the differentials.

Then please rerun the test by setting the vmstat_interval to 60.

Do another run with your modifications and show the difference.

> > Something that crossed my mind was to add a new tunable to set
> > the vmstat_interval for each CPU, this way we could essentially
> > disable it to the CPUs where DPDK is running. What's the implications
> > of doing this besides not getting up to date stats in /proc/vmstat
> > (which I still have to confirm would be OK)? Can this break anything
> > in the kernel for example?
>
> Well, you get incorrect statistics.

The statistics are never completely accurate. You will get less accurate
statistics but they will be correct. The differentials may not be
reflected in the counts shown via /proc but there is a cap on how
inaccurate those can becore.



Re: [PATCH 0/6] refine and rename slub sysfs

2017-05-24 Thread Christoph Lameter
On Wed, 24 May 2017, Wei Yang wrote:

> >
> >Who is going to use those new entries and for what purpose? Why do we
> >want to expose even more details of the slab allocator to the userspace.
> >Is the missing information something fundamental that some user space
> >cannot work without it? Seriously these are essential questions you
> >should have answer for _before_ posting the patch and mention all those
> >reasons in the changelog.
>
> It is me who wants to get more details of the slub behavior.
> AFAIK, no one else is expecting this.

I would appreciate some clearer structured statistics. These are important
for diagnostics and for debugging. Do not go overboard with this. Respin
it and provide also a cleanup of the slabinfo tool? I would appreciate it.

> Hmm, if we really don't want to export these entries, why not remove related
> code? Looks we are sure they will not be touched.

Please have a look at the slabinfo code which depends on those fields in
order to display slab information. I have patchsets here that will add
more functionality to slab and those will also add additional fields to
sysfs.



Re: [PATCH 0/6] refine and rename slub sysfs

2017-05-24 Thread Christoph Lameter
On Wed, 24 May 2017, Wei Yang wrote:

> >
> >Who is going to use those new entries and for what purpose? Why do we
> >want to expose even more details of the slab allocator to the userspace.
> >Is the missing information something fundamental that some user space
> >cannot work without it? Seriously these are essential questions you
> >should have answer for _before_ posting the patch and mention all those
> >reasons in the changelog.
>
> It is me who wants to get more details of the slub behavior.
> AFAIK, no one else is expecting this.

I would appreciate some clearer structured statistics. These are important
for diagnostics and for debugging. Do not go overboard with this. Respin
it and provide also a cleanup of the slabinfo tool? I would appreciate it.

> Hmm, if we really don't want to export these entries, why not remove related
> code? Looks we are sure they will not be touched.

Please have a look at the slabinfo code which depends on those fields in
order to display slab information. I have patchsets here that will add
more functionality to slab and those will also add additional fields to
sysfs.



Re: [PATCH 0/6] refine and rename slub sysfs

2017-05-23 Thread Christoph Lameter
On Tue, 23 May 2017, Michal Hocko wrote:

> > >_Why_ do we need all this?
> >
> > To have a clear statistics for each slab level.
>
> Is this worth risking breakage of the userspace which consume this data
> now? Do you have any user space code which will greatly benefit from the
> new data and which couldn't do the same with the current format/output?
>
> If yes this all should be in the changelog.

And the patchset would also need to update the user space tool that is in
the kernel tree...

Again Wei please do not use "level". Type?



Re: [PATCH 0/6] refine and rename slub sysfs

2017-05-23 Thread Christoph Lameter
On Tue, 23 May 2017, Michal Hocko wrote:

> > >_Why_ do we need all this?
> >
> > To have a clear statistics for each slab level.
>
> Is this worth risking breakage of the userspace which consume this data
> now? Do you have any user space code which will greatly benefit from the
> new data and which couldn't do the same with the current format/output?
>
> If yes this all should be in the changelog.

And the patchset would also need to update the user space tool that is in
the kernel tree...

Again Wei please do not use "level". Type?



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-22 Thread Christoph Lameter
On Sat, 20 May 2017, Marcelo Tosatti wrote:

> > And you can configure the interval of vmstat updates freely Set
> > the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> > enough?
>
> Not rare enough. Never is rare enough.

Ok what about the other stuff that must be going on if you allow OS
activity like f.e. the tick, scheduler etc etc.


Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-22 Thread Christoph Lameter
On Sat, 20 May 2017, Marcelo Tosatti wrote:

> > And you can configure the interval of vmstat updates freely Set
> > the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
> > enough?
>
> Not rare enough. Never is rare enough.

Ok what about the other stuff that must be going on if you allow OS
activity like f.e. the tick, scheduler etc etc.


Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-22 Thread Christoph Lameter
On Fri, 19 May 2017, Luiz Capitulino wrote:

> Something that crossed my mind was to add a new tunable to set
> the vmstat_interval for each CPU, this way we could essentially
> disable it to the CPUs where DPDK is running. What's the implications
> of doing this besides not getting up to date stats in /proc/vmstat
> (which I still have to confirm would be OK)? Can this break anything
> in the kernel for example?

The data is still going to be updated when the differential gets to big.

Increasing the vmstat interval and reducing the differential threshold
would get your there



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-22 Thread Christoph Lameter
On Fri, 19 May 2017, Luiz Capitulino wrote:

> Something that crossed my mind was to add a new tunable to set
> the vmstat_interval for each CPU, this way we could essentially
> disable it to the CPUs where DPDK is running. What's the implications
> of doing this besides not getting up to date stats in /proc/vmstat
> (which I still have to confirm would be OK)? Can this break anything
> in the kernel for example?

The data is still going to be updated when the differential gets to big.

Increasing the vmstat interval and reducing the differential threshold
would get your there



Re: [PATCH 0/3] mm/slub: Fix unused function warnings

2017-05-22 Thread Christoph Lameter
On Fri, 19 May 2017, Matthias Kaehlcke wrote:

> This series fixes a bunch of warnings about unused functions in SLUB

Looks ok to me.

Acked-by: Christoph Lameter <c...@linux.com>



Re: [PATCH 0/3] mm/slub: Fix unused function warnings

2017-05-22 Thread Christoph Lameter
On Fri, 19 May 2017, Matthias Kaehlcke wrote:

> This series fixes a bunch of warnings about unused functions in SLUB

Looks ok to me.

Acked-by: Christoph Lameter 



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-19 Thread Christoph Lameter
On Fri, 19 May 2017, Marcelo Tosatti wrote:

> Use-case: realtime application on an isolated core which for some reason
> updates vmstatistics.

Ok that is already only happening every 2 seconds by default and that
interval is configurable via the vmstat_interval proc setting.

> > Just a measurement of vmstat_worker. Pointless.
>
> Shouldnt the focus be on general scenarios rather than particular
> usecases, so that the solution covers a wider range of usecases?

Yes indeed and as far as I can tell the wider usecases are covered. Not
sure that there is anything required here.

> The situation as i see is as follows:
>
> Your point of view is: an "isolated CPU" with a set of applications
> cannot update vm statistics, otherwise they pay the vmstat_update cost:
>
>  kworker/5:1-245   [005] 1..   673.454295: workqueue_execute_start: 
> work struct a0cf6e493e20: function vmstat_update
>  kworker/5:1-245   [005] 1..   673.454305: workqueue_execute_end: 
> work struct a0cf6e493e20
>
> Thats 10us for example.

Well with a decent cpu that is 3 usec and it occurs infrequently on the
order of once per multiple seconds.

> So if want to customize a realtime setup whose code updates vmstatistic,
> you are dead. You have to avoid any systemcall which possibly updates
> vmstatistics (now and in the future kernel versions).

You are already dead because you allow IPIs and other kernel processing
which creates far more overhead. Still fail to see the point.

> The point is that these vmstat updates are rare. From
> http://www.7-cpu.com/cpu/Haswell.html:
>
> RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
> RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)
>
> Lets round to 100ns = 0.1us.

That depends on the kernel functionality used.

> You need 100 vmstat updates (all misses to RAM, the worst possible case)
> to have equivalent amount of time of the batching version.

The batching version occurs every couple of seconds if at all.

> But thats not the point. The point is the 10us interruption
> to execution of the realtime app (which can either mean
> your current deadline requirements are not met, or that
> another application with lowest latency requirement can't
> be used).

Ok then you need to get rid of the IPIs and the other stuff that you have
going on with the OS first I think.

> So why are you against integrating this simple, isolated patch which
> does not affect how current logic works?

Frankly the argument does not make sense. Vmstat updates occur very
infrequently (probably even less than you IPIs and the other OS stuff that
also causes additional latencies that you seem to be willing to tolerate).

And you can configure the interval of vmstat updates freely Set
the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
enough?



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-19 Thread Christoph Lameter
On Fri, 19 May 2017, Marcelo Tosatti wrote:

> Use-case: realtime application on an isolated core which for some reason
> updates vmstatistics.

Ok that is already only happening every 2 seconds by default and that
interval is configurable via the vmstat_interval proc setting.

> > Just a measurement of vmstat_worker. Pointless.
>
> Shouldnt the focus be on general scenarios rather than particular
> usecases, so that the solution covers a wider range of usecases?

Yes indeed and as far as I can tell the wider usecases are covered. Not
sure that there is anything required here.

> The situation as i see is as follows:
>
> Your point of view is: an "isolated CPU" with a set of applications
> cannot update vm statistics, otherwise they pay the vmstat_update cost:
>
>  kworker/5:1-245   [005] 1..   673.454295: workqueue_execute_start: 
> work struct a0cf6e493e20: function vmstat_update
>  kworker/5:1-245   [005] 1..   673.454305: workqueue_execute_end: 
> work struct a0cf6e493e20
>
> Thats 10us for example.

Well with a decent cpu that is 3 usec and it occurs infrequently on the
order of once per multiple seconds.

> So if want to customize a realtime setup whose code updates vmstatistic,
> you are dead. You have to avoid any systemcall which possibly updates
> vmstatistics (now and in the future kernel versions).

You are already dead because you allow IPIs and other kernel processing
which creates far more overhead. Still fail to see the point.

> The point is that these vmstat updates are rare. From
> http://www.7-cpu.com/cpu/Haswell.html:
>
> RAM Latency = 36 cycles + 57 ns (3.4 GHz i7-4770)
> RAM Latency = 62 cycles + 100 ns (3.6 GHz E5-2699 dual)
>
> Lets round to 100ns = 0.1us.

That depends on the kernel functionality used.

> You need 100 vmstat updates (all misses to RAM, the worst possible case)
> to have equivalent amount of time of the batching version.

The batching version occurs every couple of seconds if at all.

> But thats not the point. The point is the 10us interruption
> to execution of the realtime app (which can either mean
> your current deadline requirements are not met, or that
> another application with lowest latency requirement can't
> be used).

Ok then you need to get rid of the IPIs and the other stuff that you have
going on with the OS first I think.

> So why are you against integrating this simple, isolated patch which
> does not affect how current logic works?

Frankly the argument does not make sense. Vmstat updates occur very
infrequently (probably even less than you IPIs and the other OS stuff that
also causes additional latencies that you seem to be willing to tolerate).

And you can configure the interval of vmstat updates freely Set
the vmstat_interval to 60 seconds instead of 2 for a try? Is that rare
enough?



Re: Virtually Mapped Stacks: Do not disable interrupts

2017-05-19 Thread Christoph Lameter
On Thu, 18 May 2017, Andy Lutomirski wrote:

> On Wed, May 17, 2017 at 8:58 AM, Christoph Lameter <c...@linux.com> wrote:
> > The reason to disable interrupts seems to be to avoid switching
> > to a different processor while handling per cpu data using
> > individual loads and stores. If we use per cpu RMV primitives
> > we will not have to disable interrupts.
>
> I like this, except that those primitives can be quite expensive, I
> think, and they're being called in a loop.  What if you first did a
> this_cpu_read() to see if the value in the cache slot might be useful
> before doing the heavyweight exchange?

These operations are not expensive because they are unlocked operations
(in constrast to the usuual "lock cmpxchg") and do not
require coherency to be guaranteed between processors. That is why they
were made available because they are so much cheaper.




Re: Virtually Mapped Stacks: Do not disable interrupts

2017-05-19 Thread Christoph Lameter
On Thu, 18 May 2017, Andy Lutomirski wrote:

> On Wed, May 17, 2017 at 8:58 AM, Christoph Lameter  wrote:
> > The reason to disable interrupts seems to be to avoid switching
> > to a different processor while handling per cpu data using
> > individual loads and stores. If we use per cpu RMV primitives
> > we will not have to disable interrupts.
>
> I like this, except that those primitives can be quite expensive, I
> think, and they're being called in a loop.  What if you first did a
> this_cpu_read() to see if the value in the cache slot might be useful
> before doing the heavyweight exchange?

These operations are not expensive because they are unlocked operations
(in constrast to the usuual "lock cmpxchg") and do not
require coherency to be guaranteed between processors. That is why they
were made available because they are so much cheaper.




Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-18 Thread Christoph Lameter
On Thu, 18 May 2017, Michal Hocko wrote:

> > See above. OOM Kill in a cpuset does not kill an innocent task but a task
> > that does an allocation in that specific context meaning a task in that
> > cpuset that also has a memory policty.
>
> No, the oom killer will chose the largest task in the specific NUMA
> domain. If you just fail such an allocation then a page fault would get
> VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
> the cpusets.

Ok someone screwed up that code. There still is the determination that we
have a constrained alloc:

oom_kill:
/*
 * Check if there were limitations on the allocation (only relevant for
 * NUMA and memcg) that may require different handling.
 */
constraint = constrained_alloc(oc);
if (constraint != CONSTRAINT_MEMORY_POLICY)
oc->nodemask = NULL;
check_panic_on_oom(oc, constraint);

-- Ok. A constrained failing alloc used to terminate the allocating
process here. But it falls through to selecting a "bad process"


if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
oc->chosen = current;
oom_kill_process(oc, "Out of memory 
(oom_kill_allocating_task)");
return true;
}

--  A constrained allocation should not get here but fail the process that
attempts the alloc.

select_bad_process(oc);


Can we restore the old behavior? If I just specify the right memory policy
I can cause other processes to just be terminated?


> > Regardless of that the point earlier was that the moving logic can avoid
> > creating temporary situations of empty sets of nodes by analysing the
> > memory policies etc and only performing moves when doing so is safe.
>
> How are you going to do that in a raceless way? Moreover the whole
> discussion is about _failing_ allocations on an empty cpuset and
> mempolicy intersection.

Again this is only working for processes that are well behaved and it
never worked in a different way before. There was always the assumption
that a process does not allocate in the areas that have allocation
constraints and that the process does not change memory policies nor
store them somewhere for late etc etc. HPC apps typically allocate memory
on startup and then go through long times of processing and I/O.

The idea that cpuset node to node migration will work with a running
process that does abitrary activity is a pipe dream that we should give
up. There must be constraints on a process in order to allow this to work
and as far as I can tell this is best done in userspace with a library and
by putting requirements on the applications that desire to be movable that
way.

F.e. an application that does not use memory policies or other allocation
constraints should be fine. That has been working.


Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-18 Thread Christoph Lameter
On Thu, 18 May 2017, Michal Hocko wrote:

> > See above. OOM Kill in a cpuset does not kill an innocent task but a task
> > that does an allocation in that specific context meaning a task in that
> > cpuset that also has a memory policty.
>
> No, the oom killer will chose the largest task in the specific NUMA
> domain. If you just fail such an allocation then a page fault would get
> VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
> the cpusets.

Ok someone screwed up that code. There still is the determination that we
have a constrained alloc:

oom_kill:
/*
 * Check if there were limitations on the allocation (only relevant for
 * NUMA and memcg) that may require different handling.
 */
constraint = constrained_alloc(oc);
if (constraint != CONSTRAINT_MEMORY_POLICY)
oc->nodemask = NULL;
check_panic_on_oom(oc, constraint);

-- Ok. A constrained failing alloc used to terminate the allocating
process here. But it falls through to selecting a "bad process"


if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
oc->chosen = current;
oom_kill_process(oc, "Out of memory 
(oom_kill_allocating_task)");
return true;
}

--  A constrained allocation should not get here but fail the process that
attempts the alloc.

select_bad_process(oc);


Can we restore the old behavior? If I just specify the right memory policy
I can cause other processes to just be terminated?


> > Regardless of that the point earlier was that the moving logic can avoid
> > creating temporary situations of empty sets of nodes by analysing the
> > memory policies etc and only performing moves when doing so is safe.
>
> How are you going to do that in a raceless way? Moreover the whole
> discussion is about _failing_ allocations on an empty cpuset and
> mempolicy intersection.

Again this is only working for processes that are well behaved and it
never worked in a different way before. There was always the assumption
that a process does not allocate in the areas that have allocation
constraints and that the process does not change memory policies nor
store them somewhere for late etc etc. HPC apps typically allocate memory
on startup and then go through long times of processing and I/O.

The idea that cpuset node to node migration will work with a running
process that does abitrary activity is a pipe dream that we should give
up. There must be constraints on a process in order to allow this to work
and as far as I can tell this is best done in userspace with a library and
by putting requirements on the applications that desire to be movable that
way.

F.e. an application that does not use memory policies or other allocation
constraints should be fine. That has been working.


Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-18 Thread Christoph Lameter
On Thu, 18 May 2017, Vlastimil Babka wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> No, that expand/shrink by itself doesn't work against parallel

Parallel? I think we are clear that ithis is inherently racy against the
app changing policies etc etc? There is a huge issue there already. The
app needs to be well behaved in some heretofore undefined way in order to
make moves clean.

> get_page_from_freelist going through a zonelist. Moving from node 0 to
> 1, with zonelist containing nodes 1 and 0 in that order:
>
> - mempolicy mask is 0
> - zonelist iteration checks node 1, it's not allowed, skip

There is an allocation from node 1? This is not allowed before the move.
So it should fail. Not skipping to another node.

> - mempolicy mask is 0,1 (expand)
> - mempolicy mask is 1 (shrink)
> - zonelist iteration checks node 0, it's not allowed, skip
> - OOM

Are you talking about a race here between zonelist scanning and the
moving? That has been there forever.

And frankly there are gazillions of these races. The best thing to do is
to get the cpuset moving logic out of the kernel and into user space.

Understand that this is a heuristic and maybe come up with a list of
restrictions that make an app safe. An safe app that can be moved must f.e

1. Not allocate new memory while its being moved
2. Not change memory policies after its initialization and while its being
moved.
3. Not save memory policy state in some variable (because the logic to
translate the memory policies for the new context cannot find it).

...

Again cpuset process migration  is a huge mess that you do not want to
have in the kernel and AFAICT this is a corner case with difficult
semantics. Better have that in user space...




Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-18 Thread Christoph Lameter
On Thu, 18 May 2017, Vlastimil Babka wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> No, that expand/shrink by itself doesn't work against parallel

Parallel? I think we are clear that ithis is inherently racy against the
app changing policies etc etc? There is a huge issue there already. The
app needs to be well behaved in some heretofore undefined way in order to
make moves clean.

> get_page_from_freelist going through a zonelist. Moving from node 0 to
> 1, with zonelist containing nodes 1 and 0 in that order:
>
> - mempolicy mask is 0
> - zonelist iteration checks node 1, it's not allowed, skip

There is an allocation from node 1? This is not allowed before the move.
So it should fail. Not skipping to another node.

> - mempolicy mask is 0,1 (expand)
> - mempolicy mask is 1 (shrink)
> - zonelist iteration checks node 0, it's not allowed, skip
> - OOM

Are you talking about a race here between zonelist scanning and the
moving? That has been there forever.

And frankly there are gazillions of these races. The best thing to do is
to get the cpuset moving logic out of the kernel and into user space.

Understand that this is a heuristic and maybe come up with a list of
restrictions that make an app safe. An safe app that can be moved must f.e

1. Not allocate new memory while its being moved
2. Not change memory policies after its initialization and while its being
moved.
3. Not save memory policy state in some variable (because the logic to
translate the memory policies for the new context cannot find it).

...

Again cpuset process migration  is a huge mess that you do not want to
have in the kernel and AFAICT this is a corner case with difficult
semantics. Better have that in user space...




Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-18 Thread Christoph Lameter
On Thu, 18 May 2017, Michal Hocko wrote:

> > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> > that changed?

!

> >
> > At this point you have messed up royally and nothing is going to rescue
> > you anyways. OOM or not does not matter anymore. The app will fail.
>
> Not really. If you can trick the system to _think_ that the intersection
> between mempolicy and the cpuset is empty then the OOM killer might
> trigger an innocent task rather than the one which tricked it into that
> situation.

See above. OOM Kill in a cpuset does not kill an innocent task but a task
that does an allocation in that specific context meaning a task in that
cpuset that also has a memory policty.

Regardless of that the point earlier was that the moving logic can avoid
creating temporary situations of empty sets of nodes by analysing the
memory policies etc and only performing moves when doing so is safe.



Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-18 Thread Christoph Lameter
On Thu, 18 May 2017, Michal Hocko wrote:

> > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> > that changed?

!

> >
> > At this point you have messed up royally and nothing is going to rescue
> > you anyways. OOM or not does not matter anymore. The app will fail.
>
> Not really. If you can trick the system to _think_ that the intersection
> between mempolicy and the cpuset is empty then the OOM killer might
> trigger an innocent task rather than the one which tricked it into that
> situation.

See above. OOM Kill in a cpuset does not kill an innocent task but a task
that does an allocation in that specific context meaning a task in that
cpuset that also has a memory policty.

Regardless of that the point earlier was that the moving logic can avoid
creating temporary situations of empty sets of nodes by analysing the
memory policies etc and only performing moves when doing so is safe.



Virtually Mapped Stacks: Do not disable interrupts

2017-05-17 Thread Christoph Lameter
The reason to disable interrupts seems to be to avoid switching
to a different processor while handling per cpu data using
individual loads and stores. If we use per cpu RMV primitives
we will not have to disable interrupts.

Signed-off-by: Christoph Lameter <c...@linux.com>

Index: linux/kernel/fork.c
===
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -205,19 +205,17 @@ static unsigned long *alloc_thread_stack
void *stack;
int i;

-   local_irq_disable();
for (i = 0; i < NR_CACHED_STACKS; i++) {
-   struct vm_struct *s = this_cpu_read(cached_stacks[i]);
+   struct vm_struct *s;
+
+   s = this_cpu_xchg(cached_stacks[i], NULL);

if (!s)
continue;
-   this_cpu_write(cached_stacks[i], NULL);

tsk->stack_vm_area = s;
-   local_irq_enable();
return s->addr;
}
-   local_irq_enable();

stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
 VMALLOC_START, VMALLOC_END,
@@ -245,19 +243,15 @@ static inline void free_thread_stack(str
 {
 #ifdef CONFIG_VMAP_STACK
if (task_stack_vm_area(tsk)) {
-   unsigned long flags;
int i;

-   local_irq_save(flags);
for (i = 0; i < NR_CACHED_STACKS; i++) {
-   if (this_cpu_read(cached_stacks[i]))
+   if (this_cpu_cmpxchg(cached_stacks[i],
+   NULL, tsk->stack_vm_area) != NULL)
continue;

-   this_cpu_write(cached_stacks[i], tsk->stack_vm_area);
-   local_irq_restore(flags);
return;
}
-   local_irq_restore(flags);

vfree_atomic(tsk->stack);
return;


Virtually Mapped Stacks: Do not disable interrupts

2017-05-17 Thread Christoph Lameter
The reason to disable interrupts seems to be to avoid switching
to a different processor while handling per cpu data using
individual loads and stores. If we use per cpu RMV primitives
we will not have to disable interrupts.

Signed-off-by: Christoph Lameter 

Index: linux/kernel/fork.c
===
--- linux.orig/kernel/fork.c
+++ linux/kernel/fork.c
@@ -205,19 +205,17 @@ static unsigned long *alloc_thread_stack
void *stack;
int i;

-   local_irq_disable();
for (i = 0; i < NR_CACHED_STACKS; i++) {
-   struct vm_struct *s = this_cpu_read(cached_stacks[i]);
+   struct vm_struct *s;
+
+   s = this_cpu_xchg(cached_stacks[i], NULL);

if (!s)
continue;
-   this_cpu_write(cached_stacks[i], NULL);

tsk->stack_vm_area = s;
-   local_irq_enable();
return s->addr;
}
-   local_irq_enable();

stack = __vmalloc_node_range(THREAD_SIZE, THREAD_SIZE,
 VMALLOC_START, VMALLOC_END,
@@ -245,19 +243,15 @@ static inline void free_thread_stack(str
 {
 #ifdef CONFIG_VMAP_STACK
if (task_stack_vm_area(tsk)) {
-   unsigned long flags;
int i;

-   local_irq_save(flags);
for (i = 0; i < NR_CACHED_STACKS; i++) {
-   if (this_cpu_read(cached_stacks[i]))
+   if (this_cpu_cmpxchg(cached_stacks[i],
+   NULL, tsk->stack_vm_area) != NULL)
continue;

-   this_cpu_write(cached_stacks[i], tsk->stack_vm_area);
-   local_irq_restore(flags);
return;
}
-   local_irq_restore(flags);

vfree_atomic(tsk->stack);
return;


Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Michal Hocko wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> I am pretty sure it is describe in those changelogs and I won't repeat
> it here.

I cannot figure out what you are referring to. There are numerous
patches and discussions about OOM scenarios in this context.



Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Michal Hocko wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> I am pretty sure it is describe in those changelogs and I won't repeat
> it here.

I cannot figure out what you are referring to. There are numerous
patches and discussions about OOM scenarios in this context.



Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Michal Hocko wrote:

> > If you have screwy things like static mbinds in there then you are
> > hopelessly lost anyways. You may have moved the process to another set
> > of nodes but the static bindings may refer to a node no longer
> > available. Thus the OOM is legitimate.
>
> The point is that you do _not_ want such a process to trigger the OOM
> because it can cause other processes being killed.

Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
that changed?

At this point you have messed up royally and nothing is going to rescue
you anyways. OOM or not does not matter anymore. The app will fail.

> > At least a user space app could inspect
> > the situation and come up with custom ways of dealing with the mess.
>
> I do not really see how would this help to prevent a malicious user from
> playing tricks.

How did a malicious user come into this? Of course you can mess up in
significant ways if you can overflow nodes and cause an app that has
restrictions to fail but nothing is going to change that.



Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Michal Hocko wrote:

> > If you have screwy things like static mbinds in there then you are
> > hopelessly lost anyways. You may have moved the process to another set
> > of nodes but the static bindings may refer to a node no longer
> > available. Thus the OOM is legitimate.
>
> The point is that you do _not_ want such a process to trigger the OOM
> because it can cause other processes being killed.

Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
that changed?

At this point you have messed up royally and nothing is going to rescue
you anyways. OOM or not does not matter anymore. The app will fail.

> > At least a user space app could inspect
> > the situation and come up with custom ways of dealing with the mess.
>
> I do not really see how would this help to prevent a malicious user from
> playing tricks.

How did a malicious user come into this? Of course you can mess up in
significant ways if you can overflow nodes and cause an app that has
restrictions to fail but nothing is going to change that.



Re: [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Vlastimil Babka wrote:

>  struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> -struct zonelist *zonelist, nodemask_t *nodemask);
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> + nodemask_t *nodemask);
>
>  static inline struct page *
> -__alloc_pages(gfp_t gfp_mask, unsigned int order,
> - struct zonelist *zonelist)
> +__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
>  {
> - return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
> + return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
>  }

Maybe use nid instead of preferred_nid like in __alloc_pages? Otherwise
there may be confusion with the MPOL_PREFER policy.

> @@ -1963,8 +1960,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>  {
>   struct mempolicy *pol;
>   struct page *page;
> + int preferred_nid;
>   unsigned int cpuset_mems_cookie;
> - struct zonelist *zl;
>   nodemask_t *nmask;

Same here.

> @@ -4012,8 +4012,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
>   * This is the 'heart' of the zoned buddy allocator.
>   */
>  struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> - struct zonelist *zonelist, nodemask_t *nodemask)
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> + nodemask_t *nodemask)
>  {

and here

This looks clean to me. Still feel a bit uneasy about this since I do
remember that we had a reason to use zonelists instead of nodes back then
but cannot remember what that reason was

CCing Dimitri at SGI. This may break a lot of legacy SGIapps. If you read
this Dimitri then please review this patchset and the discussions around
it.

Reviewed-by: Christoph Lameter <c...@linux.com>



Re: [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Vlastimil Babka wrote:

>  struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> -struct zonelist *zonelist, nodemask_t *nodemask);
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> + nodemask_t *nodemask);
>
>  static inline struct page *
> -__alloc_pages(gfp_t gfp_mask, unsigned int order,
> - struct zonelist *zonelist)
> +__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
>  {
> - return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
> + return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
>  }

Maybe use nid instead of preferred_nid like in __alloc_pages? Otherwise
there may be confusion with the MPOL_PREFER policy.

> @@ -1963,8 +1960,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>  {
>   struct mempolicy *pol;
>   struct page *page;
> + int preferred_nid;
>   unsigned int cpuset_mems_cookie;
> - struct zonelist *zl;
>   nodemask_t *nmask;

Same here.

> @@ -4012,8 +4012,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
>   * This is the 'heart' of the zoned buddy allocator.
>   */
>  struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> - struct zonelist *zonelist, nodemask_t *nodemask)
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> + nodemask_t *nodemask)
>  {

and here

This looks clean to me. Still feel a bit uneasy about this since I do
remember that we had a reason to use zonelists instead of nodes back then
but cannot remember what that reason was

CCing Dimitri at SGI. This may break a lot of legacy SGIapps. If you read
this Dimitri then please review this patchset and the discussions around
it.

Reviewed-by: Christoph Lameter 



Re: [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Vlastimil Babka wrote:

> The task->il_next variable stores the next allocation node id for task's
> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
> bind mempolicies due to changing cpuset mems. Currently it also tries to
> make sure that current->il_next is valid within the updated nodemask. This is
> bogus, because 1) we are updating potentially any task's mempolicy, not just
> current, and 2) we might be updating a per-vma mempolicy, not task one.

Reviewed-by: Christoph Lameter <c...@linux.com>



Re: [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Vlastimil Babka wrote:

> The task->il_next variable stores the next allocation node id for task's
> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
> bind mempolicies due to changing cpuset mems. Currently it also tries to
> make sure that current->il_next is valid within the updated nodemask. This is
> bogus, because 1) we are updating potentially any task's mempolicy, not just
> current, and 2) we might be updating a per-vma mempolicy, not task one.

Reviewed-by: Christoph Lameter 



Re: [PATCH 0/6] refine and rename slub sysfs

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Wei Yang wrote:

> This patch serial could be divided into two parts.
>
> First three patches refine and adds slab sysfs.
> Second three patches rename slab sysfs.

These changes will break the slabinfo tool in linux/tools/vm/slabinfo.c.
Please update it as well.

> 1. Refine slab sysfs
>
> There are four level slabs:

levels? Maybe types of slabs?

> CPU
> CPU_PARTIAL
> PARTIAL
> FULL
>
> And in sysfs, it use show_slab_objects() and cpu_partial_slabs_show() to
> reflect the statistics.
>
> In patch 2, it splits some function in show_slab_objects() which makes sure
> only cpu_partial_slabs_show() covers statistics for CPU_PARTIAL slabs.
>
> After doing so, it would be more clear that show_slab_objects() has totally 9
> statistic combinations for three level of slabs. Each slab has three cases
> statistic.
>
> slabs
> objects
> total_objects

That sounds good.

> which is a little bit hard for users to understand. The second three patches
> rename sysfs file in this pattern.
>
> xxx_slabs[[_total]_objects]
>
> Finally it looks Like
>
> slabs
> slabs_objects
> slabs_total_objects
> cpu_slabs
> cpu_slabs_objects
> cpu_slabs_total_objects
> partial_slabs
> partial_slabs_objects
> partial_slabs_total_objects
> cpu_partial_slabs

Arent we missing:

cpu_partial_slabs_objects
cpu_partial_slabs_total_objects

And the partial slabs exclude the cpu slabs as well as the cpu_partial
slabs?

Could you add some documentation as well to explain the exact semantics?



Re: [PATCH 0/6] refine and rename slub sysfs

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Wei Yang wrote:

> This patch serial could be divided into two parts.
>
> First three patches refine and adds slab sysfs.
> Second three patches rename slab sysfs.

These changes will break the slabinfo tool in linux/tools/vm/slabinfo.c.
Please update it as well.

> 1. Refine slab sysfs
>
> There are four level slabs:

levels? Maybe types of slabs?

> CPU
> CPU_PARTIAL
> PARTIAL
> FULL
>
> And in sysfs, it use show_slab_objects() and cpu_partial_slabs_show() to
> reflect the statistics.
>
> In patch 2, it splits some function in show_slab_objects() which makes sure
> only cpu_partial_slabs_show() covers statistics for CPU_PARTIAL slabs.
>
> After doing so, it would be more clear that show_slab_objects() has totally 9
> statistic combinations for three level of slabs. Each slab has three cases
> statistic.
>
> slabs
> objects
> total_objects

That sounds good.

> which is a little bit hard for users to understand. The second three patches
> rename sysfs file in this pattern.
>
> xxx_slabs[[_total]_objects]
>
> Finally it looks Like
>
> slabs
> slabs_objects
> slabs_total_objects
> cpu_slabs
> cpu_slabs_objects
> cpu_slabs_total_objects
> partial_slabs
> partial_slabs_objects
> partial_slabs_total_objects
> cpu_partial_slabs

Arent we missing:

cpu_partial_slabs_objects
cpu_partial_slabs_total_objects

And the partial slabs exclude the cpu slabs as well as the cpu_partial
slabs?

Could you add some documentation as well to explain the exact semantics?



Re: [PATCH 1/6] mm/slub: add total_objects_partial sysfs

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Wei Yang wrote:

> For partial slabs, show_slab_objects could display its total objects.
>
> This patch just adds an entry to display it.

Acked-by: Christoph Lameter <c...@linux.com>


Re: [PATCH 1/6] mm/slub: add total_objects_partial sysfs

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Wei Yang wrote:

> For partial slabs, show_slab_objects could display its total objects.
>
> This patch just adds an entry to display it.

Acked-by: Christoph Lameter 


Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Michal Hocko wrote:

> > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > > case in a raceless way?
> >
> > You dont have to do that if you do not create an empty mempolicy in the
> > first place. The current kernel code avoids that by first allowing access
> > to the new set of nodes and removing the old ones from the set when done.
>
> which is racy and as Vlastimil pointed out. If we simply fail such an
> allocation the failure will go up the call chain until we hit the OOM
> killer due to VM_FAULT_OOM. How would you want to handle that?

The race is where? If you expand the node set during the move of the
application then you are safe in terms of the legacy apps that did not
include static bindings.

If you have screwy things like static mbinds in there then you are
hopelessly lost anyways. You may have moved the process to another set
of nodes but the static bindings may refer to a node no longer
available. Thus the OOM is legitimate.

At least a user space app could inspect
the situation and come up with custom ways of dealing with the mess.



Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Michal Hocko wrote:

> > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > > case in a raceless way?
> >
> > You dont have to do that if you do not create an empty mempolicy in the
> > first place. The current kernel code avoids that by first allowing access
> > to the new set of nodes and removing the old ones from the set when done.
>
> which is racy and as Vlastimil pointed out. If we simply fail such an
> allocation the failure will go up the call chain until we hit the OOM
> killer due to VM_FAULT_OOM. How would you want to handle that?

The race is where? If you expand the node set during the move of the
application then you are safe in terms of the legacy apps that did not
include static bindings.

If you have screwy things like static mbinds in there then you are
hopelessly lost anyways. You may have moved the process to another set
of nodes but the static bindings may refer to a node no longer
available. Thus the OOM is legitimate.

At least a user space app could inspect
the situation and come up with custom ways of dealing with the mess.



Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Michal Hocko wrote:

> > We certainly can do that. The failure of the page faults are due to the
> > admin trying to move an application that is not aware of this and is using
> > mempols. That could be an error. Trying to move an application that
> > contains both absolute and relative node numbers is definitely something
> > that is potentiall so screwed up that the kernel should not muck around
> > with such an app.
> >
> > Also user space can determine if the application is using memory policies
> > and can then take appropriate measures (message to the sysadmin to eval
> > tge situation f.e.) or mess aroud with the processes memory policies on
> > its own.
> >
> > So this is certainly a way out of this mess.
>
> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> case in a raceless way?

You dont have to do that if you do not create an empty mempolicy in the
first place. The current kernel code avoids that by first allowing access
to the new set of nodes and removing the old ones from the set when done.



Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

2017-05-17 Thread Christoph Lameter
On Wed, 17 May 2017, Michal Hocko wrote:

> > We certainly can do that. The failure of the page faults are due to the
> > admin trying to move an application that is not aware of this and is using
> > mempols. That could be an error. Trying to move an application that
> > contains both absolute and relative node numbers is definitely something
> > that is potentiall so screwed up that the kernel should not muck around
> > with such an app.
> >
> > Also user space can determine if the application is using memory policies
> > and can then take appropriate measures (message to the sysadmin to eval
> > tge situation f.e.) or mess aroud with the processes memory policies on
> > its own.
> >
> > So this is certainly a way out of this mess.
>
> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> case in a raceless way?

You dont have to do that if you do not create an empty mempolicy in the
first place. The current kernel code avoids that by first allowing access
to the new set of nodes and removing the old ones from the set when done.



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-16 Thread Christoph Lameter
On Mon, 15 May 2017, Marcelo Tosatti wrote:

> > NOHZ already does that. I wanted to know what your problem is that you
> > see. The latency issue has already been solved as far as I can tell .
> > Please tell me why the existing solutions are not sufficient for you.
>
> We don't want vmstat_worker to execute on a given CPU, even if the local
> CPU updates vm-statistics.

Instead of responding you repeat describing what you want.

> Because:
>
> vmstat_worker increases latency of the application
>(i can measure it if you want on a given CPU,
> how many ns's the following takes:

That still is no use case. Just a measurement of vmstat_worker. Pointless.

If you move the latency from the vmstat worker into the code thats
updating the counters then you will require increased use of atomics
which will increase contention which in turn will significantly
increase the overall latency.

> Why the existing solutions are not sufficient:
>
> 1) task-isolation patchset seems too heavy for our usecase (we do
> want IPIs, signals, etc).

Ok then minor delays from remote random events are tolerable?
Then you can also have a vmstat update.

> So this seems a little heavy for our usecase.

Sorry all of this does not make sense to me. Maybe get some numbers of of
an app with intensive OS access running with atomics vs vmstat worker?

NOHZ currently disables the vmstat worker when no updates occur. This is
applicable to DPDK and will provide a quiet vmstat worker free environment
if no statistics activity is occurring.



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-16 Thread Christoph Lameter
On Mon, 15 May 2017, Marcelo Tosatti wrote:

> > NOHZ already does that. I wanted to know what your problem is that you
> > see. The latency issue has already been solved as far as I can tell .
> > Please tell me why the existing solutions are not sufficient for you.
>
> We don't want vmstat_worker to execute on a given CPU, even if the local
> CPU updates vm-statistics.

Instead of responding you repeat describing what you want.

> Because:
>
> vmstat_worker increases latency of the application
>(i can measure it if you want on a given CPU,
> how many ns's the following takes:

That still is no use case. Just a measurement of vmstat_worker. Pointless.

If you move the latency from the vmstat worker into the code thats
updating the counters then you will require increased use of atomics
which will increase contention which in turn will significantly
increase the overall latency.

> Why the existing solutions are not sufficient:
>
> 1) task-isolation patchset seems too heavy for our usecase (we do
> want IPIs, signals, etc).

Ok then minor delays from remote random events are tolerable?
Then you can also have a vmstat update.

> So this seems a little heavy for our usecase.

Sorry all of this does not make sense to me. Maybe get some numbers of of
an app with intensive OS access running with atomics vs vmstat worker?

NOHZ currently disables the vmstat worker when no updates occur. This is
applicable to DPDK and will provide a quiet vmstat worker free environment
if no statistics activity is occurring.



Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-12 Thread Christoph Lameter
On Fri, 12 May 2017, Marcelo Tosatti wrote:

> > What exactly is the issue you are seeing and want to address? I think we
> > have similar aims and as far as I know the current situation is already
> > good enough for what you may need. You may just not be aware of how to
> > configure this.
>
> I want to disable vmstat worker thread completly from an isolated CPU.
> Because it adds overhead to a latency target, target which
> the lower the better.

NOHZ already does that. I wanted to know what your problem is that you
see. The latency issue has already been solved as far as I can tell .
Please tell me why the existing solutions are not sufficient for you.

> > I doubt that doing inline updates will do much good compared to what we
> > already have and what the dataplan mode can do.
>
> Can the dataplan mode disable vmstat worker thread completly on a given
> CPU?

That already occurs when you call quiet_vmstat() and is used by the NOHZ
logic. Configure that correctly and you should be fine.


Re: [patch 2/2] MM: allow per-cpu vmstat_threshold and vmstat_worker configuration

2017-05-12 Thread Christoph Lameter
On Fri, 12 May 2017, Marcelo Tosatti wrote:

> > What exactly is the issue you are seeing and want to address? I think we
> > have similar aims and as far as I know the current situation is already
> > good enough for what you may need. You may just not be aware of how to
> > configure this.
>
> I want to disable vmstat worker thread completly from an isolated CPU.
> Because it adds overhead to a latency target, target which
> the lower the better.

NOHZ already does that. I wanted to know what your problem is that you
see. The latency issue has already been solved as far as I can tell .
Please tell me why the existing solutions are not sufficient for you.

> > I doubt that doing inline updates will do much good compared to what we
> > already have and what the dataplan mode can do.
>
> Can the dataplan mode disable vmstat worker thread completly on a given
> CPU?

That already occurs when you call quiet_vmstat() and is used by the NOHZ
logic. Configure that correctly and you should be fine.


  1   2   3   4   5   6   7   8   9   10   >