Re: [PATCH v2] mm/slub: fix panic in slab_alloc_node()

2020-10-28 Thread Christopher Lameter
On Tue, 27 Oct 2020, Laurent Dufour wrote:

> The issue is that object is not NULL while page is NULL which is odd but
> may happen if the cache flush happened after loading object but before
> loading page. Thus checking for the page pointer is required too.


Ok then lets revert commit  6159d0f5c03e? The situation may occur
elsewhere too.


Re: [PATCH v3 0/3] Actually fix freelist pointer vs redzoning

2020-10-15 Thread Christopher Lameter
On Wed, 14 Oct 2020, Kees Cook wrote:

> Note on patch 2: Christopher NAKed it, but I actually think this is a
> reasonable thing to add -- the "too small" check is only made when built
> with CONFIG_DEBUG_VM, so it *is* actually possible for someone to trip
> over this directly, even if it would never make it into a released
> kernel. I see no reason to just leave this foot-gun in place, though, so
> we might as well just fix it too. (Which seems to be what Longman was
> similarly supporting, IIUC.)

Well then remove the duplication of checks. The NAK was there because it
seems that you were not aware of the existing checks.

> Anyway, if patch 2 stays NAKed, that's fine. It's entirely separable,
> and the other 2 can land. :)

Just deal with the old checks too and it will be fine.



Re: [PATCH] mm: Make allocator take care of memoryless numa node

2020-10-12 Thread Christopher Lameter
On Mon, 12 Oct 2020, Xianting Tian wrote:

> In architecture like powerpc, we can have cpus without any local memory
> attached to it. In such cases the node does not have real memory.
>
> In many places of current kernel code, it doesn't judge whether the node is
> memoryless numa node before calling allocator interface.

That is intentional. SLUB relies on the page allocator to pick a node.

> This patch is to use local_memory_node(), which is guaranteed to have
> memory, in allocator interface. local_memory_node() is a noop in other
> architectures that don't support memoryless nodes.

The patch would destroy the support for memory policies in the
SLUB allocator and likely in the other ones as well.



Re: [PATCH v2 2/3] mm/slub: Fix redzoning for small allocations

2020-10-12 Thread Christopher Lameter
On Fri, 9 Oct 2020, Kees Cook wrote:

> Store the freelist pointer out of line when object_size is smaller than
> sizeof(void *) and redzoning is enabled.
>
> (Note that no caches with such a size are known to exist in the kernel
> currently.)

Ummm... The smallest allowable cache size is sizeof(void *) as I recall.


mm/slab_common.c::kmem_sanity_check() checks the sizes when caches are
created.

NAK.



Re: [RFC][PATCH 03/12] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks

2020-10-07 Thread Christopher Lameter
On Tue, 6 Oct 2020, Dave Hansen wrote:

> These zero checks are not great because it is not obvious what a zero
> mode *means* in the code.  Replace them with a helper which makes it
> more obvious: node_reclaim_enabled().

Well it uselessly checks bits. But whatever. It will prevent future code
removal.

Acked-by: Christoph Lameter 


Re: [RFC][PATCH 01/12] mm/vmscan: restore zone_reclaim_mode ABI

2020-10-07 Thread Christopher Lameter
On Tue, 6 Oct 2020, Dave Hansen wrote:

> But, when the bit was removed (bit 0) the _other_ bit locations also
> got changed.  That's not OK because the bit values are documented to
> mean one specific thing and users surely rely on them meaning that one
> thing and not changing from kernel to kernel.  The end result is that
> if someone had a script that did:

Exactly right. Sorry must have missed to review that patch.

Acked-by: Christoph Lameter 



Re: [RFC][PATCH 02/12] mm/vmscan: move RECLAIM* bits to uapi header

2020-10-07 Thread Christopher Lameter
On Tue, 6 Oct 2020, Dave Hansen wrote:

> It is currently not obvious that the RECLAIM_* bits are part of the
> uapi since they are defined in vmscan.c.  Move them to a uapi header
> to make it obvious.

Acked-by: Christoph Lameter 



Re: [PATCH RFC v2 0/6] Break heap spraying needed for exploiting use-after-free

2020-10-06 Thread Christopher Lameter



On Tue, 6 Oct 2020, Matthew Wilcox wrote:

> On Tue, Oct 06, 2020 at 12:56:33AM +0200, Jann Horn wrote:
> > It seems to me like, if you want to make UAF exploitation harder at
> > the heap allocator layer, you could do somewhat more effective things
> > with a probably much smaller performance budget. Things like
> > preventing the reallocation of virtual kernel addresses with different
> > types, such that an attacker can only replace a UAF object with
> > another object of the same type. (That is not an idea I like very much
> > either, but I would like it more than this proposal.) (E.g. some
> > browsers implement things along those lines, I believe.)
>
> The slab allocator already has that functionality.  We call it
> TYPESAFE_BY_RCU, but if forcing that on by default would enhance security
> by a measurable amount, it wouldn't be a terribly hard sell ...

TYPESAFE functionality switches a lot of debugging off because that also
allows speculative accesses to the object after it was freed (requires
for RCU safeness because the object may be freed in an RCU period where
it is still accessed). I do not think you would like that.



Re: [PATCH RFC v2 0/6] Break heap spraying needed for exploiting use-after-free

2020-10-06 Thread Christopher Lameter


On Mon, 5 Oct 2020, Kees Cook wrote:

> > TYPESAFE_BY_RCU, but if forcing that on by default would enhance security
> > by a measurable amount, it wouldn't be a terribly hard sell ...
>
> Isn't the "easy" version of this already controlled by slab_merge? (i.e.
> do not share same-sized/flagged kmem_caches between different caches)

Right.

> The large trouble are the kmalloc caches, which don't have types
> associated with them. Having implicit kmem caches based on the type
> being allocated there would need some pretty extensive plumbing, I
> think?

Actually typifying those accesses may get rid of a lot of kmalloc
allocations and could help to ease the management and control of objects.

It may be a big task though given the ubiquity of kmalloc and the need to
create a massive amount of new slab caches. This is going to reduce the
cache hit rate significantly.



Re: [PATCH v4 00/10] Independent per-CPU data section for nVHE

2020-09-24 Thread Christopher Lameter
On Tue, 22 Sep 2020, David Brazdil wrote:

> Introduce '.hyp.data..percpu' as part of ongoing effort to make nVHE
> hyp code self-contained and independent of the rest of the kernel.

The percpu subsystems point is to enable the use of special hardware
instructions that can perform address calculation and a memory operation
in one interruptible instruction. This is in particular useful to avoid
higher overhead for memory management related counters because preempt
disable/enable etc can be avoided.

ARM cannot do that and thus has a LC/SC loop.

This is a patchset for ARM64 so its not clear to me what kind of advantage
there would be against a simple implementation that does a regular fetch
from a base address with an offset.

> Main benefits:
>  * independent nVHE per-CPU data section that can be unmapped from host,
>  * more robust linking of nVHE hyp code,
>  * no need for hyp-specific macros to access per-CPU variables.

Maybe simply don't use percpu variables for your arm code? Those pointers
to data will be much more indepedent of the rest of the kernel and allow a
much higher degree of being self-contained.



Re: [PATCH v2 05/10] mm, kfence: insert KFENCE hooks for SLUB

2020-09-17 Thread Christopher Lameter
On Tue, 15 Sep 2020, Marco Elver wrote:

>  void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
>  {
> - void *ret = slab_alloc(s, gfpflags, _RET_IP_);
> + void *ret = slab_alloc(s, gfpflags, _RET_IP_, s->object_size);

The additional size parameter is a part of a struct kmem_cache that is
already passed to the function. Why does the parameter list need to be
expanded?



Re: [PATCH v2 04/10] mm, kfence: insert KFENCE hooks for SLAB

2020-09-17 Thread Christopher Lameter



On Tue, 15 Sep 2020, Marco Elver wrote:

> @@ -3206,7 +3207,7 @@ static void *cache_alloc_node(struct kmem_cache 
> *cachep, gfp_t flags,
>  }
>
>  static __always_inline void *
> -slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
> +slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t 
> orig_size,
>  unsigned long caller)
>  {

The size of the object is available via a field in kmem_cache. And a
pointer to the current kmem_cache is already passed to the function. Why
is there a need to add an additional parameter?


Re: [PATCH] mm/slub: branch optimization in free slowpath

2020-08-13 Thread Christopher Lameter
On Thu, 13 Aug 2020, wuyun...@huawei.com wrote:

> The two conditions are mutually exclusive and gcc compiler will
> optimise this into if-else-like pattern. Given that the majority
> of free_slowpath is free_frozen, let's provide some hint to the
> compilers.

Acked-by: Christoph Lameter 


Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects

2020-08-11 Thread Christopher Lameter
On Fri, 7 Aug 2020, Pekka Enberg wrote:

> Why do you consider this to be a fast path? This is all partial list
> accounting when we allocate/deallocate a slab, no? Just like
> ___slab_alloc() says, I assumed this to be the slow path... What am I
> missing?

I thought these were per object counters? If you just want to count the
number of slabs then you do not need the lock at all. We already have a
counter for the number of slabs.

> No objections to alternative fixes, of course, but wrapping the
> counters under CONFIG_DEBUG seems like just hiding the actual issue...

CONFIG_DEBUG is on by default. It just compiles in the debug code and
disables it so we can enable it with a kernel boot option. This is because
we have had numerous issues in the past with "production" kernels that
could not be recompiled with debug options. So just running the prod
kernel with another option will allow you to find hard to debug issues in
a full scale producton deployment with potentially proprietary modules
etc.



Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects

2020-08-07 Thread Christopher Lameter
On Fri, 7 Aug 2020, Pekka Enberg wrote:

> I think we can just default to the counters. After all, if I
> understood correctly, we're talking about up to 100 ms time period
> with IRQs disabled when count_partial() is called. As this is
> triggerable from user space, that's a performance bug whatever way you
> look at it.


Well yes under extreme conditions and this is only happening for sysfs
counter retrieval.

There could be other solutions to this. This solution here is penalizing
evertu hotpath slab allocation for the sake of relatively infrequently
used counter monitoring. There the possibility of not traversing the list
ande simply estimating the value based on the number of slab pages
allocated on that node.

> Christoph, others, any objections?

Obviously  ;-)



Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects

2020-07-09 Thread Christopher Lameter
On Tue, 7 Jul 2020, Pekka Enberg wrote:

> On Fri, Jul 3, 2020 at 12:38 PM xunlei  wrote:
> >
> > On 2020/7/2 PM 7:59, Pekka Enberg wrote:
> > > On Thu, Jul 2, 2020 at 11:32 AM Xunlei Pang  
> > > wrote:
> > >> The node list_lock in count_partial() spend long time iterating
> > >> in case of large amount of partial page lists, which can cause
> > >> thunder herd effect to the list_lock contention, e.g. it cause
> > >> business response-time jitters when accessing "/proc/slabinfo"
> > >> in our production environments.
> > >
> > > Would you have any numbers to share to quantify this jitter? I have no
> >
> > We have HSF RT(High-speed Service Framework Response-Time) monitors, the
> > RT figures fluctuated randomly, then we deployed a tool detecting "irq
> > off" and "preempt off" to dump the culprit's calltrace, capturing the
> > list_lock cost up to 100ms with irq off issued by "ss", this also caused
> > network timeouts.
>
> Thanks for the follow up. This sounds like a good enough motivation
> for this patch, but please include it in the changelog.


Well this is access via sysfs causing a holdoff. Another way of access to
the same information without adding atomics and counters would be best.

> > I also have no idea what's the standard SLUB benchmark for the
> > regression test, any specific suggestion?
>
> I don't know what people use these days. When I did benchmarking in
> the past, hackbench and netperf were known to be slab-allocation
> intensive macro-benchmarks. Christoph also had some SLUB
> micro-benchmarks, but I don't think we ever merged them into the tree.

They are still where they have been for the last decade or so. In my git
tree on kernel.org. They were also reworked a couple of times and posted
to linux-mm. There are historical posts going back over the years where
individuals have modified them and used them to create multiple other
tests.



Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects

2020-07-07 Thread Christopher Lameter
On Thu, 2 Jul 2020, Xunlei Pang wrote:

> This patch introduces two counters to maintain the actual number
> of partial objects dynamically instead of iterating the partial
> page lists with list_lock held.
>
> New counters of kmem_cache_node are: pfree_objects, ptotal_objects.
> The main operations are under list_lock in slow path, its performance
> impact is minimal.


If at all then these counters need to be under CONFIG_SLUB_DEBUG.

> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -616,6 +616,8 @@ struct kmem_cache_node {
>  #ifdef CONFIG_SLUB
>   unsigned long nr_partial;
>   struct list_head partial;
> + atomic_long_t pfree_objects; /* partial free objects */
> + atomic_long_t ptotal_objects; /* partial total objects */

Please in the CONFIG_SLUB_DEBUG. Without CONFIG_SLUB_DEBUG we need to
build with minimal memory footprint.

>  #ifdef CONFIG_SLUB_DEBUG
>   atomic_long_t nr_slabs;
>   atomic_long_t total_objects;
> diff --git a/mm/slub.c b/mm/slub.c



Also this looks to be quite heavy on the cache and on execution time. Note
that the list_lock could be taken frequently in the performance sensitive
case of freeing an object that is not in the partial lists.



Re: [PATCH v1] mm:free unused pages in kmalloc_order

2020-07-01 Thread Christopher Lameter
On Mon, 29 Jun 2020, Matthew Wilcox wrote:

> Sounds like we need a test somewhere that checks this behaviour.
>
> > In order to make such allocations possible one would have to create yet
> > another kmalloc array for high memory.
>
> Not for this case because it goes straight to kmalloc_order().  What does
> make this particular case impossible is that we can't kmap() a compound
> page.  We could vmap it, but why are we bothering?

Well yes it will work if the slab allocator falls back to the page
allocator.  Higher order allocation through kmalloc ;-). How much fun
and uselessness 

Why not call the page allocator directly and play with all the bits you
want? Any regular small object allocation with GFP_HIGH will lead to
strange effects if the bit is not checked.



Re: [PATCH v1] mm:free unused pages in kmalloc_order

2020-06-29 Thread Christopher Lameter
On Mon, 29 Jun 2020, Matthew Wilcox wrote:

> Slab used to disallow GFP_HIGHMEM allocations earlier than this,

It is still not allowed and not supported.


Re: [PATCH v1] mm:free unused pages in kmalloc_order

2020-06-29 Thread Christopher Lameter
On Sat, 27 Jun 2020, Long Li wrote:

> Environment using the slub allocator, 1G memory in my ARM32.
> kmalloc(1024, GFP_HIGHUSER) can allocate memory normally,
> kmalloc(64*1024, GFP_HIGHUSER) will cause a memory leak, because
> alloc_pages returns highmem physical pages, but it cannot be directly
> converted into a virtual address and return NULL, the pages has not
> been released. Usually driver developers will not use the
> GFP_HIGHUSER flag to allocate memory in kmalloc, but I think this
> memory leak is not perfect, it is best to be fixed. This is the
> first time I have posted a patch, there may be something wrong.

Highmem is not supported by the slab allocators. Please ensure that there
is a warning generated if someone attempts to do such an allocation. We
used to check for that.

In order to make such allocations possible one would have to create yet
another kmalloc array for high memory.


Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-06-29 Thread Christopher Lameter
On Wed, 24 Jun 2020, Srikar Dronamraju wrote:

> Currently Linux kernel with CONFIG_NUMA on a system with multiple
> possible nodes, marks node 0 as online at boot.  However in practice,
> there are systems which have node 0 as memoryless and cpuless.

Maybe add something to explain why you are not simply mapping the
existing memory to NUMA node 0 which is after all just a numbering scheme
used by the kernel and can be used arbitrarily?

This could be seen more as a bug in the arch code during the setup of NUMA
nodes. The two nodes are created by the firmwware / bootstrap code after
all. Just do not do it?



Re: [PATCH] mm: ksize() should silently accept a NULL pointer

2020-06-17 Thread Christopher Lameter
On Tue, 16 Jun 2020, William Kucharski wrote:

> Other mm routines such as kfree() and kzfree() silently do the right
> thing if passed a NULL pointer, so ksize() should do the same.

Ok so the size of an no object pointer is zero? Ignoring the freeing
of a nonexisting object makes sense. But determining it size?


Re: [PATCH 0/3] mm/slub: Fix slabs_node return value

2020-06-17 Thread Christopher Lameter
On Sun, 14 Jun 2020, Muchun Song wrote:

> The slabs_node() always return zero when CONFIG_SLUB_DEBUG is disabled.
> But some codes determine whether slab is empty by checking the return
> value of slabs_node(). As you know, the result is not correct. we move
> the nr_slabs of kmem_cache_node out of the CONFIG_SLUB_DEBUG. So we can
> get the corrent value returned by the slabs_node().

Not that all distribution kernels have CONFIG_SLUB_DEBUG enabled. This
does not enable runtime debugging but only compiles the debug code in that
can then be enabled at runtime.

Users of !CONFIG_SLUB_DEBUG do not want to have the debug code included
because they have extreme requirements on memory use.

This patch increases use of memory by enabling fields that were excluded
under !CONFIG_SLUB_DEBUG before!

There is nothing wrong with slab_node's return value if one wants to
sacrifice debugging and consistency checks for a small code build.




Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()

2020-05-15 Thread Christopher Lameter
On Tue, 12 May 2020, Roman Gushchin wrote:

> > Add it to the metadata at the end of the object. Like the debugging
> > information or the pointer for RCU freeing.
>
> Enabling debugging metadata currently disables the cache merging.
> I doubt that it's acceptable to sacrifice the cache merging in order
> to embed the memcg pointer?

Well then keep the merging even if you have a memcg pointer.

The disabling for debugging is only to simplify debugging. You dont have
to deal with multiple caches actually using the same storage structures.

> Figuring out all these details will likely take several weeks, so the whole
> thing will be delayed for one-two major releases (in the best case). Given 
> that
> the current implementation saves ~40% of slab memory, I think there is some 
> value
> in delivering it as it is. So I wonder if the idea of embedding the pointer
> should be considered a blocker, or it can be implemented of top of the 
> proposed
> code (given it's not a user-facing api or something like this)?

Sorry no idea from my end here.



Re: [PATCH v4 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-12 Thread Christopher Lameter
On Tue, 12 May 2020, Srikar Dronamraju wrote:

> +#ifdef CONFIG_NUMA
> + [N_ONLINE] = NODE_MASK_NONE,

Again. Same issue as before. If you do this then you do a global change
for all architectures. You need to put something in the early boot
sequence (in a non architecture specific way) that sets the first node
online by default.

You have fixed the issue in your earlier patches for the powerpc
archicture. What about the other architectures?

Or did I miss something?



Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()

2020-05-08 Thread Christopher Lameter
On Mon, 4 May 2020, Roman Gushchin wrote:

> On Sat, May 02, 2020 at 11:54:09PM +, Christoph Lameter wrote:
> > On Thu, 30 Apr 2020, Roman Gushchin wrote:
> >
> > > Sorry, but what exactly do you mean?
> >
> > I think the right approach is to add a pointer to each slab object for
> > memcg support.
> >
>
> As I understand, embedding the memcg pointer will hopefully make allocations
> cheaper in terms of CPU, but will require more memory. And you think that
> it's worth it. Is it a correct understanding?

It definitely makes the code less complex. The additional memory is
minimal. In many cases you have already some space wasted at the end of
the object that could be used for the pointer.

> Can you, please, describe a bit more detailed how it should be done
> from your point of view?

Add it to the metadata at the end of the object. Like the debugging
information or the pointer for RCU freeing.

> I mean where to store the pointer, should it be SLAB/SLUB-specific code
> or a generic code, what do to with kmallocs alignments, should we
> merge slabs which had a different size before and now have the same
> because of the memcg pointer and aligment, etc.

Both SLAB and SLUB have the same capabilities there. Slabs that had
different sizes before will now have different sizes as well. So the
merging does not change.

> I'm happy to follow your advice and perform some tests to get an idea of
> how significant the memory overhead is and how big are CPU savings.
> I guess with these numbers it will be easy to make a decision.

Sure. The main issue are the power of two kmalloc caches and how to add
the pointer to these caches in order not to waste memory. SLAB has done
this in the past by creating additional structues in a page frame.




Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics

2020-05-07 Thread Christopher Lameter
On Mon, 4 May 2020, Andrew Morton wrote:

> But I guess it's better than nothing at all, unless there are
> alternative ideas?

I its highly unsusual to have such large partial lists. In a typical case
allocations whould reduce the size of the lists. 1000s? That is scary.

Are there inodes or dentries by chance?

The defrag stuff that I had been trying to do for a long time would solve
that issue but then objects would need to be made movable



Re: [PATCH] mm: slub: add panic_on_error to the debug facilities

2020-05-07 Thread Christopher Lameter
On Sun, 3 May 2020, Rafael Aquini wrote:

> On Sat, May 02, 2020 at 11:16:30PM +0000, Christopher Lameter wrote:
> > On Fri, 1 May 2020, Rafael Aquini wrote:
> >
> > > Sometimes it is desirable to override SLUB's debug facilities
> > > default behavior upon stumbling on a cache or object error
> > > and just stop the execution in order to grab a coredump, at
> > > the error-spotting time, instead of trying to fix the issue
> > > and report in an attempt to keep the system rolling.
> >
> > The stopping of execution on an error is the default behavior. Usually
> > you get some OOPS somewhere when data is corrupted and that causes a core
> > dump.
> >
> > SLUB can fix the issue and continue if enabled by specifying special
> > options on boot. That is *not* the default.
> >
> It is the default behavior when slub_debug is turned on, which is what
> this patch is trying to override, when needed. We've been seeing the
> need for such feature as, most often than not, by letting the system
> running to crash somewhere else after hitting occurrences reported by
> slub_debug ends up clobbering clues to the original issue.

Ok. The backtrace is not sufficient in that case?




Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()

2020-05-02 Thread Christopher Lameter
On Thu, 30 Apr 2020, Roman Gushchin wrote:

> Sorry, but what exactly do you mean?

I think the right approach is to add a pointer to each slab object for
memcg support.



Re: [PATCH] mm: slub: add panic_on_error to the debug facilities

2020-05-02 Thread Christopher Lameter
On Fri, 1 May 2020, Rafael Aquini wrote:

> Sometimes it is desirable to override SLUB's debug facilities
> default behavior upon stumbling on a cache or object error
> and just stop the execution in order to grab a coredump, at
> the error-spotting time, instead of trying to fix the issue
> and report in an attempt to keep the system rolling.

The stopping of execution on an error is the default behavior. Usually
you get some OOPS somewhere when data is corrupted and that causes a core
dump.

SLUB can fix the issue and continue if enabled by specifying special
options on boot. That is *not* the default.



Re: [PATCH v3 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline

2020-05-02 Thread Christopher Lameter
On Fri, 1 May 2020, Srikar Dronamraju wrote:

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy);
>   */
>  nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
>   [N_POSSIBLE] = NODE_MASK_ALL,
> +#ifdef CONFIG_NUMA
> + [N_ONLINE] = NODE_MASK_NONE,

Hmmm I would have expected that you would have added something early
in boot that would mark the current node (whatever is is) online instead?


Re: [PATCH v3 1/3] powerpc/numa: Set numa_node for all possible cpus

2020-05-02 Thread Christopher Lameter
On Fri, 1 May 2020, Srikar Dronamraju wrote:

> - for_each_present_cpu(cpu)
> - numa_setup_cpu(cpu);
> + for_each_possible_cpu(cpu) {
> + /*
> +  * Powerpc with CONFIG_NUMA always used to have a node 0,
> +  * even if it was memoryless or cpuless. For all cpus that
> +  * are possible but not present, cpu_to_node() would point
> +  * to node 0. To remove a cpuless, memoryless dummy node,
> +  * powerpc need to make sure all possible but not present
> +  * cpu_to_node are set to a proper node.
> +  */
> + if (cpu_present(cpu))
> + numa_setup_cpu(cpu);
> + else
> + set_cpu_numa_node(cpu, first_online_node);
> + }
>  }


Can this be folded into numa_setup_cpu?

This looks more like numa_setup_cpu needs to change?



Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()

2020-04-30 Thread Christopher Lameter
On Mon, 27 Apr 2020, Roman Gushchin wrote:

> > Why do you need this? Just slap a pointer to the cgroup as additional
> > metadata onto the slab object. Is that not much simpler, safer and faster?
> >
>
> So, the problem is that not all slab objects are accounted, and sometimes
> we don't know if advance if they are accounted or not (with the current 
> semantics
> of __GFP_ACCOUNT and SLAB_ACCOUNT flags). So we either have to increase
> the size of ALL slab objects, either create a pair of slab caches for each 
> size.

>
> The first option is not that cheap in terms of the memory overhead. Especially
> for those who disable cgroups using a boot-time option.


If the cgroups are disabled on boot time then you can switch back to the
compact version. Otherwise just add a pointer to each object. It will make
it consistent and there is not much memory wastage.

The problem comes about with the power of 2 caches in the kmalloc array.
If one keeps the "natural alignment" instead of going for the normal
alignment of slab caches then the alignment will cause a lot of memory
wastage and thus the scheme of off slab metadata is likely going to be
unavoidable.

But I think we are just stacking one bad idea onto another here making
things much more complex than they could be. Well at least this justifies
all our jobs  (not mine I am out of work... hehehe)



Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat

2019-10-21 Thread Christopher Lameter
On Mon, 21 Oct 2019, Roman Gushchin wrote:

> Sp far I haven't noticed any regression on the set of workloads where I did 
> test
> the patchset, but if you know any benchmark or realistic test which can 
> affected
> by this check, I'll be happy to try.
>
> Also, less-than-word-sized operations can be slower on some platforms.

The main issue in the past was the cache footprint. Memory is slow.
Processors are fast.



Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat

2019-10-20 Thread Christopher Lameter
On Thu, 17 Oct 2019, Roman Gushchin wrote:

> Currently s8 type is used for per-cpu caching of per-node statistics.
> It works fine because the overfill threshold can't exceed 125.
>
> But if some counters are in bytes (and the next commit in the series
> will convert slab counters to bytes), it's not gonna work:
> value in bytes can easily exceed s8 without exceeding the threshold
> converted to bytes. So to avoid overfilling per-cpu caches and breaking
> vmstats correctness, let's use s32 instead.

Actually this is inconsistenct since the other counters are all used to
account for pages. Some of the functions in use for the page counters will
no longer make sense. inc/dec_node_state() etc.


Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat

2019-10-20 Thread Christopher Lameter
On Thu, 17 Oct 2019, Roman Gushchin wrote:

> But if some counters are in bytes (and the next commit in the series
> will convert slab counters to bytes), it's not gonna work:
> value in bytes can easily exceed s8 without exceeding the threshold
> converted to bytes. So to avoid overfilling per-cpu caches and breaking
> vmstats correctness, let's use s32 instead.

This quardruples the cache footprint of the counters and likely has some
influence on performance.


Re: [PATCH 25/34] mm: Use CONFIG_PREEMPTION

2019-10-16 Thread Christopher Lameter


Acked-by: Chistoph Lameter 


Re: [PATCH] mm, page_alloc: drop pointless static qualifier in build_zonelists()

2019-10-08 Thread Christopher Lameter
On Sat, 28 Sep 2019, Kaitao Cheng wrote:

> There is no need to make the 'node_order' variable static
> since new value always be assigned before use it.

In the past MAX_NUMMNODES could become quite large like 512 or 1k. Large
array allocations on the stack are problematic.

Maybe that is no longer the case?


Re: [PATCH v5 0/7] mm, slab: Make kmalloc_info[] contain all types of names

2019-09-16 Thread Christopher Lameter
On Mon, 16 Sep 2019, Pengfei Li wrote:

> The name of KMALLOC_NORMAL is contained in kmalloc_info[].name,
> but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically
> generated by kmalloc_cache_name().
>
> Patch1 predefines the names of all types of kmalloc to save
> the time spent dynamically generating names.
>
> These changes make sense, and the time spent by new_kmalloc_cache()
> has been reduced by approximately 36.3%.

This is time spend during boot and does not affect later system
performance.

>  Time spent by new_kmalloc_cache()
>   (CPU cycles)
> 5.3-rc7  66264
> 5.3-rc7+patch_1-342188

Ok. 15k cycles during boot saved. So we save 5 microseconds during bootup?

The current approach was created with the view on future setups allowing a
dynamic configuration of kmalloc caches based on need. I.e. ZONE_DMA may
not be needed once the floppy driver no longer makes use of it.


Re: [PATCH v5 7/7] mm, slab_common: Modify kmalloc_caches[type][idx] to kmalloc_caches[idx][type]

2019-09-16 Thread Christopher Lameter
On Mon, 16 Sep 2019, Pengfei Li wrote:

> KMALLOC_NORMAL is the most frequently accessed, and kmalloc_caches[]
> is initialized by different types of the same size.
>
> So modifying kmalloc_caches[type][idx] to kmalloc_caches[idx][type]
> will benefit performance.


Why would that increase performance? Using your scheme means that the
KMALLOC_NORMAL pointers are spread over more cachelines. Since
KMALLOC_NORMAL is most frequently accessed this would cause
a performance regression.



Re: [PATCH v2 4/4] mm: lock slub page when listing objects

2019-09-13 Thread Christopher Lameter
On Wed, 11 Sep 2019, Yu Zhao wrote:

> Though I have no idea what the side effect of such race would be,
> apparently we want to prevent the free list from being changed
> while debugging the objects.

process_slab() is called under the list_lock which prevents any allocation
from the free list in the slab page. This means that new objects can be
added to the freelist which occurs by updating the freelist pointer in the
slab page with a pointer to the newly free object which in turn contains
the old freelist pointr.

It is therefore possible to safely traverse the objects on the freelist
after the pointer has been retrieved

NAK.



Re: [PATCH 0/5] mm, slab: Make kmalloc_info[] contain all types of names

2019-09-04 Thread Christopher Lameter
On Wed, 4 Sep 2019, Pengfei Li wrote:

> There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM
> and KMALLOC_DMA.

I only got a few patches of this set. Can I see the complete patchset
somewhere?



Re: [PATCH v2 2/2] mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)

2019-09-03 Thread Christopher Lameter
On Sat, 31 Aug 2019, Matthew Wilcox wrote:

> > The current behavior without special alignment for these caches has been
> > in the wild for over a decade. And this is now coming up?
>
> In the wild ... and rarely enabled.  When it is enabled, it may or may
> not be noticed as data corruption, or tripping other debugging asserts.
> Users then turn off the rare debugging option.

Its enabled in all full debug session as far as I know. Fedora for
example has been running this for ages to find breakage in device drivers
etc etc.

> > If there is an exceptional alignment requirement then that needs to be
> > communicated to the allocator. A special flag or create a special
> > kmem_cache or something.
>
> The only way I'd agree to that is if we deliberately misalign every
> allocation that doesn't have this special flag set.  Because right now,
> breakage happens everywhere when these debug options are enabled, and
> the very people who need to be helped are being hurt by the debugging.

That is customarily occurring for testing by adding "slub_debug" to the
kernel commandline (or adding debug kernel options) and since my
information is that this is done frequently (and has been for over a
decade now) I am having a hard time believing the stories of great
breakage here. These drivers were not tested with debugging on before?
Never ran with a debug kernel?


Re: [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read

2019-07-18 Thread Christopher Lameter
On Wed, 17 Jul 2019, Waiman Long wrote:

> The show method of /sys/kernel/slab//shrink sysfs file currently
> returns nothing. This is now modified to show the time of the last
> cache shrink operation in us.

What is this useful for? Any use cases?

> CONFIG_SLUB_DEBUG depends on CONFIG_SYSFS. So the new shrink_us field
> is always available to the shrink methods.

Aside from minimal systems without CONFIG_SYSFS... Does this build without
CONFIG_SYSFS?


Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches

2019-07-18 Thread Christopher Lameter
On Wed, 17 Jul 2019, Waiman Long wrote:

> Currently, a value of '1" is written to /sys/kernel/slab//shrink
> file to shrink the slab by flushing out all the per-cpu slabs and free
> slabs in partial lists. This can be useful to squeeze out a bit more memory
> under extreme condition as well as making the active object counts in
> /proc/slabinfo more accurate.

Acked-by: Christoph Lameter 

>  # grep task_struct /proc/slabinfo
>  task_struct53137  53192   4288   614 : tunables00
>  0 : slabdata872872  0
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:3936832 kB
>  SReclaimable: 399104 kB
>  SUnreclaim:  3537728 kB
>
> After shrinking slabs:
>
>  # grep "^S[lRU]" /proc/meminfo
>  Slab:1356288 kB
>  SReclaimable: 263296 kB
>  SUnreclaim:  1092992 kB

Well another indicator that it may not be a good decision to replicate the
whole set of slabs for each memcg. Migrate the memcg ownership into the
objects may allow the use of the same slab cache. In particular together
with the slab migration patches this may be a viable way to reduce memory
consumption.



Re: [PATCH v5 4/5] mm/slab: Refactor common ksize KASAN logic into slab_common.c

2019-07-08 Thread Christopher Lameter
On Mon, 8 Jul 2019, Marco Elver wrote:

> This refactors common code of ksize() between the various allocators
> into slab_common.c: __ksize() is the allocator-specific implementation
> without instrumentation, whereas ksize() includes the required KASAN
> logic.

Acked-by: Christoph Lameter 



Re: [PATCH] mm/slab: One function call less in verify_redzone_free()

2019-07-05 Thread Christopher Lameter
On Fri, 5 Jul 2019, Markus Elfring wrote:

> Avoid an extra function call by using a ternary operator instead of
> a conditional statement for a string literal selection.

Well. I thought the compiler does that on its own? And the tenary operator
makes the code difficult to read.



Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches

2019-07-03 Thread Christopher Lameter
On Wed, 3 Jul 2019, Waiman Long wrote:

> On 7/3/19 2:56 AM, Michal Hocko wrote:
> > On Tue 02-07-19 14:37:30, Waiman Long wrote:
> >> Currently, a value of '1" is written to /sys/kernel/slab//shrink
> >> file to shrink the slab by flushing all the per-cpu slabs and free
> >> slabs in partial lists. This applies only to the root caches, though.
> >>
> >> Extends this capability by shrinking all the child memcg caches and
> >> the root cache when a value of '2' is written to the shrink sysfs file.
> > Why do we need a new value for this functionality? I would tend to think
> > that skipping memcg caches is a bug/incomplete implementation. Or is it
> > a deliberate decision to cover root caches only?
>
> It is just that I don't want to change the existing behavior of the
> current code. It will definitely take longer to shrink both the root
> cache and the memcg caches. If we all agree that the only sensible
> operation is to shrink root cache and the memcg caches together. I am
> fine just adding memcg shrink without changing the sysfs interface
> definition and be done with it.

I think its best and consistent behavior to shrink all memcg caches
with the root cache. This looks like an oversight and thus a bugfix.



Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs

2019-06-28 Thread Christopher Lameter
On Thu, 27 Jun 2019, Roman Gushchin wrote:

> so that objects belonging to different memory cgroups can share the same page
> and kmem_caches.
>
> It's a fairly big change though.

Could this be done at another level? Put a cgoup pointer into the
corresponding structures and then go back to just a single kmen_cache for
the system as a whole? You can still account them per cgroup and there
will be no cleanup problem anymore. You could scan through a slab cache
to remove the objects of a certain cgroup and then the fragmentation
problem that cgroups create here will be handled by the slab allocators in
the traditional way. The duplication of the kmem_cache was not designed
into the allocators but bolted on later.



Re: [PATCH v1 4/4] mm: introduce MADV_PAGEOUT

2019-06-04 Thread Christopher Lameter
On Mon, 3 Jun 2019, Minchan Kim wrote:

> @@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
>   return 0;
>  }
>
> +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> +{
> + pte_t *orig_pte, *pte, ptent;
> + spinlock_t *ptl;
> + LIST_HEAD(page_list);
> + struct page *page;
> + int isolated = 0;
> + struct vm_area_struct *vma = walk->vma;
> + unsigned long next;
> +
> + if (fatal_signal_pending(current))
> + return -EINTR;
> +
> + next = pmd_addr_end(addr, end);
> + if (pmd_trans_huge(*pmd)) {
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (!ptl)
> + return 0;
> +
> + if (is_huge_zero_pmd(*pmd))
> + goto huge_unlock;
> +
> + page = pmd_page(*pmd);
> + if (page_mapcount(page) > 1)
> + goto huge_unlock;
> +
> + if (next - addr != HPAGE_PMD_SIZE) {
> + int err;
> +
> + get_page(page);
> + spin_unlock(ptl);
> + lock_page(page);
> + err = split_huge_page(page);
> + unlock_page(page);
> + put_page(page);
> + if (!err)
> + goto regular_page;
> + return 0;
> + }

I have seen this before multiple times. Is there a way to avoid
replicating the whole shebang?



Re: [PATCH] slab: remove /proc/slab_allocators

2019-05-21 Thread Christopher Lameter
On Thu, 16 May 2019, Qian Cai wrote:

> It turned out that DEBUG_SLAB_LEAK is still broken even after recent
> recue efforts that when there is a large number of objects like
> kmemleak_object which is normal on a debug kernel,

Acked-by: Christoph Lameter 


Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

2019-05-15 Thread Christopher Lameter
On Tue, 14 May 2019, Roman Gushchin wrote:

> To make this possible we need to introduce a new percpu refcounter
> for non-root kmem_caches. The counter is initialized to the percpu
> mode, and is switched to atomic mode after deactivation, so we never
> shutdown an active cache. The counter is bumped for every charged page
> and also for every running allocation. So the kmem_cache can't
> be released unless all allocations complete.

Increase refcounts during each allocation? Looks to be quite heavy
processing.


Re: [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting

2019-05-13 Thread Christopher Lameter
On Wed, 8 May 2019, Roman Gushchin wrote:

> Currently the page accounting code is duplicated in SLAB and SLUB
> internals. Let's move it into new (un)charge_slab_page helpers
> in the slab_common.c file. These helpers will be responsible
> for statistics (global and memcg-aware) and memcg charging.
> So they are replacing direct memcg_(un)charge_slab() calls.

Looks good.

Acked-by: Christoph Lameter 


Re: DISCONTIGMEM is deprecated

2019-04-30 Thread Christopher Lameter
On Mon, 29 Apr 2019, Christoph Hellwig wrote:

> So maybe it it time to mark SN2 broken and see if anyone screams?
>
> Without SN2 the whole machvec mess could basically go away - the
> only real difference between the remaining machvecs is which iommu
> if any we set up.

SPARSEMEM with VMEMMAP was developed to address these
issues and allow one mapping scheme across the different platforms.

You do not need DISCONTIGMEM support for SN2. And as far as I know (from a
decade ago ok) the distributions were using VMEMMAP instead.




Re: [PATCH] mm: Allow userland to request that the kernel clear memory on release

2019-04-25 Thread Christopher Lameter
On Wed, 24 Apr 2019, Matthew Garrett wrote:

> Applications that hold secrets and wish to avoid them leaking can use
> mlock() to prevent the page from being pushed out to swap and
> MADV_DONTDUMP to prevent it from being included in core dumps. Applications
> can also use atexit() handlers to overwrite secrets on application exit.
> However, if an attacker can reboot the system into another OS, they can
> dump the contents of RAM and extract secrets. We can avoid this by setting

Well nothing in this patchset deals with that issue That hole still
exists afterwards. So is it worth to have this functionality?

> Unfortunately, if an application exits uncleanly, its secrets may still be
> present in RAM. This can't be easily fixed in userland (eg, if the OOM
> killer decides to kill a process holding secrets, we're not going to be able
> to avoid that), so this patch adds a new flag to madvise() to allow userland
> to request that the kernel clear the covered pages whenever the page
> reference count hits zero. Since vm_flags is already full on 32-bit, it
> will only work on 64-bit systems.

But then the pages are cleared anyways when reallocated to another
process. This just clears it sooner before reuse. So it will reduce the
time that a page contains the secret sauce in case the program is
aborted and cannot run its exit handling.

Is that realy worth extending system calls and adding kernel handling for
this? Maybe the answer is yes given our current concern about anything
related to "security".



Re: DISCONTIGMEM is deprecated

2019-04-22 Thread Christopher Lameter
On Fri, 19 Apr 2019, Matthew Wilcox wrote:

> ia64 (looks complicated ...)

Well as far as I can tell it was not even used 12 or so years ago on
Itanium when I worked on that stuff.


Re: [PATCH 4/5] mm: rework non-root kmem_cache lifecycle management

2019-04-18 Thread Christopher Lameter
On Wed, 17 Apr 2019, Roman Gushchin wrote:

>  static __always_inline int memcg_charge_slab(struct page *page,
>gfp_t gfp, int order,
>struct kmem_cache *s)
>  {
> - if (is_root_cache(s))
> + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
> + int ret;
> +
> + if (is_root_cache(s)) {
> + mod_node_page_state(page_pgdat(page), idx, 1 << order);

Hmmm... This is functionality that is not memcg specific being moved into
a memcg function??? Maybe rename the function to indicate that it is not
memcg specific and add the proper #ifdefs?

>  static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>   struct kmem_cache *s)
>  {
> - memcg_kmem_uncharge(page, order);
> + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ?
> + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE;
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec;
> +
> + if (is_root_cache(s)) {
> + mod_node_page_state(page_pgdat(page), idx, -(1 << order));
> + return;
> + }

And again.



Re: [PATCH 4/5] mm: rework non-root kmem_cache lifecycle management

2019-04-18 Thread Christopher Lameter
On Wed, 17 Apr 2019, Roman Gushchin wrote:

> Let's make every page to hold a reference to the kmem_cache (we
> already have a stable pointer), and make kmem_caches to hold a single
> reference to the memory cgroup.

Ok you are freeing one word in the page struct that can be used for other
purposes now?



Re: [External] Re: Basics : Memory Configuration

2019-04-10 Thread Christopher Lameter
Please respond to my comments in the way that everyone else communicates
here. I cannot distinguish what you said from what I said before.


Re: Basics : Memory Configuration

2019-04-09 Thread Christopher Lameter
On Tue, 9 Apr 2019, Pankaj Suryawanshi wrote:


> I am confuse about memory configuration and I have below questions

Hmmm... Yes some of the terminology that you use is a bit confusing.

> 1. if 32-bit os maximum virtual address is 4GB, When i have 4 gb of ram
> for 32-bit os, What about the virtual memory size ? is it required
> virtual memory(disk space) or we can directly use physical memory ?

The virtual memory size is the maximum virtual size of a single process.
Multiple processes can run and each can use different amounts of physical
memory. So both are actually independent.

The size of the virtual memory space per process is configurable on x86 32
bit (2G, 3G, 4G). Thus the possible virtual process size may vary
depending on the hardware architecture and the configuration of the
kernel.

> 2. In 32-bit os 12 bits are offset because page size=4k i.e 2^12 and
> 2^20 for page addresses
>What about 64-bit os, What is offset size ? What is page size ? How it 
> calculated.

12 bits are passed through? Thats what you mean?

The remainder of the bits  are used to lookup the physical frame
number(PFN) in the page tables.

64 bit is the same. However, the number of bits used for lookups in the
page tables are much higher.


> 3. What is PAE? If enabled how to decide size of PAE, what is maximum
> and minimum size of extended memory.

PAE increases the physical memory size that can be addressed through a
page table lookup. The number of bits that can be specified in the PFN is
increased and thus more than 4GB of physical memory can be used by the
operating system. However, the virtual memory size stays the same and an
individual process still cannot use more memory.



Re: [PATCH] slab: fix a crash by reading /proc/slab_allocators

2019-04-08 Thread Christopher Lameter
On Sun, 7 Apr 2019, Linus Torvalds wrote:

> On Sat, Apr 6, 2019 at 12:59 PM Qian Cai  wrote:
> >
> > The commit 510ded33e075 ("slab: implement slab_root_caches list")
> > changes the name of the list node within "struct kmem_cache" from
> > "list" to "root_caches_node", but leaks_show() still use the "list"
> > which causes a crash when reading /proc/slab_allocators.
>
> The patch does seem to be correct, and I have applied it.
>
> However, it does strike me that apparently this wasn't caught for two
> years. Which makes me wonder whether we should (once again) discuss
> just removing SLAB entirely, or at least removing the
> /proc/slab_allocators file. Apparently it has never been used in the
> last two years. At some point a "this can't have worked if  anybody
> ever tried to use it" situation means that the code should likely be
> excised.

This is only occurring with specially build kernels so that memory leaks
can be investigated. The same is done with other tools (kasan and friends)
today I guess and also the SLUB debugging tools are much more user
friendly. So this means that some esoteric debugging feature of SLAB was
broken.



Re: [PATCH] percpu/module resevation: change resevation size iff X86_VSMP is set

2019-04-04 Thread Christopher Lameter
On Wed, 13 Mar 2019, Barret Rhoden wrote:

> > It is very expensive. VMSP exchanges 4K segments via RDMA between servers
> > to build a large address space and run a kernel in the large address
> > space. Using smaller segments can cause a lot of
> > "cacheline" bouncing (meaning transfers of 4K segments back and forth
> > between servers).
> >
>
> Given that these are large machines, would it be OK to statically reserve 64K
> on them for modules' percpu data?

Likely.

> The bug that led me to here was from someone running on a non-VSMP machine but
> had that config set.  Perhaps we make it more clear in the Kconfig option to
> not set it on other machines.  That might make it less likely anyone on a
> non-VSMP machine pays the 64K overhead.

Right.

> Are there any other alternatives?  Not using static SRCU in any code that
> could be built as a module seems a little harsh.

Sorry this ended up in my spam folder somehow. Just fished it out.



Re: [RFC 2/2] mm, slub: add missing kmem_cache_debug() checks

2019-04-04 Thread Christopher Lameter
On Thu, 4 Apr 2019, Vlastimil Babka wrote:

> Some debugging checks in SLUB are not hidden behind kmem_cache_debug() check.
> Add the check so that those places can also benefit from reduced overhead
> thanks to the the static key added by the previous patch.

Hmmm... I would not expect too much of a benefit from these changes since
most of the stuff is actually not on the hot path.


Re: [RFC 0/2] add static key for slub_debug

2019-04-04 Thread Christopher Lameter
On Thu, 4 Apr 2019, Vlastimil Babka wrote:

> I looked a bit at SLUB debugging capabilities and first thing I noticed is
> there's no static key guarding the runtime enablement as is common for similar
> debugging functionalities, so here's a RFC to add it. Can be further improved
> if there's interest.

Well the runtime enablement is per slab cache and static keys are global.

Adding static key adds code to the critical paths. Since the flags for a
kmem_cache have to be inspected anyways there may not be that much of a
benefit.

> It's true that in the alloc fast path the debugging check overhead is AFAICS
> amortized by the per-cpu cache, i.e. when the allocation is from there, no
> debugging functionality is performed. IMHO that's kinda a weakness, especially
> for SLAB_STORE_USER, so I might also look at doing something about it, and 
> then
> the static key might be more critical for overhead reduction.

Moving debugging out of the per cpu fastpath allows that fastpath to be
much simpler and faster.

SLAB_STORE_USER is mostly used only for debugging in which case we are
less concerned with performance.

If you want to use SLAB_STORE_USER in the fastpath then we have to do some
major redesign there.

> In the freeing fast path I quickly checked the stats and it seems that in
> do_slab_free(), the "if (likely(page == c->page))" is not as likely as it
> declares, as in the majority of cases, freeing doesn't happen on the object
> that belongs to the page currently cached. So the advantage of a static key in
> slow path __slab_free() should be more useful immediately.

Right. The freeing logic is actuall a weakness in terms of performance for
SLUB due to the need to operate on a per page queue immediately.


Re: [RFC PATCH v2 14/14] dcache: Implement object migration

2019-04-04 Thread Christopher Lameter
On Wed, 3 Apr 2019, Al Viro wrote:

> > This is an RFC and we want to know how to do this right.
>
> If by "how to do it right" you mean "expedit kicking out something with
> non-zero refcount" - there's no way to do that.  Nothing even remotely
> sane.

Sure we know that.

> If you mean "kick out everything in this page with zero refcount" - that
> can be done (see further in the thread).

Ok that would already be progress. If we can use this to liberate some
slab pages with just a few dentry object then it may be worthwhile.

> Look, dentries and inodes are really, really not relocatable.  If they
> can be evicted by memory pressure - sure, we can do that for a given
> set (e.g. "everything in that page").  But that's it - if memory
> pressure would _not_ get rid of that one, there's nothing to be done.
> Again, all VM can do is to simulate shrinker hitting hard on given
> bunch (rather than buggering the entire cache).  If filesystem (or
> something in VFS) says "it's busy", it bloody well _is_ busy and
> won't be going away until it ceases to be such.

Right. Thats why the patch attempted to check for these things to avoid
touching such objects.



Re: [RFC PATCH v2 14/14] dcache: Implement object migration

2019-04-03 Thread Christopher Lameter
On Wed, 3 Apr 2019, Al Viro wrote:

> Let's do d_invalidate() on random dentries and hope they go away.
> With convoluted and brittle logics for deciding which ones to
> spare, which is actually wrong.  This will pick mountpoints
> and tear them out, to start with.
>
> NAKed-by: Al Viro 
>
> And this is a NAK for the entire approach; if it has a positive refcount,
> LEAVE IT ALONE.  Period.  Don't play this kind of games, they are wrong.
> d_invalidate() is not something that can be done to an arbitrary dentry.

Well could you help us figure out how to do it the right way? We (the MM
guys) are having a hard time not being familiar with the filesystem stuff.

This is an RFC and we want to know how to do this right.



Re: [PATCH v5 6/7] slab: Use slab_list instead of lru

2019-04-03 Thread Christopher Lameter


Acked-by: Christoph Lameter 




Re: [PATCH v5 3/7] slob: Use slab_list instead of lru

2019-04-03 Thread Christopher Lameter


Acked-by: Christoph Lameter 



Re: [PATCH v5 1/7] list: Add function list_rotate_to_front()

2019-04-03 Thread Christopher Lameter
On Wed, 3 Apr 2019, Tobin C. Harding wrote:

> Add function list_rotate_to_front() to rotate a list until the specified
> item is at the front of the list.

Reviewed-by: Christoph Lameter 


Re: [PATCH v5 2/7] slob: Respect list_head abstraction layer

2019-04-03 Thread Christopher Lameter
On Wed, 3 Apr 2019, Tobin C. Harding wrote:

> Currently we reach inside the list_head.  This is a violation of the
> layer of abstraction provided by the list_head.  It makes the code
> fragile.  More importantly it makes the code wicked hard to understand.

Great It definitely makes it clearer. The boolean parameter is not
so nice but I have no idea how to avoid it in the brief time I spent
looking at it.

Acked-by: Christoph Lameter 



Re: [PATCH v3] kmemleaak: survive in a low-memory situation

2019-03-26 Thread Christopher Lameter
On Tue, 26 Mar 2019, Qian Cai wrote:

> + if (!object) {
> + /*
> +  * The tracked memory was allocated successful, if the kmemleak
> +  * object failed to allocate for some reasons, it ends up with
> +  * the whole kmemleak disabled, so let it success at all cost.

"let it succeed at all costs"

> +  */
> + gfp = (in_atomic() || irqs_disabled()) ? GFP_ATOMIC :
> +gfp_kmemleak_mask(gfp) | __GFP_DIRECT_RECLAIM;
> + object = kmem_cache_alloc(object_cache, gfp);
> + }
> +
>   if (!object) {

If the alloc must succeed then this check is no longer necessary.


Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

2019-03-25 Thread Christopher Lameter
On Mon, 25 Mar 2019, Matthew Wilcox wrote:

> Options:
>
> 1. Dispense with this optimisation and always store the size of the
> object before the object.

I think thats how SLOB handled it at some point in the past. Lets go back
to that setup so its compatible with the other allocators?


Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

2019-03-25 Thread Christopher Lameter
On Fri, 22 Mar 2019, Matthew Wilcox wrote:

> On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> > On Fri, 22 Mar 2019, Waiman Long wrote:
> >
> > > >
> > > >> I am looking forward to it.
> > > > There is also alrady rcu being used in these paths. kfree_rcu() would 
> > > > not
> > > > be enough? It is an estalished mechanism that is mature and well
> > > > understood.
> > > >
> > > In this case, the memory objects are from kmem caches, so they can't
> > > freed using kfree_rcu().
> >
> > Oh they can. kfree() can free memory from any slab cache.
>
> Only for SLAB and SLUB.  SLOB requires that you pass a pointer to the
> slab cache; it has no way to look up the slab cache from the object.

Well then we could either fix SLOB to conform to the others or add a
kmem_cache_free_rcu() variant.



Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

2019-03-22 Thread Christopher Lameter
On Fri, 22 Mar 2019, Waiman Long wrote:

> >
> >> I am looking forward to it.
> > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > be enough? It is an estalished mechanism that is mature and well
> > understood.
> >
> In this case, the memory objects are from kmem caches, so they can't
> freed using kfree_rcu().

Oh they can. kfree() can free memory from any slab cache.

> There are certainly overhead using the kfree_rcu(), or a
> kfree_rcu()-like mechanism. Also I think the actual freeing is done at
> SoftIRQ context which can be a problem if there are too many memory
> objects to free.

No there is none that I am aware of. And this has survived testing of HPC
loads with gazillion of objects that have to be freed from multiple
processors. We really should not rearchitect this stuff... It took us
quite a long time to have this scale well under all loads.

Please use rcu_free().




Re: [PATCH 1/4] mm: Implement kmem objects freeing queue

2019-03-22 Thread Christopher Lameter
On Thu, 21 Mar 2019, Waiman Long wrote:

> When releasing kernel data structures, freeing up the memory
> occupied by those objects is usually the last step. To avoid races,
> the release operation is commonly done with a lock held. However, the
> freeing operations do not need to be under lock, but are in many cases.
>
> In some complex cases where the locks protect many different memory
> objects, that can be a problem especially if some memory debugging
> features like KASAN are enabled. In those cases, freeing memory objects
> under lock can greatly lengthen the lock hold time. This can even lead
> to soft/hard lockups in some extreme cases.
>
> To make it easer to defer freeing memory objects until after unlock,
> a kernel memory freeing queue mechanism is now added. It is modelled
> after the wake_q mechanism for waking up tasks without holding a lock.

It is already pretty easy. You just store the pointer to the slab object
in a local variable, finish all the unlocks and then free the objects.
This is done in numerous places of the kernel.

I fear that the automated mechanism will make the code more difficult to
read and result in a loss of clarity of the sequencing of events in
releasing locks and objects.

Also there is already kfree_rcu which does a similar thing to what you are
proposing here and is used in numerous places.



Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

2019-03-22 Thread Christopher Lameter
On Fri, 22 Mar 2019, Waiman Long wrote:

> I am looking forward to it.

There is also alrady rcu being used in these paths. kfree_rcu() would not
be enough? It is an estalished mechanism that is mature and well
understood.



Re: [PATCH] mm, slab: remove unneed check in cpuup_canceled

2019-03-22 Thread Christopher Lameter
On Thu, 21 Mar 2019, Li RongQing wrote:

> nc is a member of percpu allocation memory, and impossible NULL

Acked-by: Christoph Lameter 


Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-19 Thread Christopher Lameter
On Tue, 19 Mar 2019, John Hubbard wrote:

> >
> > My concerns do not affect this patchset which just marks the get/put for
> > the pagecache. The problem was that the description was making claims that
> > were a bit misleading and seemed to prescribe a solution.
> >
> > So lets get this merged. Whatever the solution will be, we will need this
> > markup.
> >
>
> Sounds good. Do you care to promote that thought into a formal ACK for me? :)

Reviewed-by: Christoph Lameter 



Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-19 Thread Christopher Lameter
On Wed, 20 Mar 2019, Dave Chinner wrote:

> So the plan for GUP vs writeback so far is "break fsync()"? :)

Well if its an anonymous page and not a file backed page then the
semantics are preserved. Disallow GUP long term pinning (marking stuff like in 
this
patchset may make that possible) and its clean.


Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-19 Thread Christopher Lameter
On Fri, 8 Mar 2019, john.hubb...@gmail.com wrote:

> We seem to have pretty solid consensus on the concept and details of the
> put_user_pages() approach. Or at least, if we don't, someone please speak
> up now. Christopher Lameter, especially, since you had some concerns
> recently.

My concerns do not affect this patchset which just marks the get/put for
the pagecache. The problem was that the description was making claims that
were a bit misleading and seemed to prescribe a solution.

So lets get this merged. Whatever the solution will be, we will need this
markup.



Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-13 Thread Christopher Lameter
On Wed, 13 Mar 2019, Christoph Hellwig wrote:

> On Wed, Mar 13, 2019 at 09:11:13AM +1100, Dave Chinner wrote:
> > On Tue, Mar 12, 2019 at 03:39:33AM -0700, Ira Weiny wrote:
> > > IMHO I don't think that the copy_file_range() is going to carry us 
> > > through the
> > > next wave of user performance requirements.  RDMA, while the first, is 
> > > not the
> > > only technology which is looking to have direct access to files.  XDP is
> > > another.[1]
> >
> > Sure, all I doing here was demonstrating that people have been
> > trying to get local direct access to file mappings to DMA directly
> > into them for a long time. Direct Io games like these are now
> > largely unnecessary because we now have much better APIs to do
> > zero-copy data transfer between files (which can do hardware offload
> > if it is available!).
>
> And that is just the file to file case.  There are tons of other
> users of get_user_pages, including various drivers that do large
> amounts of I/O like video capture.  For them it makes tons of sense
> to transfer directly to/from a mmap()ed file.

That is very similar to the RDMA case and DAX etc. We need to have a way
to tell a filesystem that this is going to happen and that things need to
be setup for this to work properly.

But if that has not been done then I think its proper to fail a long term
pin operation on page cache pages. Meaning the regular filesystems
maintain control of whats happening with their pages.


Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-13 Thread Christopher Lameter
On Tue, 12 Mar 2019, Jerome Glisse wrote:

> > > This has been discuss extensively already. GUP usage is now widespread in
> > > multiple drivers, removing that would regress userspace ie break existing
> > > application. We all know what the rules for that is.

You are still misstating the issue. In RDMA land GUP is widely used for
anonyous memory and memory based filesystems. *Not* for real filesystems.

> > Because someone was able to get away with weird ways of abusing the system
> > it not an argument that we should continue to allow such things. In fact
> > we have repeatedly ensured that the kernel works reliably by improving the
> > kernel so that a proper failure is occurring.
>
> Driver doing GUP on mmap of regular file is something that seems to
> already have widespread user (in the RDMA devices at least). So they
> are active users and they were never told that what they are doing
> was illegal.

Not true. Again please differentiate the use cases between regular
filesystem and anonyous mappings.

> > Well swapout cannot occur if the page is pinned and those pages are also
> > often mlocked.
>
> I would need to check the swapout code but i believe the write to disk
> can happen before the pin checks happens. I believe the event flow is:
> map read only, allocate swap, write to disk, try to free page which
> checks for pin. So that you could write stale data to disk and the GUP
> going away before you perform the pin checks.

Allocate swap is a separate step that associates a swap entry to an
anonymous page.

> They are other thing to take into account and that need proper page
> dirtying, like soft dirtyness for instance.

RDMA mapped pages are all dirty all the time.

> Well RDMA driver maintainer seems to report that this has been a valid
> and working workload for their users.

No they dont.

Could you please get up to date on the discussion before posting?


Re: [PATCH v2 5/5] mm: Remove stale comment from page struct

2019-03-13 Thread Christopher Lameter


Acked-by: Christoph Lameter 




Re: [PATCH v2 4/5] slob: Use slab_list instead of lru

2019-03-13 Thread Christopher Lameter
On Wed, 13 Mar 2019, Tobin C. Harding wrote:

> @@ -297,7 +297,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int 
> align, int node)
>   continue;
>
>   /* Attempt to alloc */
> - prev = sp->lru.prev;
> + prev = sp->slab_list.prev;
>   b = slob_page_alloc(sp, size, align);
>   if (!b)
>   continue;

Hmmm... Is there a way to use a macro or so to avoid referencing the field
within the slab_list?


Re: [PATCH v2 2/5] slub: Use slab_list instead of lru

2019-03-13 Thread Christopher Lameter


Acked-by: Christoph Lameter 




Re: [PATCH v2 1/5] slub: Add comments to endif pre-processor macros

2019-03-13 Thread Christopher Lameter


Acked-by: Christoph Lameter 




Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-11 Thread Christopher Lameter
On Mon, 11 Mar 2019, Dave Chinner wrote:

> > Direct IO on a mmapped file backed page doesnt make any sense.
>
> People have used it for many, many years as zero-copy data movement
> pattern. i.e. mmap the destination file, use direct IO to DMA direct
> into the destination file page cache pages, fdatasync() to force
> writeback of the destination file.

Well we could make that more safe through a special API that designates a
range of pages in a file in the same way as for RDMA. This is inherently
not reliable as we found out.

> Now we have copy_file_range() to optimise this sort of data
> movement, the need for games with mmap+direct IO largely goes away.
> However, we still can't just remove that functionality as it will
> break lots of random userspace stuff...

It is already broken and unreliable. Are there really "lots" of these
things around? Can we test this by adding a warning in the kernel and see
where it actually crops up?


Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-11 Thread Christopher Lameter
On Fri, 8 Mar 2019, Jerome Glisse wrote:

> >
> > It would good if that understanding would be enforced somehow given the 
> > problems
> > that we see.
>
> This has been discuss extensively already. GUP usage is now widespread in
> multiple drivers, removing that would regress userspace ie break existing
> application. We all know what the rules for that is.

The applications that work are using anonymous memory and memory
filesystems. I have never seen use cases with a real filesystem and would
have objected if someone tried something crazy like that.

Because someone was able to get away with weird ways of abusing the system
it not an argument that we should continue to allow such things. In fact
we have repeatedly ensured that the kernel works reliably by improving the
kernel so that a proper failure is occurring.


> > > In fact, the GUP documentation even recommends that pattern.
> >
> > Isnt that pattern safe for anonymous memory and memory filesystems like
> > hugetlbfs etc? Which is the common use case.
>
> Still an issue in respect to swapout ie if anon/shmem page was map
> read only in preparation for swapout and we do not report the page
> as dirty what endup in swap might lack what was written last through
> GUP.

Well swapout cannot occur if the page is pinned and those pages are also
often mlocked.

> >
> > Yes you now have the filesystem as well as the GUP pinner claiming
> > authority over the contents of a single memory segment. Maybe better not
> > allow that?
>
> This goes back to regressing existing driver with existing users.

There is no regression if that behavior never really worked.

> > Two filesystem trying to sync one memory segment both believing to have
> > exclusive access and we want to sort this out. Why? Dont allow this.
>
> This is allowed, it always was, forbidding that case now would regress
> existing application and it would also means that we are modifying the
> API we expose to userspace. So again this is not something we can block
> without regressing existing user.

We have always stopped the user from doing obviously stupid and risky
things. It would be logical to do it here as well.



Re: [RFC 04/15] slub: Enable Slab Movable Objects (SMO)

2019-03-11 Thread Christopher Lameter
On Mon, 11 Mar 2019, Roman Gushchin wrote:

> > +static inline void *alloc_scratch(struct kmem_cache *s)
> > +{
> > +   unsigned int size = oo_objects(s->max);
> > +
> > +   return kmalloc(size * sizeof(void *) +
> > +  BITS_TO_LONGS(size) * sizeof(unsigned long),
> > +  GFP_KERNEL);
>
> I wonder how big this allocation can be?
> Given that the reason for migration is probably highly fragmented memory,
> we probably don't want to have a high-order allocation here. So maybe
> kvmalloc()?

The smallest object size is 8 bytes which is one word which would be
places in an order 0 page. So it comes out to about a page again.

Larger allocation orders are possible if the slab pages itself can have
larger orders of course. If you set the min_order to the huge page order
then we can have similar sized orders for the allocation of the scratch
space. However, that is not a problem since the allocations for the slab
pages itself are also already of that same order.



Re: [RFC 02/15] slub: Add isolate() and migrate() methods

2019-03-11 Thread Christopher Lameter
On Mon, 11 Mar 2019, Roman Gushchin wrote:

> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4325,6 +4325,34 @@ int __kmem_cache_create(struct kmem_cache *s, 
> > slab_flags_t flags)
> > return err;
> >  }
> >
> > +void kmem_cache_setup_mobility(struct kmem_cache *s,
> > +  kmem_cache_isolate_func isolate,
> > +  kmem_cache_migrate_func migrate)
> > +{
>
> I wonder if it's better to adapt kmem_cache_create() to take two additional
> argument? I suspect mobility is not a dynamic option, so it can be
> set on kmem_cache creation.

One other idea that prior versions of this patchset used was to change
kmem_cache_create() so that the ctor parameter becomes an ops vector.

However, in order to reduce the size of the patchset I dropped that. It
could be easily moved back to the way it was before.

> > +   /*
> > +* Sadly serialization requirements currently mean that we have
> > +* to disable fast cmpxchg based processing.
> > +*/
>
> Can you, please, elaborate a bit more here?

cmpxchg based processing does not lock the struct page. SMO requires to
ensure that all changes on a slab page can be stopped. The page->lock will
accomplish that. I think we could avoid dealing with actually locking the
page with some more work.


Re: [RFC 02/15] slub: Add isolate() and migrate() methods

2019-03-08 Thread Christopher Lameter
On Fri, 8 Mar 2019, Tycho Andersen wrote:

> On Fri, Mar 08, 2019 at 03:14:13PM +1100, Tobin C. Harding wrote:
> > diff --git a/mm/slab_common.c b/mm/slab_common.c
> > index f9d89c1b5977..754acdb292e4 100644
> > --- a/mm/slab_common.c
> > +++ b/mm/slab_common.c
> > @@ -298,6 +298,10 @@ int slab_unmergeable(struct kmem_cache *s)
> > if (!is_root_cache(s))
> > return 1;
> >
> > +   /*
> > +* s->isolate and s->migrate imply s->ctor so no need to
> > +* check them explicitly.
> > +*/
>
> Shouldn't this implication go the other way, i.e.
> s->ctor => s->isolate & s->migrate

A cache can have a constructor but the object may not be movable (I.e.
currently dentries and inodes).




Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions

2019-03-07 Thread Christopher Lameter
On Wed, 6 Mar 2019, john.hubb...@gmail.com wrote:


> GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code
> to get the struct page behind a virtual address and to let storage hardware
> perform a direct copy to or from that page. This is a short-lived access
> pattern, and as such, the window for a concurrent writeback of GUP'd page
> was small enough that there were not (we think) any reported problems.
> Also, userspace was expected to understand and accept that Direct IO was
> not synchronized with memory-mapped access to that data, nor with any
> process address space changes such as munmap(), mremap(), etc.

It would good if that understanding would be enforced somehow given the problems
that we see.

> Interactions with file systems
> ==
>
> File systems expect to be able to write back data, both to reclaim pages,

Regular filesystems do that. But usually those are not used with GUP
pinning AFAICT.

> and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain
> write access to the file memory pages means that such hardware can dirty
> the pages, without the filesystem being aware. This can, in some cases
> (depending on filesystem, filesystem options, block device, block device
> options, and other variables), lead to data corruption, and also to kernel
> bugs of the form:

> Long term GUP
> =
>
> Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
> writeable mapping is created), and the pages are file-backed. That can lead
> to filesystem corruption. What happens is that when a file-backed page is
> being written back, it is first mapped read-only in all of the CPU page
> tables; the file system then assumes that nobody can write to the page, and
> that the page content is therefore stable. Unfortunately, the GUP callers
> generally do not monitor changes to the CPU pages tables; they instead
> assume that the following pattern is safe (it's not):
>
> get_user_pages()
>
> Hardware can keep a reference to those pages for a very long time,
> and write to it at any time. Because "hardware" here means "devices
> that are not a CPU", this activity occurs without any interaction
> with the kernel's file system code.
>
> for each page
> set_page_dirty
> put_page()
>
> In fact, the GUP documentation even recommends that pattern.

Isnt that pattern safe for anonymous memory and memory filesystems like
hugetlbfs etc? Which is the common use case.

> Anyway, the file system assumes that the page is stable (nothing is writing
> to the page), and that is a problem: stable page content is necessary for
> many filesystem actions during writeback, such as checksum, encryption,
> RAID striping, etc. Furthermore, filesystem features like COW (copy on
> write) or snapshot also rely on being able to use a new page for as memory
> for that memory range inside the file.
>
> Corruption during write back is clearly possible here. To solve that, one
> idea is to identify pages that have active GUP, so that we can use a bounce
> page to write stable data to the filesystem. The filesystem would work
> on the bounce page, while any of the active GUP might write to the
> original page. This would avoid the stable page violation problem, but note
> that it is only part of the overall solution, because other problems
> remain.

Yes you now have the filesystem as well as the GUP pinner claiming
authority over the contents of a single memory segment. Maybe better not
allow that?

> Direct IO
> =
>
> Direct IO can cause corruption, if userspace does Direct-IO that writes to
> a range of virtual addresses that are mmap'd to a file.  The pages written
> to are file-backed pages that can be under write back, while the Direct IO
> is taking place.  Here, Direct IO races with a write back: it calls
> GUP before page_mkclean() has replaced the CPU pte with a read-only entry.
> The race window is pretty small, which is probably why years have gone by
> before we noticed this problem: Direct IO is generally very quick, and
> tends to finish up before the filesystem gets around to do anything with
> the page contents.  However, it's still a real problem.  The solution is
> to never let GUP return pages that are under write back, but instead,
> force GUP to take a write fault on those pages.  That way, GUP will
> properly synchronize with the active write back.  This does not change the
> required GUP behavior, it just avoids that race.

Direct IO on a mmapped file backed page doesnt make any sense. The direct
I/O write syscall already specifies one file handle of a filesystem that
the data is to be written onto.  Plus mmap already established another
second filehandle and another filesystem that is also in charge of that
memory segment.

Two filesystem trying to sync one memory segment both believing to have
exclusive access and we want to sort this out. Why? Dont allow this.



Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-07 Thread Christopher Lameter
On Wed, 6 Mar 2019, john.hubb...@gmail.com wrote:

> Dave Chinner's description of this is very clear:
>
> "The fundamental issue is that ->page_mkwrite must be called on every
> write access to a clean file backed page, not just the first one.
> How long the GUP reference lasts is irrelevant, if the page is clean
> and you need to dirty it, you must call ->page_mkwrite before it is
> marked writeable and dirtied. Every. Time."
>
> This is just one symptom of the larger design problem: filesystems do not
> actually support get_user_pages() being called on their pages, and letting
> hardware write directly to those pages--even though that patter has been
> going on since about 2005 or so.

Can we distinguish between real filesystems that actually write to a
backing device and the special filesystems (like hugetlbfs, shm and
friends) that are like anonymous memory and do not require
->page_mkwrite() in the same way as regular filesystems?

The use that I have seen in my section of the world has been restricted to
RDMA and get_user_pages being limited to anonymous memory and those
special filesystems. And if the RDMA memory is of such type then the use
in the past and present is safe.

So a logical other approach would be to simply not allow the use of
long term get_user_page() on real filesystem pages. I hope this patch
supports that?

It is customary after all that a file read or write operation involve one
single file(!) and that what is written either comes from or goes to
memory (anonymous or special memory filesystem).

If you have an mmapped memory segment with a regular device backed file
then you already have one file associated with a memory segment and a
filesystem that does take care of synchronizing the contents of the memory
segment to a backing device.

If you now perform RDMA or device I/O on such a memory segment then you
will have *two* different devices interacting with that memory segment. I
think that ought not to happen and not be supported out of the box. It
will be difficult to handle and the semantics will be hard for users to
understand.

What could happen is that the filesystem could agree on request to allow
third party I/O to go to such a memory segment. But that needs to be well
defined and clearly and explicitly handled by some mechanism in user space
that has well defined semantics for data integrity for the filesystem as
well as the RDMA or device I/O.





Re: [PATCH] percpu/module resevation: change resevation size iff X86_VSMP is set

2019-03-01 Thread Christopher Lameter
On Fri, 1 Mar 2019, Barret Rhoden wrote:

> I'm not familiar with VSMP - how bad is it to use L1 cache alignment instead
> of 4K page alignment?  Maybe some structures can use the smaller alignment?
> Or maybe have VSMP require SRCU-using modules to be built-in?

It is very expensive. VMSP exchanges 4K segments via RDMA between servers
to build a large address space and run a kernel in the large address
space. Using smaller segments can cause a lot of
"cacheline" bouncing (meaning transfers of 4K segments back and forth
between servers).


Re: [PATCH] cxgb4: fix undefined behavior in mem.c

2019-03-01 Thread Christopher Lameter
On Thu, 28 Feb 2019, Shaobo He wrote:

> I think maybe the more problematic issue is that the value of a freed pointer
> is intermediate.

The pointer is not affected by freeing the data it points to. Thus it
definitely has the same value as before and is not indeterminate.

The pointer points now to an area of memory that could now be in use for
different purposes so maybe it could be taken as a dangerous situation.

But situations like that are common in code.



Re: [PATCH 1/2] percpu: km: remove SMP check

2019-02-26 Thread Christopher Lameter
On Mon, 25 Feb 2019, Dennis Zhou wrote:

> > @@ -27,7 +27,7 @@
> >   *   chunk size is not aligned.  percpu-km code will whine about it.
> >   */
> >
> > -#if defined(CONFIG_SMP) && defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> > +#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK)
> >  #error "contiguous percpu allocation is incompatible with paged first 
> > chunk"
> >  #endif
> >
> > --
> > 2.16.4
> >
>
> Hi,
>
> I think keeping CONFIG_SMP makes this easier to remember dependencies
> rather than having to dig into the config. So this is a NACK from me.

But it simplifies the code and makes it easier to read.




Re: [PATCH 2/2] percpu: km: no need to consider pcpu_group_offsets[0]

2019-02-26 Thread Christopher Lameter
On Mon, 25 Feb 2019, den...@kernel.org wrote:

> > @@ -67,7 +67,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
> > pcpu_set_page_chunk(nth_page(pages, i), chunk);
> >
> > chunk->data = pages;
> > -   chunk->base_addr = page_address(pages) - pcpu_group_offsets[0];
> > +   chunk->base_addr = page_address(pages);
> >
> > spin_lock_irqsave(_lock, flags);
> > pcpu_chunk_populated(chunk, 0, nr_pages, false);
> > --
> > 2.16.4
> >
>
> While I do think you're right, creating a chunk is not a part of the
> critical path and subtracting 0 is incredibly minor overhead. So I'd
> rather keep the code as is to maintain consistency between percpu-vm.c
> and percpu-km.c.

Well it is confusing if there the expression is there but never used. It
is clearer with the patch.



Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-16 Thread Christopher Lameter
On Fri, 15 Feb 2019, Ira Weiny wrote:

> > > > for filesystems and processes.  The only problems come in for the things
> > > > which bypass the page cache like O_DIRECT and DAX.
> > >
> > > It makes a lot of sense since the filesystems play COW etc games with the
> > > pages and RDMA is very much like O_DIRECT in that the pages are modified
> > > directly under I/O. It also bypasses the page cache in case you have
> > > not noticed yet.
> >
> > It is quite different, O_DIRECT modifies the physical blocks on the
> > storage, bypassing the memory copy.
> >
>
> Really?  I thought O_DIRECT allowed the block drivers to write to/from user
> space buffers.  But the _storage_ was still under the control of the block
> drivers?

It depends on what you see as the modification target. O_DIRECT uses
memory as a target and source like RDMA. The block device is at the other
end of the handling.

> > RDMA modifies the memory copy.
> >
> > pages are necessary to do RDMA, and those pages have to be flushed to
> > disk.. So I'm not seeing how it can be disconnected from the page
> > cache?
>
> I don't disagree with this.

RDMA does direct access to memory. If that memmory is a mmmap of a regular
block  device then we have a problem (this has not been a standard use case to 
my
knowledge). The semantics are simmply different. RDMA expects memory to be
pinned and always to be able to read and write from it. The block
device/filesystem expects memory access to be controllable via the page
permission. In particular access to be page need to be able to be stopped.

This is fundamentally incompatible. RDMA access to such an mmapped section
must preserve the RDMA semantics while the pinning is done and can only
provide the access control after RDMA is finished. Pages in the RDMA range
cannot be handled like normal page cache pages.

This is in particular evident in the DAX case in which we have direct pass
through even to the storage medium. And in this case write through can
replace the page cache.


Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-15 Thread Christopher Lameter
On Fri, 15 Feb 2019, Matthew Wilcox wrote:

> > Since RDMA is something similar: Can we say that a file that is used for
> > RDMA should not use the page cache?
>
> That makes no sense.  The page cache is the standard synchronisation point
> for filesystems and processes.  The only problems come in for the things
> which bypass the page cache like O_DIRECT and DAX.

It makes a lot of sense since the filesystems play COW etc games with the
pages and RDMA is very much like O_DIRECT in that the pages are modified
directly under I/O. It also bypasses the page cache in case you have
not noticed yet.

Both filesysetms and RDMA acting on a page cache at
the same time lead to the mess that we are trying to solve.




Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA

2019-02-15 Thread Christopher Lameter
On Fri, 15 Feb 2019, Dave Chinner wrote:

> Which tells us filesystem people that the applications are doing
> something that _will_ cause data corruption and hence not to spend
> any time triaging data corruption reports because it's not a
> filesystem bug that caused it.
>
> See open(2):
>
>   Applications should avoid mixing O_DIRECT and normal I/O to
>   the same file, and especially to overlapping byte regions in
>   the same file.  Even when the filesystem correctly handles
>   the coherency issues in this situation, overall I/O
>   throughput is likely to be slower than using either mode
>   alone.  Likewise, applications should avoid mixing mmap(2)
>   of files with direct I/O to the same files.

Since RDMA is something similar: Can we say that a file that is used for
RDMA should not use the page cache?

And can we enforce this in the future? I.e. have some file state that says
that this file is direct/RDMA or contains long term pinning and thus
allows only a certain type of operations to ensure data consistency?

If we cannot enforce it then we may want to spit out some warning?



  1   2   3   4   5   6   7   >