Re: [PATCH v2] mm/slub: fix panic in slab_alloc_node()
On Tue, 27 Oct 2020, Laurent Dufour wrote: > The issue is that object is not NULL while page is NULL which is odd but > may happen if the cache flush happened after loading object but before > loading page. Thus checking for the page pointer is required too. Ok then lets revert commit 6159d0f5c03e? The situation may occur elsewhere too.
Re: [PATCH v3 0/3] Actually fix freelist pointer vs redzoning
On Wed, 14 Oct 2020, Kees Cook wrote: > Note on patch 2: Christopher NAKed it, but I actually think this is a > reasonable thing to add -- the "too small" check is only made when built > with CONFIG_DEBUG_VM, so it *is* actually possible for someone to trip > over this directly, even if it would never make it into a released > kernel. I see no reason to just leave this foot-gun in place, though, so > we might as well just fix it too. (Which seems to be what Longman was > similarly supporting, IIUC.) Well then remove the duplication of checks. The NAK was there because it seems that you were not aware of the existing checks. > Anyway, if patch 2 stays NAKed, that's fine. It's entirely separable, > and the other 2 can land. :) Just deal with the old checks too and it will be fine.
Re: [PATCH] mm: Make allocator take care of memoryless numa node
On Mon, 12 Oct 2020, Xianting Tian wrote: > In architecture like powerpc, we can have cpus without any local memory > attached to it. In such cases the node does not have real memory. > > In many places of current kernel code, it doesn't judge whether the node is > memoryless numa node before calling allocator interface. That is intentional. SLUB relies on the page allocator to pick a node. > This patch is to use local_memory_node(), which is guaranteed to have > memory, in allocator interface. local_memory_node() is a noop in other > architectures that don't support memoryless nodes. The patch would destroy the support for memory policies in the SLUB allocator and likely in the other ones as well.
Re: [PATCH v2 2/3] mm/slub: Fix redzoning for small allocations
On Fri, 9 Oct 2020, Kees Cook wrote: > Store the freelist pointer out of line when object_size is smaller than > sizeof(void *) and redzoning is enabled. > > (Note that no caches with such a size are known to exist in the kernel > currently.) Ummm... The smallest allowable cache size is sizeof(void *) as I recall. mm/slab_common.c::kmem_sanity_check() checks the sizes when caches are created. NAK.
Re: [RFC][PATCH 03/12] mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks
On Tue, 6 Oct 2020, Dave Hansen wrote: > These zero checks are not great because it is not obvious what a zero > mode *means* in the code. Replace them with a helper which makes it > more obvious: node_reclaim_enabled(). Well it uselessly checks bits. But whatever. It will prevent future code removal. Acked-by: Christoph Lameter
Re: [RFC][PATCH 01/12] mm/vmscan: restore zone_reclaim_mode ABI
On Tue, 6 Oct 2020, Dave Hansen wrote: > But, when the bit was removed (bit 0) the _other_ bit locations also > got changed. That's not OK because the bit values are documented to > mean one specific thing and users surely rely on them meaning that one > thing and not changing from kernel to kernel. The end result is that > if someone had a script that did: Exactly right. Sorry must have missed to review that patch. Acked-by: Christoph Lameter
Re: [RFC][PATCH 02/12] mm/vmscan: move RECLAIM* bits to uapi header
On Tue, 6 Oct 2020, Dave Hansen wrote: > It is currently not obvious that the RECLAIM_* bits are part of the > uapi since they are defined in vmscan.c. Move them to a uapi header > to make it obvious. Acked-by: Christoph Lameter
Re: [PATCH RFC v2 0/6] Break heap spraying needed for exploiting use-after-free
On Tue, 6 Oct 2020, Matthew Wilcox wrote: > On Tue, Oct 06, 2020 at 12:56:33AM +0200, Jann Horn wrote: > > It seems to me like, if you want to make UAF exploitation harder at > > the heap allocator layer, you could do somewhat more effective things > > with a probably much smaller performance budget. Things like > > preventing the reallocation of virtual kernel addresses with different > > types, such that an attacker can only replace a UAF object with > > another object of the same type. (That is not an idea I like very much > > either, but I would like it more than this proposal.) (E.g. some > > browsers implement things along those lines, I believe.) > > The slab allocator already has that functionality. We call it > TYPESAFE_BY_RCU, but if forcing that on by default would enhance security > by a measurable amount, it wouldn't be a terribly hard sell ... TYPESAFE functionality switches a lot of debugging off because that also allows speculative accesses to the object after it was freed (requires for RCU safeness because the object may be freed in an RCU period where it is still accessed). I do not think you would like that.
Re: [PATCH RFC v2 0/6] Break heap spraying needed for exploiting use-after-free
On Mon, 5 Oct 2020, Kees Cook wrote: > > TYPESAFE_BY_RCU, but if forcing that on by default would enhance security > > by a measurable amount, it wouldn't be a terribly hard sell ... > > Isn't the "easy" version of this already controlled by slab_merge? (i.e. > do not share same-sized/flagged kmem_caches between different caches) Right. > The large trouble are the kmalloc caches, which don't have types > associated with them. Having implicit kmem caches based on the type > being allocated there would need some pretty extensive plumbing, I > think? Actually typifying those accesses may get rid of a lot of kmalloc allocations and could help to ease the management and control of objects. It may be a big task though given the ubiquity of kmalloc and the need to create a massive amount of new slab caches. This is going to reduce the cache hit rate significantly.
Re: [PATCH v4 00/10] Independent per-CPU data section for nVHE
On Tue, 22 Sep 2020, David Brazdil wrote: > Introduce '.hyp.data..percpu' as part of ongoing effort to make nVHE > hyp code self-contained and independent of the rest of the kernel. The percpu subsystems point is to enable the use of special hardware instructions that can perform address calculation and a memory operation in one interruptible instruction. This is in particular useful to avoid higher overhead for memory management related counters because preempt disable/enable etc can be avoided. ARM cannot do that and thus has a LC/SC loop. This is a patchset for ARM64 so its not clear to me what kind of advantage there would be against a simple implementation that does a regular fetch from a base address with an offset. > Main benefits: > * independent nVHE per-CPU data section that can be unmapped from host, > * more robust linking of nVHE hyp code, > * no need for hyp-specific macros to access per-CPU variables. Maybe simply don't use percpu variables for your arm code? Those pointers to data will be much more indepedent of the rest of the kernel and allow a much higher degree of being self-contained.
Re: [PATCH v2 05/10] mm, kfence: insert KFENCE hooks for SLUB
On Tue, 15 Sep 2020, Marco Elver wrote: > void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) > { > - void *ret = slab_alloc(s, gfpflags, _RET_IP_); > + void *ret = slab_alloc(s, gfpflags, _RET_IP_, s->object_size); The additional size parameter is a part of a struct kmem_cache that is already passed to the function. Why does the parameter list need to be expanded?
Re: [PATCH v2 04/10] mm, kfence: insert KFENCE hooks for SLAB
On Tue, 15 Sep 2020, Marco Elver wrote: > @@ -3206,7 +3207,7 @@ static void *cache_alloc_node(struct kmem_cache > *cachep, gfp_t flags, > } > > static __always_inline void * > -slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, > +slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t > orig_size, > unsigned long caller) > { The size of the object is available via a field in kmem_cache. And a pointer to the current kmem_cache is already passed to the function. Why is there a need to add an additional parameter?
Re: [PATCH] mm/slub: branch optimization in free slowpath
On Thu, 13 Aug 2020, wuyun...@huawei.com wrote: > The two conditions are mutually exclusive and gcc compiler will > optimise this into if-else-like pattern. Given that the majority > of free_slowpath is free_frozen, let's provide some hint to the > compilers. Acked-by: Christoph Lameter
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
On Fri, 7 Aug 2020, Pekka Enberg wrote: > Why do you consider this to be a fast path? This is all partial list > accounting when we allocate/deallocate a slab, no? Just like > ___slab_alloc() says, I assumed this to be the slow path... What am I > missing? I thought these were per object counters? If you just want to count the number of slabs then you do not need the lock at all. We already have a counter for the number of slabs. > No objections to alternative fixes, of course, but wrapping the > counters under CONFIG_DEBUG seems like just hiding the actual issue... CONFIG_DEBUG is on by default. It just compiles in the debug code and disables it so we can enable it with a kernel boot option. This is because we have had numerous issues in the past with "production" kernels that could not be recompiled with debug options. So just running the prod kernel with another option will allow you to find hard to debug issues in a full scale producton deployment with potentially proprietary modules etc.
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
On Fri, 7 Aug 2020, Pekka Enberg wrote: > I think we can just default to the counters. After all, if I > understood correctly, we're talking about up to 100 ms time period > with IRQs disabled when count_partial() is called. As this is > triggerable from user space, that's a performance bug whatever way you > look at it. Well yes under extreme conditions and this is only happening for sysfs counter retrieval. There could be other solutions to this. This solution here is penalizing evertu hotpath slab allocation for the sake of relatively infrequently used counter monitoring. There the possibility of not traversing the list ande simply estimating the value based on the number of slab pages allocated on that node. > Christoph, others, any objections? Obviously ;-)
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
On Tue, 7 Jul 2020, Pekka Enberg wrote: > On Fri, Jul 3, 2020 at 12:38 PM xunlei wrote: > > > > On 2020/7/2 PM 7:59, Pekka Enberg wrote: > > > On Thu, Jul 2, 2020 at 11:32 AM Xunlei Pang > > > wrote: > > >> The node list_lock in count_partial() spend long time iterating > > >> in case of large amount of partial page lists, which can cause > > >> thunder herd effect to the list_lock contention, e.g. it cause > > >> business response-time jitters when accessing "/proc/slabinfo" > > >> in our production environments. > > > > > > Would you have any numbers to share to quantify this jitter? I have no > > > > We have HSF RT(High-speed Service Framework Response-Time) monitors, the > > RT figures fluctuated randomly, then we deployed a tool detecting "irq > > off" and "preempt off" to dump the culprit's calltrace, capturing the > > list_lock cost up to 100ms with irq off issued by "ss", this also caused > > network timeouts. > > Thanks for the follow up. This sounds like a good enough motivation > for this patch, but please include it in the changelog. Well this is access via sysfs causing a holdoff. Another way of access to the same information without adding atomics and counters would be best. > > I also have no idea what's the standard SLUB benchmark for the > > regression test, any specific suggestion? > > I don't know what people use these days. When I did benchmarking in > the past, hackbench and netperf were known to be slab-allocation > intensive macro-benchmarks. Christoph also had some SLUB > micro-benchmarks, but I don't think we ever merged them into the tree. They are still where they have been for the last decade or so. In my git tree on kernel.org. They were also reworked a couple of times and posted to linux-mm. There are historical posts going back over the years where individuals have modified them and used them to create multiple other tests.
Re: [PATCH 1/2] mm/slub: Introduce two counters for the partial objects
On Thu, 2 Jul 2020, Xunlei Pang wrote: > This patch introduces two counters to maintain the actual number > of partial objects dynamically instead of iterating the partial > page lists with list_lock held. > > New counters of kmem_cache_node are: pfree_objects, ptotal_objects. > The main operations are under list_lock in slow path, its performance > impact is minimal. If at all then these counters need to be under CONFIG_SLUB_DEBUG. > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -616,6 +616,8 @@ struct kmem_cache_node { > #ifdef CONFIG_SLUB > unsigned long nr_partial; > struct list_head partial; > + atomic_long_t pfree_objects; /* partial free objects */ > + atomic_long_t ptotal_objects; /* partial total objects */ Please in the CONFIG_SLUB_DEBUG. Without CONFIG_SLUB_DEBUG we need to build with minimal memory footprint. > #ifdef CONFIG_SLUB_DEBUG > atomic_long_t nr_slabs; > atomic_long_t total_objects; > diff --git a/mm/slub.c b/mm/slub.c Also this looks to be quite heavy on the cache and on execution time. Note that the list_lock could be taken frequently in the performance sensitive case of freeing an object that is not in the partial lists.
Re: [PATCH v1] mm:free unused pages in kmalloc_order
On Mon, 29 Jun 2020, Matthew Wilcox wrote: > Sounds like we need a test somewhere that checks this behaviour. > > > In order to make such allocations possible one would have to create yet > > another kmalloc array for high memory. > > Not for this case because it goes straight to kmalloc_order(). What does > make this particular case impossible is that we can't kmap() a compound > page. We could vmap it, but why are we bothering? Well yes it will work if the slab allocator falls back to the page allocator. Higher order allocation through kmalloc ;-). How much fun and uselessness Why not call the page allocator directly and play with all the bits you want? Any regular small object allocation with GFP_HIGH will lead to strange effects if the bit is not checked.
Re: [PATCH v1] mm:free unused pages in kmalloc_order
On Mon, 29 Jun 2020, Matthew Wilcox wrote: > Slab used to disallow GFP_HIGHMEM allocations earlier than this, It is still not allowed and not supported.
Re: [PATCH v1] mm:free unused pages in kmalloc_order
On Sat, 27 Jun 2020, Long Li wrote: > Environment using the slub allocator, 1G memory in my ARM32. > kmalloc(1024, GFP_HIGHUSER) can allocate memory normally, > kmalloc(64*1024, GFP_HIGHUSER) will cause a memory leak, because > alloc_pages returns highmem physical pages, but it cannot be directly > converted into a virtual address and return NULL, the pages has not > been released. Usually driver developers will not use the > GFP_HIGHUSER flag to allocate memory in kmalloc, but I think this > memory leak is not perfect, it is best to be fixed. This is the > first time I have posted a patch, there may be something wrong. Highmem is not supported by the slab allocators. Please ensure that there is a warning generated if someone attempts to do such an allocation. We used to check for that. In order to make such allocations possible one would have to create yet another kmalloc array for high memory.
Re: [PATCH v5 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Wed, 24 Jun 2020, Srikar Dronamraju wrote: > Currently Linux kernel with CONFIG_NUMA on a system with multiple > possible nodes, marks node 0 as online at boot. However in practice, > there are systems which have node 0 as memoryless and cpuless. Maybe add something to explain why you are not simply mapping the existing memory to NUMA node 0 which is after all just a numbering scheme used by the kernel and can be used arbitrarily? This could be seen more as a bug in the arch code during the setup of NUMA nodes. The two nodes are created by the firmwware / bootstrap code after all. Just do not do it?
Re: [PATCH] mm: ksize() should silently accept a NULL pointer
On Tue, 16 Jun 2020, William Kucharski wrote: > Other mm routines such as kfree() and kzfree() silently do the right > thing if passed a NULL pointer, so ksize() should do the same. Ok so the size of an no object pointer is zero? Ignoring the freeing of a nonexisting object makes sense. But determining it size?
Re: [PATCH 0/3] mm/slub: Fix slabs_node return value
On Sun, 14 Jun 2020, Muchun Song wrote: > The slabs_node() always return zero when CONFIG_SLUB_DEBUG is disabled. > But some codes determine whether slab is empty by checking the return > value of slabs_node(). As you know, the result is not correct. we move > the nr_slabs of kmem_cache_node out of the CONFIG_SLUB_DEBUG. So we can > get the corrent value returned by the slabs_node(). Not that all distribution kernels have CONFIG_SLUB_DEBUG enabled. This does not enable runtime debugging but only compiles the debug code in that can then be enabled at runtime. Users of !CONFIG_SLUB_DEBUG do not want to have the debug code included because they have extreme requirements on memory use. This patch increases use of memory by enabling fields that were excluded under !CONFIG_SLUB_DEBUG before! There is nothing wrong with slab_node's return value if one wants to sacrifice debugging and consistency checks for a small code build.
Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
On Tue, 12 May 2020, Roman Gushchin wrote: > > Add it to the metadata at the end of the object. Like the debugging > > information or the pointer for RCU freeing. > > Enabling debugging metadata currently disables the cache merging. > I doubt that it's acceptable to sacrifice the cache merging in order > to embed the memcg pointer? Well then keep the merging even if you have a memcg pointer. The disabling for debugging is only to simplify debugging. You dont have to deal with multiple caches actually using the same storage structures. > Figuring out all these details will likely take several weeks, so the whole > thing will be delayed for one-two major releases (in the best case). Given > that > the current implementation saves ~40% of slab memory, I think there is some > value > in delivering it as it is. So I wonder if the idea of embedding the pointer > should be considered a blocker, or it can be implemented of top of the > proposed > code (given it's not a user-facing api or something like this)? Sorry no idea from my end here.
Re: [PATCH v4 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Tue, 12 May 2020, Srikar Dronamraju wrote: > +#ifdef CONFIG_NUMA > + [N_ONLINE] = NODE_MASK_NONE, Again. Same issue as before. If you do this then you do a global change for all architectures. You need to put something in the early boot sequence (in a non architecture specific way) that sets the first node online by default. You have fixed the issue in your earlier patches for the powerpc archicture. What about the other architectures? Or did I miss something?
Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
On Mon, 4 May 2020, Roman Gushchin wrote: > On Sat, May 02, 2020 at 11:54:09PM +, Christoph Lameter wrote: > > On Thu, 30 Apr 2020, Roman Gushchin wrote: > > > > > Sorry, but what exactly do you mean? > > > > I think the right approach is to add a pointer to each slab object for > > memcg support. > > > > As I understand, embedding the memcg pointer will hopefully make allocations > cheaper in terms of CPU, but will require more memory. And you think that > it's worth it. Is it a correct understanding? It definitely makes the code less complex. The additional memory is minimal. In many cases you have already some space wasted at the end of the object that could be used for the pointer. > Can you, please, describe a bit more detailed how it should be done > from your point of view? Add it to the metadata at the end of the object. Like the debugging information or the pointer for RCU freeing. > I mean where to store the pointer, should it be SLAB/SLUB-specific code > or a generic code, what do to with kmallocs alignments, should we > merge slabs which had a different size before and now have the same > because of the memcg pointer and aligment, etc. Both SLAB and SLUB have the same capabilities there. Slabs that had different sizes before will now have different sizes as well. So the merging does not change. > I'm happy to follow your advice and perform some tests to get an idea of > how significant the memory overhead is and how big are CPU savings. > I guess with these numbers it will be easy to make a decision. Sure. The main issue are the power of two kmalloc caches and how to add the pointer to these caches in order not to waste memory. SLAB has done this in the past by creating additional structues in a page frame.
Re: [PATCH] slub: limit count of partial slabs scanned to gather statistics
On Mon, 4 May 2020, Andrew Morton wrote: > But I guess it's better than nothing at all, unless there are > alternative ideas? I its highly unsusual to have such large partial lists. In a typical case allocations whould reduce the size of the lists. 1000s? That is scary. Are there inodes or dentries by chance? The defrag stuff that I had been trying to do for a long time would solve that issue but then objects would need to be made movable
Re: [PATCH] mm: slub: add panic_on_error to the debug facilities
On Sun, 3 May 2020, Rafael Aquini wrote: > On Sat, May 02, 2020 at 11:16:30PM +0000, Christopher Lameter wrote: > > On Fri, 1 May 2020, Rafael Aquini wrote: > > > > > Sometimes it is desirable to override SLUB's debug facilities > > > default behavior upon stumbling on a cache or object error > > > and just stop the execution in order to grab a coredump, at > > > the error-spotting time, instead of trying to fix the issue > > > and report in an attempt to keep the system rolling. > > > > The stopping of execution on an error is the default behavior. Usually > > you get some OOPS somewhere when data is corrupted and that causes a core > > dump. > > > > SLUB can fix the issue and continue if enabled by specifying special > > options on boot. That is *not* the default. > > > It is the default behavior when slub_debug is turned on, which is what > this patch is trying to override, when needed. We've been seeing the > need for such feature as, most often than not, by letting the system > running to crash somewhere else after hitting occurrences reported by > slub_debug ends up clobbering clues to the original issue. Ok. The backtrace is not sufficient in that case?
Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
On Thu, 30 Apr 2020, Roman Gushchin wrote: > Sorry, but what exactly do you mean? I think the right approach is to add a pointer to each slab object for memcg support.
Re: [PATCH] mm: slub: add panic_on_error to the debug facilities
On Fri, 1 May 2020, Rafael Aquini wrote: > Sometimes it is desirable to override SLUB's debug facilities > default behavior upon stumbling on a cache or object error > and just stop the execution in order to grab a coredump, at > the error-spotting time, instead of trying to fix the issue > and report in an attempt to keep the system rolling. The stopping of execution on an error is the default behavior. Usually you get some OOPS somewhere when data is corrupted and that causes a core dump. SLUB can fix the issue and continue if enabled by specifying special options on boot. That is *not* the default.
Re: [PATCH v3 3/3] mm/page_alloc: Keep memoryless cpuless node 0 offline
On Fri, 1 May 2020, Srikar Dronamraju wrote: > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -116,8 +116,10 @@ EXPORT_SYMBOL(latent_entropy); > */ > nodemask_t node_states[NR_NODE_STATES] __read_mostly = { > [N_POSSIBLE] = NODE_MASK_ALL, > +#ifdef CONFIG_NUMA > + [N_ONLINE] = NODE_MASK_NONE, Hmmm I would have expected that you would have added something early in boot that would mark the current node (whatever is is) online instead?
Re: [PATCH v3 1/3] powerpc/numa: Set numa_node for all possible cpus
On Fri, 1 May 2020, Srikar Dronamraju wrote: > - for_each_present_cpu(cpu) > - numa_setup_cpu(cpu); > + for_each_possible_cpu(cpu) { > + /* > + * Powerpc with CONFIG_NUMA always used to have a node 0, > + * even if it was memoryless or cpuless. For all cpus that > + * are possible but not present, cpu_to_node() would point > + * to node 0. To remove a cpuless, memoryless dummy node, > + * powerpc need to make sure all possible but not present > + * cpu_to_node are set to a proper node. > + */ > + if (cpu_present(cpu)) > + numa_setup_cpu(cpu); > + else > + set_cpu_numa_node(cpu, first_online_node); > + } > } Can this be folded into numa_setup_cpu? This looks more like numa_setup_cpu needs to change?
Re: [PATCH v3 04/19] mm: slub: implement SLUB version of obj_to_index()
On Mon, 27 Apr 2020, Roman Gushchin wrote: > > Why do you need this? Just slap a pointer to the cgroup as additional > > metadata onto the slab object. Is that not much simpler, safer and faster? > > > > So, the problem is that not all slab objects are accounted, and sometimes > we don't know if advance if they are accounted or not (with the current > semantics > of __GFP_ACCOUNT and SLAB_ACCOUNT flags). So we either have to increase > the size of ALL slab objects, either create a pair of slab caches for each > size. > > The first option is not that cheap in terms of the memory overhead. Especially > for those who disable cgroups using a boot-time option. If the cgroups are disabled on boot time then you can switch back to the compact version. Otherwise just add a pointer to each object. It will make it consistent and there is not much memory wastage. The problem comes about with the power of 2 caches in the kmalloc array. If one keeps the "natural alignment" instead of going for the normal alignment of slab caches then the alignment will cause a lot of memory wastage and thus the scheme of off slab metadata is likely going to be unavoidable. But I think we are just stacking one bad idea onto another here making things much more complex than they could be. Well at least this justifies all our jobs (not mine I am out of work... hehehe)
Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
On Mon, 21 Oct 2019, Roman Gushchin wrote: > Sp far I haven't noticed any regression on the set of workloads where I did > test > the patchset, but if you know any benchmark or realistic test which can > affected > by this check, I'll be happy to try. > > Also, less-than-word-sized operations can be slower on some platforms. The main issue in the past was the cache footprint. Memory is slow. Processors are fast.
Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
On Thu, 17 Oct 2019, Roman Gushchin wrote: > Currently s8 type is used for per-cpu caching of per-node statistics. > It works fine because the overfill threshold can't exceed 125. > > But if some counters are in bytes (and the next commit in the series > will convert slab counters to bytes), it's not gonna work: > value in bytes can easily exceed s8 without exceeding the threshold > converted to bytes. So to avoid overfilling per-cpu caches and breaking > vmstats correctness, let's use s32 instead. Actually this is inconsistenct since the other counters are all used to account for pages. Some of the functions in use for the page counters will no longer make sense. inc/dec_node_state() etc.
Re: [PATCH 02/16] mm: vmstat: use s32 for vm_node_stat_diff in struct per_cpu_nodestat
On Thu, 17 Oct 2019, Roman Gushchin wrote: > But if some counters are in bytes (and the next commit in the series > will convert slab counters to bytes), it's not gonna work: > value in bytes can easily exceed s8 without exceeding the threshold > converted to bytes. So to avoid overfilling per-cpu caches and breaking > vmstats correctness, let's use s32 instead. This quardruples the cache footprint of the counters and likely has some influence on performance.
Re: [PATCH 25/34] mm: Use CONFIG_PREEMPTION
Acked-by: Chistoph Lameter
Re: [PATCH] mm, page_alloc: drop pointless static qualifier in build_zonelists()
On Sat, 28 Sep 2019, Kaitao Cheng wrote: > There is no need to make the 'node_order' variable static > since new value always be assigned before use it. In the past MAX_NUMMNODES could become quite large like 512 or 1k. Large array allocations on the stack are problematic. Maybe that is no longer the case?
Re: [PATCH v5 0/7] mm, slab: Make kmalloc_info[] contain all types of names
On Mon, 16 Sep 2019, Pengfei Li wrote: > The name of KMALLOC_NORMAL is contained in kmalloc_info[].name, > but the names of KMALLOC_RECLAIM and KMALLOC_DMA are dynamically > generated by kmalloc_cache_name(). > > Patch1 predefines the names of all types of kmalloc to save > the time spent dynamically generating names. > > These changes make sense, and the time spent by new_kmalloc_cache() > has been reduced by approximately 36.3%. This is time spend during boot and does not affect later system performance. > Time spent by new_kmalloc_cache() > (CPU cycles) > 5.3-rc7 66264 > 5.3-rc7+patch_1-342188 Ok. 15k cycles during boot saved. So we save 5 microseconds during bootup? The current approach was created with the view on future setups allowing a dynamic configuration of kmalloc caches based on need. I.e. ZONE_DMA may not be needed once the floppy driver no longer makes use of it.
Re: [PATCH v5 7/7] mm, slab_common: Modify kmalloc_caches[type][idx] to kmalloc_caches[idx][type]
On Mon, 16 Sep 2019, Pengfei Li wrote: > KMALLOC_NORMAL is the most frequently accessed, and kmalloc_caches[] > is initialized by different types of the same size. > > So modifying kmalloc_caches[type][idx] to kmalloc_caches[idx][type] > will benefit performance. Why would that increase performance? Using your scheme means that the KMALLOC_NORMAL pointers are spread over more cachelines. Since KMALLOC_NORMAL is most frequently accessed this would cause a performance regression.
Re: [PATCH v2 4/4] mm: lock slub page when listing objects
On Wed, 11 Sep 2019, Yu Zhao wrote: > Though I have no idea what the side effect of such race would be, > apparently we want to prevent the free list from being changed > while debugging the objects. process_slab() is called under the list_lock which prevents any allocation from the free list in the slab page. This means that new objects can be added to the freelist which occurs by updating the freelist pointer in the slab page with a pointer to the newly free object which in turn contains the old freelist pointr. It is therefore possible to safely traverse the objects on the freelist after the pointer has been retrieved NAK.
Re: [PATCH 0/5] mm, slab: Make kmalloc_info[] contain all types of names
On Wed, 4 Sep 2019, Pengfei Li wrote: > There are three types of kmalloc, KMALLOC_NORMAL, KMALLOC_RECLAIM > and KMALLOC_DMA. I only got a few patches of this set. Can I see the complete patchset somewhere?
Re: [PATCH v2 2/2] mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)
On Sat, 31 Aug 2019, Matthew Wilcox wrote: > > The current behavior without special alignment for these caches has been > > in the wild for over a decade. And this is now coming up? > > In the wild ... and rarely enabled. When it is enabled, it may or may > not be noticed as data corruption, or tripping other debugging asserts. > Users then turn off the rare debugging option. Its enabled in all full debug session as far as I know. Fedora for example has been running this for ages to find breakage in device drivers etc etc. > > If there is an exceptional alignment requirement then that needs to be > > communicated to the allocator. A special flag or create a special > > kmem_cache or something. > > The only way I'd agree to that is if we deliberately misalign every > allocation that doesn't have this special flag set. Because right now, > breakage happens everywhere when these debug options are enabled, and > the very people who need to be helped are being hurt by the debugging. That is customarily occurring for testing by adding "slub_debug" to the kernel commandline (or adding debug kernel options) and since my information is that this is done frequently (and has been for over a decade now) I am having a hard time believing the stories of great breakage here. These drivers were not tested with debugging on before? Never ran with a debug kernel?
Re: [PATCH v2 2/2] mm, slab: Show last shrink time in us when slab/shrink is read
On Wed, 17 Jul 2019, Waiman Long wrote: > The show method of /sys/kernel/slab//shrink sysfs file currently > returns nothing. This is now modified to show the time of the last > cache shrink operation in us. What is this useful for? Any use cases? > CONFIG_SLUB_DEBUG depends on CONFIG_SYSFS. So the new shrink_us field > is always available to the shrink methods. Aside from minimal systems without CONFIG_SYSFS... Does this build without CONFIG_SYSFS?
Re: [PATCH v2 1/2] mm, slab: Extend slab/shrink to shrink all memcg caches
On Wed, 17 Jul 2019, Waiman Long wrote: > Currently, a value of '1" is written to /sys/kernel/slab//shrink > file to shrink the slab by flushing out all the per-cpu slabs and free > slabs in partial lists. This can be useful to squeeze out a bit more memory > under extreme condition as well as making the active object counts in > /proc/slabinfo more accurate. Acked-by: Christoph Lameter > # grep task_struct /proc/slabinfo > task_struct53137 53192 4288 614 : tunables00 > 0 : slabdata872872 0 > # grep "^S[lRU]" /proc/meminfo > Slab:3936832 kB > SReclaimable: 399104 kB > SUnreclaim: 3537728 kB > > After shrinking slabs: > > # grep "^S[lRU]" /proc/meminfo > Slab:1356288 kB > SReclaimable: 263296 kB > SUnreclaim: 1092992 kB Well another indicator that it may not be a good decision to replicate the whole set of slabs for each memcg. Migrate the memcg ownership into the objects may allow the use of the same slab cache. In particular together with the slab migration patches this may be a viable way to reduce memory consumption.
Re: [PATCH v5 4/5] mm/slab: Refactor common ksize KASAN logic into slab_common.c
On Mon, 8 Jul 2019, Marco Elver wrote: > This refactors common code of ksize() between the various allocators > into slab_common.c: __ksize() is the allocator-specific implementation > without instrumentation, whereas ksize() includes the required KASAN > logic. Acked-by: Christoph Lameter
Re: [PATCH] mm/slab: One function call less in verify_redzone_free()
On Fri, 5 Jul 2019, Markus Elfring wrote: > Avoid an extra function call by using a ternary operator instead of > a conditional statement for a string literal selection. Well. I thought the compiler does that on its own? And the tenary operator makes the code difficult to read.
Re: [PATCH] mm, slab: Extend slab/shrink to shrink all the memcg caches
On Wed, 3 Jul 2019, Waiman Long wrote: > On 7/3/19 2:56 AM, Michal Hocko wrote: > > On Tue 02-07-19 14:37:30, Waiman Long wrote: > >> Currently, a value of '1" is written to /sys/kernel/slab//shrink > >> file to shrink the slab by flushing all the per-cpu slabs and free > >> slabs in partial lists. This applies only to the root caches, though. > >> > >> Extends this capability by shrinking all the child memcg caches and > >> the root cache when a value of '2' is written to the shrink sysfs file. > > Why do we need a new value for this functionality? I would tend to think > > that skipping memcg caches is a bug/incomplete implementation. Or is it > > a deliberate decision to cover root caches only? > > It is just that I don't want to change the existing behavior of the > current code. It will definitely take longer to shrink both the root > cache and the memcg caches. If we all agree that the only sensible > operation is to shrink root cache and the memcg caches together. I am > fine just adding memcg shrink without changing the sysfs interface > definition and be done with it. I think its best and consistent behavior to shrink all memcg caches with the root cache. This looks like an oversight and thus a bugfix.
Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs
On Thu, 27 Jun 2019, Roman Gushchin wrote: > so that objects belonging to different memory cgroups can share the same page > and kmem_caches. > > It's a fairly big change though. Could this be done at another level? Put a cgoup pointer into the corresponding structures and then go back to just a single kmen_cache for the system as a whole? You can still account them per cgroup and there will be no cleanup problem anymore. You could scan through a slab cache to remove the objects of a certain cgroup and then the fragmentation problem that cgroups create here will be handled by the slab allocators in the traditional way. The duplication of the kmem_cache was not designed into the allocators but bolted on later.
Re: [PATCH v1 4/4] mm: introduce MADV_PAGEOUT
On Mon, 3 Jun 2019, Minchan Kim wrote: > @@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma, > return 0; > } > > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr, > + unsigned long end, struct mm_walk *walk) > +{ > + pte_t *orig_pte, *pte, ptent; > + spinlock_t *ptl; > + LIST_HEAD(page_list); > + struct page *page; > + int isolated = 0; > + struct vm_area_struct *vma = walk->vma; > + unsigned long next; > + > + if (fatal_signal_pending(current)) > + return -EINTR; > + > + next = pmd_addr_end(addr, end); > + if (pmd_trans_huge(*pmd)) { > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (!ptl) > + return 0; > + > + if (is_huge_zero_pmd(*pmd)) > + goto huge_unlock; > + > + page = pmd_page(*pmd); > + if (page_mapcount(page) > 1) > + goto huge_unlock; > + > + if (next - addr != HPAGE_PMD_SIZE) { > + int err; > + > + get_page(page); > + spin_unlock(ptl); > + lock_page(page); > + err = split_huge_page(page); > + unlock_page(page); > + put_page(page); > + if (!err) > + goto regular_page; > + return 0; > + } I have seen this before multiple times. Is there a way to avoid replicating the whole shebang?
Re: [PATCH] slab: remove /proc/slab_allocators
On Thu, 16 May 2019, Qian Cai wrote: > It turned out that DEBUG_SLAB_LEAK is still broken even after recent > recue efforts that when there is a large number of objects like > kmemleak_object which is normal on a debug kernel, Acked-by: Christoph Lameter
Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management
On Tue, 14 May 2019, Roman Gushchin wrote: > To make this possible we need to introduce a new percpu refcounter > for non-root kmem_caches. The counter is initialized to the percpu > mode, and is switched to atomic mode after deactivation, so we never > shutdown an active cache. The counter is bumped for every charged page > and also for every running allocation. So the kmem_cache can't > be released unless all allocations complete. Increase refcounts during each allocation? Looks to be quite heavy processing.
Re: [PATCH v3 4/7] mm: unify SLAB and SLUB page accounting
On Wed, 8 May 2019, Roman Gushchin wrote: > Currently the page accounting code is duplicated in SLAB and SLUB > internals. Let's move it into new (un)charge_slab_page helpers > in the slab_common.c file. These helpers will be responsible > for statistics (global and memcg-aware) and memcg charging. > So they are replacing direct memcg_(un)charge_slab() calls. Looks good. Acked-by: Christoph Lameter
Re: DISCONTIGMEM is deprecated
On Mon, 29 Apr 2019, Christoph Hellwig wrote: > So maybe it it time to mark SN2 broken and see if anyone screams? > > Without SN2 the whole machvec mess could basically go away - the > only real difference between the remaining machvecs is which iommu > if any we set up. SPARSEMEM with VMEMMAP was developed to address these issues and allow one mapping scheme across the different platforms. You do not need DISCONTIGMEM support for SN2. And as far as I know (from a decade ago ok) the distributions were using VMEMMAP instead.
Re: [PATCH] mm: Allow userland to request that the kernel clear memory on release
On Wed, 24 Apr 2019, Matthew Garrett wrote: > Applications that hold secrets and wish to avoid them leaking can use > mlock() to prevent the page from being pushed out to swap and > MADV_DONTDUMP to prevent it from being included in core dumps. Applications > can also use atexit() handlers to overwrite secrets on application exit. > However, if an attacker can reboot the system into another OS, they can > dump the contents of RAM and extract secrets. We can avoid this by setting Well nothing in this patchset deals with that issue That hole still exists afterwards. So is it worth to have this functionality? > Unfortunately, if an application exits uncleanly, its secrets may still be > present in RAM. This can't be easily fixed in userland (eg, if the OOM > killer decides to kill a process holding secrets, we're not going to be able > to avoid that), so this patch adds a new flag to madvise() to allow userland > to request that the kernel clear the covered pages whenever the page > reference count hits zero. Since vm_flags is already full on 32-bit, it > will only work on 64-bit systems. But then the pages are cleared anyways when reallocated to another process. This just clears it sooner before reuse. So it will reduce the time that a page contains the secret sauce in case the program is aborted and cannot run its exit handling. Is that realy worth extending system calls and adding kernel handling for this? Maybe the answer is yes given our current concern about anything related to "security".
Re: DISCONTIGMEM is deprecated
On Fri, 19 Apr 2019, Matthew Wilcox wrote: > ia64 (looks complicated ...) Well as far as I can tell it was not even used 12 or so years ago on Itanium when I worked on that stuff.
Re: [PATCH 4/5] mm: rework non-root kmem_cache lifecycle management
On Wed, 17 Apr 2019, Roman Gushchin wrote: > static __always_inline int memcg_charge_slab(struct page *page, >gfp_t gfp, int order, >struct kmem_cache *s) > { > - if (is_root_cache(s)) > + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ? > + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE; > + struct mem_cgroup *memcg; > + struct lruvec *lruvec; > + int ret; > + > + if (is_root_cache(s)) { > + mod_node_page_state(page_pgdat(page), idx, 1 << order); Hmmm... This is functionality that is not memcg specific being moved into a memcg function??? Maybe rename the function to indicate that it is not memcg specific and add the proper #ifdefs? > static __always_inline void memcg_uncharge_slab(struct page *page, int order, > struct kmem_cache *s) > { > - memcg_kmem_uncharge(page, order); > + int idx = (s->flags & SLAB_RECLAIM_ACCOUNT) ? > + NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE; > + struct mem_cgroup *memcg; > + struct lruvec *lruvec; > + > + if (is_root_cache(s)) { > + mod_node_page_state(page_pgdat(page), idx, -(1 << order)); > + return; > + } And again.
Re: [PATCH 4/5] mm: rework non-root kmem_cache lifecycle management
On Wed, 17 Apr 2019, Roman Gushchin wrote: > Let's make every page to hold a reference to the kmem_cache (we > already have a stable pointer), and make kmem_caches to hold a single > reference to the memory cgroup. Ok you are freeing one word in the page struct that can be used for other purposes now?
Re: [External] Re: Basics : Memory Configuration
Please respond to my comments in the way that everyone else communicates here. I cannot distinguish what you said from what I said before.
Re: Basics : Memory Configuration
On Tue, 9 Apr 2019, Pankaj Suryawanshi wrote: > I am confuse about memory configuration and I have below questions Hmmm... Yes some of the terminology that you use is a bit confusing. > 1. if 32-bit os maximum virtual address is 4GB, When i have 4 gb of ram > for 32-bit os, What about the virtual memory size ? is it required > virtual memory(disk space) or we can directly use physical memory ? The virtual memory size is the maximum virtual size of a single process. Multiple processes can run and each can use different amounts of physical memory. So both are actually independent. The size of the virtual memory space per process is configurable on x86 32 bit (2G, 3G, 4G). Thus the possible virtual process size may vary depending on the hardware architecture and the configuration of the kernel. > 2. In 32-bit os 12 bits are offset because page size=4k i.e 2^12 and > 2^20 for page addresses >What about 64-bit os, What is offset size ? What is page size ? How it > calculated. 12 bits are passed through? Thats what you mean? The remainder of the bits are used to lookup the physical frame number(PFN) in the page tables. 64 bit is the same. However, the number of bits used for lookups in the page tables are much higher. > 3. What is PAE? If enabled how to decide size of PAE, what is maximum > and minimum size of extended memory. PAE increases the physical memory size that can be addressed through a page table lookup. The number of bits that can be specified in the PFN is increased and thus more than 4GB of physical memory can be used by the operating system. However, the virtual memory size stays the same and an individual process still cannot use more memory.
Re: [PATCH] slab: fix a crash by reading /proc/slab_allocators
On Sun, 7 Apr 2019, Linus Torvalds wrote: > On Sat, Apr 6, 2019 at 12:59 PM Qian Cai wrote: > > > > The commit 510ded33e075 ("slab: implement slab_root_caches list") > > changes the name of the list node within "struct kmem_cache" from > > "list" to "root_caches_node", but leaks_show() still use the "list" > > which causes a crash when reading /proc/slab_allocators. > > The patch does seem to be correct, and I have applied it. > > However, it does strike me that apparently this wasn't caught for two > years. Which makes me wonder whether we should (once again) discuss > just removing SLAB entirely, or at least removing the > /proc/slab_allocators file. Apparently it has never been used in the > last two years. At some point a "this can't have worked if anybody > ever tried to use it" situation means that the code should likely be > excised. This is only occurring with specially build kernels so that memory leaks can be investigated. The same is done with other tools (kasan and friends) today I guess and also the SLUB debugging tools are much more user friendly. So this means that some esoteric debugging feature of SLAB was broken.
Re: [PATCH] percpu/module resevation: change resevation size iff X86_VSMP is set
On Wed, 13 Mar 2019, Barret Rhoden wrote: > > It is very expensive. VMSP exchanges 4K segments via RDMA between servers > > to build a large address space and run a kernel in the large address > > space. Using smaller segments can cause a lot of > > "cacheline" bouncing (meaning transfers of 4K segments back and forth > > between servers). > > > > Given that these are large machines, would it be OK to statically reserve 64K > on them for modules' percpu data? Likely. > The bug that led me to here was from someone running on a non-VSMP machine but > had that config set. Perhaps we make it more clear in the Kconfig option to > not set it on other machines. That might make it less likely anyone on a > non-VSMP machine pays the 64K overhead. Right. > Are there any other alternatives? Not using static SRCU in any code that > could be built as a module seems a little harsh. Sorry this ended up in my spam folder somehow. Just fished it out.
Re: [RFC 2/2] mm, slub: add missing kmem_cache_debug() checks
On Thu, 4 Apr 2019, Vlastimil Babka wrote: > Some debugging checks in SLUB are not hidden behind kmem_cache_debug() check. > Add the check so that those places can also benefit from reduced overhead > thanks to the the static key added by the previous patch. Hmmm... I would not expect too much of a benefit from these changes since most of the stuff is actually not on the hot path.
Re: [RFC 0/2] add static key for slub_debug
On Thu, 4 Apr 2019, Vlastimil Babka wrote: > I looked a bit at SLUB debugging capabilities and first thing I noticed is > there's no static key guarding the runtime enablement as is common for similar > debugging functionalities, so here's a RFC to add it. Can be further improved > if there's interest. Well the runtime enablement is per slab cache and static keys are global. Adding static key adds code to the critical paths. Since the flags for a kmem_cache have to be inspected anyways there may not be that much of a benefit. > It's true that in the alloc fast path the debugging check overhead is AFAICS > amortized by the per-cpu cache, i.e. when the allocation is from there, no > debugging functionality is performed. IMHO that's kinda a weakness, especially > for SLAB_STORE_USER, so I might also look at doing something about it, and > then > the static key might be more critical for overhead reduction. Moving debugging out of the per cpu fastpath allows that fastpath to be much simpler and faster. SLAB_STORE_USER is mostly used only for debugging in which case we are less concerned with performance. If you want to use SLAB_STORE_USER in the fastpath then we have to do some major redesign there. > In the freeing fast path I quickly checked the stats and it seems that in > do_slab_free(), the "if (likely(page == c->page))" is not as likely as it > declares, as in the majority of cases, freeing doesn't happen on the object > that belongs to the page currently cached. So the advantage of a static key in > slow path __slab_free() should be more useful immediately. Right. The freeing logic is actuall a weakness in terms of performance for SLUB due to the need to operate on a per page queue immediately.
Re: [RFC PATCH v2 14/14] dcache: Implement object migration
On Wed, 3 Apr 2019, Al Viro wrote: > > This is an RFC and we want to know how to do this right. > > If by "how to do it right" you mean "expedit kicking out something with > non-zero refcount" - there's no way to do that. Nothing even remotely > sane. Sure we know that. > If you mean "kick out everything in this page with zero refcount" - that > can be done (see further in the thread). Ok that would already be progress. If we can use this to liberate some slab pages with just a few dentry object then it may be worthwhile. > Look, dentries and inodes are really, really not relocatable. If they > can be evicted by memory pressure - sure, we can do that for a given > set (e.g. "everything in that page"). But that's it - if memory > pressure would _not_ get rid of that one, there's nothing to be done. > Again, all VM can do is to simulate shrinker hitting hard on given > bunch (rather than buggering the entire cache). If filesystem (or > something in VFS) says "it's busy", it bloody well _is_ busy and > won't be going away until it ceases to be such. Right. Thats why the patch attempted to check for these things to avoid touching such objects.
Re: [RFC PATCH v2 14/14] dcache: Implement object migration
On Wed, 3 Apr 2019, Al Viro wrote: > Let's do d_invalidate() on random dentries and hope they go away. > With convoluted and brittle logics for deciding which ones to > spare, which is actually wrong. This will pick mountpoints > and tear them out, to start with. > > NAKed-by: Al Viro > > And this is a NAK for the entire approach; if it has a positive refcount, > LEAVE IT ALONE. Period. Don't play this kind of games, they are wrong. > d_invalidate() is not something that can be done to an arbitrary dentry. Well could you help us figure out how to do it the right way? We (the MM guys) are having a hard time not being familiar with the filesystem stuff. This is an RFC and we want to know how to do this right.
Re: [PATCH v5 6/7] slab: Use slab_list instead of lru
Acked-by: Christoph Lameter
Re: [PATCH v5 3/7] slob: Use slab_list instead of lru
Acked-by: Christoph Lameter
Re: [PATCH v5 1/7] list: Add function list_rotate_to_front()
On Wed, 3 Apr 2019, Tobin C. Harding wrote: > Add function list_rotate_to_front() to rotate a list until the specified > item is at the front of the list. Reviewed-by: Christoph Lameter
Re: [PATCH v5 2/7] slob: Respect list_head abstraction layer
On Wed, 3 Apr 2019, Tobin C. Harding wrote: > Currently we reach inside the list_head. This is a violation of the > layer of abstraction provided by the list_head. It makes the code > fragile. More importantly it makes the code wicked hard to understand. Great It definitely makes it clearer. The boolean parameter is not so nice but I have no idea how to avoid it in the brief time I spent looking at it. Acked-by: Christoph Lameter
Re: [PATCH v3] kmemleaak: survive in a low-memory situation
On Tue, 26 Mar 2019, Qian Cai wrote: > + if (!object) { > + /* > + * The tracked memory was allocated successful, if the kmemleak > + * object failed to allocate for some reasons, it ends up with > + * the whole kmemleak disabled, so let it success at all cost. "let it succeed at all costs" > + */ > + gfp = (in_atomic() || irqs_disabled()) ? GFP_ATOMIC : > +gfp_kmemleak_mask(gfp) | __GFP_DIRECT_RECLAIM; > + object = kmem_cache_alloc(object_cache, gfp); > + } > + > if (!object) { If the alloc must succeed then this check is no longer necessary.
Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
On Mon, 25 Mar 2019, Matthew Wilcox wrote: > Options: > > 1. Dispense with this optimisation and always store the size of the > object before the object. I think thats how SLOB handled it at some point in the past. Lets go back to that setup so its compatible with the other allocators?
Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
On Fri, 22 Mar 2019, Matthew Wilcox wrote: > On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote: > > On Fri, 22 Mar 2019, Waiman Long wrote: > > > > > > > > > >> I am looking forward to it. > > > > There is also alrady rcu being used in these paths. kfree_rcu() would > > > > not > > > > be enough? It is an estalished mechanism that is mature and well > > > > understood. > > > > > > > In this case, the memory objects are from kmem caches, so they can't > > > freed using kfree_rcu(). > > > > Oh they can. kfree() can free memory from any slab cache. > > Only for SLAB and SLUB. SLOB requires that you pass a pointer to the > slab cache; it has no way to look up the slab cache from the object. Well then we could either fix SLOB to conform to the others or add a kmem_cache_free_rcu() variant.
Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
On Fri, 22 Mar 2019, Waiman Long wrote: > > > >> I am looking forward to it. > > There is also alrady rcu being used in these paths. kfree_rcu() would not > > be enough? It is an estalished mechanism that is mature and well > > understood. > > > In this case, the memory objects are from kmem caches, so they can't > freed using kfree_rcu(). Oh they can. kfree() can free memory from any slab cache. > There are certainly overhead using the kfree_rcu(), or a > kfree_rcu()-like mechanism. Also I think the actual freeing is done at > SoftIRQ context which can be a problem if there are too many memory > objects to free. No there is none that I am aware of. And this has survived testing of HPC loads with gazillion of objects that have to be freed from multiple processors. We really should not rearchitect this stuff... It took us quite a long time to have this scale well under all loads. Please use rcu_free().
Re: [PATCH 1/4] mm: Implement kmem objects freeing queue
On Thu, 21 Mar 2019, Waiman Long wrote: > When releasing kernel data structures, freeing up the memory > occupied by those objects is usually the last step. To avoid races, > the release operation is commonly done with a lock held. However, the > freeing operations do not need to be under lock, but are in many cases. > > In some complex cases where the locks protect many different memory > objects, that can be a problem especially if some memory debugging > features like KASAN are enabled. In those cases, freeing memory objects > under lock can greatly lengthen the lock hold time. This can even lead > to soft/hard lockups in some extreme cases. > > To make it easer to defer freeing memory objects until after unlock, > a kernel memory freeing queue mechanism is now added. It is modelled > after the wake_q mechanism for waking up tasks without holding a lock. It is already pretty easy. You just store the pointer to the slab object in a local variable, finish all the unlocks and then free the objects. This is done in numerous places of the kernel. I fear that the automated mechanism will make the code more difficult to read and result in a loss of clarity of the sequencing of events in releasing locks and objects. Also there is already kfree_rcu which does a similar thing to what you are proposing here and is used in numerous places.
Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory
On Fri, 22 Mar 2019, Waiman Long wrote: > I am looking forward to it. There is also alrady rcu being used in these paths. kfree_rcu() would not be enough? It is an estalished mechanism that is mature and well understood.
Re: [PATCH] mm, slab: remove unneed check in cpuup_canceled
On Thu, 21 Mar 2019, Li RongQing wrote: > nc is a member of percpu allocation memory, and impossible NULL Acked-by: Christoph Lameter
Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions
On Tue, 19 Mar 2019, John Hubbard wrote: > > > > My concerns do not affect this patchset which just marks the get/put for > > the pagecache. The problem was that the description was making claims that > > were a bit misleading and seemed to prescribe a solution. > > > > So lets get this merged. Whatever the solution will be, we will need this > > markup. > > > > Sounds good. Do you care to promote that thought into a formal ACK for me? :) Reviewed-by: Christoph Lameter
Re: [PATCH v4 1/1] mm: introduce put_user_page*(), placeholder versions
On Wed, 20 Mar 2019, Dave Chinner wrote: > So the plan for GUP vs writeback so far is "break fsync()"? :) Well if its an anonymous page and not a file backed page then the semantics are preserved. Disallow GUP long term pinning (marking stuff like in this patchset may make that possible) and its clean.
Re: [PATCH v4 0/1] mm: introduce put_user_page*(), placeholder versions
On Fri, 8 Mar 2019, john.hubb...@gmail.com wrote: > We seem to have pretty solid consensus on the concept and details of the > put_user_pages() approach. Or at least, if we don't, someone please speak > up now. Christopher Lameter, especially, since you had some concerns > recently. My concerns do not affect this patchset which just marks the get/put for the pagecache. The problem was that the description was making claims that were a bit misleading and seemed to prescribe a solution. So lets get this merged. Whatever the solution will be, we will need this markup.
Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
On Wed, 13 Mar 2019, Christoph Hellwig wrote: > On Wed, Mar 13, 2019 at 09:11:13AM +1100, Dave Chinner wrote: > > On Tue, Mar 12, 2019 at 03:39:33AM -0700, Ira Weiny wrote: > > > IMHO I don't think that the copy_file_range() is going to carry us > > > through the > > > next wave of user performance requirements. RDMA, while the first, is > > > not the > > > only technology which is looking to have direct access to files. XDP is > > > another.[1] > > > > Sure, all I doing here was demonstrating that people have been > > trying to get local direct access to file mappings to DMA directly > > into them for a long time. Direct Io games like these are now > > largely unnecessary because we now have much better APIs to do > > zero-copy data transfer between files (which can do hardware offload > > if it is available!). > > And that is just the file to file case. There are tons of other > users of get_user_pages, including various drivers that do large > amounts of I/O like video capture. For them it makes tons of sense > to transfer directly to/from a mmap()ed file. That is very similar to the RDMA case and DAX etc. We need to have a way to tell a filesystem that this is going to happen and that things need to be setup for this to work properly. But if that has not been done then I think its proper to fail a long term pin operation on page cache pages. Meaning the regular filesystems maintain control of whats happening with their pages.
Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
On Tue, 12 Mar 2019, Jerome Glisse wrote: > > > This has been discuss extensively already. GUP usage is now widespread in > > > multiple drivers, removing that would regress userspace ie break existing > > > application. We all know what the rules for that is. You are still misstating the issue. In RDMA land GUP is widely used for anonyous memory and memory based filesystems. *Not* for real filesystems. > > Because someone was able to get away with weird ways of abusing the system > > it not an argument that we should continue to allow such things. In fact > > we have repeatedly ensured that the kernel works reliably by improving the > > kernel so that a proper failure is occurring. > > Driver doing GUP on mmap of regular file is something that seems to > already have widespread user (in the RDMA devices at least). So they > are active users and they were never told that what they are doing > was illegal. Not true. Again please differentiate the use cases between regular filesystem and anonyous mappings. > > Well swapout cannot occur if the page is pinned and those pages are also > > often mlocked. > > I would need to check the swapout code but i believe the write to disk > can happen before the pin checks happens. I believe the event flow is: > map read only, allocate swap, write to disk, try to free page which > checks for pin. So that you could write stale data to disk and the GUP > going away before you perform the pin checks. Allocate swap is a separate step that associates a swap entry to an anonymous page. > They are other thing to take into account and that need proper page > dirtying, like soft dirtyness for instance. RDMA mapped pages are all dirty all the time. > Well RDMA driver maintainer seems to report that this has been a valid > and working workload for their users. No they dont. Could you please get up to date on the discussion before posting?
Re: [PATCH v2 5/5] mm: Remove stale comment from page struct
Acked-by: Christoph Lameter
Re: [PATCH v2 4/5] slob: Use slab_list instead of lru
On Wed, 13 Mar 2019, Tobin C. Harding wrote: > @@ -297,7 +297,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int > align, int node) > continue; > > /* Attempt to alloc */ > - prev = sp->lru.prev; > + prev = sp->slab_list.prev; > b = slob_page_alloc(sp, size, align); > if (!b) > continue; Hmmm... Is there a way to use a macro or so to avoid referencing the field within the slab_list?
Re: [PATCH v2 2/5] slub: Use slab_list instead of lru
Acked-by: Christoph Lameter
Re: [PATCH v2 1/5] slub: Add comments to endif pre-processor macros
Acked-by: Christoph Lameter
Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
On Mon, 11 Mar 2019, Dave Chinner wrote: > > Direct IO on a mmapped file backed page doesnt make any sense. > > People have used it for many, many years as zero-copy data movement > pattern. i.e. mmap the destination file, use direct IO to DMA direct > into the destination file page cache pages, fdatasync() to force > writeback of the destination file. Well we could make that more safe through a special API that designates a range of pages in a file in the same way as for RDMA. This is inherently not reliable as we found out. > Now we have copy_file_range() to optimise this sort of data > movement, the need for games with mmap+direct IO largely goes away. > However, we still can't just remove that functionality as it will > break lots of random userspace stuff... It is already broken and unreliable. Are there really "lots" of these things around? Can we test this by adding a warning in the kernel and see where it actually crops up?
Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
On Fri, 8 Mar 2019, Jerome Glisse wrote: > > > > It would good if that understanding would be enforced somehow given the > > problems > > that we see. > > This has been discuss extensively already. GUP usage is now widespread in > multiple drivers, removing that would regress userspace ie break existing > application. We all know what the rules for that is. The applications that work are using anonymous memory and memory filesystems. I have never seen use cases with a real filesystem and would have objected if someone tried something crazy like that. Because someone was able to get away with weird ways of abusing the system it not an argument that we should continue to allow such things. In fact we have repeatedly ensured that the kernel works reliably by improving the kernel so that a proper failure is occurring. > > > In fact, the GUP documentation even recommends that pattern. > > > > Isnt that pattern safe for anonymous memory and memory filesystems like > > hugetlbfs etc? Which is the common use case. > > Still an issue in respect to swapout ie if anon/shmem page was map > read only in preparation for swapout and we do not report the page > as dirty what endup in swap might lack what was written last through > GUP. Well swapout cannot occur if the page is pinned and those pages are also often mlocked. > > > > Yes you now have the filesystem as well as the GUP pinner claiming > > authority over the contents of a single memory segment. Maybe better not > > allow that? > > This goes back to regressing existing driver with existing users. There is no regression if that behavior never really worked. > > Two filesystem trying to sync one memory segment both believing to have > > exclusive access and we want to sort this out. Why? Dont allow this. > > This is allowed, it always was, forbidding that case now would regress > existing application and it would also means that we are modifying the > API we expose to userspace. So again this is not something we can block > without regressing existing user. We have always stopped the user from doing obviously stupid and risky things. It would be logical to do it here as well.
Re: [RFC 04/15] slub: Enable Slab Movable Objects (SMO)
On Mon, 11 Mar 2019, Roman Gushchin wrote: > > +static inline void *alloc_scratch(struct kmem_cache *s) > > +{ > > + unsigned int size = oo_objects(s->max); > > + > > + return kmalloc(size * sizeof(void *) + > > + BITS_TO_LONGS(size) * sizeof(unsigned long), > > + GFP_KERNEL); > > I wonder how big this allocation can be? > Given that the reason for migration is probably highly fragmented memory, > we probably don't want to have a high-order allocation here. So maybe > kvmalloc()? The smallest object size is 8 bytes which is one word which would be places in an order 0 page. So it comes out to about a page again. Larger allocation orders are possible if the slab pages itself can have larger orders of course. If you set the min_order to the huge page order then we can have similar sized orders for the allocation of the scratch space. However, that is not a problem since the allocations for the slab pages itself are also already of that same order.
Re: [RFC 02/15] slub: Add isolate() and migrate() methods
On Mon, 11 Mar 2019, Roman Gushchin wrote: > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -4325,6 +4325,34 @@ int __kmem_cache_create(struct kmem_cache *s, > > slab_flags_t flags) > > return err; > > } > > > > +void kmem_cache_setup_mobility(struct kmem_cache *s, > > + kmem_cache_isolate_func isolate, > > + kmem_cache_migrate_func migrate) > > +{ > > I wonder if it's better to adapt kmem_cache_create() to take two additional > argument? I suspect mobility is not a dynamic option, so it can be > set on kmem_cache creation. One other idea that prior versions of this patchset used was to change kmem_cache_create() so that the ctor parameter becomes an ops vector. However, in order to reduce the size of the patchset I dropped that. It could be easily moved back to the way it was before. > > + /* > > +* Sadly serialization requirements currently mean that we have > > +* to disable fast cmpxchg based processing. > > +*/ > > Can you, please, elaborate a bit more here? cmpxchg based processing does not lock the struct page. SMO requires to ensure that all changes on a slab page can be stopped. The page->lock will accomplish that. I think we could avoid dealing with actually locking the page with some more work.
Re: [RFC 02/15] slub: Add isolate() and migrate() methods
On Fri, 8 Mar 2019, Tycho Andersen wrote: > On Fri, Mar 08, 2019 at 03:14:13PM +1100, Tobin C. Harding wrote: > > diff --git a/mm/slab_common.c b/mm/slab_common.c > > index f9d89c1b5977..754acdb292e4 100644 > > --- a/mm/slab_common.c > > +++ b/mm/slab_common.c > > @@ -298,6 +298,10 @@ int slab_unmergeable(struct kmem_cache *s) > > if (!is_root_cache(s)) > > return 1; > > > > + /* > > +* s->isolate and s->migrate imply s->ctor so no need to > > +* check them explicitly. > > +*/ > > Shouldn't this implication go the other way, i.e. > s->ctor => s->isolate & s->migrate A cache can have a constructor but the object may not be movable (I.e. currently dentries and inodes).
Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions
On Wed, 6 Mar 2019, john.hubb...@gmail.com wrote: > GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code > to get the struct page behind a virtual address and to let storage hardware > perform a direct copy to or from that page. This is a short-lived access > pattern, and as such, the window for a concurrent writeback of GUP'd page > was small enough that there were not (we think) any reported problems. > Also, userspace was expected to understand and accept that Direct IO was > not synchronized with memory-mapped access to that data, nor with any > process address space changes such as munmap(), mremap(), etc. It would good if that understanding would be enforced somehow given the problems that we see. > Interactions with file systems > == > > File systems expect to be able to write back data, both to reclaim pages, Regular filesystems do that. But usually those are not used with GUP pinning AFAICT. > and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain > write access to the file memory pages means that such hardware can dirty > the pages, without the filesystem being aware. This can, in some cases > (depending on filesystem, filesystem options, block device, block device > options, and other variables), lead to data corruption, and also to kernel > bugs of the form: > Long term GUP > = > > Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a > writeable mapping is created), and the pages are file-backed. That can lead > to filesystem corruption. What happens is that when a file-backed page is > being written back, it is first mapped read-only in all of the CPU page > tables; the file system then assumes that nobody can write to the page, and > that the page content is therefore stable. Unfortunately, the GUP callers > generally do not monitor changes to the CPU pages tables; they instead > assume that the following pattern is safe (it's not): > > get_user_pages() > > Hardware can keep a reference to those pages for a very long time, > and write to it at any time. Because "hardware" here means "devices > that are not a CPU", this activity occurs without any interaction > with the kernel's file system code. > > for each page > set_page_dirty > put_page() > > In fact, the GUP documentation even recommends that pattern. Isnt that pattern safe for anonymous memory and memory filesystems like hugetlbfs etc? Which is the common use case. > Anyway, the file system assumes that the page is stable (nothing is writing > to the page), and that is a problem: stable page content is necessary for > many filesystem actions during writeback, such as checksum, encryption, > RAID striping, etc. Furthermore, filesystem features like COW (copy on > write) or snapshot also rely on being able to use a new page for as memory > for that memory range inside the file. > > Corruption during write back is clearly possible here. To solve that, one > idea is to identify pages that have active GUP, so that we can use a bounce > page to write stable data to the filesystem. The filesystem would work > on the bounce page, while any of the active GUP might write to the > original page. This would avoid the stable page violation problem, but note > that it is only part of the overall solution, because other problems > remain. Yes you now have the filesystem as well as the GUP pinner claiming authority over the contents of a single memory segment. Maybe better not allow that? > Direct IO > = > > Direct IO can cause corruption, if userspace does Direct-IO that writes to > a range of virtual addresses that are mmap'd to a file. The pages written > to are file-backed pages that can be under write back, while the Direct IO > is taking place. Here, Direct IO races with a write back: it calls > GUP before page_mkclean() has replaced the CPU pte with a read-only entry. > The race window is pretty small, which is probably why years have gone by > before we noticed this problem: Direct IO is generally very quick, and > tends to finish up before the filesystem gets around to do anything with > the page contents. However, it's still a real problem. The solution is > to never let GUP return pages that are under write back, but instead, > force GUP to take a write fault on those pages. That way, GUP will > properly synchronize with the active write back. This does not change the > required GUP behavior, it just avoids that race. Direct IO on a mmapped file backed page doesnt make any sense. The direct I/O write syscall already specifies one file handle of a filesystem that the data is to be written onto. Plus mmap already established another second filehandle and another filesystem that is also in charge of that memory segment. Two filesystem trying to sync one memory segment both believing to have exclusive access and we want to sort this out. Why? Dont allow this.
Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
On Wed, 6 Mar 2019, john.hubb...@gmail.com wrote: > Dave Chinner's description of this is very clear: > > "The fundamental issue is that ->page_mkwrite must be called on every > write access to a clean file backed page, not just the first one. > How long the GUP reference lasts is irrelevant, if the page is clean > and you need to dirty it, you must call ->page_mkwrite before it is > marked writeable and dirtied. Every. Time." > > This is just one symptom of the larger design problem: filesystems do not > actually support get_user_pages() being called on their pages, and letting > hardware write directly to those pages--even though that patter has been > going on since about 2005 or so. Can we distinguish between real filesystems that actually write to a backing device and the special filesystems (like hugetlbfs, shm and friends) that are like anonymous memory and do not require ->page_mkwrite() in the same way as regular filesystems? The use that I have seen in my section of the world has been restricted to RDMA and get_user_pages being limited to anonymous memory and those special filesystems. And if the RDMA memory is of such type then the use in the past and present is safe. So a logical other approach would be to simply not allow the use of long term get_user_page() on real filesystem pages. I hope this patch supports that? It is customary after all that a file read or write operation involve one single file(!) and that what is written either comes from or goes to memory (anonymous or special memory filesystem). If you have an mmapped memory segment with a regular device backed file then you already have one file associated with a memory segment and a filesystem that does take care of synchronizing the contents of the memory segment to a backing device. If you now perform RDMA or device I/O on such a memory segment then you will have *two* different devices interacting with that memory segment. I think that ought not to happen and not be supported out of the box. It will be difficult to handle and the semantics will be hard for users to understand. What could happen is that the filesystem could agree on request to allow third party I/O to go to such a memory segment. But that needs to be well defined and clearly and explicitly handled by some mechanism in user space that has well defined semantics for data integrity for the filesystem as well as the RDMA or device I/O.
Re: [PATCH] percpu/module resevation: change resevation size iff X86_VSMP is set
On Fri, 1 Mar 2019, Barret Rhoden wrote: > I'm not familiar with VSMP - how bad is it to use L1 cache alignment instead > of 4K page alignment? Maybe some structures can use the smaller alignment? > Or maybe have VSMP require SRCU-using modules to be built-in? It is very expensive. VMSP exchanges 4K segments via RDMA between servers to build a large address space and run a kernel in the large address space. Using smaller segments can cause a lot of "cacheline" bouncing (meaning transfers of 4K segments back and forth between servers).
Re: [PATCH] cxgb4: fix undefined behavior in mem.c
On Thu, 28 Feb 2019, Shaobo He wrote: > I think maybe the more problematic issue is that the value of a freed pointer > is intermediate. The pointer is not affected by freeing the data it points to. Thus it definitely has the same value as before and is not indeterminate. The pointer points now to an area of memory that could now be in use for different purposes so maybe it could be taken as a dangerous situation. But situations like that are common in code.
Re: [PATCH 1/2] percpu: km: remove SMP check
On Mon, 25 Feb 2019, Dennis Zhou wrote: > > @@ -27,7 +27,7 @@ > > * chunk size is not aligned. percpu-km code will whine about it. > > */ > > > > -#if defined(CONFIG_SMP) && defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) > > +#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) > > #error "contiguous percpu allocation is incompatible with paged first > > chunk" > > #endif > > > > -- > > 2.16.4 > > > > Hi, > > I think keeping CONFIG_SMP makes this easier to remember dependencies > rather than having to dig into the config. So this is a NACK from me. But it simplifies the code and makes it easier to read.
Re: [PATCH 2/2] percpu: km: no need to consider pcpu_group_offsets[0]
On Mon, 25 Feb 2019, den...@kernel.org wrote: > > @@ -67,7 +67,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) > > pcpu_set_page_chunk(nth_page(pages, i), chunk); > > > > chunk->data = pages; > > - chunk->base_addr = page_address(pages) - pcpu_group_offsets[0]; > > + chunk->base_addr = page_address(pages); > > > > spin_lock_irqsave(_lock, flags); > > pcpu_chunk_populated(chunk, 0, nr_pages, false); > > -- > > 2.16.4 > > > > While I do think you're right, creating a chunk is not a part of the > critical path and subtracting 0 is incredibly minor overhead. So I'd > rather keep the code as is to maintain consistency between percpu-vm.c > and percpu-km.c. Well it is confusing if there the expression is there but never used. It is clearer with the patch.
Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
On Fri, 15 Feb 2019, Ira Weiny wrote: > > > > for filesystems and processes. The only problems come in for the things > > > > which bypass the page cache like O_DIRECT and DAX. > > > > > > It makes a lot of sense since the filesystems play COW etc games with the > > > pages and RDMA is very much like O_DIRECT in that the pages are modified > > > directly under I/O. It also bypasses the page cache in case you have > > > not noticed yet. > > > > It is quite different, O_DIRECT modifies the physical blocks on the > > storage, bypassing the memory copy. > > > > Really? I thought O_DIRECT allowed the block drivers to write to/from user > space buffers. But the _storage_ was still under the control of the block > drivers? It depends on what you see as the modification target. O_DIRECT uses memory as a target and source like RDMA. The block device is at the other end of the handling. > > RDMA modifies the memory copy. > > > > pages are necessary to do RDMA, and those pages have to be flushed to > > disk.. So I'm not seeing how it can be disconnected from the page > > cache? > > I don't disagree with this. RDMA does direct access to memory. If that memmory is a mmmap of a regular block device then we have a problem (this has not been a standard use case to my knowledge). The semantics are simmply different. RDMA expects memory to be pinned and always to be able to read and write from it. The block device/filesystem expects memory access to be controllable via the page permission. In particular access to be page need to be able to be stopped. This is fundamentally incompatible. RDMA access to such an mmapped section must preserve the RDMA semantics while the pinning is done and can only provide the access control after RDMA is finished. Pages in the RDMA range cannot be handled like normal page cache pages. This is in particular evident in the DAX case in which we have direct pass through even to the storage medium. And in this case write through can replace the page cache.
Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
On Fri, 15 Feb 2019, Matthew Wilcox wrote: > > Since RDMA is something similar: Can we say that a file that is used for > > RDMA should not use the page cache? > > That makes no sense. The page cache is the standard synchronisation point > for filesystems and processes. The only problems come in for the things > which bypass the page cache like O_DIRECT and DAX. It makes a lot of sense since the filesystems play COW etc games with the pages and RDMA is very much like O_DIRECT in that the pages are modified directly under I/O. It also bypasses the page cache in case you have not noticed yet. Both filesysetms and RDMA acting on a page cache at the same time lead to the mess that we are trying to solve.
Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA
On Fri, 15 Feb 2019, Dave Chinner wrote: > Which tells us filesystem people that the applications are doing > something that _will_ cause data corruption and hence not to spend > any time triaging data corruption reports because it's not a > filesystem bug that caused it. > > See open(2): > > Applications should avoid mixing O_DIRECT and normal I/O to > the same file, and especially to overlapping byte regions in > the same file. Even when the filesystem correctly handles > the coherency issues in this situation, overall I/O > throughput is likely to be slower than using either mode > alone. Likewise, applications should avoid mixing mmap(2) > of files with direct I/O to the same files. Since RDMA is something similar: Can we say that a file that is used for RDMA should not use the page cache? And can we enforce this in the future? I.e. have some file state that says that this file is direct/RDMA or contains long term pinning and thus allows only a certain type of operations to ensure data consistency? If we cannot enforce it then we may want to spit out some warning?