Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On 1/12/21 12:12 AM, Jann Horn wrote: > At first I thought that this wasn't a significant issue because SLUB > has a reclaim path that can trim the percpu partial lists; but as it > turns out, that reclaim path is not actually wired up to the page > allocator's reclaim logic. The SLUB reclaim stuff is only triggered by > (very rare) subsystem-specific calls into SLUB for specific slabs and > by sysfs entries. So in userland processes will OOM even if SLUB still > has megabytes of entirely unused pages lying around. > > It might be a good idea to figure out whether it is possible to > efficiently keep track of a more accurate count of the free objects on > percpu partial lists; and if not, maybe change the accounting to > explicitly track the number of partial pages, and use limits that are > more appropriate for that? And perhaps the page allocator reclaim path > should also occasionally rip unused pages out of the percpu partial > lists? I'm gonna send a RFC that adds a proper shrinker and thus connects this shrinking to page reclaim, as a reply to this e-mail.
Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On Mon 18-01-21 15:46:43, Cristopher Lameter wrote: > On Mon, 18 Jan 2021, Michal Hocko wrote: > > > > Hm this would be similar to recommending a periodical echo > drop_caches > > > operation. We actually discourage from that (and yeah, some tools do > > > that, and > > > we now report those in dmesg). I believe the kernel should respond to > > > memory > > > pressure and not OOM prematurely by itself, including SLUB. > > > > Absolutely agreed! Partial caches are a very deep internal > > implementation detail of the allocator and admin has no bussiness into > > fiddling with that. This would only lead to more harm than good. > > Comparision to drop_caches is really exact! > > Really? The maximum allocation here has a upper boundary that depends on > the number of possible partial per cpu slabs. And number of cpus and caches... > There is a worst case > scenario that is not nice and wastes some memory but it is not an OOM > situation and the system easily recovers from it. There is no pro-active shrinking of those when we are close to the OOM so we still can go and kill a task while there is quite some memory sitting in a freeable slub caches unless I am missing something. We have learned about this in a memcg environment on our distribution kernels where the problem is amplified by the use in memcgs with a small limit. This is an older kernel and I would expect the current upstream will behave better with Roman's accounting rework. But still it would be great if the allocator could manage its caches depending on the memory demand. > The slab shrinking is not needed but if you are concerned about reclaiming > more memory right now then I guess you may want to run the slab shrink > operation. Yes, you can do that. In a same way you can shrink the page cache. Moreover it is really hard to do that somehow inteligently because you would need to watch the system very closely in order to shrink when it is really needed. That requires a deep understanding of the allocator. > Dropping the page cache is bad? Well sometimes you want more free memory > due to a certain operation that needs to be started and where you do not > want the overhead of page cache processing. It is not bad if used properly. My experience is that people have developed instinct to drop caches whenever something is not quite right because Internet has recommended that. -- Michal Hocko SUSE Labs
Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On Mon, 18 Jan 2021, Michal Hocko wrote: > > Hm this would be similar to recommending a periodical echo > drop_caches > > operation. We actually discourage from that (and yeah, some tools do that, > > and > > we now report those in dmesg). I believe the kernel should respond to memory > > pressure and not OOM prematurely by itself, including SLUB. > > Absolutely agreed! Partial caches are a very deep internal > implementation detail of the allocator and admin has no bussiness into > fiddling with that. This would only lead to more harm than good. > Comparision to drop_caches is really exact! Really? The maximum allocation here has a upper boundary that depends on the number of possible partial per cpu slabs. There is a worst case scenario that is not nice and wastes some memory but it is not an OOM situation and the system easily recovers from it. The slab shrinking is not needed but if you are concerned about reclaiming more memory right now then I guess you may want to run the slab shrink operation. Dropping the page cache is bad? Well sometimes you want more free memory due to a certain operation that needs to be started and where you do not want the overhead of page cache processing. You can go crazy and expect magical things from either operation. True.
Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On Thu 14-01-21 10:27:40, Vlastimil Babka wrote: > On 1/12/21 5:35 PM, Christoph Lameter wrote: > > On Tue, 12 Jan 2021, Jann Horn wrote: > > > >> [This is not something I intend to work on myself. But since I > >> stumbled over this issue, I figured I should at least document/report > >> it, in case anyone is willing to pick it up.] > > > > Well yeah all true. There is however a slabinfo tool that has an -s option > > to shrink all slabs. > > > > slabinfo -s > > > > So you could put that somewhere that executes if the system is > > idle or put it into cron or so. > > Hm this would be similar to recommending a periodical echo > drop_caches > operation. We actually discourage from that (and yeah, some tools do that, and > we now report those in dmesg). I believe the kernel should respond to memory > pressure and not OOM prematurely by itself, including SLUB. Absolutely agreed! Partial caches are a very deep internal implementation detail of the allocator and admin has no bussiness into fiddling with that. This would only lead to more harm than good. Comparision to drop_caches is really exact! -- Michal Hocko SUSE Labs
SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On 1/12/21 5:35 PM, Christoph Lameter wrote: > On Tue, 12 Jan 2021, Jann Horn wrote: > >> [This is not something I intend to work on myself. But since I >> stumbled over this issue, I figured I should at least document/report >> it, in case anyone is willing to pick it up.] > > Well yeah all true. There is however a slabinfo tool that has an -s option > to shrink all slabs. > > slabinfo -s > > So you could put that somewhere that executes if the system is > idle or put it into cron or so. Hm this would be similar to recommending a periodical echo > drop_caches operation. We actually discourage from that (and yeah, some tools do that, and we now report those in dmesg). I believe the kernel should respond to memory pressure and not OOM prematurely by itself, including SLUB.
Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On Wed, 13 Jan 2021, Jann Horn wrote: > Some brainstorming: > > Maybe you could have an atomic counter in kmem_cache_cpu that tracks > the number of empty frozen pages that are associated with a specific > CPU? So the freeing slowpath would do its cmpxchg_double, and if the The latencies of these functions are so low that any additional counter will have significant performance impacts. An atomic counter would be waay out there. > You could additionally have a plain percpu counter, not tied to the The performance critical counters are already all per cpu. I enhanced the percpu subsystem specifically to support latency critical operations in the fast path of the slab allocators.
Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On Wed, Jan 13, 2021 at 8:14 PM Vlastimil Babka wrote: > On 1/12/21 12:12 AM, Jann Horn wrote: > It doesn't help that slabinfo (global or per-memcg) is also > inaccurate as it cannot count free objects on per-cpu partial slabs and thus > reports them as active. Maybe SLUB could be taught to track how many objects are in the percpu machinery, and then print that number separately so that you can at least know how much data you're missing without having to collect data with IPIs... > > It might be a good idea to figure out whether it is possible to > > efficiently keep track of a more accurate count of the free objects on > > As long as there are some inuse objects, it shouldn't matter much if the slab > is > sitting on per-cpu partial list or per-node list, as it can't be freed anyway. > It becomes a real problem only after the slab become fully free. If we > detected > that in __slab_free() also for already-frozen slabs, we would need to know > which > CPU this slab belongs to (currently that's not tracked afaik), Yeah, but at least on 64-bit systems we still have 32 completely unused bits in the counter field that's updated via cmpxchg_double on struct page. (On 32-bit systems the bitfields are also wider than they strictly need to be, I think, at least if the system has 4K page size.) So at least on 64-bit systems, we could squeeze a CPU number in there, and then you'd know to which CPU the page belonged at the time the object was freed. > and send it an > IPI to do some light version of unfreeze_partials() that would only remove > empty > slabs. The trick would be not to cause too many IPI's by this, obviously :/ Some brainstorming: Maybe you could have an atomic counter in kmem_cache_cpu that tracks the number of empty frozen pages that are associated with a specific CPU? So the freeing slowpath would do its cmpxchg_double, and if the new state after a successful cmpxchg_double is "inuse==0 && frozen == 1" with a valid CPU number, you afterwards do "atomic_long_inc(_cpu_ptr(cache->cpu_slab, cpu)->empty_partial_pages)". I think it should be possible to implement that such that the empty_partial_pages count, while not immediately completely accurate, would be eventually consistent; and readers on the CPU owning the kmem_cache_cpu should never see a number that is too large, only one that is too small. You could additionally have a plain percpu counter, not tied to the kmem_cache, and increment it by 1<
Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On 1/12/21 12:12 AM, Jann Horn wrote: > [This is not something I intend to work on myself. But since I > stumbled over this issue, I figured I should at least document/report > it, in case anyone is willing to pick it up.] > > Hi! Hi, thanks for saving me a lot of typing! ... > This means that in practice, SLUB actually ends up keeping as many > **pages** on the percpu partial lists as it intends to keep **free > objects** there. Yes, I concluded the same thing. ... > I suspect that this may have also contributed to the memory wastage > problem with memory cgroups that was fixed in v5.9 > (https://lore.kernel.org/linux-mm/20200623174037.3951353-1-g...@fb.com/); > meaning that servers with lots of CPU cores running pre-5.9 kernels > with memcg and systemd (which tends to stick every service into its > own memcg) might be even worse off. Very much yes. Investigating an increase of kmemcg usage of a workload between an older kernel with SLAB and 5.3-based kernel with SLUB led us to find the same issue as you did. It doesn't help that slabinfo (global or per-memcg) is also inaccurate as it cannot count free objects on per-cpu partial slabs and thus reports them as active. I was aware that some empty slab pages might linger on per-cpu lists, but only after seeing how many were freed after "echo 1 > .../shrink" made me realize the extent of this. > It also seems unsurprising to me that flushing ~30 pages out of the > percpu partial caches at once with IRQs disabled would cause tail > latency spikes (as noted by Joonsoo Kim and Christoph Lameter in > commit 345c905d13a4e "slub: Make cpu partial slab support > configurable"). > > At first I thought that this wasn't a significant issue because SLUB > has a reclaim path that can trim the percpu partial lists; but as it > turns out, that reclaim path is not actually wired up to the page > allocator's reclaim logic. The SLUB reclaim stuff is only triggered by > (very rare) subsystem-specific calls into SLUB for specific slabs and > by sysfs entries. So in userland processes will OOM even if SLUB still > has megabytes of entirely unused pages lying around. Yeah, we considered to wire the shrinking to memcg OOM, but it's a poor solution. I'm considering introducing a proper shrinker that would be registered and work like other shrinkers for reclaimable caches. Then we would make it memcg-aware in our backport - upstream after v5.9 doesn't need that obviously. > It might be a good idea to figure out whether it is possible to > efficiently keep track of a more accurate count of the free objects on As long as there are some inuse objects, it shouldn't matter much if the slab is sitting on per-cpu partial list or per-node list, as it can't be freed anyway. It becomes a real problem only after the slab become fully free. If we detected that in __slab_free() also for already-frozen slabs, we would need to know which CPU this slab belongs to (currently that's not tracked afaik), and send it an IPI to do some light version of unfreeze_partials() that would only remove empty slabs. The trick would be not to cause too many IPI's by this, obviously :/ Actually I'm somewhat wrong above. If a CPU and per-node partial list runs out of free objects, it's wasteful to allocate new slabs if almost-empty slabs sit on another CPU's per-node partial list. > percpu partial lists; and if not, maybe change the accounting to > explicitly track the number of partial pages, and use limits that are That would be probably the simplest solution. Maybe sufficient upstream where the wastage only depends on number of caches and not memcgs. For pre-5.9 I also considered limiting the number of pages only for the per-memcg clones :/ Currently writing to the /sys/...//cpu_partial file is propagated to all the clones and root cache. > more appropriate for that? And perhaps the page allocator reclaim path > should also occasionally rip unused pages out of the percpu partial > lists? That would be best done by the a shrinker? BTW, SLAB does this by reaping of its per-cpu and shared arrays by timers (which works, but is not ideal) They also can't grow that large like this.
Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On Tue, 12 Jan 2021, Jann Horn wrote: > [This is not something I intend to work on myself. But since I > stumbled over this issue, I figured I should at least document/report > it, in case anyone is willing to pick it up.] Well yeah all true. There is however a slabinfo tool that has an -s option to shrink all slabs. slabinfo -s So you could put that somewhere that executes if the system is idle or put it into cron or so. This is a heavy handed operation through. You could switch off partial cpu slabs completely to avoid the issue. Nothing came to mind in the past on how to solve this without sacrificing significant performance or cause some system processing at random times while the shrinking runs. No one wants any of that. Being able to do it from userspace cleanly shifts the burden to userspace ;-) You can even do random heavy system processing from user space if you put a sleep there for random seconds between the shrinking runs.
SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
[This is not something I intend to work on myself. But since I stumbled over this issue, I figured I should at least document/report it, in case anyone is willing to pick it up.] Hi! I was poking around in SLUB internals and noticed that the estimate of how many free objects exist on a percpu partial list (tracked in page->pobjects of the first page on the list and exposed to userspace via /sys/kernel/slab/*/slabs_cpu_partial) is highly inaccurate. >From a naive first look, it seems like SLUB tries to roughly keep up to cache->cpu_partial free objects per slab and CPU around. cache->cpu_partial depends on the object size; the maximum is 30 (for objects <256 bytes). However, the accounting of free objects into page->pobjects only happens when a page is added to the percpu partial list; page->pobjects is not updated when objects are freed to a page that is already on the percpu partial list. Pages can be added to the percpu partial list in two cases: 1. When an object is freed from a page which previously had zero free objects (via __slab_free()), meaning the page was not previously on any list. 2. When the percpu partial list was empty and get_partial_node(), after grabbing a partial page from the node, moves more partial pages onto the percpu partial list to make the percpu partial list contain around cache->cpu_partial/2 free objects. In case 1, pages will almost always be counted as having one free object, unless a race with a concurrent __slab_free() on the same page happens, because the code path specifically handles the case where the number of free objects just became 1. In case 2, pages will probably often be counted as having many free objects; however, this case doesn't appear to happen often in practice, likely partly because pages outside of percpu partial lists get freed immediately when they become empty. This means that in practice, SLUB actually ends up keeping as many **pages** on the percpu partial lists as it intends to keep **free objects** there. To see this, you can append the snippet at the end of this mail to mm/slub.c; that will add a debugfs entry that lets you accurately dump the percpu partial lists of all running CPUs (at the cost of an IPI storm when you read it). Especially after running some forkbombs multiple times on a VM with a bunch of CPUs, that will show you that some of the percpu lists look like this (just showing a single slab's percpu lists on three CPU cores here) - note the "inuse=0" everywhere: task_delay_info on 10: page=f9eb9b00 base=988bd2e6c000 order=0 partial_pages=24 partial_objects=24 objects=51 inuse=0 page=f9e44469f800 base=988bda7e order=0 partial_pages=23 partial_objects=23 objects=51 inuse=0 page=f9e444e01380 base=988bf804e000 order=0 partial_pages=22 partial_objects=22 objects=51 inuse=0 page=f9e444bdda40 base=988bef769000 order=0 partial_pages=21 partial_objects=21 objects=51 inuse=0 page=f9e444c59700 base=988bf165c000 order=0 partial_pages=20 partial_objects=20 objects=51 inuse=0 page=f9e44494d280 base=988be534a000 order=0 partial_pages=19 partial_objects=19 objects=51 inuse=0 page=f9e444fd2e80 base=988bff4ba000 order=0 partial_pages=18 partial_objects=18 objects=51 inuse=0 page=f9e444c47c80 base=988bf11f2000 order=0 partial_pages=17 partial_objects=17 objects=51 inuse=0 page=f9e4448ff780 base=988be3fde000 order=0 partial_pages=16 partial_objects=16 objects=51 inuse=0 page=f9e4443883c0 base=988bce20f000 order=0 partial_pages=15 partial_objects=15 objects=51 inuse=0 page=f9e444eede40 base=988bfbb79000 order=0 partial_pages=14 partial_objects=14 objects=51 inuse=0 page=f9e4442febc0 base=988bcbfaf000 order=0 partial_pages=13 partial_objects=13 objects=51 inuse=0 page=f9e444e44940 base=988bf9125000 order=0 partial_pages=12 partial_objects=12 objects=51 inuse=0 page=f9e4446f72c0 base=988bdbdcb000 order=0 partial_pages=11 partial_objects=11 objects=51 inuse=0 page=f9e444dba080 base=988bf6e82000 order=0 partial_pages=10 partial_objects=10 objects=51 inuse=0 page=f9e444a23c40 base=988be88f1000 order=0 partial_pages=9 partial_objects=9 objects=51 inuse=0 page=f9e444786cc0 base=988bde1b3000 order=0 partial_pages=8 partial_objects=8 objects=51 inuse=0 page=f9e444b2cf80 base=988becb3e000 order=0 partial_pages=7 partial_objects=7 objects=51 inuse=0 page=f9e444f19cc0 base=988bfc673000 order=0 partial_pages=6 partial_objects=6 objects=51 inuse=0 page=f9e444f08fc0 base=988bfc23f000 order=0 partial_pages=5 partial_objects=5 objects=51 inuse=0 page=f9e444a0e540 base=988be8395000 order=0 partial_pages=4 partial_objects=4 objects=51 inuse=0 page=f9e445127a00 base=988c049e8000 order=0 partial_pages=3 partial_objects=3 objects=51 inuse=0 page=f9e44468ae40 base=988bda2b9000 order=0 partial_pages=2 partial_objects=2 objects=51
Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?
On Tue, Jan 12, 2021 at 12:12:47AM +0100, Jann Horn wrote: > [This is not something I intend to work on myself. But since I > stumbled over this issue, I figured I should at least document/report > it, in case anyone is willing to pick it up.] Hi, Jann! Thank you very much for this report! It should be less of a problem since 5.9, but still on a machine with many CPUs a non-trivial amount of memory is wasted. I'll definitely take a look. Thanks!