Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-21 Thread Vlastimil Babka
On 1/12/21 12:12 AM, Jann Horn wrote:
> At first I thought that this wasn't a significant issue because SLUB
> has a reclaim path that can trim the percpu partial lists; but as it
> turns out, that reclaim path is not actually wired up to the page
> allocator's reclaim logic. The SLUB reclaim stuff is only triggered by
> (very rare) subsystem-specific calls into SLUB for specific slabs and
> by sysfs entries. So in userland processes will OOM even if SLUB still
> has megabytes of entirely unused pages lying around.
> 
> It might be a good idea to figure out whether it is possible to
> efficiently keep track of a more accurate count of the free objects on
> percpu partial lists; and if not, maybe change the accounting to
> explicitly track the number of partial pages, and use limits that are
> more appropriate for that? And perhaps the page allocator reclaim path
> should also occasionally rip unused pages out of the percpu partial
> lists?

I'm gonna send a RFC that adds a proper shrinker and thus connects this
shrinking to page reclaim, as a reply to this e-mail.


Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-18 Thread Michal Hocko
On Mon 18-01-21 15:46:43, Cristopher Lameter wrote:
> On Mon, 18 Jan 2021, Michal Hocko wrote:
> 
> > > Hm this would be similar to recommending a periodical echo > drop_caches
> > > operation. We actually discourage from that (and yeah, some tools do 
> > > that, and
> > > we now report those in dmesg). I believe the kernel should respond to 
> > > memory
> > > pressure and not OOM prematurely by itself, including SLUB.
> >
> > Absolutely agreed! Partial caches are a very deep internal
> > implementation detail of the allocator and admin has no bussiness into
> > fiddling with that. This would only lead to more harm than good.
> > Comparision to drop_caches is really exact!
> 
> Really? The maximum allocation here has a upper boundary that depends on
> the number of possible partial per cpu slabs.

And number of cpus and caches...

> There is a worst case
> scenario that is not nice and wastes some memory but it is not an OOM
> situation and the system easily recovers from it.

There is no pro-active shrinking of those when we are close to the OOM
so we still can go and kill a task while there is quite some memory
sitting in a freeable slub caches unless I am missing something.

We have learned about this in a memcg environment on our distribution
kernels where the problem is amplified by the use in memcgs with a small
limit. This is an older kernel and I would expect the current upstream
will behave better with Roman's accounting rework. But still it would be
great if the allocator could manage its caches depending on the memory
demand.

> The slab shrinking is not needed but if you are concerned about reclaiming
> more memory right now then I guess you may want to run the slab shrink
> operation.

Yes, you can do that. In a same way you can shrink the page cache.
Moreover it is really hard to do that somehow inteligently because you
would need to watch the system very closely in order to shrink when it
is really needed. That requires a deep understanding of the allocator.

> Dropping the page cache is bad? Well sometimes you want more free memory
> due to a certain operation that needs to be started and where you do not
> want the overhead of page cache processing.

It is not bad if used properly. My experience is that people have
developed instinct to drop caches whenever something is not quite right
because Internet has recommended that.

-- 
Michal Hocko
SUSE Labs


Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-18 Thread Christoph Lameter
On Mon, 18 Jan 2021, Michal Hocko wrote:

> > Hm this would be similar to recommending a periodical echo > drop_caches
> > operation. We actually discourage from that (and yeah, some tools do that, 
> > and
> > we now report those in dmesg). I believe the kernel should respond to memory
> > pressure and not OOM prematurely by itself, including SLUB.
>
> Absolutely agreed! Partial caches are a very deep internal
> implementation detail of the allocator and admin has no bussiness into
> fiddling with that. This would only lead to more harm than good.
> Comparision to drop_caches is really exact!

Really? The maximum allocation here has a upper boundary that depends on
the number of possible partial per cpu slabs. There is a worst case
scenario that is not nice and wastes some memory but it is not an OOM
situation and the system easily recovers from it.

The slab shrinking is not needed but if you are concerned about reclaiming
more memory right now then I guess you may want to run the slab shrink
operation.

Dropping the page cache is bad? Well sometimes you want more free memory
due to a certain operation that needs to be started and where you do not
want the overhead of page cache processing.

You can go crazy and expect magical things from either operation. True.







Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-18 Thread Michal Hocko
On Thu 14-01-21 10:27:40, Vlastimil Babka wrote:
> On 1/12/21 5:35 PM, Christoph Lameter wrote:
> > On Tue, 12 Jan 2021, Jann Horn wrote:
> > 
> >> [This is not something I intend to work on myself. But since I
> >> stumbled over this issue, I figured I should at least document/report
> >> it, in case anyone is willing to pick it up.]
> > 
> > Well yeah all true. There is however a slabinfo tool that has an -s option
> > to shrink all slabs.
> > 
> > slabinfo -s
> > 
> > So you could put that somewhere that executes if the system is
> > idle or put it into cron or so.
> 
> Hm this would be similar to recommending a periodical echo > drop_caches
> operation. We actually discourage from that (and yeah, some tools do that, and
> we now report those in dmesg). I believe the kernel should respond to memory
> pressure and not OOM prematurely by itself, including SLUB.

Absolutely agreed! Partial caches are a very deep internal
implementation detail of the allocator and admin has no bussiness into
fiddling with that. This would only lead to more harm than good.
Comparision to drop_caches is really exact!

-- 
Michal Hocko
SUSE Labs


SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-14 Thread Vlastimil Babka
On 1/12/21 5:35 PM, Christoph Lameter wrote:
> On Tue, 12 Jan 2021, Jann Horn wrote:
> 
>> [This is not something I intend to work on myself. But since I
>> stumbled over this issue, I figured I should at least document/report
>> it, in case anyone is willing to pick it up.]
> 
> Well yeah all true. There is however a slabinfo tool that has an -s option
> to shrink all slabs.
> 
>   slabinfo -s
> 
> So you could put that somewhere that executes if the system is
> idle or put it into cron or so.

Hm this would be similar to recommending a periodical echo > drop_caches
operation. We actually discourage from that (and yeah, some tools do that, and
we now report those in dmesg). I believe the kernel should respond to memory
pressure and not OOM prematurely by itself, including SLUB.


Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-14 Thread Christoph Lameter
On Wed, 13 Jan 2021, Jann Horn wrote:

> Some brainstorming:
>
> Maybe you could have an atomic counter in kmem_cache_cpu that tracks
> the number of empty frozen pages that are associated with a specific
> CPU? So the freeing slowpath would do its cmpxchg_double, and if the


The latencies of these functions are so low that any additional counter
will have significant performance impacts. An atomic counter would be waay
out there.

> You could additionally have a plain percpu counter, not tied to the

The performance critical counters are already all per cpu. I enhanced the
percpu subsystem specifically to support latency critical operations in
the fast path of the slab allocators.


Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-13 Thread Jann Horn
On Wed, Jan 13, 2021 at 8:14 PM Vlastimil Babka  wrote:
> On 1/12/21 12:12 AM, Jann Horn wrote:
> It doesn't help that slabinfo (global or per-memcg) is also
> inaccurate as it cannot count free objects on per-cpu partial slabs and thus
> reports them as active.

Maybe SLUB could be taught to track how many objects are in the percpu
machinery, and then print that number separately so that you can at
least know how much data you're missing without having to collect data
with IPIs...

> > It might be a good idea to figure out whether it is possible to
> > efficiently keep track of a more accurate count of the free objects on
>
> As long as there are some inuse objects, it shouldn't matter much if the slab 
> is
> sitting on per-cpu partial list or per-node list, as it can't be freed anyway.
> It becomes a real problem only after the slab become fully free. If we 
> detected
> that in __slab_free() also for already-frozen slabs, we would need to know 
> which
> CPU this slab belongs to (currently that's not tracked afaik),

Yeah, but at least on 64-bit systems we still have 32 completely
unused bits in the counter field that's updated via cmpxchg_double on
struct page. (On 32-bit systems the bitfields are also wider than they
strictly need to be, I think, at least if the system has 4K page
size.) So at least on 64-bit systems, we could squeeze a CPU number in
there, and then you'd know to which CPU the page belonged at the time
the object was freed.

> and send it an
> IPI to do some light version of unfreeze_partials() that would only remove 
> empty
> slabs. The trick would be not to cause too many IPI's by this, obviously :/

Some brainstorming:

Maybe you could have an atomic counter in kmem_cache_cpu that tracks
the number of empty frozen pages that are associated with a specific
CPU? So the freeing slowpath would do its cmpxchg_double, and if the
new state after a successful cmpxchg_double is "inuse==0 && frozen ==
1" with a valid CPU number, you afterwards do
"atomic_long_inc(_cpu_ptr(cache->cpu_slab,
cpu)->empty_partial_pages)". I think it should be possible to
implement that such that the empty_partial_pages count, while not
immediately completely accurate, would be eventually consistent; and
readers on the CPU owning the kmem_cache_cpu should never see a number
that is too large, only one that is too small.

You could additionally have a plain percpu counter, not tied to the
kmem_cache, and increment it by 1<

Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-13 Thread Vlastimil Babka
On 1/12/21 12:12 AM, Jann Horn wrote:
> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]
> 
> Hi!

Hi, thanks for saving me a lot of typing!

...

> This means that in practice, SLUB actually ends up keeping as many
> **pages** on the percpu partial lists as it intends to keep **free
> objects** there.

Yes, I concluded the same thing.

...

> I suspect that this may have also contributed to the memory wastage
> problem with memory cgroups that was fixed in v5.9
> (https://lore.kernel.org/linux-mm/20200623174037.3951353-1-g...@fb.com/);
> meaning that servers with lots of CPU cores running pre-5.9 kernels
> with memcg and systemd (which tends to stick every service into its
> own memcg) might be even worse off.

Very much yes. Investigating an increase of kmemcg usage of a workload between
an older kernel with SLAB and 5.3-based kernel with SLUB led us to find the same
issue as you did. It doesn't help that slabinfo (global or per-memcg) is also
inaccurate as it cannot count free objects on per-cpu partial slabs and thus
reports them as active. I was aware that some empty slab pages might linger on
per-cpu lists, but only after seeing how many were freed after "echo 1 >
.../shrink" made me realize the extent of this.

> It also seems unsurprising to me that flushing ~30 pages out of the
> percpu partial caches at once with IRQs disabled would cause tail
> latency spikes (as noted by Joonsoo Kim and Christoph Lameter in
> commit 345c905d13a4e "slub: Make cpu partial slab support
> configurable").
> 
> At first I thought that this wasn't a significant issue because SLUB
> has a reclaim path that can trim the percpu partial lists; but as it
> turns out, that reclaim path is not actually wired up to the page
> allocator's reclaim logic. The SLUB reclaim stuff is only triggered by
> (very rare) subsystem-specific calls into SLUB for specific slabs and
> by sysfs entries. So in userland processes will OOM even if SLUB still
> has megabytes of entirely unused pages lying around.

Yeah, we considered to wire the shrinking to memcg OOM, but it's a poor
solution. I'm considering introducing a proper shrinker that would be registered
and work like other shrinkers for reclaimable caches. Then we would make it
memcg-aware in our backport - upstream after v5.9 doesn't need that obviously.

> It might be a good idea to figure out whether it is possible to
> efficiently keep track of a more accurate count of the free objects on

As long as there are some inuse objects, it shouldn't matter much if the slab is
sitting on per-cpu partial list or per-node list, as it can't be freed anyway.
It becomes a real problem only after the slab become fully free. If we detected
that in __slab_free() also for already-frozen slabs, we would need to know which
CPU this slab belongs to (currently that's not tracked afaik), and send it an
IPI to do some light version of unfreeze_partials() that would only remove empty
slabs. The trick would be not to cause too many IPI's by this, obviously :/

Actually I'm somewhat wrong above. If a CPU and per-node partial list runs out
of free objects, it's wasteful to allocate new slabs if almost-empty slabs sit
on another CPU's per-node partial list.

> percpu partial lists; and if not, maybe change the accounting to
> explicitly track the number of partial pages, and use limits that are

That would be probably the simplest solution. Maybe sufficient  upstream where
the wastage only depends on number of caches and not memcgs. For pre-5.9 I also
considered limiting the number of pages only for the per-memcg clones :/
Currently writing to the /sys/...//cpu_partial file is propagated to all
the clones and root cache.

> more appropriate for that? And perhaps the page allocator reclaim path
> should also occasionally rip unused pages out of the percpu partial
> lists?

That would be best done by the a shrinker?

BTW, SLAB does this by reaping of its per-cpu and shared arrays by timers (which
works, but is not ideal) They also can't grow that large like this.




Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-12 Thread Christoph Lameter
On Tue, 12 Jan 2021, Jann Horn wrote:

> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]

Well yeah all true. There is however a slabinfo tool that has an -s option
to shrink all slabs.

slabinfo -s

So you could put that somewhere that executes if the system is
idle or put it into cron or so.

This is a heavy handed operation through. You could switch off partial cpu
slabs completely to avoid the issue. Nothing came to mind in the past on
how to solve this without sacrificing significant performance or cause
some system processing at random times while the shrinking runs. No one
wants any of that.

Being able to do it from userspace cleanly shifts the burden to userspace  ;-)
You can even do random heavy system processing from user space if you put
a sleep there for random seconds between the shrinking runs.



SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-11 Thread Jann Horn
[This is not something I intend to work on myself. But since I
stumbled over this issue, I figured I should at least document/report
it, in case anyone is willing to pick it up.]

Hi!

I was poking around in SLUB internals and noticed that the estimate of
how many free objects exist on a percpu partial list (tracked in
page->pobjects of the first page on the list and exposed to userspace
via /sys/kernel/slab/*/slabs_cpu_partial) is highly inaccurate.

>From a naive first look, it seems like SLUB tries to roughly keep up
to cache->cpu_partial free objects per slab and CPU around.
cache->cpu_partial depends on the object size; the maximum is 30 (for
objects <256 bytes).

However, the accounting of free objects into page->pobjects only
happens when a page is added to the percpu partial list;
page->pobjects is not updated when objects are freed to a page that is
already on the percpu partial list. Pages can be added to the percpu
partial list in two cases:

1. When an object is freed from a page which previously had zero free
objects (via __slab_free()), meaning the page was not previously on
any list.
2. When the percpu partial list was empty and get_partial_node(),
after grabbing a partial page from the node, moves more partial pages
onto the percpu partial list to make the percpu partial list contain
around cache->cpu_partial/2 free objects.

In case 1, pages will almost always be counted as having one free
object, unless a race with a concurrent __slab_free() on the same page
happens, because the code path specifically handles the case where the
number of free objects just became 1.
In case 2, pages will probably often be counted as having many free
objects; however, this case doesn't appear to happen often in
practice, likely partly because pages outside of percpu partial lists
get freed immediately when they become empty.


This means that in practice, SLUB actually ends up keeping as many
**pages** on the percpu partial lists as it intends to keep **free
objects** there. To see this, you can append the snippet at the end of
this mail to mm/slub.c; that will add a debugfs entry that lets you
accurately dump the percpu partial lists of all running CPUs (at the
cost of an IPI storm when you read it).

Especially after running some forkbombs multiple times on a VM with a
bunch of CPUs, that will show you that some of the percpu lists look
like this (just showing a single slab's percpu lists on three CPU
cores here) - note the "inuse=0" everywhere:

task_delay_info on 10:
  page=f9eb9b00 base=988bd2e6c000 order=0 partial_pages=24
partial_objects=24 objects=51 inuse=0
  page=f9e44469f800 base=988bda7e order=0 partial_pages=23
partial_objects=23 objects=51 inuse=0
  page=f9e444e01380 base=988bf804e000 order=0 partial_pages=22
partial_objects=22 objects=51 inuse=0
  page=f9e444bdda40 base=988bef769000 order=0 partial_pages=21
partial_objects=21 objects=51 inuse=0
  page=f9e444c59700 base=988bf165c000 order=0 partial_pages=20
partial_objects=20 objects=51 inuse=0
  page=f9e44494d280 base=988be534a000 order=0 partial_pages=19
partial_objects=19 objects=51 inuse=0
  page=f9e444fd2e80 base=988bff4ba000 order=0 partial_pages=18
partial_objects=18 objects=51 inuse=0
  page=f9e444c47c80 base=988bf11f2000 order=0 partial_pages=17
partial_objects=17 objects=51 inuse=0
  page=f9e4448ff780 base=988be3fde000 order=0 partial_pages=16
partial_objects=16 objects=51 inuse=0
  page=f9e4443883c0 base=988bce20f000 order=0 partial_pages=15
partial_objects=15 objects=51 inuse=0
  page=f9e444eede40 base=988bfbb79000 order=0 partial_pages=14
partial_objects=14 objects=51 inuse=0
  page=f9e4442febc0 base=988bcbfaf000 order=0 partial_pages=13
partial_objects=13 objects=51 inuse=0
  page=f9e444e44940 base=988bf9125000 order=0 partial_pages=12
partial_objects=12 objects=51 inuse=0
  page=f9e4446f72c0 base=988bdbdcb000 order=0 partial_pages=11
partial_objects=11 objects=51 inuse=0
  page=f9e444dba080 base=988bf6e82000 order=0 partial_pages=10
partial_objects=10 objects=51 inuse=0
  page=f9e444a23c40 base=988be88f1000 order=0 partial_pages=9
partial_objects=9 objects=51 inuse=0
  page=f9e444786cc0 base=988bde1b3000 order=0 partial_pages=8
partial_objects=8 objects=51 inuse=0
  page=f9e444b2cf80 base=988becb3e000 order=0 partial_pages=7
partial_objects=7 objects=51 inuse=0
  page=f9e444f19cc0 base=988bfc673000 order=0 partial_pages=6
partial_objects=6 objects=51 inuse=0
  page=f9e444f08fc0 base=988bfc23f000 order=0 partial_pages=5
partial_objects=5 objects=51 inuse=0
  page=f9e444a0e540 base=988be8395000 order=0 partial_pages=4
partial_objects=4 objects=51 inuse=0
  page=f9e445127a00 base=988c049e8000 order=0 partial_pages=3
partial_objects=3 objects=51 inuse=0
  page=f9e44468ae40 base=988bda2b9000 order=0 partial_pages=2
partial_objects=2 objects=51 

Re: SLUB: percpu partial object count is highly inaccurate, causing some memory wastage and maybe also worse tail latencies?

2021-01-11 Thread Roman Gushchin
On Tue, Jan 12, 2021 at 12:12:47AM +0100, Jann Horn wrote:
> [This is not something I intend to work on myself. But since I
> stumbled over this issue, I figured I should at least document/report
> it, in case anyone is willing to pick it up.]

Hi, Jann!

Thank you very much for this report!

It should be less of a problem since 5.9, but still on a machine with
many CPUs a non-trivial amount of memory is wasted.

I'll definitely take a look.

Thanks!