subject:"\[MOCKUP\] x86\/mm\: Lightweight lazy mm refcounting"

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Andy Lutomirski



> On Dec 3, 2020, at 2:13 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Peter Zijlstra's message of December 3, 2020 6:44 pm:
>>> On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:
>>> 
>>> power: same as ARM, except that the loop may be rather larger since
>>> the systems are bigger.  But I imagine it's still faster than Nick's
>>> approach -- a cmpxchg to a remote cacheline should still be faster than
>>> an IPI shootdown. 
>> 
>> While a single atomic might be cheaper than an IPI, the comparison
>> doesn't work out nicely. You do the xchg() on every unlazy, while the
>> IPI would be once per process exit.
>> 
>> So over the life of the process, it might do very many unlazies, adding
>> up to a total cost far in excess of what the single IPI would've been.
> 
> Yeah this is the concern, I looked at things that add cost to the
> idle switch code and it gets hard to justify the scalability improvement
> when you slow these fundmaental things down even a bit.

v2 fixes this and is generally much nicer. I’ll send it out in a couple hours.

> 
> I still think working on the assumption that IPIs = scary expensive 
> might not be correct. An IPI itself is, but you only issue them when 
> you've left a lazy mm on another CPU which just isn't that often.
> 
> Thanks,
> Nick

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Nicholas Piggin

Excerpts from Peter Zijlstra's message of December 3, 2020 6:44 pm:
> On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:
> 
>> power: same as ARM, except that the loop may be rather larger since
>> the systems are bigger.  But I imagine it's still faster than Nick's
>> approach -- a cmpxchg to a remote cacheline should still be faster than
>> an IPI shootdown. 
> 
> While a single atomic might be cheaper than an IPI, the comparison
> doesn't work out nicely. You do the xchg() on every unlazy, while the
> IPI would be once per process exit.
> 
> So over the life of the process, it might do very many unlazies, adding
> up to a total cost far in excess of what the single IPI would've been.

Yeah this is the concern, I looked at things that add cost to the
idle switch code and it gets hard to justify the scalability improvement
when you slow these fundmaental things down even a bit.

I still think working on the assumption that IPIs = scary expensive 
might not be correct. An IPI itself is, but you only issue them when 
you've left a lazy mm on another CPU which just isn't that often.

Thanks,
Nick

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Rik van Riel

On Thu, 2020-12-03 at 12:31 +, Matthew Wilcox wrote:

> And this just makes me think RCU freeing of mm_struct.  I'm sure it's
> more complicated than that (then, or now), but if an anonymous
> process
> is borrowing a freed mm, and the mm is freed by RCU then it will not
> go
> away until the task context switches.  When we context switch back to
> the anon task, it'll borrow some other task's MM and won't even
> notice
> that the MM it was using has gone away.

One major complication here is that most of the
active_mm borrowing is done by the idle task,
but RCU does not wait for idle tasks to context
switch.

That means RCU, as it is today, is not a
mechanism that mm_struct freeing could just
piggyback off.

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Matthew Wilcox

On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:
> This code compiles, but I haven't even tried to boot it.  The earlier
> part of the series isn't terribly interesting -- it's a handful of
> cleanups that remove all reads of ->active_mm from arch/x86.  I've
> been meaning to do that for a while, and now I did it.  But, with
> that done, I think we can move to a totally different lazy mm refcounting
> model.

I went back and read Documentation/vm/active_mm.rst recently.

I think it's useful to think about how this would have been handled if
we'd had RCU at the time.  Particularly:

Linus wrote:
> To support all that, the "struct mm_struct" now has two counters: a
> "mm_users" counter that is how many "real address space users" there are,
> and a "mm_count" counter that is the number of "lazy" users (ie anonymous
> users) plus one if there are any real users.

And this just makes me think RCU freeing of mm_struct.  I'm sure it's
more complicated than that (then, or now), but if an anonymous process
is borrowing a freed mm, and the mm is freed by RCU then it will not go
away until the task context switches.  When we context switch back to
the anon task, it'll borrow some other task's MM and won't even notice
that the MM it was using has gone away.

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Peter Zijlstra

On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:

> power: same as ARM, except that the loop may be rather larger since
> the systems are bigger.  But I imagine it's still faster than Nick's
> approach -- a cmpxchg to a remote cacheline should still be faster than
> an IPI shootdown. 

While a single atomic might be cheaper than an IPI, the comparison
doesn't work out nicely. You do the xchg() on every unlazy, while the
IPI would be once per process exit.

So over the life of the process, it might do very many unlazies, adding
up to a total cost far in excess of what the single IPI would've been.

And while I appreciate all the work to get rid of the active_mm
accounting; the worry I have with pushing this all into arch code is
that it will be so very easy to get this subtly wrong.

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-02 Thread kernel test robot

Hi Andy,

I love your patch! Yet something to improve:

[auto build test ERROR on tip/x86/core]
[also build test ERROR on tip/x86/mm soc/for-next linus/master v5.10-rc6 
next-20201201]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Andy-Lutomirski/x86-mm-Lightweight-lazy-mm-refcounting/20201203-132706
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
238c91115cd05c71447ea071624a4c9fe661f970
config: m68k-randconfig-s032-20201203 (attached as .config)
compiler: m68k-linux-gcc (GCC) 9.3.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.3-179-ga00755aa-dirty
# 
https://github.com/0day-ci/linux/commit/0d9b23b22e621d8e588095b4d0f9f39110a57901
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Andy-Lutomirski/x86-mm-Lightweight-lazy-mm-refcounting/20201203-132706
git checkout 0d9b23b22e621d8e588095b4d0f9f39110a57901
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 
CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=m68k 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   m68k-linux-ld: kernel/fork.o: in function `__mmput':
>> fork.c:(.text+0x214): undefined reference to `arch_fixup_lazy_mm_refs'
   m68k-linux-ld: kernel/fork.o: in function `mmput_async_fn':
   fork.c:(.text+0x990): undefined reference to `arch_fixup_lazy_mm_refs'
   m68k-linux-ld: kernel/fork.o: in function `mmput':
   fork.c:(.text+0xa68): undefined reference to `arch_fixup_lazy_mm_refs'
   m68k-linux-ld: kernel/fork.o: in function `dup_mm.isra.0':
   fork.c:(.text+0x1192): undefined reference to `arch_fixup_lazy_mm_refs'
   m68k-linux-ld: kernel/fork.o: in function `mm_access':
   fork.c:(.text+0x13c6): undefined reference to `arch_fixup_lazy_mm_refs'

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

[MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-02 Thread Andy Lutomirski

For context, this is part of a series here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm&id=7c4bcc0a464ca60be1e0aeba805a192be0ee81e5

This code compiles, but I haven't even tried to boot it.  The earlier
part of the series isn't terribly interesting -- it's a handful of
cleanups that remove all reads of ->active_mm from arch/x86.  I've
been meaning to do that for a while, and now I did it.  But, with
that done, I think we can move to a totally different lazy mm refcounting
model.

So this patch is a mockup of what this could look like.  The algorithm
involves no atomics at all in the context switch path except for a
single atomic_long_xchg() of a percpu variable when going from lazy
mode to nonlazy mode.  Even that could be optimized -- I suspect it could
be replaced with non-atomic code if mm_users > 0.  Instead, on mm exit,
there's a single loop over all CPUs on which that mm could be lazily
loaded that atomic_long_cmpxchg_relaxed()'s a remote percpu variable to
tell the CPU to kindly mmdrop() the mm when it reschedules.  All cpus
that don't have the mm lazily loaded are ignored because they can't
have lazy references, and no cpus can gain lazy references after
mm_users hits zero.  (I need to verify that all the relevant barriers
are in place.  I suspect that they are on x86 but that I'm short an
smp_mb() on arches for which _relaxed atomics are genuinely relaxed.)

Here's how I think it fits with various arches:

x86: On bare metal (i.e. paravirt flush unavailable), the loop won't do
much.  The existing TLB shootdown when user tables are freed will
empty mm_cpumask of everything but the calling CPU.  So x86 ends up
pretty close to as good as we can get short of reworking mm_cpumask() itself.

arm64: with s/for_each_cpu/for_each_online_cpu, I think it will give
good performance.  The mmgrab()/mmdrop() overhead goes away, and,
on smallish systems, the cost of the loop should be low.

power: same as ARM, except that the loop may be rather larger since
the systems are bigger.  But I imagine it's still faster than Nick's
approach -- a cmpxchg to a remote cacheline should still be faster than
an IPI shootdown.  (Nick, don't benchmark this patch -- at least the
mm_users optimization mentioned above should be done, but also the
mmgrab() and mmdrop() aren't actually removed.)

Other arches: I don't know.  Further research is required.

What do you all think?


As mentioned, there are several things blatantly wrong with this patch:

The coding stype is not up to kernel standars.  I have prototypes in the
wrong places and other hacks.

mms are likely to be freed with IRQs off.  I think this is safe, but it's
suboptimal.

This whole thing is in arch/x86.  The core algorithm ought to move outside
arch/, but doing so without making a mess might take some thought.  It
doesn't help that different architectures have very different ideas
of what mm_cpumask() does.

Finally, this patch has no benefit by itself.  I didn't remove the
active_mm refounting, so the big benefit of removing mmgrab() and
mmdrop() calls on transitions to and from lazy mode isn't there.
There is no point at all in benchmarking this patch as is.  That
being said, getting rid of the active_mm refcounting shouldn't be
so hard, since x86 (in this tree) no longer uses active_mm at all.

I should contemplate whether the calling CPU is special in
arch_fixup_lazy_mm_refs().  On a very very quick think, it's not, but
it needs more thought.

Signed-off-by: Andy Lutomirski 

 arch/x86/include/asm/tlbflush.h | 20 
 arch/x86/mm/tlb.c   | 81 +++--
 kernel/fork.c   |  5 ++
 3 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..efcd4f125f2c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -124,6 +124,26 @@ struct tlb_state {
 */
unsigned short user_pcid_flush_mask;
 
+   /*
+* Lazy mm tracking.
+*
+*  - If this is NULL, it means that any mm_struct referenced by
+*this CPU is kept alive by a real reference count.
+*
+*  - If this is nonzero but the low bit is clear, it points to
+*an mm_struct that must not be freed out from under this
+*CPU.
+*
+*  - If the low bit is set, it still points to an mm_struct,
+*but some other CPU has mmgrab()'d it on our behalf, and we
+*must mmdrop() it when we're done with it.
+*
+* See lazy_mm_grab() and related functions for the precise
+* access rules.
+*/
+   atomic_long_t   lazy_mm;
+
+
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
 * disabling interrupts when modifying either one.
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e27300fc865b..00f5bace534b 100644
--- a/arch

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

[MOCKUP] x86/mm: Lightweight lazy mm refcounting

7 matches

Site Navigation

Mail list logo

Footer information