Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-07-17 Thread Andy Lutomirski



On Mon, Jun 26, 2023, at 10:48 AM, Song Liu wrote:
> On Mon, Jun 26, 2023 at 5:31 AM Mark Rutland  wrote:
>>
> [...]
>> >
>> > So the idea was that jit_text_alloc() will have a cache of large pages
>> > mapped ROX, will allocate memory from those caches and there will be
>> > jit_update() that uses text poking for writing to that memory.
>> >
>> > Upon allocation of a large page to increase the cache, that large page will
>> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
>> > x86)
>>
>> Does that work on x86?
>>
>> That is in no way gauranteed for other architectures; on arm64 you need
>> explicit cache maintenance (with I-cache maintenance at the VA to be executed
>> from) followed by context-synchronization-events (e.g. via ISB instructions, 
>> or
>> IPIs).
>
> I guess we need:
> 1) Invalidate unused part of the huge ROX pages;
> 2) Do not put two jit users (including module text, bpf, etc.) in the
> same cache line;
> 3) Explicit cache maintenance;
> 4) context-synchronization-events.
>
> Would these (or a subset of them) be sufficient to protect us from torn read?

Maybe?  #4 is sufficiently vague that I can't really interpret it.

I have a half-drafted email asking for official clarification on the rules that 
might help shed light on this.  I find that this type of request works best 
when it's really well written :)

>
> Thanks,
> Song


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-25 Thread Andy Lutomirski



On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
>> 
>> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
>> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> >> > From: "Mike Rapoport (IBM)" 
>> >> >
>> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >> >
>> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> >> > and puts the burden of code allocation to the modules code.
>> >> >
>> >> > Several architectures override module_alloc() because of various
>> >> > constraints where the executable memory can be located and this causes
>> >> > additional obstacles for improvements of code allocation.
>> >> >
>> >> > Start splitting code allocation from modules by introducing
>> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >> >
>> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> >> > module_alloc() and execmem_free() and jit_free() are replacements of
>> >> > module_memfree() to allow updating all call sites to use the new APIs.
>> >> >
>> >> > The intention semantics for new allocation APIs:
>> >> >
>> >> > * execmem_text_alloc() should be used to allocate memory that must 
>> >> > reside
>> >> >   close to the kernel image, like loadable kernel modules and generated
>> >> >   code that is restricted by relative addressing.
>> >> >
>> >> > * jit_text_alloc() should be used to allocate memory for generated code
>> >> >   when there are no restrictions for the code placement. For
>> >> >   architectures that require that any code is within certain distance
>> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >> >   execmem_text_alloc().
>> >> >
>> >> 
>> >> Is there anything in this series to help users do the appropriate
>> >> synchronization when the actually populate the allocated memory with
>> >> code?  See here, for example:
>> >
>> > This series only factors out the executable allocations from modules and
>> > puts them in a central place.
>> > Anything else would go on top after this lands.
>> 
>> Hmm.
>> 
>> On the one hand, there's nothing wrong with factoring out common code. On
>> the other hand, this is probably the right time to at least start
>> thinking about synchronization, at least to the extent that it might make
>> us want to change this API.  (I'm not at all saying that this series
>> should require changes -- I'm just saying that this is a good time to
>> think about how this should work.)
>> 
>> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
>> look like the one think in the Linux ecosystem that actually
>> intelligently and efficiently maps new text into an address space:
>> mmap().
>> 
>> On x86, you can mmap() an existing file full of executable code PROT_EXEC
>> and jump to it with minimal synchronization (just the standard implicit
>> ordering in the kernel that populates the pages before setting up the
>> PTEs and whatever user synchronization is needed to avoid jumping into
>> the mapping before mmap() finishes).  It works across CPUs, and the only
>> possible way userspace can screw it up (for a read-only mapping of
>> read-only text, anyway) is to jump to the mapping too early, in which
>> case userspace gets a page fault.  Incoherence is impossible, and no one
>> needs to "serialize" (in the SDM sense).
>> 
>> I think the same sequence (from userspace's perspective) works on other
>> architectures, too, although I think more cache management is needed on
>> the kernel's end.  As far as I know, no Linux SMP architecture needs an
>> IPI to map executable text into usermode, but I could easily be wrong.
>> (IIRC RISC-V has very developer-unfriendly icache management, but I don't
>> remember the details.)
>> 
>> Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
>> rather fraught, and I bet m

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-20 Thread Andy Lutomirski



On Mon, Jun 19, 2023, at 1:18 PM, Nadav Amit wrote:
>> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski  wrote:
>> 
>> But jit_text_alloc() can't do this, because the order of operations doesn't 
>> match.  With jit_text_alloc(), the executable mapping shows up before the 
>> text is populated, so there is no atomic change from not-there to 
>> populated-and-executable.  Which means that there is an opportunity for 
>> CPUs, speculatively or otherwise, to start filling various caches with 
>> intermediate states of the text, which means that various architectures 
>> (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite 
>> coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address, but 
>> does *not* map it there
>> jit_text_write() -> write to that handle
>> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I 
>> think)
>
> Andy, would you mind explaining why you think a sync is not needed? I 
> mean I have a “feeling” that perhaps TSO can guarantee something based 
> on the order of write and page-table update. Is that the argument?

Sorry, when I say "no sync" I mean no cross-CPU synchronization.  I'm assuming 
the underlying sequence of events is:

allocate physical pages (jit_text_alloc)

write to them (with MOV, memcpy, whatever), via the direct map or via a 
temporary mm

do an appropriate *local* barrier (which, on x86, is probably implied by TSO, 
as the subsequent pagetable change is at least a release; also, any any 
previous temporary mm stuff would have done MOV CR3 afterwards, which is a full 
"serializing" barrier)

optionally zap the direct map via IPI, assuming the pages are direct mapped 
(but this could be avoided with a smart enough allocator and temporary_mm above)

install the final RX PTE (jit_text_map), which does a MOV or maybe a LOCK 
CMPXCHG16B.  Note that the virtual address in question was not readable or 
executable before this, and all CPUs have serialized since the last time it was 
executable.

either jump to the new text locally, or:

1. Do a store-release to tell other CPUs that the text is mapped
2. Other CPU does a load-acquire to detect that the text is mapped and jumps to 
the text

This is all approximately the same thing that plain old mmap(..., PROT_EXEC, 
...) does.

>
> On this regard, one thing that I clearly do not understand is why 
> *today* it is ok for users of bpf_arch_text_copy() not to call 
> text_poke_sync(). Am I missing something?

I cannot explain this, because I suspect the current code is wrong.  But it's 
only wrong across CPUs, because bpf_arch_text_copy goes through text_poke_copy, 
which calls unuse_temporary_mm(), which is serializing.  And it's plausible 
that most eBPF use cases don't actually cause the loaded program to get used on 
a different CPU without first serializing on the CPU that ends up using it.  
(Context switches and interrupts are serializing.)

FRED could make interrupts non-serializing. I sincerely hope that FRED doesn't 
cause this all to fall apart.

--Andy


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-19 Thread Andy Lutomirski



On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
>> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
>> > From: "Mike Rapoport (IBM)" 
>> >
>> > module_alloc() is used everywhere as a mean to allocate memory for code.
>> >
>> > Beside being semantically wrong, this unnecessarily ties all subsystems
>> > that need to allocate code, such as ftrace, kprobes and BPF to modules
>> > and puts the burden of code allocation to the modules code.
>> >
>> > Several architectures override module_alloc() because of various
>> > constraints where the executable memory can be located and this causes
>> > additional obstacles for improvements of code allocation.
>> >
>> > Start splitting code allocation from modules by introducing
>> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>> >
>> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
>> > module_alloc() and execmem_free() and jit_free() are replacements of
>> > module_memfree() to allow updating all call sites to use the new APIs.
>> >
>> > The intention semantics for new allocation APIs:
>> >
>> > * execmem_text_alloc() should be used to allocate memory that must reside
>> >   close to the kernel image, like loadable kernel modules and generated
>> >   code that is restricted by relative addressing.
>> >
>> > * jit_text_alloc() should be used to allocate memory for generated code
>> >   when there are no restrictions for the code placement. For
>> >   architectures that require that any code is within certain distance
>> >   from the kernel image, jit_text_alloc() will be essentially aliased to
>> >   execmem_text_alloc().
>> >
>> 
>> Is there anything in this series to help users do the appropriate
>> synchronization when the actually populate the allocated memory with
>> code?  See here, for example:
>
> This series only factors out the executable allocations from modules and
> puts them in a central place.
> Anything else would go on top after this lands.

Hmm.

On the one hand, there's nothing wrong with factoring out common code. On the 
other hand, this is probably the right time to at least start thinking about 
synchronization, at least to the extent that it might make us want to change 
this API.  (I'm not at all saying that this series should require changes -- 
I'm just saying that this is a good time to think about how this should work.)

The current APIs, *and* the proposed jit_text_alloc() API, don't actually look 
like the one think in the Linux ecosystem that actually intelligently and 
efficiently maps new text into an address space: mmap().

On x86, you can mmap() an existing file full of executable code PROT_EXEC and 
jump to it with minimal synchronization (just the standard implicit ordering in 
the kernel that populates the pages before setting up the PTEs and whatever 
user synchronization is needed to avoid jumping into the mapping before mmap() 
finishes).  It works across CPUs, and the only possible way userspace can screw 
it up (for a read-only mapping of read-only text, anyway) is to jump to the 
mapping too early, in which case userspace gets a page fault.  Incoherence is 
impossible, and no one needs to "serialize" (in the SDM sense).

I think the same sequence (from userspace's perspective) works on other 
architectures, too, although I think more cache management is needed on the 
kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to map 
executable text into usermode, but I could easily be wrong.  (IIRC RISC-V has 
very developer-unfriendly icache management, but I don't remember the details.)

Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is rather 
fraught, and I bet many things do it wrong when userspace is multithreaded.  
But not in production because it's mostly not used in production.)

But jit_text_alloc() can't do this, because the order of operations doesn't 
match.  With jit_text_alloc(), the executable mapping shows up before the text 
is populated, so there is no atomic change from not-there to 
populated-and-executable.  Which means that there is an opportunity for CPUs, 
speculatively or otherwise, to start filling various caches with intermediate 
states of the text, which means that various architectures (even x86!) may need 
serialization.

For eBPF- and module- like use cases, where JITting/code gen is quite 
coarse-grained, perhaps something vaguely like:

jit_text_alloc() -> returns a handle and an executable virtual address, but 
does *not* map

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-17 Thread Andy Lutomirski
On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
>
> module_alloc() is used everywhere as a mean to allocate memory for code.
>
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules
> and puts the burden of code allocation to the modules code.
>
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
>
> Start splitting code allocation from modules by introducing
> execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
>
> Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> module_alloc() and execmem_free() and jit_free() are replacements of
> module_memfree() to allow updating all call sites to use the new APIs.
>
> The intention semantics for new allocation APIs:
>
> * execmem_text_alloc() should be used to allocate memory that must reside
>   close to the kernel image, like loadable kernel modules and generated
>   code that is restricted by relative addressing.
>
> * jit_text_alloc() should be used to allocate memory for generated code
>   when there are no restrictions for the code placement. For
>   architectures that require that any code is within certain distance
>   from the kernel image, jit_text_alloc() will be essentially aliased to
>   execmem_text_alloc().
>

Is there anything in this series to help users do the appropriate 
synchronization when the actually populate the allocated memory with code?  See 
here, for example:

https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab...@app.fastmail.com/T/#u


Re: [PATCH v2 13/18] uaccess: generalize access_ok()

2022-02-18 Thread Andy Lutomirski
On Fri, Feb 18, 2022 at 1:30 AM David Laight  wrote:
>
> From: Andy Lutomirski
> > Sent: 17 February 2022 19:15
> ...
> > This isn't actually optimal.  On x86, TASK_SIZE_MAX is a bizarre
> > constant that has a very specific value to work around a bug^Wdesign
> > error^Wfeature of Intel CPUs.  TASK_SIZE_MAX is the maximum address at
> > which userspace is permitted to allocate memory, but there is a huge
> > gap between user and kernel addresses, and any value in the gap would
> > be adequate for the comparison.  If we wanted to optimize this, simply
> > checking the high bit (which x86 can do without any immediate
> > constants at all) would be sufficient and, for an access known to fit
> > in 32 bits, one could get even fancier and completely ignore the size
> > of the access.  (For accesses not known to fit in 32 bits, I suspect
> > some creativity could still come up with a construction that's
> > substantially faster than the one in your patch.)
> >
> > So there's plenty of room for optimization here.
> >
> > (This is not in any respect a NAK -- it's just an observation that
> > this could be even better.)
>
> For 64bit arch that use the top bit to separate user/kernel
> you can test '(addr | size) >> 62)'.
> The compiler optimises out constant sizes.
>
> This has all been mentioned a lot of times.
> You do get different fault types.
>
> OTOH an explicit check for constant size (less than something big)
> can use the cheaper test of the sign bit.
> Big constant sizes could be compile time errors.

The different fault type issue may well be a real problem.  Right now
the core x86 fault code reserves the right to grouch if we get #GP
instead of #PF.  We could change that.


Re: [PATCH v2 13/18] uaccess: generalize access_ok()

2022-02-17 Thread Andy Lutomirski
On Wed, Feb 16, 2022 at 5:19 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> There are many different ways that access_ok() is defined across
> architectures, but in the end, they all just compare against the
> user_addr_max() value or they accept anything.
>
> Provide one definition that works for most architectures, checking
> against TASK_SIZE_MAX for user processes or skipping the check inside
> of uaccess_kernel() sections.
>
> For architectures without CONFIG_SET_FS(), this should be the fastest
> check, as it comes down to a single comparison of a pointer against a
> compile-time constant, while the architecture specific versions tend to
> do something more complex for historic reasons or get something wrong.

This isn't actually optimal.  On x86, TASK_SIZE_MAX is a bizarre
constant that has a very specific value to work around a bug^Wdesign
error^Wfeature of Intel CPUs.  TASK_SIZE_MAX is the maximum address at
which userspace is permitted to allocate memory, but there is a huge
gap between user and kernel addresses, and any value in the gap would
be adequate for the comparison.  If we wanted to optimize this, simply
checking the high bit (which x86 can do without any immediate
constants at all) would be sufficient and, for an access known to fit
in 32 bits, one could get even fancier and completely ignore the size
of the access.  (For accesses not known to fit in 32 bits, I suspect
some creativity could still come up with a construction that's
substantially faster than the one in your patch.)

So there's plenty of room for optimization here.

(This is not in any respect a NAK -- it's just an observation that
this could be even better.)


[PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2022-01-08 Thread Andy Lutomirski
The old sync_core_before_usermode() comments suggested that a
non-icache-syncing return-to-usermode instruction is x86-specific and that
all other architectures automatically notice cross-modified code on return
to userspace.

This is misleading.  The incantation needed to modify code from one
CPU and execute it on another CPU is highly architecture dependent.
On x86, according to the SDM, one must modify the code, issue SFENCE
if the modification was WC or nontemporal, and then issue a "serializing
instruction" on the CPU that will execute the code.  membarrier() can do
the latter.

On arm, arm64 and powerpc, one must flush the icache and then flush the
pipeline on the target CPU, although the CPU manuals don't necessarily use
this language.

So let's drop any pretense that we can have a generic way to define or
implement membarrier's SYNC_CORE operation and instead require all
architectures to define the helper and supply their own documentation as to
how to use it.  This means x86, arm64, and powerpc for now.  Let's also
rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

(It may well be the case that, on real x86 processors, synchronizing the
 icache (which requires no action at all) and "flushing the pipeline" is
 sufficient, but trying to use this language would be confusing at best.
 LFENCE does something awfully like "flushing the pipeline", but the SDM
 does not permit LFENCE as an alternative to a "serializing instruction"
 for this purpose.)

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mathieu Desnoyers 
Cc: Nicholas Piggin 
Cc: Peter Zijlstra 
Cc: x...@kernel.org
Cc: sta...@vger.kernel.org
Acked-by: Will Deacon  # for arm64
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, 
*_SYNC_CORE")
Signed-off-by: Andy Lutomirski 
---
 .../membarrier-sync-core/arch-support.txt | 69 ++-
 arch/arm/include/asm/membarrier.h | 21 ++
 arch/arm64/include/asm/membarrier.h   | 19 +
 arch/powerpc/include/asm/membarrier.h | 10 +++
 arch/x86/Kconfig  |  1 -
 arch/x86/include/asm/membarrier.h | 25 +++
 arch/x86/include/asm/sync_core.h  | 20 --
 arch/x86/kernel/alternative.c |  2 +-
 arch/x86/kernel/cpu/mce/core.c|  2 +-
 arch/x86/mm/tlb.c |  3 +-
 drivers/misc/sgi-gru/grufault.c   |  2 +-
 drivers/misc/sgi-gru/gruhandles.c |  2 +-
 drivers/misc/sgi-gru/grukservices.c   |  2 +-
 include/linux/sched/mm.h  |  1 -
 include/linux/sync_core.h | 21 --
 init/Kconfig  |  3 -
 kernel/sched/membarrier.c | 14 +++-
 17 files changed, 115 insertions(+), 102 deletions(-)
 create mode 100644 arch/arm/include/asm/membarrier.h
 create mode 100644 arch/arm64/include/asm/membarrier.h
 create mode 100644 arch/x86/include/asm/membarrier.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt 
b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 883d33b265d6..4009b26bf5c3 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,51 +5,26 @@
 #
 # Architecture requirements
 #
-# * arm/arm64/powerpc
 #
-# Rely on implicit context synchronization as a result of exception return
-# when returning from IPI handler, and when returning to user-space.
-#
-# * x86
-#
-# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
-# instruction is core serializing, but not SYSEXIT.
-#
-# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it can return to user-space through either SYSRETL (compat code),
-# SYSRETQ, or IRET.
-#
-# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
-# instead on write_cr3() performed by switch_mm() to provide core serialization
-# after changing the current mm, and deal with the special case of kthread ->
-# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
-#
----
-| arch |status|
----
-|   alpha: | TODO |
-| arc: | TODO |
-| arm: |  ok  |
-|   arm64: |  ok  |
-

[PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch

2022-01-08 Thread Andy Lutomirski
powerpc did the following on some, but not all, paths through
switch_mm_irqs_off():

   /*
* Only need the full barrier when switching between processes.
* Barrier when switching from kernel to userspace is not
* required here, given that it is implied by mmdrop(). Barrier
* when switching from userspace to kernel is not needed after
* store to rq->curr.
*/
   if (likely(!(atomic_read(&next->membarrier_state) &
(MEMBARRIER_STATE_PRIVATE_EXPEDITED |
 MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
   return;

This is puzzling: if !prev, then one might expect that we are switching
from kernel to user, not user to kernel, which is inconsistent with the
comment.  But this is all nonsense, because the one and only caller would
never have prev == NULL and would, in fact, OOPS if prev == NULL.

In any event, this code is unnecessary, since the new generic
membarrier_finish_switch_mm() provides the same barrier without arch help.

arch/powerpc/include/asm/membarrier.h remains as an empty header,
because a later patch in this series will add code to it.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Signed-off-by: Andy Lutomirski 
---
 arch/powerpc/include/asm/membarrier.h | 24 
 arch/powerpc/mm/mmu_context.c |  1 -
 2 files changed, 25 deletions(-)

diff --git a/arch/powerpc/include/asm/membarrier.h 
b/arch/powerpc/include/asm/membarrier.h
index de7f79157918..b90766e95bd1 100644
--- a/arch/powerpc/include/asm/membarrier.h
+++ b/arch/powerpc/include/asm/membarrier.h
@@ -1,28 +1,4 @@
 #ifndef _ASM_POWERPC_MEMBARRIER_H
 #define _ASM_POWERPC_MEMBARRIER_H
 
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-struct mm_struct *next,
-struct task_struct *tsk)
-{
-   /*
-* Only need the full barrier when switching between processes.
-* Barrier when switching from kernel to userspace is not
-* required here, given that it is implied by mmdrop(). Barrier
-* when switching from userspace to kernel is not needed after
-* store to rq->curr.
-*/
-   if (IS_ENABLED(CONFIG_SMP) &&
-   likely(!(atomic_read(&next->membarrier_state) &
-(MEMBARRIER_STATE_PRIVATE_EXPEDITED |
- MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
-   return;
-
-   /*
-* The membarrier system call requires a full memory barrier
-* after storing to rq->curr, before going back to user-space.
-*/
-   smp_mb();
-}
-
 #endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 74246536b832..5f2daa6b0497 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
asm volatile ("dssall");
 
if (!new_on_cpu)
-   membarrier_arch_switch_mm(prev, next, tsk);
 
/*
 * The actual HW switching method differs between the various
-- 
2.33.1



Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-19 Thread Andy Lutomirski



On Fri, Jun 18, 2021, at 11:02 PM, Nicholas Piggin wrote:
> Excerpts from Mathieu Desnoyers's message of June 19, 2021 6:09 am:
> > - On Jun 18, 2021, at 3:58 PM, Andy Lutomirski l...@kernel.org wrote:
> > 
> >> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
> >>> - On Jun 17, 2021, at 8:12 PM, Andy Lutomirski l...@kernel.org wrote:
> >>> 
> >>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> >>> > 
> >>> >> Please change back this #ifndef / #else / #endif within function for
> >>> >> 
> >>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
> >>> >>   ...
> >>> >> } else {
> >>> >>   ...
> >>> >> }
> >>> >> 
> >>> >> I don't think mixing up preprocessor and code logic makes it more 
> >>> >> readable.
> >>> > 
> >>> > I agree, but I don't know how to make the result work well.
> >>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> >>> > case, so either I need to fake up a definition or use #ifdef.
> >>> > 
> >>> > If I faked up a definition, I would want to assert, at build time, that
> >>> > it isn't called.  I don't think we can do:
> >>> > 
> >>> > static void membarrier_sync_core_before_usermode()
> >>> > {
> >>> >BUILD_BUG_IF_REACHABLE();
> >>> > }
> >>> 
> >>> Let's look at the context here:
> >>> 
> >>> static void ipi_sync_core(void *info)
> >>> {
> >>> []
> >>> membarrier_sync_core_before_usermode()
> >>> }
> >>> 
> >>> ^ this can be within #ifdef / #endif
> >>> 
> >>> static int membarrier_private_expedited(int flags, int cpu_id)
> >>> [...]
> >>>if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> >>> return -EINVAL;
> >>> if (!(atomic_read(&mm->membarrier_state) &
> >>>   MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> >>> return -EPERM;
> >>> ipi_func = ipi_sync_core;
> >>> 
> >>> All we need to make the line above work is to define an empty 
> >>> ipi_sync_core
> >>> function in the #else case after the ipi_sync_core() function definition.
> >>> 
> >>> Or am I missing your point ?
> >> 
> >> Maybe?
> >> 
> >> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the 
> >> core.
> >> I would be fine with that if I could have the compiler statically verify 
> >> that
> >> it’s not called, but I’m uncomfortable having it there if the 
> >> implementation is
> >> actively incorrect.
> > 
> > I see. Another approach would be to implement a "setter" function to 
> > populate
> > "ipi_func". That setter function would return -EINVAL in its #ifndef 
> > CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> > implementation.
> 
> I still don't get the problem with my suggestion. Sure the 
> ipi is a "lie", but it doesn't get used. That's how a lot of
> ifdef folding works out. E.g.,
> 
> diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
> index b5add64d9698..54cb32d064af 100644
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -5,6 +5,15 @@
>   * membarrier system call
>   */
>  #include "sched.h"
> +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
> +#include 
> +#else
> +static inline void membarrier_sync_core_before_usermode(void)
> +{
> + compiletime_assert(0, "architecture does not implement 
> membarrier_sync_core_before_usermode");
> +}
> +

With the assert there, I’m fine with this. Let me see if the result builds.

> +#endif
>  
>  /*
>   * For documentation purposes, here are some membarrier ordering
> 


Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-18 Thread Andy Lutomirski



On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
> - On Jun 17, 2021, at 8:12 PM, Andy Lutomirski l...@kernel.org wrote:
> 
> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> > 
> >> Please change back this #ifndef / #else / #endif within function for
> >> 
> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
> >>   ...
> >> } else {
> >>   ...
> >> }
> >> 
> >> I don't think mixing up preprocessor and code logic makes it more readable.
> > 
> > I agree, but I don't know how to make the result work well.
> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> > case, so either I need to fake up a definition or use #ifdef.
> > 
> > If I faked up a definition, I would want to assert, at build time, that
> > it isn't called.  I don't think we can do:
> > 
> > static void membarrier_sync_core_before_usermode()
> > {
> >BUILD_BUG_IF_REACHABLE();
> > }
> 
> Let's look at the context here:
> 
> static void ipi_sync_core(void *info)
> {
> []
> membarrier_sync_core_before_usermode()
> }
> 
> ^ this can be within #ifdef / #endif
> 
> static int membarrier_private_expedited(int flags, int cpu_id)
> [...]
>if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> return -EINVAL;
> if (!(atomic_read(&mm->membarrier_state) &
>   MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
> return -EPERM;
> ipi_func = ipi_sync_core;
> 
> All we need to make the line above work is to define an empty ipi_sync_core
> function in the #else case after the ipi_sync_core() function definition.
> 
> Or am I missing your point ?

Maybe?

My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the 
core.  I would be fine with that if I could have the compiler statically verify 
that it’s not called, but I’m uncomfortable having it there if the 
implementation is actively incorrect.


Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-17 Thread Andy Lutomirski
On 6/17/21 8:16 AM, Mathieu Desnoyers wrote:
> - On Jun 15, 2021, at 11:21 PM, Andy Lutomirski l...@kernel.org wrote:
> 
> [...]
> 
>> +# An architecture that wants to support
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what 
>> it
>> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
>> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
>> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
>> +# fantastic API and may not make sense on all architectures.  Once an
>> +# architecture meets these requirements,
> 
> Can we please remove the editorial comment about the quality of the membarrier
> sync-core's API ?

Done
>> +#
>> +# On x86, a program can safely modify code, issue
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, 
>> via
>> +# the modified address or an alias, from any thread in the calling process.
>> +#
>> +# On arm64, a program can modify code, flush the icache as needed, and issue
>> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context 
>> synchronizing
>> +# event", aka pipeline flush on all CPUs that might run the calling process.
>> +# Then the program can execute the modified code as long as it is executed
>> +# from an address consistent with the icache flush and the CPU's cache type.
>> +#
>> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
>> +# similarly to arm64.  It would be nice if the powerpc maintainers could
>> +# add a more clear explanantion.
> 
> We should document the requirements on ARMv7 as well.

Done.

> 
> Thanks,
> 
> Mathieu
> 



Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-17 Thread Andy Lutomirski
On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:

> Please change back this #ifndef / #else / #endif within function for
> 
> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>   ...
> } else {
>   ...
> }
> 
> I don't think mixing up preprocessor and code logic makes it more readable.

I agree, but I don't know how to make the result work well.
membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
case, so either I need to fake up a definition or use #ifdef.

If I faked up a definition, I would want to assert, at build time, that
it isn't called.  I don't think we can do:

static void membarrier_sync_core_before_usermode()
{
BUILD_BUG_IF_REACHABLE();
}



Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-16 Thread Andy Lutomirski
On Wed, Jun 16, 2021, at 3:20 AM, Will Deacon wrote:
> 
> For the arm64 bits (docs and asm/sync_core.h):
> 
> Acked-by: Will Deacon 
> 

Thanks.

Per Nick's suggestion, I renamed the header to membarrier.h.  Unless I hear 
otherwise, I'll keep the ack.

> Will
> 


Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-16 Thread Andy Lutomirski
On Wed, Jun 16, 2021, at 11:52 AM, Andy Lutomirski wrote:
> On 6/15/21 9:45 PM, Nicholas Piggin wrote:
> > Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
> >> The old sync_core_before_usermode() comments suggested that a 
> >> non-icache-syncing
> >> return-to-usermode instruction is x86-specific and that all other
> >> architectures automatically notice cross-modified code on return to
> >> userspace.
> 
> >> +/*
> >> + * XXX: can a powerpc person put an appropriate comment here?
> >> + */
> >> +static inline void membarrier_sync_core_before_usermode(void)
> >> +{
> >> +}
> >> +
> >> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> > 
> > powerpc's can just go in asm/membarrier.h
> 
> $ ls arch/powerpc/include/asm/membarrier.h
> ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
> or directory

Which is because I deleted it.  Duh.  I'll clean this up.

> 
> 
> > 
> > /*
> >  * The RFI family of instructions are context synchronising, and
> >  * that is how we return to userspace, so nothing is required here.
> >  */
> 
> Thanks!
> 


Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-16 Thread Andy Lutomirski
On 6/15/21 9:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 16, 2021 1:21 pm:
>> The old sync_core_before_usermode() comments suggested that a 
>> non-icache-syncing
>> return-to-usermode instruction is x86-specific and that all other
>> architectures automatically notice cross-modified code on return to
>> userspace.

>> +/*
>> + * XXX: can a powerpc person put an appropriate comment here?
>> + */
>> +static inline void membarrier_sync_core_before_usermode(void)
>> +{
>> +}
>> +
>> +#endif /* _ASM_POWERPC_SYNC_CORE_H */
> 
> powerpc's can just go in asm/membarrier.h

$ ls arch/powerpc/include/asm/membarrier.h
ls: cannot access 'arch/powerpc/include/asm/membarrier.h': No such file
or directory


> 
> /*
>  * The RFI family of instructions are context synchronising, and
>  * that is how we return to userspace, so nothing is required here.
>  */

Thanks!


[PATCH 6/8] powerpc/membarrier: Remove special barrier on mm switch

2021-06-15 Thread Andy Lutomirski
powerpc did the following on some, but not all, paths through
switch_mm_irqs_off():

   /*
* Only need the full barrier when switching between processes.
* Barrier when switching from kernel to userspace is not
* required here, given that it is implied by mmdrop(). Barrier
* when switching from userspace to kernel is not needed after
* store to rq->curr.
*/
   if (likely(!(atomic_read(&next->membarrier_state) &
(MEMBARRIER_STATE_PRIVATE_EXPEDITED |
 MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
   return;

This is puzzling: if !prev, then one might expect that we are switching
from kernel to user, not user to kernel, which is inconsistent with the
comment.  But this is all nonsense, because the one and only caller would
never have prev == NULL and would, in fact, OOPS if prev == NULL.

In any event, this code is unnecessary, since the new generic
membarrier_finish_switch_mm() provides the same barrier without arch help.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin 
Cc: Mathieu Desnoyers 
Cc: Peter Zijlstra 
Signed-off-by: Andy Lutomirski 
---
 arch/powerpc/include/asm/membarrier.h | 27 ---
 arch/powerpc/mm/mmu_context.c |  2 --
 2 files changed, 29 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/arch/powerpc/include/asm/membarrier.h 
b/arch/powerpc/include/asm/membarrier.h
deleted file mode 100644
index 6e20bb5c74ea..
--- a/arch/powerpc/include/asm/membarrier.h
+++ /dev/null
@@ -1,27 +0,0 @@
-#ifndef _ASM_POWERPC_MEMBARRIER_H
-#define _ASM_POWERPC_MEMBARRIER_H
-
-static inline void membarrier_arch_switch_mm(struct mm_struct *prev,
-struct mm_struct *next,
-struct task_struct *tsk)
-{
-   /*
-* Only need the full barrier when switching between processes.
-* Barrier when switching from kernel to userspace is not
-* required here, given that it is implied by mmdrop(). Barrier
-* when switching from userspace to kernel is not needed after
-* store to rq->curr.
-*/
-   if (likely(!(atomic_read(&next->membarrier_state) &
-(MEMBARRIER_STATE_PRIVATE_EXPEDITED |
- MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev))
-   return;
-
-   /*
-* The membarrier system call requires a full memory barrier
-* after storing to rq->curr, before going back to user-space.
-*/
-   smp_mb();
-}
-
-#endif /* _ASM_POWERPC_MEMBARRIER_H */
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index a857af401738..8daa95b3162b 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -85,8 +85,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 
if (new_on_cpu)
radix_kvm_prefetch_workaround(next);
-   else
-   membarrier_arch_switch_mm(prev, next, tsk);
 
/*
 * The actual HW switching method differs between the various
-- 
2.31.1



[PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-15 Thread Andy Lutomirski
The old sync_core_before_usermode() comments suggested that a non-icache-syncing
return-to-usermode instruction is x86-specific and that all other
architectures automatically notice cross-modified code on return to
userspace.

This is misleading.  The incantation needed to modify code from one
CPU and execute it on another CPU is highly architecture dependent.
On x86, according to the SDM, one must modify the code, issue SFENCE
if the modification was WC or nontemporal, and then issue a "serializing
instruction" on the CPU that will execute the code.  membarrier() can do
the latter.

On arm64 and powerpc, one must flush the icache and then flush the pipeline
on the target CPU, although the CPU manuals don't necessarily use this
language.

So let's drop any pretense that we can have a generic way to define or
implement membarrier's SYNC_CORE operation and instead require all
architectures to define the helper and supply their own documentation as to
how to use it.  This means x86, arm64, and powerpc for now.  Let's also
rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

(It may well be the case that, on real x86 processors, synchronizing the
 icache (which requires no action at all) and "flushing the pipeline" is
 sufficient, but trying to use this language would be confusing at best.
 LFENCE does something awfully like "flushing the pipeline", but the SDM
 does not permit LFENCE as an alternative to a "serializing instruction"
 for this purpose.)

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mathieu Desnoyers 
Cc: Nicholas Piggin 
Cc: Peter Zijlstra 
Cc: x...@kernel.org
Cc: sta...@vger.kernel.org
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, 
*_SYNC_CORE")
Signed-off-by: Andy Lutomirski 
---
 .../membarrier-sync-core/arch-support.txt | 68 ++-
 arch/arm64/include/asm/sync_core.h| 19 ++
 arch/powerpc/include/asm/sync_core.h  | 14 
 arch/x86/Kconfig  |  1 -
 arch/x86/include/asm/sync_core.h  |  7 +-
 arch/x86/kernel/alternative.c |  2 +-
 arch/x86/kernel/cpu/mce/core.c|  2 +-
 arch/x86/mm/tlb.c |  3 +-
 drivers/misc/sgi-gru/grufault.c   |  2 +-
 drivers/misc/sgi-gru/gruhandles.c |  2 +-
 drivers/misc/sgi-gru/grukservices.c   |  2 +-
 include/linux/sched/mm.h  |  1 -
 include/linux/sync_core.h | 21 --
 init/Kconfig  |  3 -
 kernel/sched/membarrier.c | 15 ++--
 15 files changed, 75 insertions(+), 87 deletions(-)
 create mode 100644 arch/arm64/include/asm/sync_core.h
 create mode 100644 arch/powerpc/include/asm/sync_core.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt 
b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 883d33b265d6..41c9ebcb275f 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,51 +5,25 @@
 #
 # Architecture requirements
 #
-# * arm/arm64/powerpc
 #
-# Rely on implicit context synchronization as a result of exception return
-# when returning from IPI handler, and when returning to user-space.
-#
-# * x86
-#
-# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
-# instruction is core serializing, but not SYSEXIT.
-#
-# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
-# However, it can return to user-space through either SYSRETL (compat code),
-# SYSRETQ, or IRET.
-#
-# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
-# instead on write_cr3() performed by switch_mm() to provide core serialization
-# after changing the current mm, and deal with the special case of kthread ->
-# uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
-#
----
-| arch |status|
----
-|   alpha: | TODO |
-| arc: | TODO |
-| arm: |  ok  |
-|   arm64: |  ok  |
-|csky: | TODO |
-|   h8300: | TODO |
-| hexagon: | TODO |
-|ia64: | TODO |
-|m68k: | TODO |
-|  microblaze: | TODO |
-|mips: | TODO |
-

Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

2021-06-15 Thread Andy Lutomirski
On 6/14/21 5:55 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 15, 2021 2:20 am:
>> Replying to several emails at once...
>>

> 
> So the only documentation relating to the current active_mm value or 
> refcounting is that it may not match what the x86 specific code is 
> doing?
> 
> All this complexity you accuse me of adding is entirely in x86 code.
> On other architectures, it's very simple and understandable, and 
> documented. I don't know how else to explain this.

And the docs you referred me to will be *wrong* with your patches
applied.  They are your patches, and they break the semantics.

> 
>
>> With your patch applied:
>>
>>  To support all that, the "struct mm_struct" now has two counters: a
>>  "mm_users" counter that is how many "real address space users" there 
>> are,
>>  and a "mm_count" counter that is the number of "lazy" users (ie 
>> anonymous
>>  users) plus one if there are any real users.
>>
>> isn't even true any more.
>
> Well yeah but the active_mm concept hasn't changed. The refcounting 
> change is hopefully reasonably documented?
>>
>> active_mm is *only* refcounting in the core code.  See below.
> 
> It's just not. It's passed in to switch_mm. Most architectures except 
> for x86 require this.
> 

Sorry, I was obviously blatantly wrong.  Let me say it differently.
active_mm does two things:

1. It keeps an mm alive via a refcounting scheme.

2. It passes a parameter to switch_mm() to remind the arch code what the
most recently switch-to mm was.

#2 is basically useless.  An architecture can handle *that* with a
percpu variable and two lines of code.

If you are getting rid of functionality #1 in the core code via a new
arch opt-out, please get rid of #2 as well.  *Especially* because, when
the arch asks the core code to stop refcounting active_mm, there is
absolutely nothing guaranteeing that the parameter that the core code
will pass to switch_mm() points to memory that hasn't been freed and
reused for another purpose.

> I might not have been clear. Core code doesn't need active_mm if 
> active_mm somehow goes away. I'm saying active_mm can't go away because
> it's needed to support (most) archs that do lazy tlb mm switching.
>
> The part I don't understand is when you say it can just go away. How? 
>>
>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>>  struct mm_struct *active_mm;
>> #endif
> 
> Thanks for returning the snark.

That wasn't intended to be snark.  It was a literal suggestion, and, in
fact, it's *exactly* what I'm asking you to do to fix your patches.

>> I don't understand what contract you're talking about.  The core code
>> maintains an active_mm counter and keeps that mm_struct from
>> disappearing.  That's *it*.  The core code does not care that active_mm
>> is active, and x86 provides evidence of that -- on x86,
>> current->active_mm may well be completely unused.
> 
> I already acknowledged archs can do their own thing under the covers if 
> they want.

No.

I am *not* going to write x86 patches that use your feature in a way
that will cause the core code to pass around a complete garbage pointer
to an mm_struct that is completely unreferenced and may well be deleted.
 Doing something private in arch code is one thing.  Doing something
that causes basic common sense to be violated in core code is another
thing entirely.

>>
>> static inline void do_switch_mm(struct task_struct *prev_task, ...)
>> {
>> #ifdef CONFIG_MMU_TLB_REFCOUNT
>>  switch_mm(...);
>> #else
>>  switch_mm(fewer parameters);
>>  /* or pass NULL or whatever. */
>> #endif
>> }
> 
> And prev_task comes from active_mm, ergo core code requires the concept 
> of active_mm.

I don't see why this concept is hard.  We are literally quibbling about
this single line of core code in kernel/sched/core.c:

switch_mm_irqs_off(prev->active_mm, next->mm, next);

This is not rocket science.  There are any number of ways to fix it.
For example:

#ifdef CONFIG_MMU_TLB_REFCOUNT
switch_mm_irqs_off(prev->active_mm, next->mm, next);
#else
switch_mm_irqs_off(NULL, next->mm, next);
#endif

If you don't like the NULL, then make the signature depend on the config
option.

What you may not do is what your patch actually does:

switch_mm_irqs_off(random invalid pointer, next->mm, next);

Now maybe it works because powerpc's lifecycle rules happen to keep
active_mm alive, but I haven't verified it.  x86's lifecycle rules *do not*.

>>>
>>> That's understandable, but please redirect your objections to the proper 
>>> place. git blame suggests 3d28ebceaffab.
>>
>> Thanks for the snark.
> 
> Is it not true? I don't mean that one patch causing all the x86 
> complexity or even making the situation worse itself. But you seem to be 
> asking my series to do things that really apply to the x86 changes over
> the past few years that got us here.

With just my patch from 4.15 applied, task->active_mm 

Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

2021-06-14 Thread Andy Lutomirski
Replying to several emails at once...


On 6/13/21 10:21 PM, Nicholas Piggin wrote:
> Excerpts from Nicholas Piggin's message of June 14, 2021 2:47 pm:
>> Excerpts from Nicholas Piggin's message of June 14, 2021 2:14 pm:
>>> Excerpts from Andy Lutomirski's message of June 14, 2021 1:52 pm:
 On 6/13/21 5:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>> when it is context switched. This can be disabled by architectures that
>>> don't require this refcounting if they clean up lazy tlb mms when the
>>> last refcount is dropped. Currently this is always enabled, which is
>>> what existing code does, so the patch is effectively a no-op.
>>>
>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>
>> I am in favor of this approach, but I would be a lot more comfortable
>> with the resulting code if task->active_mm were at least better
>> documented and possibly even guarded by ifdefs.
>
> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
> I don't think anything has changed in 20 years, I don't know what more
> is needed, but if you can add to documentation that would be nice. Maybe
> moving a bit of that into .c and .h files?
>

 Quoting from that file:

   - however, we obviously need to keep track of which address space we
 "stole" for such an anonymous user. For that, we have "tsk->active_mm",
 which shows what the currently active address space is.

 This isn't even true right now on x86.
>>>
>>> From the perspective of core code, it is. x86 might do something crazy 
>>> with it, but it has to make it appear this way to non-arch code that
>>> uses active_mm.
>>>
>>> Is x86's scheme documented?

arch/x86/include/asm/tlbflush.h documents it a bit:

/*
 * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
 * are on.  This means that it may not match current->active_mm,
 * which will contain the previous user mm when we're in lazy TLB
 * mode even if we've already switched back to swapper_pg_dir.
 *
 * During switch_mm_irqs_off(), loaded_mm will be set to
 * LOADED_MM_SWITCHING during the brief interrupts-off window
 * when CR3 and loaded_mm would otherwise be inconsistent.  This
 * is for nmi_uaccess_okay()'s benefit.
 */



>>>
 With your patch applied:

  To support all that, the "struct mm_struct" now has two counters: a
  "mm_users" counter that is how many "real address space users" there are,
  and a "mm_count" counter that is the number of "lazy" users (ie anonymous
  users) plus one if there are any real users.

 isn't even true any more.
>>>
>>> Well yeah but the active_mm concept hasn't changed. The refcounting 
>>> change is hopefully reasonably documented?

active_mm is *only* refcounting in the core code.  See below.


 I looked through all active_mm references in core code.  We have:

 kernel/sched/core.c: it's all refcounting, although it's a bit tangled
 with membarrier.

 kernel/kthread.c: same.  refcounting and membarrier stuff.

 kernel/exit.c: exit_mm() a BUG_ON().

 kernel/fork.c: initialization code and a warning.

 kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went 
 away.

 fs/exec.c: nothing of interest
>>>
>>> I might not have been clear. Core code doesn't need active_mm if 
>>> active_mm somehow goes away. I'm saying active_mm can't go away because
>>> it's needed to support (most) archs that do lazy tlb mm switching.
>>>
>>> The part I don't understand is when you say it can just go away. How? 

#ifdef CONFIG_MMU_TLB_REFCOUNT
struct mm_struct *active_mm;
#endif

>>>
 I didn't go through drivers, but I maintain my point.  active_mm is
 there for refcounting.  So please don't just make it even more confusing
 -- do your performance improvement, but improve the code at the same
 time: get rid of active_mm, at least on architectures that opt out of
 the refcounting.
>>>
>>> powerpc opts out of the refcounting and can not "get rid of active_mm".
>>> Not even in theory.
>>
>> That is to say, it does do a type of reference management that requires 
>> active_mm so you can argue it has not entirely opted out of refcounting.
>> But we're not just doing refcounting for the sake of refcounting! That
>> would make no sense.
>>
>> active_mm is required because that's the mm that we have switched to 
>> (from core code's perspective), and it is integral to know when to 
>> switch to a different mm. See how active_mm is a fundamental concept
>> in core code? It's part of the contract between core code and the
>> arch mm context managem

Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

2021-06-13 Thread Andy Lutomirski
On 6/13/21 5:45 PM, Nicholas Piggin wrote:
> Excerpts from Andy Lutomirski's message of June 9, 2021 2:20 am:
>> On 6/4/21 6:42 PM, Nicholas Piggin wrote:
>>> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
>>> when it is context switched. This can be disabled by architectures that
>>> don't require this refcounting if they clean up lazy tlb mms when the
>>> last refcount is dropped. Currently this is always enabled, which is
>>> what existing code does, so the patch is effectively a no-op.
>>>
>>> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.
>>
>> I am in favor of this approach, but I would be a lot more comfortable
>> with the resulting code if task->active_mm were at least better
>> documented and possibly even guarded by ifdefs.
> 
> active_mm is fairly well documented in Documentation/active_mm.rst IMO.
> I don't think anything has changed in 20 years, I don't know what more
> is needed, but if you can add to documentation that would be nice. Maybe
> moving a bit of that into .c and .h files?
> 

Quoting from that file:

  - however, we obviously need to keep track of which address space we
"stole" for such an anonymous user. For that, we have "tsk->active_mm",
which shows what the currently active address space is.

This isn't even true right now on x86.  With your patch applied:

 To support all that, the "struct mm_struct" now has two counters: a
 "mm_users" counter that is how many "real address space users" there are,
 and a "mm_count" counter that is the number of "lazy" users (ie anonymous
 users) plus one if there are any real users.

isn't even true any more.


>> x86 bare metal currently does not need the core lazy mm refcounting, and
>> x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
>> if lazy mm refcounting were configured out, ->active_mm could become a
>> dangling pointer, and this makes me extremely uncomfortable.
>>
>> So I tend to think that, depending on config, the core code should
>> either keep ->active_mm [1] alive or get rid of it entirely.
> 
> I don't actually know what you mean.
> 
> core code needs the concept of an "active_mm". This is the mm that your 
> kernel threads are using, even in the unmerged CONFIG_LAZY_TLB=n patch,
> active_mm still points to init_mm for kernel threads.

Core code does *not* need this concept.  First, it's wrong on x86 since
at least 4.15.  Any core code that actually assumes that ->active_mm is
"active" for any sensible definition of the word active is wrong.
Fortunately there is no such code.

I looked through all active_mm references in core code.  We have:

kernel/sched/core.c: it's all refcounting, although it's a bit tangled
with membarrier.

kernel/kthread.c: same.  refcounting and membarrier stuff.

kernel/exit.c: exit_mm() a BUG_ON().

kernel/fork.c: initialization code and a warning.

kernel/cpu.c: cpu offline stuff.  wouldn't be needed if active_mm went away.

fs/exec.c: nothing of interest

I didn't go through drivers, but I maintain my point.  active_mm is
there for refcounting.  So please don't just make it even more confusing
-- do your performance improvement, but improve the code at the same
time: get rid of active_mm, at least on architectures that opt out of
the refcounting.



> 
> We could hide that idea behind an active_mm() function that would always 
> return &init_mm if mm==NULL, but you still have the concept of an active
> mm and a pointer that callers must not access after free (because some
> cases will be CONFIG_LAZY_TLB=y).
> 
>> [1] I don't really think it belongs in task_struct at all.  It's not a
>> property of the task.  It's the *per-cpu* mm that the core code is
>> keeping alive for lazy purposes.  How about consolidating it with the
>> copy in rq?
> 
> I agree it's conceptually a per-cpu property. I don't know why it was 
> done this way, maybe it was just convenient and works well for mm and 
> active_mm to be adjacent. Linus might have a better insight.
> 
>> I guess the short summary of my opinion is that I like making this
>> configurable, but I do not like the state of the code.
> 
> I don't think I'd object to moving active_mm to rq and converting all
> usages to active_mm() while we're there, it would make things a bit
> more configurable. But I don't see it making core code fundamentally
> less complex... if you're referring to the x86 mm switching monstrosity,
> then that's understandable, but I admit I haven't spent enough time
> looking at it to make a useful comment. A patch would be enlightening,
> I have the leftover CONFIG_LAZY_TLB=n patch if you were thinking of 
> building on that I can send it to you.
> 
> Thanks,
> Nick
> 



Re: [PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

2021-06-08 Thread Andy Lutomirski
On 6/4/21 6:42 PM, Nicholas Piggin wrote:
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched. This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the
> last refcount is dropped. Currently this is always enabled, which is
> what existing code does, so the patch is effectively a no-op.
> 
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

I am in favor of this approach, but I would be a lot more comfortable
with the resulting code if task->active_mm were at least better
documented and possibly even guarded by ifdefs.

x86 bare metal currently does not need the core lazy mm refcounting, and
x86 bare metal *also* does not need ->active_mm.  Under the x86 scheme,
if lazy mm refcounting were configured out, ->active_mm could become a
dangling pointer, and this makes me extremely uncomfortable.

So I tend to think that, depending on config, the core code should
either keep ->active_mm [1] alive or get rid of it entirely.

[1] I don't really think it belongs in task_struct at all.  It's not a
property of the task.  It's the *per-cpu* mm that the core code is
keeping alive for lazy purposes.  How about consolidating it with the
copy in rq?

I guess the short summary of my opinion is that I like making this
configurable, but I do not like the state of the code.

--Andy


Re: [PATCH v3 0/4] shoot lazy tlbs

2021-06-04 Thread Andy Lutomirski
On 6/4/21 9:54 AM, Andy Lutomirski wrote:
> On 5/31/21 11:22 PM, Nicholas Piggin wrote:
>> There haven't been objections to the series since last posting, this
>> is just a rebase and tidies up a few comments minor patch rearranging.
>>
> 
> I continue to object to having too many modes.  I like my more generic
> improvements better.  Let me try to find some time to email again.
> 

Specifically, this:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm

I, or someone, needs to dust off my membarrier series before any of
these kinds of changes get made.  The barrier situation in the scheduler
is too confusing otherwise.


Re: [PATCH v3 0/4] shoot lazy tlbs

2021-06-04 Thread Andy Lutomirski
On 5/31/21 11:22 PM, Nicholas Piggin wrote:
> There haven't been objections to the series since last posting, this
> is just a rebase and tidies up a few comments minor patch rearranging.
> 

I continue to object to having too many modes.  I like my more generic
improvements better.  Let me try to find some time to email again.


Re: [RFC 00/20] TLB batching consolidation and enhancements

2021-01-30 Thread Andy Lutomirski
On Sat, Jan 30, 2021 at 4:16 PM Nadav Amit  wrote:
>
> From: Nadav Amit 
>
> There are currently (at least?) 5 different TLB batching schemes in the
> kernel:
>
> 1. Using mmu_gather (e.g., zap_page_range()).
>
> 2. Using {inc|dec}_tlb_flush_pending() to inform other threads on the
>ongoing deferred TLB flush and flushing the entire range eventually
>(e.g., change_protection_range()).
>
> 3. arch_{enter|leave}_lazy_mmu_mode() for sparc and powerpc (and Xen?).
>
> 4. Batching per-table flushes (move_ptes()).
>
> 5. By setting a flag on that a deferred TLB flush operation takes place,
>flushing when (try_to_unmap_one() on x86).

Are you referring to the arch_tlbbatch_add_mm/flush mechanism?


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2021-01-05 Thread Andy Lutomirski



> On Jan 5, 2021, at 5:26 AM, Will Deacon  wrote:
> 
> Hi Andy,
> 
> Sorry for the slow reply, I was socially distanced from my keyboard.
> 
>> On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote:
>> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin  wrote:
>>>> +static inline void membarrier_sync_core_before_usermode(void)
>>>> +{
>>>> + /*
>>>> +  * XXX: I know basically nothing about powerpc cache management.
>>>> +  * Is this correct?
>>>> +  */
>>>> + isync();
>>> 
>>> This is not about memory ordering or cache management, it's about
>>> pipeline management. Powerpc's return to user mode serializes the
>>> CPU (aka the hardware thread, _not_ the core; another wrongness of
>>> the name, but AFAIKS the HW thread is what is required for
>>> membarrier). So this is wrong, powerpc needs nothing here.
>> 
>> Fair enough.  I'm happy to defer to you on the powerpc details.  In
>> any case, this just illustrates that we need feedback from a person
>> who knows more about ARM64 than I do.
> 
> I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking:
> 
>  1. SYNC_CORE does _not_ perform any cache management; that is the
> responsibility of userspace, either by executing the relevant
> maintenance instructions (arm64) or a system call (arm32). Crucially,
> the hardware will ensure that this cache maintenance is broadcast
> to all other CPUs.

Is this guaranteed regardless of any aliases?  That is, if I flush from one CPU 
at one VA and then execute the same physical address from another CPU at a 
different VA, does this still work?

> 
>  2. Even with all the cache maintenance in the world, a CPU could have
> speculatively fetched stale instructions into its "pipeline" ahead of
> time, and these are _not_ flushed by the broadcast maintenance 
> instructions
> in (1). SYNC_CORE provides a means for userspace to discard these stale
> instructions.
> 
>  3. The context synchronization event on exception entry/exit is
> sufficient here. The Arm ARM isn't very good at describing what it
> does, because it's in denial about the existence of a pipeline, but
> it does have snippets such as:
> 
>(s/PE/CPU/)
>   | For all types of memory:
>   | The PE might have fetched the instructions from memory at any time
>   | since the last Context synchronization event on that PE.
> 
> Interestingly, the architecture recently added a control bit to remove
> this synchronisation from exception return, so if we set that then we'd
> have a problem with SYNC_CORE and adding an ISB would be necessary (and
> we could probable then make kernel->kernel returns cheaper, but I
> suspect we're relying on this implicit synchronisation in other places
> too).
> 

Is ISB just a context synchronization event or does it do more?

On x86, it’s very hard to tell that MFENCE does any more than LOCK, but it’s 
much slower.  And we have LFENCE, which, as documented, doesn’t appear to have 
any semantics at all.  (Or at least it didn’t before Spectre.)

> Are you seeing a problem in practice, or did this come up while trying to
> decipher the semantics of SYNC_CORE?

It came up while trying to understand the code and work through various bugs in 
it.  The code was written using something approximating x86 terminology, but it 
was definitely wrong on x86 (at least if you believe the SDM, and I haven’t 
convinced any architects to say otherwise).

Thanks!

> 
> Will


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin  wrote:
>
> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am:
> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
> >  wrote:
> >>
> >> - On Dec 28, 2020, at 2:44 PM, Andy Lutomirski l...@kernel.org wrote:
> >>
> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
> >> >  wrote:
> >> >>
> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
> >> >> > After chatting with rmk about this (but without claiming that any of
> >> >> > this is his opinion), based on the manpage, I think membarrier()
> >> >> > currently doesn't really claim to be synchronizing caches? It just
> >> >> > serializes cores. So arguably if userspace wants to use membarrier()
> >> >> > to synchronize code changes, userspace should first do the code
> >> >> > change, then flush icache as appropriate for the architecture, and
> >> >> > then do the membarrier() to ensure that the old code is unused?
> >>
> >> ^ exactly, yes.
> >>
> >> >> >
> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
> >> >> > syscall. That might cause you to end up with two IPIs instead of one
> >> >> > in total, but we probably don't care _that_ much about extra IPIs on
> >> >> > 32-bit arm?
> >>
> >> This was the original thinking, yes. The cacheflush IPI will flush specific
> >> regions of code, and the membarrier IPI issues context synchronizing
> >> instructions.
>
> APIs should be written in terms of the service they provide to
> userspace, and in highest level terms as possible, rather than directing
> hardware to do some low level operation. Unfortunately we're stuck with
> this for now. We could deprecate it and replace it though.
>
> If userspace wants to modify code and ensure that after the system call
> returns then no other thread will be executing the previous code, then
> there should be an API for that. It could actually combine the two IPIs
> into one for architectures that require both too.

I agree.  The membarrier API for SYNC_CORE is pretty nasty.  I would
much prefer a real API for JITs to use.

>
> >>
> >> Architectures with coherent i/d caches don't need the cacheflush step.
> >
> > There are different levels of coherency -- VIVT architectures may have
> > differing requirements compared to PIPT, etc.
> >
> > In any case, I feel like the approach taken by the documentation is
> > fundamentally confusing.  Architectures don't all speak the same
> > language  How about something like:
> >
> > The SYNC_CORE operation causes all threads in the caller's address
> > space (including the caller) to execute an architecture-defined
> > barrier operation.  membarrier() will ensure that this barrier is
> > executed at a time such that all data writes done by the calling
> > thread before membarrier() are made visible by the barrier.
> > Additional architecture-dependent cache management operations may be
> > required to use this for JIT code.
>
> As said this isn't what SYNC_CORE does, and it's not what powerpc
> context synchronizing instructions do either, it will very much
> re-order visibility of stores around such an instruction.

Perhaps the docs should be entirely arch-specific.  It may well be
impossible to state what it does in an arch-neutral way.

>
> A thread completes store instructions into a store queue, which is
> as far as a context synchronizing event goes. Visibility comes at
> some indeterminite time later.

As currently implemented, it has the same visibility semantics as
regular membarrier, too.  So if I do:

a = 1;
membarrier(SYNC_CORE);
b = 1;

and another thread does:

while (READ_ONCE(b) != 1)
  ;
barrier();
assert(a == 1);

then the assertion will pass.  Similarly, one can do this, I hope:

memcpy(codeptr, [some new instructions], len);
arch_dependent_cache_flush(codeptr, len);
membarrier(SYNC_CORE);
ready = 1;

and another thread does:

while (READ_ONCE(ready) != 1)
  ;
barrier();
(*codeptr)();

arch_dependent_cache_flush is a nop on x86.  On arm and arm64, it
appears to be a syscall, although maybe arm64 can do it from
userspace.  I still don't know what it is on powerpc.

Even using the term "cache" here is misleading.  x86 chips have all
kinds of barely-documented instruction caches, and they have varying
degrees of coherency.  The architecture actually promises that, if you
do a certain incan

Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin  wrote:
>
> Excerpts from Andy Lutomirski's message of December 28, 2020 4:28 am:
> > The old sync_core_before_usermode() comments said that a non-icache-syncing
> > return-to-usermode instruction is x86-specific and that all other
> > architectures automatically notice cross-modified code on return to
> > userspace.  Based on my general understanding of how CPUs work and based on
> > my atttempt to read the ARM manual, this is not true at all.  In fact, x86
> > seems to be a bit of an anomaly in the other direction: x86's IRET is
> > unusually heavyweight for a return-to-usermode instruction.
>
> "sync_core_before_usermode" as I've said says nothing to arch, or to the
> scheduler, or to membarrier.

Agreed.  My patch tries to fix this.  I agree that the name is bad and
could be improved further.  We should define what
membarrier(...SYNC_CORE) actually does and have arch hooks to make it
happen.

> > So let's drop any pretense that we can have a generic way implementation
> > behind membarrier's SYNC_CORE flush and require all architectures that opt
> > in to supply their own.  This means x86, arm64, and powerpc for now.  Let's
> > also rename the function from sync_core_before_usermode() to
> > membarrier_sync_core_before_usermode() because the precise flushing details
> > may very well be specific to membarrier, and even the concept of
> > "sync_core" in the kernel is mostly an x86-ism.
>
> The concept of "sync_core" (x86: serializing instruction, powerpc: context
> synchronizing instruction, etc) is not an x86-ism at all. x86 just wanted
> to add a serializing instruction to generic code so it grew this nasty API,
> but the concept applies broadly.

I mean that the mapping from the name "sync_core" to its semantics is
x86 only.  The string "sync_core" appears in the kernel only in
arch/x86, membarrier code, membarrier docs, and a single SGI driver
that is x86-only.  Sure, the idea of serializing things is fairly
generic, but exactly what operations serialize what, when things need
serialization, etc is quite architecture specific.

Heck, on 486 you serialize the instruction stream with JMP.

> > +static inline void membarrier_sync_core_before_usermode(void)
> > +{
> > + /*
> > +  * XXX: I know basically nothing about powerpc cache management.
> > +  * Is this correct?
> > +  */
> > + isync();
>
> This is not about memory ordering or cache management, it's about
> pipeline management. Powerpc's return to user mode serializes the
> CPU (aka the hardware thread, _not_ the core; another wrongness of
> the name, but AFAIKS the HW thread is what is required for
> membarrier). So this is wrong, powerpc needs nothing here.

Fair enough.  I'm happy to defer to you on the powerpc details.  In
any case, this just illustrates that we need feedback from a person
who knows more about ARM64 than I do.


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2020 at 1:09 PM Mathieu Desnoyers
 wrote:
>
> - On Dec 27, 2020, at 4:36 PM, Andy Lutomirski l...@kernel.org wrote:
>
> [...]
>
> >> You seem to have noticed odd cases on arm64 where this guarantee does not
> >> match reality. Where exactly can we find this in the code, and which part
> >> of the architecture manual can you point us to which supports your concern 
> >> ?
> >>
> >> Based on the notes I have, use of `eret` on aarch64 guarantees a context
> >> synchronizing
> >> instruction when returning to user-space.
> >
> > Based on my reading of the manual, ERET on ARM doesn't synchronize
> > anything at all.  I can't find any evidence that it synchronizes data
> > or instructions, and I've seen reports that the CPU will happily
> > speculate right past it.
>
> Reading [1] there appears to be 3 kind of context synchronization events:
>
> - Taking an exception,
> - Returning from an exception,
> - ISB.

My reading of [1] is that all three of these are "context
synchronization event[s]", but that only ISB flushes the pipeline,
etc.  The little description of context synchronization seems to
suggest that it only implies that certain register changes become
effective.

>
> This other source [2] adds (search for Context synchronization operation):
>
> - Exit from Debug state
> - Executing a DCPS instruction
> - Executing a DRPS instruction
>
> "ERET" falls into the second kind of events, and AFAIU should be context
> synchronizing. That was confirmed to me by Will Deacon when membarrier
> sync-core was implemented for aarch64. If the architecture reference manuals
> are wrong, is there an errata ?
>
> As for the algorithm to use on ARMv8 to update instructions, see [2]
> B2.3.4  Implication of caches for the application programmer
> "Synchronization and coherency issues between data and instruction accesses"

This specifically discusses ISB.

Let's wait for an actual ARM64 expert to chime in, though.


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
 wrote:
>
> - On Dec 28, 2020, at 2:44 PM, Andy Lutomirski l...@kernel.org wrote:
>
> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
> >  wrote:
> >>
> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
> >> > After chatting with rmk about this (but without claiming that any of
> >> > this is his opinion), based on the manpage, I think membarrier()
> >> > currently doesn't really claim to be synchronizing caches? It just
> >> > serializes cores. So arguably if userspace wants to use membarrier()
> >> > to synchronize code changes, userspace should first do the code
> >> > change, then flush icache as appropriate for the architecture, and
> >> > then do the membarrier() to ensure that the old code is unused?
>
> ^ exactly, yes.
>
> >> >
> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
> >> > syscall. That might cause you to end up with two IPIs instead of one
> >> > in total, but we probably don't care _that_ much about extra IPIs on
> >> > 32-bit arm?
>
> This was the original thinking, yes. The cacheflush IPI will flush specific
> regions of code, and the membarrier IPI issues context synchronizing
> instructions.
>
> Architectures with coherent i/d caches don't need the cacheflush step.

There are different levels of coherency -- VIVT architectures may have
differing requirements compared to PIPT, etc.

In any case, I feel like the approach taken by the documentation is
fundamentally confusing.  Architectures don't all speak the same
language  How about something like:

The SYNC_CORE operation causes all threads in the caller's address
space (including the caller) to execute an architecture-defined
barrier operation.  membarrier() will ensure that this barrier is
executed at a time such that all data writes done by the calling
thread before membarrier() are made visible by the barrier.
Additional architecture-dependent cache management operations may be
required to use this for JIT code.

x86: SYNC_CORE executes a barrier that will cause subsequent
instruction fetches to observe prior writes.  Currently this will be a
"serializing" instruction, but, if future improved CPU documentation
becomes available and relaxes this requirement, the barrier may
change.  The kernel guarantees that writing new or modified
instructions to normal memory (and issuing SFENCE if the writes were
non-temporal) then doing a membarrier SYNC_CORE operation is
sufficient to cause all threads in the caller's address space to
execute the new or modified instructions.  This is true regardless of
whether or not those instructions are written at the same virtual
address from which they are subsequently executed.  No additional
cache management is required on x86.

arm: Something about the cache management syscalls.

arm64: Ditto

powerpc: I have no idea.


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
 wrote:
>
> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
> > After chatting with rmk about this (but without claiming that any of
> > this is his opinion), based on the manpage, I think membarrier()
> > currently doesn't really claim to be synchronizing caches? It just
> > serializes cores. So arguably if userspace wants to use membarrier()
> > to synchronize code changes, userspace should first do the code
> > change, then flush icache as appropriate for the architecture, and
> > then do the membarrier() to ensure that the old code is unused?
> >
> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
> > syscall. That might cause you to end up with two IPIs instead of one
> > in total, but we probably don't care _that_ much about extra IPIs on
> > 32-bit arm?
> >
> > For arm64, I believe userspace can flush icache across the entire
> > system with some instructions from userspace - "DC CVAU" followed by
> > "DSB ISH", or something like that, I think? (See e.g.
> > compat_arm_syscall(), the arm64 compat code that implements the 32-bit
> > arm cacheflush() syscall.)
>
> Note that the ARM cacheflush syscall calls flush_icache_user_range()
> over the range of addresses that userspace has passed - it's intention
> since day one is to support cases where userspace wants to change
> executable code.
>
> It will issue the appropriate write-backs to the data cache (DCCMVAU),
> the invalidates to the instruction cache (ICIMVAU), invalidate the
> branch target buffer (BPIALLIS or BPIALL as appropriate), and issue
> the appropriate barriers (DSB ISHST, ISB).
>
> Note that neither flush_icache_user_range() nor flush_icache_range()
> result in IPIs; cache operations are broadcast across all CPUs (which
> is one of the minimums we require for SMP systems.)
>
> Now, that all said, I think the question that has to be asked is...
>
> What is the basic purpose of membarrier?
>
> Is the purpose of it to provide memory barriers, or is it to provide
> memory coherence?
>
> If it's the former and not the latter, then cache flushes are out of
> scope, and expecting memory written to be visible to the instruction
> stream is totally out of scope of the membarrier interface, whether
> or not the writes happen on the same or a different CPU to the one
> executing the rewritten code.
>
> The documentation in the kernel does not seem to describe what it's
> supposed to be doing - the only thing I could find is this:
> Documentation/features/sched/membarrier-sync-core/arch-support.txt
> which describes it as "arch supports core serializing membarrier"
> whatever that means.
>
> Seems to be the standard and usual case of utterly poor to non-existent
> documentation within the kernel tree, or even a pointer to where any
> useful documentation can be found.
>
> Reading the membarrier(2) man page, I find nothing in there that talks
> about any kind of cache coherency for self-modifying code - it only
> seems to be about _barriers_ and nothing more, and barriers alone do
> precisely nothing to save you from non-coherent Harvard caches.
>
> So, either Andy has a misunderstanding, or the man page is wrong, or
> my rudimentary understanding of what membarrier is supposed to be
> doing is wrong...

Look at the latest man page:

https://man7.org/linux/man-pages/man2/membarrier.2.html

for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be
all that enlightening.

--Andy


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2020 at 10:30 AM Jann Horn  wrote:
>
> On Mon, Dec 28, 2020 at 6:14 PM Andy Lutomirski  wrote:
> > On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin
> >  wrote:
> > >
> > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:
> > > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers
> > > >  wrote:
> > > > >
> > > > > - On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org 
> > > > > wrote:
> > > > >
> > > >
> > > > > >
> > > > > > I admit that I'm rather surprised that the code worked at all on 
> > > > > > arm64,
> > > > > > and I'm suspicious that it has never been very well tested.  My 
> > > > > > apologies
> > > > > > for not reviewing this more carefully in the first place.
> > > > >
> > > > > Please refer to 
> > > > > Documentation/features/sched/membarrier-sync-core/arch-support.txt
> > > > >
> > > > > It clearly states that only arm, arm64, powerpc and x86 support the 
> > > > > membarrier
> > > > > sync core feature as of now:
> > > >
> > > > Sigh, I missed arm (32).  Russell or ARM folks, what's the right
> > > > incantation to make the CPU notice instruction changes initiated by
> > > > other cores on 32-bit ARM?
> > >
> > > You need to call flush_icache_range(), since the changes need to be
> > > flushed from the data cache to the point of unification (of the Harvard
> > > I and D), and the instruction cache needs to be invalidated so it can
> > > then see those updated instructions. This will also take care of the
> > > necessary barriers that the CPU requires for you.
> >
> > With what parameters?   From looking at the header, this is for the
> > case in which the kernel writes some memory and then intends to
> > execute it.  That's not what membarrier() does at all.  membarrier()
> > works like this:
> >
> > User thread 1:
> >
> > write to RWX memory *or* write to an RW alias of an X region.
> > membarrier(...);
> > somehow tell thread 2 that we're ready (with a store release, perhaps,
> > or even just a relaxed store.)
> >
> > User thread 2:
> >
> > wait for the indication from thread 1.
> > barrier();
> > jump to the code.
> >
> > membarrier() is, for better or for worse, not given a range of addresses.
> >
> > On x86, the documentation is a bit weak, but a strict reading
> > indicates that thread 2 must execute a serializing instruction at some
> > point after thread 1 writes the code and before thread 2 runs it.
> > membarrier() does this by sending an IPI and ensuring that a
> > "serializing" instruction (thanks for great terminology, Intel) is
> > executed.  Note that flush_icache_range() is a no-op on x86, and I've
> > asked Intel's architects to please clarify their precise rules.  No
> > response yet.
> >
> > On arm64, flush_icache_range() seems to send an IPI, and that's not
> > what I want.  membarrier() already does an IPI.
>
> After chatting with rmk about this (but without claiming that any of
> this is his opinion), based on the manpage, I think membarrier()
> currently doesn't really claim to be synchronizing caches? It just
> serializes cores. So arguably if userspace wants to use membarrier()
> to synchronize code changes, userspace should first do the code
> change, then flush icache as appropriate for the architecture, and
> then do the membarrier() to ensure that the old code is unused?

I haven't the faintest clue what "serializes cores" means.  It seems
to be a bit of a mishmash of x86 SDM terminology and Linux x86
"sync_core" terminology.  The latter means very little to me, even as
an x86 person.

I'm moderately confident that the *intent* is that a multithreaded
program can write some JIT code to memory, do this membarrier()
operation, and then execute the code, and it will work.  Maybe it's
even intended to work cross-architecture without any additional help
from the program.  But maybe the existing API is simply incorrect for
this.

>
> For 32-bit arm, rmk pointed out that that would be the cacheflush()
> syscall. That might cause you to end up with two IPIs instead of one
> in total, but we probably don't care _that_ much about extra IPIs on
> 32-bit arm?
>
> For arm64, I believe userspace can flush icache across the entire
> system with some instructions from userspace - "DC CVAU" followed by
> "DSB ISH", or something like that, I think? (See e.g.
> compat_arm_syscall(), the arm64 compat code that implements the 32-bit
> arm cacheflush() syscall.)

I have no idea what DC anything does.   Based on my very cursory
reading of the manual, ISB is the right approach, but I don't pretend
I understand all the details.


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2020 at 9:23 AM Russell King - ARM Linux admin
 wrote:
>
> On Mon, Dec 28, 2020 at 09:14:23AM -0800, Andy Lutomirski wrote:
> > On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin
> >  wrote:
> > >
> > > On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:
> > > > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers
> > > >  wrote:
> > > > >
> > > > > - On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org 
> > > > > wrote:
> > > > >
> > > >
> > > > > >
> > > > > > I admit that I'm rather surprised that the code worked at all on 
> > > > > > arm64,
> > > > > > and I'm suspicious that it has never been very well tested.  My 
> > > > > > apologies
> > > > > > for not reviewing this more carefully in the first place.
> > > > >
> > > > > Please refer to 
> > > > > Documentation/features/sched/membarrier-sync-core/arch-support.txt
> > > > >
> > > > > It clearly states that only arm, arm64, powerpc and x86 support the 
> > > > > membarrier
> > > > > sync core feature as of now:
> > > >
> > > > Sigh, I missed arm (32).  Russell or ARM folks, what's the right
> > > > incantation to make the CPU notice instruction changes initiated by
> > > > other cores on 32-bit ARM?
> > >
> > > You need to call flush_icache_range(), since the changes need to be
> > > flushed from the data cache to the point of unification (of the Harvard
> > > I and D), and the instruction cache needs to be invalidated so it can
> > > then see those updated instructions. This will also take care of the
> > > necessary barriers that the CPU requires for you.
> >
> > With what parameters?   From looking at the header, this is for the
> > case in which the kernel writes some memory and then intends to
> > execute it.  That's not what membarrier() does at all.  membarrier()
> > works like this:
>
> You didn't specify that you weren't looking at kernel memory.
>
> If you're talking about userspace, then the interface you require
> is flush_icache_user_range(), which does the same as
> flush_icache_range() but takes userspace addresses. Note that this
> requires that the memory is currently mapped at those userspace
> addresses.
>
> If that doesn't fit your needs, there isn't an interface to do what
> you require, and it basically means creating something brand new
> on every architecture.
>
> What you are asking for is not "just a matter of a few instructions".
> I have stated the required steps to achieve what you require above;
> that is the minimum when you have non-snooping harvard caches, which
> the majority of 32-bit ARMs have.
>
> > User thread 1:
> >
> > write to RWX memory *or* write to an RW alias of an X region.
> > membarrier(...);
> > somehow tell thread 2 that we're ready (with a store release, perhaps,
> > or even just a relaxed store.)
> >
> > User thread 2:
> >
> > wait for the indication from thread 1.
> > barrier();
> > jump to the code.
> >
> > membarrier() is, for better or for worse, not given a range of addresses.
>
> Then, I'm sorry, it can't work on 32-bit ARM.

Is there a way to flush the *entire* user icache?  If so, and if it
has reasonable performance, then it could probably be used here.
Otherwise I'll just send a revert for this whole mechanism on 32-bit
ARM.

--Andy


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Andy Lutomirski
On Mon, Dec 28, 2020 at 2:25 AM Russell King - ARM Linux admin
 wrote:
>
> On Sun, Dec 27, 2020 at 01:36:13PM -0800, Andy Lutomirski wrote:
> > On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers
> >  wrote:
> > >
> > > ----- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org wrote:
> > >
> >
> > > >
> > > > I admit that I'm rather surprised that the code worked at all on arm64,
> > > > and I'm suspicious that it has never been very well tested.  My 
> > > > apologies
> > > > for not reviewing this more carefully in the first place.
> > >
> > > Please refer to 
> > > Documentation/features/sched/membarrier-sync-core/arch-support.txt
> > >
> > > It clearly states that only arm, arm64, powerpc and x86 support the 
> > > membarrier
> > > sync core feature as of now:
> >
> > Sigh, I missed arm (32).  Russell or ARM folks, what's the right
> > incantation to make the CPU notice instruction changes initiated by
> > other cores on 32-bit ARM?
>
> You need to call flush_icache_range(), since the changes need to be
> flushed from the data cache to the point of unification (of the Harvard
> I and D), and the instruction cache needs to be invalidated so it can
> then see those updated instructions. This will also take care of the
> necessary barriers that the CPU requires for you.

With what parameters?   From looking at the header, this is for the
case in which the kernel writes some memory and then intends to
execute it.  That's not what membarrier() does at all.  membarrier()
works like this:

User thread 1:

write to RWX memory *or* write to an RW alias of an X region.
membarrier(...);
somehow tell thread 2 that we're ready (with a store release, perhaps,
or even just a relaxed store.)

User thread 2:

wait for the indication from thread 1.
barrier();
jump to the code.

membarrier() is, for better or for worse, not given a range of addresses.

On x86, the documentation is a bit weak, but a strict reading
indicates that thread 2 must execute a serializing instruction at some
point after thread 1 writes the code and before thread 2 runs it.
membarrier() does this by sending an IPI and ensuring that a
"serializing" instruction (thanks for great terminology, Intel) is
executed.  Note that flush_icache_range() is a no-op on x86, and I've
asked Intel's architects to please clarify their precise rules.  No
response yet.

On arm64, flush_icache_range() seems to send an IPI, and that's not
what I want.  membarrier() already does an IPI.

I suppose one option is to revert support for this membarrier()
operation on arm, but it would be nice to just implement it instead.

--Andy


Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-27 Thread Andy Lutomirski
On Sun, Dec 27, 2020 at 12:18 PM Mathieu Desnoyers
 wrote:
>
> - On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org wrote:
>

> >
> > I admit that I'm rather surprised that the code worked at all on arm64,
> > and I'm suspicious that it has never been very well tested.  My apologies
> > for not reviewing this more carefully in the first place.
>
> Please refer to 
> Documentation/features/sched/membarrier-sync-core/arch-support.txt
>
> It clearly states that only arm, arm64, powerpc and x86 support the membarrier
> sync core feature as of now:

Sigh, I missed arm (32).  Russell or ARM folks, what's the right
incantation to make the CPU notice instruction changes initiated by
other cores on 32-bit ARM?

>
>
> # Architecture requirements
> #
> # * arm/arm64/powerpc
> #
> # Rely on implicit context synchronization as a result of exception return
> # when returning from IPI handler, and when returning to user-space.
> #
> # * x86
> #
> # x86-32 uses IRET as return from interrupt, which takes care of the IPI.
> # However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
> # instruction is core serializing, but not SYSEXIT.
> #
> # x86-64 uses IRET as return from interrupt, which takes care of the IPI.
> # However, it can return to user-space through either SYSRETL (compat code),
> # SYSRETQ, or IRET.
> #
> # Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
> # instead on write_cr3() performed by switch_mm() to provide core 
> serialization
> # after changing the current mm, and deal with the special case of kthread ->
> # uthread (temporarily keeping current mm into active_mm) by issuing a
> # sync_core_before_usermode() in that specific case.
>

I need to update that document as part of my series.

> This is based on direct feedback from the architecture maintainers.
>
> You seem to have noticed odd cases on arm64 where this guarantee does not
> match reality. Where exactly can we find this in the code, and which part
> of the architecture manual can you point us to which supports your concern ?
>
> Based on the notes I have, use of `eret` on aarch64 guarantees a context 
> synchronizing
> instruction when returning to user-space.

Based on my reading of the manual, ERET on ARM doesn't synchronize
anything at all.  I can't find any evidence that it synchronizes data
or instructions, and I've seen reports that the CPU will happily
speculate right past it.

--Andy


[RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-27 Thread Andy Lutomirski
The old sync_core_before_usermode() comments said that a non-icache-syncing
return-to-usermode instruction is x86-specific and that all other
architectures automatically notice cross-modified code on return to
userspace.  Based on my general understanding of how CPUs work and based on
my atttempt to read the ARM manual, this is not true at all.  In fact, x86
seems to be a bit of an anomaly in the other direction: x86's IRET is
unusually heavyweight for a return-to-usermode instruction.

So let's drop any pretense that we can have a generic way implementation
behind membarrier's SYNC_CORE flush and require all architectures that opt
in to supply their own.  This means x86, arm64, and powerpc for now.  Let's
also rename the function from sync_core_before_usermode() to
membarrier_sync_core_before_usermode() because the precise flushing details
may very well be specific to membarrier, and even the concept of
"sync_core" in the kernel is mostly an x86-ism.

I admit that I'm rather surprised that the code worked at all on arm64,
and I'm suspicious that it has never been very well tested.  My apologies
for not reviewing this more carefully in the first place.

Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nicholas Piggin 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: linux-arm-ker...@lists.infradead.org
Cc: Mathieu Desnoyers 
Cc: x...@kernel.org
Cc: sta...@vger.kernel.org
Fixes: 70216e18e519 ("membarrier: Provide core serializing command, 
*_SYNC_CORE")
Signed-off-by: Andy Lutomirski 
---

Hi arm64 and powerpc people-

This is part of a series here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes

Before I send out the whole series, I'm hoping that some arm64 and powerpc
people can help me verify that I did this patch right.  Once I get
some feedback on this patch, I'll send out the whole pile.  And once
*that's* done, I'll start giving the mm lazy stuff some serious thought.

The x86 part is already fixed in Linus' tree.

Thanks,
Andy

 arch/arm64/include/asm/sync_core.h   | 21 +
 arch/powerpc/include/asm/sync_core.h | 20 
 arch/x86/Kconfig |  1 -
 arch/x86/include/asm/sync_core.h |  7 +++
 include/linux/sched/mm.h |  1 -
 include/linux/sync_core.h| 21 -
 init/Kconfig |  3 ---
 kernel/sched/membarrier.c| 15 +++
 8 files changed, 55 insertions(+), 34 deletions(-)
 create mode 100644 arch/arm64/include/asm/sync_core.h
 create mode 100644 arch/powerpc/include/asm/sync_core.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/arch/arm64/include/asm/sync_core.h 
b/arch/arm64/include/asm/sync_core.h
new file mode 100644
index ..5be4531caabd
--- /dev/null
+++ b/arch/arm64/include/asm/sync_core.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_ARM64_SYNC_CORE_H
+#define _ASM_ARM64_SYNC_CORE_H
+
+#include 
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+   /*
+* XXX: is this enough or do we need a DMB first to make sure that
+* writes from other CPUs become visible to this CPU?  We have an
+* smp_mb() already, but that's not quite the same thing.
+*/
+   isb();
+}
+
+#endif /* _ASM_ARM64_SYNC_CORE_H */
diff --git a/arch/powerpc/include/asm/sync_core.h 
b/arch/powerpc/include/asm/sync_core.h
new file mode 100644
index ..71dfbe7794e5
--- /dev/null
+++ b/arch/powerpc/include/asm/sync_core.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SYNC_CORE_H
+#define _ASM_POWERPC_SYNC_CORE_H
+
+#include 
+
+/*
+ * Ensure that the CPU notices any instruction changes before the next time
+ * it returns to usermode.
+ */
+static inline void membarrier_sync_core_before_usermode(void)
+{
+   /*
+* XXX: I know basically nothing about powerpc cache management.
+* Is this correct?
+*/
+   isync();
+}
+
+#endif /* _ASM_POWERPC_SYNC_CORE_H */
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b5137cc5b7b4..895f70fd4a61 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -81,7 +81,6 @@ config X86
select ARCH_HAS_SET_DIRECT_MAP
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
-   select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_SYSCALL_WRAPPER
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAS_DEBUG_WX
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
index ab7382f92aff..c665b453969a 100644
--- a/arch/x86/include/asm/sync_core.h
+++ b/arch/x86/include/asm/sync_core.h
@@ -89,11 +89,10 @@ static inline void sync_cor

Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-12-10 Thread Andy Lutomirski
> On Dec 5, 2020, at 7:59 PM, Nicholas Piggin  wrote:
>

> I'm still going to persue shoot-lazies for the merge window. As you
> see it's about a dozen lines and a if (IS_ENABLED(... in core code.
> Your change is common code, but a significant complexity (which
> affects all archs) so needs a lot more review and testing at this
> point.

I don't think it's ready for this merge window.  I read the early
patches again, and I think they make the membarrier code worse, not
better.  I'm not fundamentally opposed to the shoot-lazies concept,
but it needs more thought and it needs a cleaner foundation.


Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-12-05 Thread Andy Lutomirski
On Sat, Dec 5, 2020 at 3:15 PM Nicholas Piggin  wrote:
>
> Excerpts from Andy Lutomirski's message of December 6, 2020 2:11 am:
> >

> If an mm was lazy tlb for a kernel thread and then it becomes unlazy,
> and if switch_mm is serialising but return to user is not, then you
> need a serialising instruction somewhere before return to user. unlazy
> is the logical place to add that, because the lazy tlb mm (i.e.,
> switching to a kernel thread and back without switching mm) is what
> opens the hole.

The issue here is that unlazying on x86 sometimes serializes and
sometimes doesn't.  It's straightforward to add logic to the x86 code
to serialize specifically in the case where membarrier core sync is
registered and unlazying would otherwise not serialize, but trying to
define sensible semantics for this in a call to core code seems
complicated.  (Specifically, the x86 code only sometimes sends IPIs to
lazy CPUs for TLB flushes.  If an IPI is skipped, then unlazying will
flush the TLB, and that operation is serializing.

The whole lazy thing is IMO a red herring for membarrier().  The
membarrier() logic requires that switching *logical* mms
(rq->curr->mm) serializes before user mode if the new mm is registered
for core sync.  AFAICT the only architecture on which this isn't
automatic is x86, and somehow the logic turned into "actually changing
rq->curr->mm serializes, but unlazying only sometimes serializes, so
we need to do an extra serialization step on unlazying operations"
instead of "tell x86 to make sure it always serializes when it
switches logical mms".  The latter is easy to specify and easy to
implement.

>
> How do you mean? exit_lazy_tlb is the opposite, core scheduler notifying
> arch code about when an mm becomes not-lazy, and nothing to do with
> membarrier at all even. It's a convenient hook to do your un-lazying.
> I guess you can do it also checking things in switch_mm and keeping state
> in arch code, I don't think that's necessarily the best place to put it.

I'm confused.  I just re-read your patches, and it looks like you have
arch code calling exit_lazy_tlb().  On x86, if we do a TLB shootdown
IPI to a lazy CPU, the IPI handler will unlazy that CPU (by switching
to init_mm for real), and we have no way to notify the core scheduler
about this, so we don't.  The result is that the core scheduler state
and the x86 state gets out of sync.  If the core scheduler
subsequently switches us back to the mm that it thinks we were still
using lazily them, from the x86 code's perspective, we're not
unlazying -- we're just doing a regular switch from init_mm to some
other mm.  This is why x86's switch_mm_irqs_off() totally ignores its
'prev' argument.

I'm honestly a bit surprised that other architectures don't do the
same thing.  I suppose that some architectures don't use shootdown
IPIs at all, in which case there doesn't seem to be any good reason to
aggressively unlazy.

(Oddly, despite the fact that, since Ivy Bridge, x86 has a "just flush
the TLB" instruction, that instruction is considerably slower than the
old "switch mm and flush" operation.  So the operation "switch to
init_mm" is only ever any slower than "flush and stay lazy" if we get
lucky and unlazy to the same mm before we get a second TLB shootdown
*and* if unlazying to the same mm would not have needed to flush.  I
spend quite a bit of time tuning this stuff and being quite surprised
at the bizarre performance properties of Intel's TLB management
instructions.)

>
> So membarrier code is unchanged (it cares that the serialise is done at
> un-lazy time), core code is simpler (no knowledge of this membarrier
> quirk and it already knows about lazy-tlb so the calls actually improve
> the documentation), and x86 code I would argue becomes nicer (or no real
> difference at worst) because you can move some exit lazy tlb handling to
> that specific call rather than decipher it from switch_mm.

As above, I can't move the exit-lazy handling because the scheduler
doesn't know when I'm unlazying.

>
> >
> > I’m currently trying to document how membarrier actually works, and
> > hopefully this will result in untangling membarrier from mmdrop() and
> > such.
>
> That would be nice.

It's still a work in progress.  I haven't actually convinced myself
that the non-IPI case in membarrier() is correct, nor have I convinced
myself that it's incorrect.

Anyway, I think that my patch is a bit incorrect and I either need a
barrier somewhere (which may already exist) or a store-release to
lazy_mm to make sure that all accesses to the lazy mm are done before
lazy_mm is freed.  On x86, even aside from the fact that all stores
are releases, this isn't needed -- stopping using an mm is itself a
full barrier.  Will this be a performance issue on power?

>
> >
> > A silly part of this is that x86 already has a high quality
> > implementation of most of membarrier(): flush_tlb_mm().  If you flush
> > an mm’s TLB, we carefully propagate the flush to all threads, 

Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-12-05 Thread Andy Lutomirski


> On Dec 5, 2020, at 12:00 AM, Nicholas Piggin  wrote:
> 
> 
> I disagree. Until now nobody following it noticed that the mm gets
> un-lazied in other cases, because that was not too clear from the
> code (only indirectly using non-standard terminology in the arch
> support document).

> In other words, membarrier needs a special sync to deal with the case 
> when a kthread takes the mm.

I don’t think this is actually true. Somehow the x86 oddities about CR3 writes 
leaked too much into the membarrier core code and comments. (I doubt this is 
x86 specific.  The actual x86 specific part seems to be that we can return to 
user mode without syncing the instruction stream.)

As far as I can tell, membarrier doesn’t care at all about laziness. Membarrier 
cares about rq->curr->mm.  The fact that a cpu can switch its actual loaded mm 
without scheduling at all (on x86 at least) is entirely beside the point except 
insofar as it has an effect on whether a subsequent switch_mm() call 
serializes.  If we notify membarrier about x86’s asynchronous CR3 writes, then 
membarrier needs to understand what to do with them, which results in an 
unmaintainable mess in membarrier *and* in the x86 code.

I’m currently trying to document how membarrier actually works, and hopefully 
this will result in untangling membarrier from mmdrop() and such.

A silly part of this is that x86 already has a high quality implementation of 
most of membarrier(): flush_tlb_mm().  If you flush an mm’s TLB, we carefully 
propagate the flush to all threads, with attention to memory ordering.  We 
can’t use this directly as an arch-specific implementation of membarrier 
because it has the annoying side affect of flushing the TLB and because 
upcoming hardware might be able to flush without guaranteeing a core sync.  
(Upcoming means Zen 3, but the Zen 3 implementation is sadly not usable by 
Linux.)


Re: [RFC v2 2/2] [MOCKUP] sched/mm: Lightweight lazy mm refcounting

2020-12-04 Thread Andy Lutomirski



> On Dec 3, 2020, at 11:54 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Andy Lutomirski's message of December 4, 2020 3:26 pm:
>> This is a mockup.  It's designed to illustrate the algorithm and how the
>> code might be structured.  There are several things blatantly wrong with
>> it:
>> 
>> The coding stype is not up to kernel standards.  I have prototypes in the
>> wrong places and other hacks.
>> 
>> There's a problem with mm_cpumask() not being reliable.
> 
> Interesting, this might be a way to reduce those IPIs with fairly 
> minimal fast path cost. Would be interesting to see how much performance 
> advantage it has over my dumb simple shoot-lazies.

My real motivation isn’t really performance per se. I think there’s 
considerable value in keeping the core algorithms the same across all 
architectures, and I think my approach can manage that with only a single hint 
from the architecture as to which CPUs to scan.

With shoot-lazies, in contrast, enabling it everywhere would either malfunction 
or have very poor performance or even DoS issues on arches like arm64 and s390x 
that don’t track mm_cpumask at all.  I’m sure we could come up with some way to 
mitigate that, but I think that my approach may be better overall for keeping 
the core code uniform and relatively straightforward.

> 
> For powerpc I don't think we'd be inclined to go that way, so don't feel 
> the need to add this complexity for us alone -- we'd be more inclined to 
> move the exit lazy to the final TLB shootdown path, which we're slowly 
> getting more infrastructure in place to do.
> 


> 
> There's a few nits but I don't think I can see a fundamental problem 
> yet.

Thanks!

I can polish the patch, but I want to be sure the memory ordering parts are 
clear.

> 
> Thanks,
> Nick


[RFC v2 2/2] [MOCKUP] sched/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Andy Lutomirski
This is a mockup.  It's designed to illustrate the algorithm and how the
code might be structured.  There are several things blatantly wrong with
it:

The coding stype is not up to kernel standards.  I have prototypes in the
wrong places and other hacks.

There's a problem with mm_cpumask() not being reliable.

Signed-off-by: Andy Lutomirski 
---
 kernel/fork.c|   4 ++
 kernel/sched/core.c  | 128 +--
 kernel/sched/sched.h |  11 +++-
 3 files changed, 126 insertions(+), 17 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index da8d360fb032..0887a33cf84f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1066,6 +1066,8 @@ struct mm_struct *mm_alloc(void)
return mm_init(mm, current, current_user_ns());
 }
 
+extern void mm_fixup_lazy_refs(struct mm_struct *mm);
+
 static inline void __mmput(struct mm_struct *mm)
 {
VM_BUG_ON(atomic_read(&mm->mm_users));
@@ -1084,6 +1086,8 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+
+   mm_fixup_lazy_refs(mm);
mmdrop(mm);
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6c4b76147166..69dfdfe0e5b4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3555,6 +3555,75 @@ prepare_task_switch(struct rq *rq, struct task_struct 
*prev,
prepare_arch_switch(next);
 }
 
+static void drop_extra_mm_refs(struct rq *rq)
+{
+   unsigned long old_mm;
+
+   if (likely(!atomic_long_read(&rq->mm_to_mmdrop)))
+   return;
+
+   /*
+* Slow path.  This only happens when we recently stopped using
+* an mm that is exiting.
+*/
+   old_mm = atomic_long_xchg_relaxed(&rq->mm_to_mmdrop, 0);
+   if (old_mm)
+   mmdrop((struct mm_struct *)old_mm);
+}
+
+/*
+ * This ensures that all lazy_mm refs to mm are converted to mm_count
+ * refcounts.  Our caller holds an mm_count reference, so we don't need
+ * to worry about mm being freed out from under us.
+ */
+void mm_fixup_lazy_refs(struct mm_struct *mm)
+{
+   int cpu;
+
+   /*
+* mm_users is zero, so no new lazy refs will be taken.
+*/
+   WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0);
+
+   /*
+* XXX: this is wrong on arm64 and possibly on other architectures.
+* Maybe we need a config option for this?  Or a
+* for_each_possible_lazy_cpu(cpu, mm) helper?
+*/
+   for_each_cpu(cpu, mm_cpumask(mm)) {
+   struct rq *rq = cpu_rq(cpu);
+   unsigned long old;
+
+   if (READ_ONCE(rq->lazy_mm) != mm)
+   continue;
+
+   // XXX: we could optimize this by doing a big addition to
+   // mm_count up front instead of incrementing it separately
+   // for each CPU.
+   mmgrab(mm);
+
+   // XXX: could this be relaxed instead?
+   old = atomic_long_xchg(&rq->mm_to_mmdrop, (unsigned long)mm);
+
+   // At this point, mm can be mmdrop()ed at any time, probably
+   // by the target cpu.
+
+   if (!old)
+   continue;  // All done!
+
+   WARN_ON_ONCE(old == (unsigned long)mm);
+
+   // Uh oh!  We just stole an mm reference from the target CPU.
+   // Fortunately, we just observed the target's lazy_mm pointing
+   // to something other than old, and we observed this after
+   // bringing mm_users down to 0.  This means that the remote
+   // cpu is definitely done with old.  So we can drop it on the
+   // remote CPU's behalf.
+
+   mmdrop((struct mm_struct *)old);
+   }
+}
+
 /**
  * finish_task_switch - clean up after a task-switch
  * @prev: the thread we just switched away from.
@@ -3578,7 +3647,6 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
__releases(rq->lock)
 {
struct rq *rq = this_rq();
-   struct mm_struct *mm = rq->prev_mm;
long prev_state;
 
/*
@@ -3597,8 +3665,6 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
  current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);
 
-   rq->prev_mm = NULL;
-
/*
 * A task struct has one reference for the use as "current".
 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
@@ -3629,11 +3695,28 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
 * rq->curr, before returning to userspace, and mmdrop() provides
 * this barrier.
 *
-* XXX: I don't think mmdrop() actually does this.  There's no
-* smp_mb__before/after_atomic() in there.
+* XXX: I d

[RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-03 Thread Andy Lutomirski
The core scheduler isn't a great place for
membarrier_mm_sync_core_before_usermode() -- the core scheduler doesn't
actually know whether we are lazy.  With the old code, if a CPU is
running a membarrier-registered task, goes idle, gets unlazied via a TLB
shootdown IPI, and switches back to the membarrier-registered task, it
will do an unnecessary core sync.

Conveniently, x86 is the only architecture that does anything in this
hook, so we can just move the code.

XXX: there are some comments in swich_mm_irqs_off() that seem to be
trying to document what barriers are expected, and it's not clear to me
that these barriers are actually present in all paths through the
code.  So I think this change makes the code more comprehensible and
has no effect on the code's correctness, but I'm not at all convinced
that the code is correct.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/mm/tlb.c   | 17 -
 kernel/sched/core.c | 14 +++---
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3338a1feccf9..23df035b80e8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * from one thread in a process to another thread in the same
 * process. No TLB flush required.
 */
+
+   // XXX: why is this okay wrt membarrier?
if (!was_lazy)
return;
 
@@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
smp_mb();
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
-   next_tlb_gen)
+   next_tlb_gen) {
+   /*
+* We're reactivating an mm, and membarrier might
+* need to serialize.  Tell membarrier.
+*/
+
+   // XXX: I can't understand the logic in
+   // membarrier_mm_sync_core_before_usermode().  What's
+   // the mm check for?
+   membarrier_mm_sync_core_before_usermode(next);
return;
+   }
 
/*
 * TLB contents went out of date while we were in lazy
 * mode. Fall through to the TLB switching code below.
+* No need for an explicit membarrier invocation -- the CR3
+* write will serialize.
 */
new_asid = prev_asid;
need_flush = true;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d95dc3f4644..6c4b76147166 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3619,22 +3619,22 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
kcov_finish_switch(current);
 
fire_sched_in_preempt_notifiers(current);
+
/*
 * When switching through a kernel thread, the loop in
 * membarrier_{private,global}_expedited() may have observed that
 * kernel thread and not issued an IPI. It is therefore possible to
 * schedule between user->kernel->user threads without passing though
 * switch_mm(). Membarrier requires a barrier after storing to
-* rq->curr, before returning to userspace, so provide them here:
+* rq->curr, before returning to userspace, and mmdrop() provides
+* this barrier.
 *
-* - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-*   provided by mmdrop(),
-* - a sync_core for SYNC_CORE.
+* XXX: I don't think mmdrop() actually does this.  There's no
+* smp_mb__before/after_atomic() in there.
 */
-   if (mm) {
-   membarrier_mm_sync_core_before_usermode(mm);
+   if (mm)
mmdrop(mm);
-   }
+
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
-- 
2.28.0



[RFC v2 0/2] lazy mm refcounting

2020-12-03 Thread Andy Lutomirski
This is part of a larger series here, but the beginning bit is irrelevant
to the current discussion:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm&id=203d39d11562575fd8bd6a094d97a3a332d8b265

This is IMO a lot better than v1.  It's now almost entirely in generic
code.  (It looks like it's 100% generic, but that's a lie -- the
generic code currently that all possible lazy mm refs are in
mm_cpumask(), and that's not true on all arches.  So, if we take my
approach, we'll need to have a little arch hook to control this.)

Here's how I think it fits with various arches:

x86: On bare metal (i.e. paravirt flush unavailable), the loop won't do
much.  The existing TLB shootdown when user tables are freed will
empty mm_cpumask of everything but the calling CPU.  So x86 ends up
pretty close to as good as we can get short of reworking mm_cpumask() itself.

arm64: It needs the fixup above for correctness, but I think performance
should be pretty good.  Compared to current kernels, we lose an mmgrab()
and mmdrop() on each lazy transition, and we add a reasonably fast loop
over all cpus on process exit.  Someone (probably me) needs to make
sure we don't need some extra barriers.

power: Similar to x86.

s390x: Should be essentially the same as arm64.

Other arches: I don't know.  Further research is required.

What do you all think?

Andy Lutomirski (2):
  [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch
code
  [MOCKUP] sched/mm: Lightweight lazy mm refcounting

 arch/x86/mm/tlb.c|  17 +-
 kernel/fork.c|   4 ++
 kernel/sched/core.c  | 134 +--
 kernel/sched/sched.h |  11 +++-
 4 files changed, 145 insertions(+), 21 deletions(-)

-- 
2.28.0



Re: [MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-03 Thread Andy Lutomirski


> On Dec 3, 2020, at 2:13 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Peter Zijlstra's message of December 3, 2020 6:44 pm:
>>> On Wed, Dec 02, 2020 at 09:25:51PM -0800, Andy Lutomirski wrote:
>>> 
>>> power: same as ARM, except that the loop may be rather larger since
>>> the systems are bigger.  But I imagine it's still faster than Nick's
>>> approach -- a cmpxchg to a remote cacheline should still be faster than
>>> an IPI shootdown. 
>> 
>> While a single atomic might be cheaper than an IPI, the comparison
>> doesn't work out nicely. You do the xchg() on every unlazy, while the
>> IPI would be once per process exit.
>> 
>> So over the life of the process, it might do very many unlazies, adding
>> up to a total cost far in excess of what the single IPI would've been.
> 
> Yeah this is the concern, I looked at things that add cost to the
> idle switch code and it gets hard to justify the scalability improvement
> when you slow these fundmaental things down even a bit.

v2 fixes this and is generally much nicer. I’ll send it out in a couple hours.

> 
> I still think working on the assumption that IPIs = scary expensive 
> might not be correct. An IPI itself is, but you only issue them when 
> you've left a lazy mm on another CPU which just isn't that often.
> 
> Thanks,
> Nick


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-03 Thread Andy Lutomirski



> On Dec 3, 2020, at 9:09 AM, Alexander Gordeev  wrote:
> 
> On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
>> other arch folk: there's some background here:

> 
>> 
>> power: Ridiculously complicated, seems to vary by system and kernel config.
>> 
>> So, Nick, your unconditional IPI scheme is apparently a big
>> improvement for power, and it should be an improvement and have low
>> cost for x86.  On arm64 and s390x it will add more IPIs on process
>> exit but reduce contention on context switching depending on how lazy
> 
> s390 does not invalidate TLBs per-CPU explicitly - we have special
> instructions for that. Those in turn initiate signalling to other
> CPUs, completely transparent to OS.

Just to make sure I understand: this means that you broadcast flushes to all 
CPUs, not just a subset?

> 
> Apart from mm_count, I am struggling to realize how the suggested
> scheme could change the the contention on s390 in connection with
> TLB. Could you clarify a bit here, please?

I’m just talking about mm_count. Maintaining mm_count is quite expensive on 
some workloads.

> 
>> TLB works.  I suppose we could try it for all architectures without
>> any further optimizations.  Or we could try one of the perhaps
>> excessively clever improvements I linked above.  arm64, s390x people,
>> what do you think?
> 
> I do not immediately see anything in the series that would harm
> performance on s390.
> 
> We however use mm_cpumask to distinguish between local and global TLB
> flushes. With this series it looks like mm_cpumask is *required* to
> be consistent with lazy users. And that is something quite diffucult
> for us to adhere (at least in the foreseeable future).

You don’t actually need to maintain mm_cpumask — we could scan all CPUs instead.

> 
> But actually keeping track of lazy users in a cpumask is something
> the generic code would rather do AFAICT.

The problem is that arches don’t agree on what the contents of mm_cpumask 
should be.  Tracking a mask of exactly what the arch wants in generic code is a 
nontrivial operation.


[MOCKUP] x86/mm: Lightweight lazy mm refcounting

2020-12-02 Thread Andy Lutomirski
For context, this is part of a series here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm&id=7c4bcc0a464ca60be1e0aeba805a192be0ee81e5

This code compiles, but I haven't even tried to boot it.  The earlier
part of the series isn't terribly interesting -- it's a handful of
cleanups that remove all reads of ->active_mm from arch/x86.  I've
been meaning to do that for a while, and now I did it.  But, with
that done, I think we can move to a totally different lazy mm refcounting
model.

So this patch is a mockup of what this could look like.  The algorithm
involves no atomics at all in the context switch path except for a
single atomic_long_xchg() of a percpu variable when going from lazy
mode to nonlazy mode.  Even that could be optimized -- I suspect it could
be replaced with non-atomic code if mm_users > 0.  Instead, on mm exit,
there's a single loop over all CPUs on which that mm could be lazily
loaded that atomic_long_cmpxchg_relaxed()'s a remote percpu variable to
tell the CPU to kindly mmdrop() the mm when it reschedules.  All cpus
that don't have the mm lazily loaded are ignored because they can't
have lazy references, and no cpus can gain lazy references after
mm_users hits zero.  (I need to verify that all the relevant barriers
are in place.  I suspect that they are on x86 but that I'm short an
smp_mb() on arches for which _relaxed atomics are genuinely relaxed.)

Here's how I think it fits with various arches:

x86: On bare metal (i.e. paravirt flush unavailable), the loop won't do
much.  The existing TLB shootdown when user tables are freed will
empty mm_cpumask of everything but the calling CPU.  So x86 ends up
pretty close to as good as we can get short of reworking mm_cpumask() itself.

arm64: with s/for_each_cpu/for_each_online_cpu, I think it will give
good performance.  The mmgrab()/mmdrop() overhead goes away, and,
on smallish systems, the cost of the loop should be low.

power: same as ARM, except that the loop may be rather larger since
the systems are bigger.  But I imagine it's still faster than Nick's
approach -- a cmpxchg to a remote cacheline should still be faster than
an IPI shootdown.  (Nick, don't benchmark this patch -- at least the
mm_users optimization mentioned above should be done, but also the
mmgrab() and mmdrop() aren't actually removed.)

Other arches: I don't know.  Further research is required.

What do you all think?


As mentioned, there are several things blatantly wrong with this patch:

The coding stype is not up to kernel standars.  I have prototypes in the
wrong places and other hacks.

mms are likely to be freed with IRQs off.  I think this is safe, but it's
suboptimal.

This whole thing is in arch/x86.  The core algorithm ought to move outside
arch/, but doing so without making a mess might take some thought.  It
doesn't help that different architectures have very different ideas
of what mm_cpumask() does.

Finally, this patch has no benefit by itself.  I didn't remove the
active_mm refounting, so the big benefit of removing mmgrab() and
mmdrop() calls on transitions to and from lazy mode isn't there.
There is no point at all in benchmarking this patch as is.  That
being said, getting rid of the active_mm refcounting shouldn't be
so hard, since x86 (in this tree) no longer uses active_mm at all.

I should contemplate whether the calling CPU is special in
arch_fixup_lazy_mm_refs().  On a very very quick think, it's not, but
it needs more thought.

Signed-off-by: Andy Lutomirski 

 arch/x86/include/asm/tlbflush.h | 20 
 arch/x86/mm/tlb.c   | 81 +++--
 kernel/fork.c   |  5 ++
 3 files changed, 101 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 8c87a2e0b660..efcd4f125f2c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -124,6 +124,26 @@ struct tlb_state {
 */
unsigned short user_pcid_flush_mask;
 
+   /*
+* Lazy mm tracking.
+*
+*  - If this is NULL, it means that any mm_struct referenced by
+*this CPU is kept alive by a real reference count.
+*
+*  - If this is nonzero but the low bit is clear, it points to
+*an mm_struct that must not be freed out from under this
+*CPU.
+*
+*  - If the low bit is set, it still points to an mm_struct,
+*but some other CPU has mmgrab()'d it on our behalf, and we
+*must mmdrop() it when we're done with it.
+*
+* See lazy_mm_grab() and related functions for the precise
+* access rules.
+*/
+   atomic_long_t   lazy_mm;
+
+
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
 * disabling inte

Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-12-02 Thread Andy Lutomirski
On Tue, Dec 1, 2020 at 6:50 PM Nicholas Piggin  wrote:
>
> Excerpts from Andy Lutomirski's message of November 29, 2020 3:55 am:
> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
> >>
> >> And get rid of the generic sync_core_before_usermode facility. This is
> >> functionally a no-op in the core scheduler code, but it also catches
> >>
> >> This helper is the wrong way around I think. The idea that membarrier
> >> state requires a core sync before returning to user is the easy one
> >> that does not need hiding behind membarrier calls. The gap in core
> >> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> >> tricky detail that is better put in x86 lazy tlb code.
> >>
> >> Consider if an arch did not synchronize core in switch_mm either, then
> >> membarrier_mm_sync_core_before_usermode would be in the wrong place
> >> but arch specific mmu context functions would still be the right place.
> >> There is also a exit_lazy_tlb case that is not covered by this call, which
> >> could be a bugs (kthread use mm the membarrier process's mm then context
> >> switch back to the process without switching mm or lazy mm switch).
> >>
> >> This makes lazy tlb code a bit more modular.
> >
> > I have a couple of membarrier fixes that I want to send out today or
> > tomorrow, and they might eliminate the need for this patch.  Let me
> > think about this a little bit.  I'll cc you.  The existing code is way
> > to subtle and the comments are far too confusing for me to be quickly
> > confident about any of my conclusions :)
> >
>
> Thanks for the head's up. I'll have to have a better look through them
> but I don't know that it eliminates the need for this entirely although
> it might close some gaps and make this not a bug fix. The problem here
> is x86 code wanted something to be called when a lazy mm is unlazied,
> but it missed some spots and also the core scheduler doesn't need to
> know about those x86 details if it has this generic call that annotates
> the lazy handling better.

I'll send v3 tomorrow.  They add more sync_core_before_usermode() callers.

Having looked at your patches a bunch and the membarrier code a bunch,
I don't think I like the approach of pushing this logic out into new
core functions called by arch code.  Right now, even with my
membarrier patches applied, understanding how (for example) the x86
switch_mm_irqs_off() plus the scheduler code provides the barriers
that membarrier needs is quite complicated, and it's not clear to me
that the code is even correct.  At the very least I'm pretty sure that
the x86 comments are misleading.  With your patches, someone trying to
audit the code would have to follow core code calling into arch code
and back out into core code to figure out what's going on.  I think
the result is worse.

I wrote this incomplete patch which takes the opposite approach (sorry
for whitespace damage):

commit 928b5c60e93f475934892d6e0b357ebf0a2bf87d
Author: Andy Lutomirski 
Date:   Wed Dec 2 17:24:02 2020 -0800

[WIP] x86/mm: Handle unlazying membarrier core sync in the arch code

The core scheduler isn't a great place for
membarrier_mm_sync_core_before_usermode() -- the core scheduler
doesn't actually know whether we are lazy.  With the old code, if a
CPU is running a membarrier-registered task, goes idle, gets unlazied
via a TLB shootdown IPI, and switches back to the
membarrier-registered task, it will do an unnecessary core sync.

Conveniently, x86 is the only architecture that does anything in this
hook, so we can just move the code.

XXX: actually delete the old code.

Signed-off-by: Andy Lutomirski 

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3338a1feccf9..e27300fc865b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -496,6 +496,8 @@ void switch_mm_irqs_off(struct mm_struct *prev,
struct mm_struct *next,
  * from one thread in a process to another thread in the same
  * process. No TLB flush required.
  */
+
+// XXX: why is this okay wrt membarrier?
 if (!was_lazy)
 return;

@@ -508,12 +510,24 @@ void switch_mm_irqs_off(struct mm_struct *prev,
struct mm_struct *next,
 smp_mb();
 next_tlb_gen = atomic64_read(&next->context.tlb_gen);
 if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
-next_tlb_gen)
+next_tlb_gen) {
+/*
+ * We're reactivating an mm, and membarrier might
+ * need to serialize.  Tell membarrier.
+ */
+
+// XXX: I can't understand the logic in
+// memba

Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-02 Thread Andy Lutomirski
> On Dec 1, 2020, at 7:47 PM, Nicholas Piggin  wrote:
>
> Excerpts from Andy Lutomirski's message of December 1, 2020 4:31 am:
>> other arch folk: there's some background here:
>>
>> https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com
>>
>>> On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
>>>
>>> On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
>>>>
>>>> On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
>>>>>
>>>>> On big systems, the mm refcount can become highly contented when doing
>>>>> a lot of context switching with threaded applications (particularly
>>>>> switching between the idle thread and an application thread).
>>>>>
>>>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>>>> any remaining lazy ones.
>>>>>
>>>>> Shootdown IPIs are some concern, but they have not been observed to be
>>>>> a big problem with this scheme (the powerpc implementation generated
>>>>> 314 additional interrupts on a 144 CPU system during a kernel compile).
>>>>> There are a number of strategies that could be employed to reduce IPIs
>>>>> if they turn out to be a problem for some workload.
>>>>
>>>> I'm still wondering whether we can do even better.
>>>>
>>>
>>> Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
>>> the TLB.  On x86, this will shoot down all lazies as long as even a
>>> single pagetable was freed.  (Or at least it will if we don't have a
>>> serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
>>> sets tlb->freed_tables, which will trigger the IPI.)  So, on
>>> architectures like x86, the shootdown approach should be free.  The
>>> only way it ought to have any excess IPIs is if we have CPUs in
>>> mm_cpumask() that don't need IPI to free pagetables, which could
>>> happen on paravirt.
>>
>> Indeed, on x86, we do this:
>>
>> [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
>> [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
>> [   11.561068]  exit_mmap+0xc8/0x1a0
>> [   11.561932]  mmput+0x29/0xd0
>> [   11.562688]  do_exit+0x316/0xa90
>> [   11.563588]  do_group_exit+0x34/0xb0
>> [   11.564476]  __x64_sys_exit_group+0xf/0x10
>> [   11.565512]  do_syscall_64+0x34/0x50
>>
>> and we have info->freed_tables set.
>>
>> What are the architectures that have large systems like?
>>
>> x86: we already zap lazies, so it should cost basically nothing to do
>
> This is not zapping lazies, this is freeing the user page tables.
>
> "lazy mm" is where a switch to a kernel thread takes on the
> previous mm for its kernel mapping rather than switch to init_mm.

The intent of the code is to flush the TLB after freeing user pages
tables, but, on bare metal, lazies get zapped as a side effect.

Anyway, I'm going to send out a mockup of an alternative approach shortly.


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-02 Thread Andy Lutomirski


> On Dec 2, 2020, at 6:20 AM, Peter Zijlstra  wrote:
> 
> On Sun, Nov 29, 2020 at 02:01:39AM +1000, Nicholas Piggin wrote:
>> + * - A delayed freeing and RCU-like quiescing sequence based on
>> + *   mm switching to avoid IPIs completely.
> 
> That one's interesting too. so basically you want to count switch_mm()
> invocations on each CPU. Then, periodically snapshot the counter on each
> CPU, and when they've all changed, increment a global counter.
> 
> Then, you snapshot the global counter and wait for it to increment
> (twice I think, the first increment might already be in progress).
> 
> The only question here is what should drive this machinery.. the tick
> probably.
> 
> This shouldn't be too hard to do I think.
> 
> Something a little like so perhaps?

I don’t think this will work.  A CPU can go idle with lazy mm and nohz forever. 
 This could lead to unbounded memory use on a lightly loaded system.



Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-12-01 Thread Andy Lutomirski
On Tue, Dec 1, 2020 at 1:28 PM Will Deacon  wrote:
>
> On Mon, Nov 30, 2020 at 10:31:51AM -0800, Andy Lutomirski wrote:
> > other arch folk: there's some background here:
> >
> > https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com
> >
> > On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
> > >
> > > On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
> > > >
> > > > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  
> > > > wrote:
> > > > >
> > > > > On big systems, the mm refcount can become highly contented when doing
> > > > > a lot of context switching with threaded applications (particularly
> > > > > switching between the idle thread and an application thread).
> > > > >
> > > > > Abandoning lazy tlb slows switching down quite a bit in the important
> > > > > user->idle->user cases, so so instead implement a non-refcounted 
> > > > > scheme
> > > > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot 
> > > > > down
> > > > > any remaining lazy ones.
> > > > >
> > > > > Shootdown IPIs are some concern, but they have not been observed to be
> > > > > a big problem with this scheme (the powerpc implementation generated
> > > > > 314 additional interrupts on a 144 CPU system during a kernel 
> > > > > compile).
> > > > > There are a number of strategies that could be employed to reduce IPIs
> > > > > if they turn out to be a problem for some workload.
> > > >
> > > > I'm still wondering whether we can do even better.
> > > >
> > >
> > > Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
> > > the TLB.  On x86, this will shoot down all lazies as long as even a
> > > single pagetable was freed.  (Or at least it will if we don't have a
> > > serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
> > > sets tlb->freed_tables, which will trigger the IPI.)  So, on
> > > architectures like x86, the shootdown approach should be free.  The
> > > only way it ought to have any excess IPIs is if we have CPUs in
> > > mm_cpumask() that don't need IPI to free pagetables, which could
> > > happen on paravirt.
> >
> > Indeed, on x86, we do this:
> >
> > [   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
> > [   11.559905]  tlb_finish_mmu+0x10e/0x1a0
> > [   11.561068]  exit_mmap+0xc8/0x1a0
> > [   11.561932]  mmput+0x29/0xd0
> > [   11.562688]  do_exit+0x316/0xa90
> > [   11.563588]  do_group_exit+0x34/0xb0
> > [   11.564476]  __x64_sys_exit_group+0xf/0x10
> > [   11.565512]  do_syscall_64+0x34/0x50
> >
> > and we have info->freed_tables set.
> >
> > What are the architectures that have large systems like?
> >
> > x86: we already zap lazies, so it should cost basically nothing to do
> > a little loop at the end of __mmput() to make sure that no lazies are
> > left.  If we care about paravirt performance, we could implement one
> > of the optimizations I mentioned above to fix up the refcounts instead
> > of sending an IPI to any remaining lazies.
> >
> > arm64: AFAICT arm64's flush uses magic arm64 hardware support for
> > remote flushes, so any lazy mm references will still exist after
> > exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
> > the x86 paravirt case.  Are there large enough arm64 systems that any
> > of this matters?
>
> Yes, there are large arm64 systems where performance of TLB invalidation
> matters, but they're either niche (supercomputers) or not readily available
> (NUMA boxes).
>
> But anyway, we blow away the TLB for everybody in tlb_finish_mmu() after
> freeing the page-tables. We have an optimisation to avoid flushing if
> we're just unmapping leaf entries when the mm is going away, but we don't
> have a choice once we get to actually reclaiming the page-tables.
>
> One thing I probably should mention, though, is that we don't maintain
> mm_cpumask() because we're not able to benefit from it and the atomic
> update is a waste of time.

Do you do anything special for lazy TLB or do you just use the generic
code?  (i.e. where do your user pagetables point when you go from a
user task to idle or to a kernel thread?)

Do you end up with all cpus set in mm_cpumask or can you have the mm
loaded on a CPU that isn't in mm_cpumask?

--Andy

>
> Will


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-11-30 Thread Andy Lutomirski
other arch folk: there's some background here:

https://lkml.kernel.org/r/calcetrvxube8lfnn-qs+dzroqaiw+sfug1j047ybyv31sat...@mail.gmail.com

On Sun, Nov 29, 2020 at 12:16 PM Andy Lutomirski  wrote:
>
> On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
> >
> > On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
> > >
> > > On big systems, the mm refcount can become highly contented when doing
> > > a lot of context switching with threaded applications (particularly
> > > switching between the idle thread and an application thread).
> > >
> > > Abandoning lazy tlb slows switching down quite a bit in the important
> > > user->idle->user cases, so so instead implement a non-refcounted scheme
> > > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> > > any remaining lazy ones.
> > >
> > > Shootdown IPIs are some concern, but they have not been observed to be
> > > a big problem with this scheme (the powerpc implementation generated
> > > 314 additional interrupts on a 144 CPU system during a kernel compile).
> > > There are a number of strategies that could be employed to reduce IPIs
> > > if they turn out to be a problem for some workload.
> >
> > I'm still wondering whether we can do even better.
> >
>
> Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
> the TLB.  On x86, this will shoot down all lazies as long as even a
> single pagetable was freed.  (Or at least it will if we don't have a
> serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
> sets tlb->freed_tables, which will trigger the IPI.)  So, on
> architectures like x86, the shootdown approach should be free.  The
> only way it ought to have any excess IPIs is if we have CPUs in
> mm_cpumask() that don't need IPI to free pagetables, which could
> happen on paravirt.

Indeed, on x86, we do this:

[   11.558844]  flush_tlb_mm_range.cold+0x18/0x1d
[   11.559905]  tlb_finish_mmu+0x10e/0x1a0
[   11.561068]  exit_mmap+0xc8/0x1a0
[   11.561932]  mmput+0x29/0xd0
[   11.562688]  do_exit+0x316/0xa90
[   11.563588]  do_group_exit+0x34/0xb0
[   11.564476]  __x64_sys_exit_group+0xf/0x10
[   11.565512]  do_syscall_64+0x34/0x50

and we have info->freed_tables set.

What are the architectures that have large systems like?

x86: we already zap lazies, so it should cost basically nothing to do
a little loop at the end of __mmput() to make sure that no lazies are
left.  If we care about paravirt performance, we could implement one
of the optimizations I mentioned above to fix up the refcounts instead
of sending an IPI to any remaining lazies.

arm64: AFAICT arm64's flush uses magic arm64 hardware support for
remote flushes, so any lazy mm references will still exist after
exit_mmap().  (arm64 uses lazy TLB, right?)  So this is kind of like
the x86 paravirt case.  Are there large enough arm64 systems that any
of this matters?

s390x: The code has too many acronyms for me to understand it fully,
but I think it's more or less the same situation as arm64.  How big do
s390x systems come?

power: Ridiculously complicated, seems to vary by system and kernel config.

So, Nick, your unconditional IPI scheme is apparently a big
improvement for power, and it should be an improvement and have low
cost for x86.  On arm64 and s390x it will add more IPIs on process
exit but reduce contention on context switching depending on how lazy
TLB works.  I suppose we could try it for all architectures without
any further optimizations.  Or we could try one of the perhaps
excessively clever improvements I linked above.  arm64, s390x people,
what do you think?


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-11-29 Thread Andy Lutomirski
On Sat, Nov 28, 2020 at 7:54 PM Andy Lutomirski  wrote:
>
> On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
> >
> > On big systems, the mm refcount can become highly contented when doing
> > a lot of context switching with threaded applications (particularly
> > switching between the idle thread and an application thread).
> >
> > Abandoning lazy tlb slows switching down quite a bit in the important
> > user->idle->user cases, so so instead implement a non-refcounted scheme
> > that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> > any remaining lazy ones.
> >
> > Shootdown IPIs are some concern, but they have not been observed to be
> > a big problem with this scheme (the powerpc implementation generated
> > 314 additional interrupts on a 144 CPU system during a kernel compile).
> > There are a number of strategies that could be employed to reduce IPIs
> > if they turn out to be a problem for some workload.
>
> I'm still wondering whether we can do even better.
>

Hold on a sec.. __mmput() unmaps VMAs, frees pagetables, and flushes
the TLB.  On x86, this will shoot down all lazies as long as even a
single pagetable was freed.  (Or at least it will if we don't have a
serious bug, but the code seems okay.  We'll hit pmd_free_tlb, which
sets tlb->freed_tables, which will trigger the IPI.)  So, on
architectures like x86, the shootdown approach should be free.  The
only way it ought to have any excess IPIs is if we have CPUs in
mm_cpumask() that don't need IPI to free pagetables, which could
happen on paravirt.

Can you try to figure out why you saw any increase in IPIs?  It would
be nice if we can make the new code unconditional.


Re: [PATCH 6/8] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-11-28 Thread Andy Lutomirski
On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
>
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
>
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
>
> Shootdown IPIs are some concern, but they have not been observed to be
> a big problem with this scheme (the powerpc implementation generated
> 314 additional interrupts on a 144 CPU system during a kernel compile).
> There are a number of strategies that could be employed to reduce IPIs
> if they turn out to be a problem for some workload.

I'm still wondering whether we can do even better.

The IPIs you're doing aren't really necessary -- we don't
fundamentally need to free the pagetables immediately when all
non-lazy users are done with them (and current kernels don't) -- what
we need to do is to synchronize all the bookkeeping.  So, with
adequate locking (famous last words), a couple of alternative schemes
ought to be possible.

a) Instead of sending an IPI, increment mm_count on behalf of the
remote CPU and do something to make sure that the remote CPU knows we
did this on its behalf.  Then free the mm when mm_count hits zero.

b) Treat mm_cpumask as part of the refcount.  Add one to mm_count when
an mm is created.  Once mm_users hits zero, whoever clears the last
bit in mm_cpumask is responsible for decrementing a single reference
from mm_count, and whoever sets it to zero frees the mm.

Version (b) seems fairly straightforward to implement -- add RCU
protection and a atomic_t special_ref_cleared (initially 0) to struct
mm_struct itself.  After anyone clears a bit to mm_cpumask (which is
already a barrier), they read mm_users.  If it's zero, then they scan
mm_cpumask and see if it's empty.  If it is, they atomically swap
special_ref_cleared to 1.  If it was zero before the swap, they do
mmdrop().  I can imagine some tweaks that could make this a big
faster, at least in the limit of a huge number of CPUs.

Version (a) seems a bit harder to reason about.  Maybe it could be
done like this.  Add a percpu variable mm_with_extra_count.  This
variable can be NULL, but it can also be an mm that has an extra
reference on behalf of the cpu in question.

__mmput scans mm_cpumask and, for each cpu in the mask, mmgrabs the mm
and cmpxchgs that cpu's mm_with_extra_count from NULL to mm.  If it
succeeds, then we win.  If it fails, further thought is required, and
maybe we have to send an IPI, although maybe some other cleverness is
possible.  Any time a CPU switches mms, it does atomic swaps
mm_with_extra_count to NULL and mmdrops whatever the mm was.  (Maybe
it needs to check the mm isn't equal to the new mm, although it would
be quite bizarre for this to happen.)  Other than these mmgrab and
mmdrop calls, the mm switching code doesn't mmgrab or mmdrop at all.


Version (a) seems like it could have excellent performance.


*However*, I think we should consider whether we want to do something
even bigger first.  Even with any of these changes, we still need to
maintain mm_cpumask(), and that itself can be a scalability problem.
I wonder if we can solve this problem too.  Perhaps the switch_mm()
paths could only ever set mm_cpumask bits, and anyone who would send
an IPI because a bit is set in mm_cpumask would first check some
percpu variable (cpu_rq(cpu)->something?  an entirely new variable) to
see if the bit in mm_cpumask is spurious.  Or perhaps mm_cpumask could
be split up across multiple cachelines, one per node.

We should keep the recent lessons from Apple in mind, though: x86 is a
dinosaur.  The future of atomics is going to look a lot more like
ARM's LSE than x86's rather anemic set.  This means that mm_cpumask
operations won't need to be full barriers forever, and we might not
want to take the implied full barriers in set_bit() and clear_bit()
for granted.

--Andy


Re: [PATCH 1/8] lazy tlb: introduce exit_lazy_tlb

2020-11-28 Thread Andy Lutomirski
On Sat, Nov 28, 2020 at 8:01 AM Nicholas Piggin  wrote:
>
> This is called at points where a lazy mm is switched away or made not
> lazy (by its owner switching back).
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/arm/mach-rpc/ecard.c|  1 +
>  arch/powerpc/mm/book3s64/radix_tlb.c |  1 +
>  fs/exec.c|  6 --
>  include/asm-generic/mmu_context.h| 21 +
>  kernel/kthread.c |  1 +
>  kernel/sched/core.c  |  2 ++
>  6 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
> index 827b50f1c73e..43eb1bfba466 100644
> --- a/arch/arm/mach-rpc/ecard.c
> +++ b/arch/arm/mach-rpc/ecard.c
> @@ -253,6 +253,7 @@ static int ecard_init_mm(void)
> current->mm = mm;
> current->active_mm = mm;
> activate_mm(active_mm, mm);
> +   exit_lazy_tlb(active_mm, current);
> mmdrop(active_mm);
> ecard_init_pgtables(mm);
> return 0;
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
> b/arch/powerpc/mm/book3s64/radix_tlb.c
> index b487b489d4b6..ac3fec03926a 100644
> --- a/arch/powerpc/mm/book3s64/radix_tlb.c
> +++ b/arch/powerpc/mm/book3s64/radix_tlb.c
> @@ -661,6 +661,7 @@ static void do_exit_flush_lazy_tlb(void *arg)
> mmgrab(&init_mm);
> current->active_mm = &init_mm;
> switch_mm_irqs_off(mm, &init_mm, current);
> +   exit_lazy_tlb(mm, current);
> mmdrop(mm);
> }
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 547a2390baf5..4b4dea1bb7ba 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1017,6 +1017,8 @@ static int exec_mmap(struct mm_struct *mm)
> if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> local_irq_enable();
> activate_mm(active_mm, mm);
> +   if (!old_mm)
> +   exit_lazy_tlb(active_mm, tsk);
> if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
> local_irq_enable();
> tsk->mm->vmacache_seqnum = 0;
> @@ -1028,9 +1030,9 @@ static int exec_mmap(struct mm_struct *mm)
> setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
> mm_update_next_owner(old_mm);
> mmput(old_mm);
> -   return 0;
> +   } else {
> +   mmdrop(active_mm);
> }
> -   mmdrop(active_mm);

This looks like an unrelated change.

> return 0;
>  }
>
> diff --git a/include/asm-generic/mmu_context.h 
> b/include/asm-generic/mmu_context.h
> index 91727065bacb..4626d0020e65 100644
> --- a/include/asm-generic/mmu_context.h
> +++ b/include/asm-generic/mmu_context.h
> @@ -24,6 +24,27 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
>  }
>  #endif
>
> +/*
> + * exit_lazy_tlb - Called after switching away from a lazy TLB mode mm.
> + *
> + * mm:  the lazy mm context that was switched
> + * tsk: the task that was switched to (with a non-lazy mm)
> + *
> + * mm may equal tsk->mm.
> + * mm and tsk->mm will not be NULL.
> + *
> + * Note this is not symmetrical to enter_lazy_tlb, this is not
> + * called when tasks switch into the lazy mm, it's called after the
> + * lazy mm becomes non-lazy (either switched to a different mm or the
> + * owner of the mm returns).
> + */
> +#ifndef exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm,

Maybe name this parameter prev_lazy_mm?


Re: [PATCH 5/8] lazy tlb: allow lazy tlb mm switching to be configurable

2020-11-28 Thread Andy Lutomirski
On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
>
> NOMMU systems could easily go without this and save a bit of code
> and the refcount atomics, because their mm switch is a no-op. I
> haven't flipped them over because haven't audited all arch code to
> convert over to using the _lazy_tlb refcounting.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/Kconfig | 11 +++
>  include/linux/sched/mm.h | 13 ++--
>  kernel/sched/core.c  | 68 +---
>  kernel/sched/sched.h |  4 ++-
>  4 files changed, 75 insertions(+), 21 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 56b6ccc0e32d..596bf589d74b 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -430,6 +430,17 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>   irqs disabled over activate_mm. Architectures that do IPI based TLB
>   shootdowns should enable this.
>
> +# Should make this depend on MMU, because there is little use for lazy mm 
> switching
> +# with NOMMU. Must audit NOMMU architecture code for lazy mm refcounting 
> first.
> +config MMU_LAZY_TLB
> +   def_bool y
> +   help
> + Enable "lazy TLB" mmu context switching for kernel threads.
> +
> +config MMU_LAZY_TLB_REFCOUNT
> +   def_bool y
> +   depends on MMU_LAZY_TLB
> +

This could use some documentation as to what "no" means.

>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
> bool
>
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 7157c0f6fef8..bd0f27402d4b 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -51,12 +51,21 @@ static inline void mmdrop(struct mm_struct *mm)
>  /* Helpers for lazy TLB mm refcounting */
>  static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
>  {
> -   mmgrab(mm);
> +   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
> +   mmgrab(mm);
>  }
>
>  static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
>  {
> -   mmdrop(mm);
> +   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
> +   mmdrop(mm);
> +   } else {
> +   /*
> +* mmdrop_lazy_tlb must provide a full memory barrier, see the
> +* membarrier comment finish_task_switch.

"membarrier comment in finish_task_switch()", perhaps?

> +*/
> +   smp_mb();
> +   }
>  }
>
>  /**
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e372b613d514..3b79c6cc3a37 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3579,7 +3579,7 @@ static struct rq *finish_task_switch(struct task_struct 
> *prev)
> __releases(rq->lock)
>  {
> struct rq *rq = this_rq();
> -   struct mm_struct *mm = rq->prev_mm;
> +   struct mm_struct *mm = NULL;
> long prev_state;
>
> /*
> @@ -3598,7 +3598,10 @@ static struct rq *finish_task_switch(struct 
> task_struct *prev)
>   current->comm, current->pid, preempt_count()))
> preempt_count_set(FORK_PREEMPT_COUNT);
>
> -   rq->prev_mm = NULL;
> +#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
> +   mm = rq->prev_lazy_mm;
> +   rq->prev_lazy_mm = NULL;
> +#endif
>
> /*
>  * A task struct has one reference for the use as "current".
> @@ -3630,6 +3633,8 @@ static struct rq *finish_task_switch(struct task_struct 
> *prev)
>  * rq->curr, before returning to userspace, for
>  * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by
>  * mmdrop_lazy_tlb().
> +*
> +* This same issue applies to other places that mmdrop_lazy_tlb().
>  */
> if (mm)
> mmdrop_lazy_tlb(mm);
> @@ -3719,22 +3724,10 @@ asmlinkage __visible void schedule_tail(struct 
> task_struct *prev)
> calculate_sigpending();
>  }
>
> -/*
> - * context_switch - switch to the new MM and the new thread's register state.
> - */
> -static __always_inline struct rq *
> -context_switch(struct rq *rq, struct task_struct *prev,
> -  struct task_struct *next, struct rq_flags *rf)
> +static __always_inline void
> +context_switch_mm(struct rq *rq, struct task_struct *prev,
> +  struct task_struct *next)
>  {
> -   prepare_task_switch(rq, prev, next);
> -
> -   /*
> -* For paravirt, this is coupled with an exit in switch_to to
> -* combine the page table reload and the switch backend into
> -* one hypercall.
> -*/
> -   arch_start_context_switch(prev);
> -
> /*
>  * kernel -> kernel   lazy + transfer active
>  *   user -> kernel   lazy + mmgrab_lazy_tlb() active
> @@ -3765,11 +3758,50 @@ context_switch(struct rq *rq, struct task_struct 
> *prev,
> if (!prev->mm) {// from kernel
> exit_lazy_tlb(prev->active_mm, next);
>
> +#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
> /* will mmdrop_lazy_tlb() in finish_task_switch

Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-11-28 Thread Andy Lutomirski
On Sat, Nov 28, 2020 at 8:02 AM Nicholas Piggin  wrote:
>
> And get rid of the generic sync_core_before_usermode facility. This is
> functionally a no-op in the core scheduler code, but it also catches
>
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
>
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
>
> This makes lazy tlb code a bit more modular.

I have a couple of membarrier fixes that I want to send out today or
tomorrow, and they might eliminate the need for this patch.  Let me
think about this a little bit.  I'll cc you.  The existing code is way
to subtle and the comments are far too confusing for me to be quickly
confident about any of my conclusions :)


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-22 Thread Andy Lutomirski



> On Sep 22, 2020, at 2:01 AM, Arnd Bergmann  wrote:
> 
> On Tue, Sep 22, 2020 at 9:59 AM Pavel Begunkov  
> wrote:
>>> On 22/09/2020 10:23, Arnd Bergmann wrote:
>>> On Tue, Sep 22, 2020 at 8:32 AM Pavel Begunkov  
>>> wrote:
>>>> On 22/09/2020 03:58, Andy Lutomirski wrote:
>>>>> On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov  
>>>>> wrote:
>>>>> I may be looking at a different kernel than you, but aren't you
>>>>> preventing creating an io_uring regardless of whether SQPOLL is
>>>>> requested?
>>>> 
>>>> I diffed a not-saved file on a sleepy head, thanks for noticing.
>>>> As you said, there should be an SQPOLL check.
>>>> 
>>>> ...
>>>> if (ctx->compat && (p->flags & IORING_SETUP_SQPOLL))
>>>>goto err;
>>> 
>>> Wouldn't that mean that now 32-bit containers behave differently
>>> between compat and native execution?
>>> 
>>> I think if you want to prevent 32-bit applications from using SQPOLL,
>>> it needs to be done the same way on both to be consistent:
>> 
>> The intention was to disable only compat not native 32-bit.
> 
> I'm not following why that would be considered a valid option,
> as that clearly breaks existing users that update from a 32-bit
> kernel to a 64-bit one.
> 
> Taking away the features from users that are still on 32-bit kernels
> already seems questionable to me, but being inconsistent
> about it seems much worse, in particular when the regression
> is on the upgrade path.
> 
>>> Can we expect all existing and future user space to have a sane
>>> fallback when IORING_SETUP_SQPOLL fails?
>> 
>> SQPOLL has a few differences with non-SQPOLL modes, but it's easy
>> to convert between them. Anyway, SQPOLL is a privileged special
>> case that's here for performance/latency reasons, I don't think
>> there will be any non-accidental users of it.
> 
> Ok, so the behavior of 32-bit tasks would be the same as running
> the same application as unprivileged 64-bit tasks, with applications
> already having to implement that fallback, right?
> 
> 

I don’t have any real preference wrt SQPOLL, and it may be that we have a 
problem even without SQPOLL when IO gets punted without one of the fixes 
discussed.

But banning the mismatched io_uring and io_uring_enter seems like it may be 
worthwhile regardless.

Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-21 Thread Andy Lutomirski
On Mon, Sep 21, 2020 at 5:24 PM Pavel Begunkov  wrote:
>
>
>
> On 22/09/2020 02:51, Andy Lutomirski wrote:
> > On Mon, Sep 21, 2020 at 9:15 AM Pavel Begunkov  
> > wrote:
> >>
> >> On 21/09/2020 19:10, Pavel Begunkov wrote:
> >>> On 20/09/2020 01:22, Andy Lutomirski wrote:
> >>>>
> >>>>> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann  wrote:
> >>>>>
> >>>>> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski  
> >>>>> wrote:
> >>>>>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig  wrote:
> >>>>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> >>>>>>>> Said that, why not provide a variant that would take an explicit
> >>>>>>>> "is it compat" argument and use it there?  And have the normal
> >>>>>>>> one pass in_compat_syscall() to that...
> >>>>>>>
> >>>>>>> That would help to not introduce a regression with this series yes.
> >>>>>>> But it wouldn't fix existing bugs when io_uring is used to access
> >>>>>>> read or write methods that use in_compat_syscall().  One example that
> >>>>>>> I recently ran into is drivers/scsi/sg.c.
> >>>>>
> >>>>> Ah, so reading /dev/input/event* would suffer from the same issue,
> >>>>> and that one would in fact be broken by your patch in the hypothetical
> >>>>> case that someone tried to use io_uring to read /dev/input/event on 
> >>>>> x32...
> >>>>>
> >>>>> For reference, I checked the socket timestamp handling that has a
> >>>>> number of corner cases with time32/time64 formats in compat mode,
> >>>>> but none of those appear to be affected by the problem.
> >>>>>
> >>>>>> Aside from the potentially nasty use of per-task variables, one thing
> >>>>>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
> >>>>>> going to have a generic mechanism for this, shouldn't we allow a full
> >>>>>> override of the syscall arch instead of just allowing forcing compat
> >>>>>> so that a compat syscall can do a non-compat operation?
> >>>>>
> >>>>> The only reason it's needed here is that the caller is in a kernel
> >>>>> thread rather than a system call. Are there any possible scenarios
> >>>>> where one would actually need the opposite?
> >>>>>
> >>>>
> >>>> I can certainly imagine needing to force x32 mode from a kernel thread.
> >>>>
> >>>> As for the other direction: what exactly are the desired bitness/arch 
> >>>> semantics of io_uring?  Is the operation bitness chosen by the io_uring 
> >>>> creation or by the io_uring_enter() bitness?
> >>>
> >>> It's rather the second one. Even though AFAIR it wasn't discussed
> >>> specifically, that how it works now (_partially_).
> >>
> >> Double checked -- I'm wrong, that's the former one. Most of it is based
> >> on a flag that was set an creation.
> >>
> >
> > Could we get away with making io_uring_enter() return -EINVAL (or
> > maybe -ENOTTY?) if you try to do it with bitness that doesn't match
> > the io_uring?  And disable SQPOLL in compat mode?
>
> Something like below. If PF_FORCE_COMPAT or any other solution
> doesn't lend by the time, I'll take a look whether other io_uring's
> syscalls need similar checks, etc.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 0458f02d4ca8..aab20785fa9a 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -8671,6 +8671,10 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, 
> to_submit,
> if (ctx->flags & IORING_SETUP_R_DISABLED)
> goto out;
>
> +   ret = -EINVAl;
> +   if (ctx->compat != in_compat_syscall())
> +   goto out;
> +

This seems entirely reasonable to me.  Sharing an io_uring ring
between programs with different ABIs seems a bit nutty.

> /*
>  * For SQ polling, the thread will do all submissions and completions.
>  * Just return the requested submit count, and wake the thread if
> @@ -9006,6 +9010,10 @@ static int io_uring_create(unsigned entries, struct 
> io_uring_params *p,
> if (ret)
> goto err;
>
> +   ret = -EINVAL;
> +   if (ctx->compat)
> +   goto err;
> +

I may be looking at a different kernel than you, but aren't you
preventing creating an io_uring regardless of whether SQPOLL is
requested?

> /* Only gets the ring fd, doesn't install it in the file table */
> fd = io_uring_get_fd(ctx, &file);
> if (fd < 0) {
> --
> Pavel Begunkov


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-21 Thread Andy Lutomirski
On Mon, Sep 21, 2020 at 9:15 AM Pavel Begunkov  wrote:
>
> On 21/09/2020 19:10, Pavel Begunkov wrote:
> > On 20/09/2020 01:22, Andy Lutomirski wrote:
> >>
> >>> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann  wrote:
> >>>
> >>> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski  wrote:
> >>>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig  wrote:
> >>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> >>>>>> Said that, why not provide a variant that would take an explicit
> >>>>>> "is it compat" argument and use it there?  And have the normal
> >>>>>> one pass in_compat_syscall() to that...
> >>>>>
> >>>>> That would help to not introduce a regression with this series yes.
> >>>>> But it wouldn't fix existing bugs when io_uring is used to access
> >>>>> read or write methods that use in_compat_syscall().  One example that
> >>>>> I recently ran into is drivers/scsi/sg.c.
> >>>
> >>> Ah, so reading /dev/input/event* would suffer from the same issue,
> >>> and that one would in fact be broken by your patch in the hypothetical
> >>> case that someone tried to use io_uring to read /dev/input/event on x32...
> >>>
> >>> For reference, I checked the socket timestamp handling that has a
> >>> number of corner cases with time32/time64 formats in compat mode,
> >>> but none of those appear to be affected by the problem.
> >>>
> >>>> Aside from the potentially nasty use of per-task variables, one thing
> >>>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
> >>>> going to have a generic mechanism for this, shouldn't we allow a full
> >>>> override of the syscall arch instead of just allowing forcing compat
> >>>> so that a compat syscall can do a non-compat operation?
> >>>
> >>> The only reason it's needed here is that the caller is in a kernel
> >>> thread rather than a system call. Are there any possible scenarios
> >>> where one would actually need the opposite?
> >>>
> >>
> >> I can certainly imagine needing to force x32 mode from a kernel thread.
> >>
> >> As for the other direction: what exactly are the desired bitness/arch 
> >> semantics of io_uring?  Is the operation bitness chosen by the io_uring 
> >> creation or by the io_uring_enter() bitness?
> >
> > It's rather the second one. Even though AFAIR it wasn't discussed
> > specifically, that how it works now (_partially_).
>
> Double checked -- I'm wrong, that's the former one. Most of it is based
> on a flag that was set an creation.
>

Could we get away with making io_uring_enter() return -EINVAL (or
maybe -ENOTTY?) if you try to do it with bitness that doesn't match
the io_uring?  And disable SQPOLL in compat mode?

--Andy


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Andy Lutomirski
On Sun, Sep 20, 2020 at 12:23 PM Matthew Wilcox  wrote:
>
> On Sun, Sep 20, 2020 at 08:10:31PM +0100, Al Viro wrote:
> > IMO it's much saner to mark those and refuse to touch them from io_uring...
>
> Simpler solution is to remove io_uring from the 32-bit syscall list.
> If you're a 32-bit process, you don't get to use io_uring.  Would
> any real users actually care about that?

We could go one step farther and declare that we're done adding *any*
new compat syscalls :)



-- 
Andy Lutomirski
AMA Capital Management, LLC


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Andy Lutomirski
On Sun, Sep 20, 2020 at 11:07 AM Al Viro  wrote:
>
> On Sun, Sep 20, 2020 at 04:15:10PM +0100, Matthew Wilcox wrote:
> > On Fri, Sep 18, 2020 at 02:45:25PM +0200, Christoph Hellwig wrote:
> > > Add a flag to force processing a syscall as a compat syscall.  This is
> > > required so that in_compat_syscall() works for I/O submitted by io_uring
> > > helper threads on behalf of compat syscalls.
> >
> > Al doesn't like this much, but my suggestion is to introduce two new
> > opcodes -- IORING_OP_READV32 and IORING_OP_WRITEV32.  The compat code
> > can translate IORING_OP_READV to IORING_OP_READV32 and then the core
> > code can know what that user pointer is pointing to.
>
> Let's separate two issues:
> 1) compat syscalls want 32bit iovecs.  Nothing to do with the
> drivers, dealt with just fine.
> 2) a few drivers are really fucked in head.  They use different
> *DATA* layouts for reads/writes, depending upon the calling process.
> IOW, if you fork/exec a 32bit binary and your stdin is one of those,
> reads from stdin in parent and child will yield different data layouts.
> On the same struct file.
> That's what Christoph worries about (/dev/sg he'd mentioned is
> one of those).
>
> IMO we should simply have that dozen or so of pathological files
> marked with FMODE_SHITTY_ABI; it's not about how they'd been opened -
> it describes the userland ABI provided by those.  And it's cast in stone.
>

I wonder if this is really quite cast in stone.  We could also have
FMODE_SHITTY_COMPAT and set that when a file like this is *opened* in
compat mode.  Then that particular struct file would be read and
written using the compat data format.  The change would be
user-visible, but the user that would see it would be very strange
indeed.

I don't have a strong opinion as to whether that is better or worse
than denying io_uring access to these things, but at least it moves
the special case out of io_uring.

--Andy


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-20 Thread Andy Lutomirski
On Sat, Sep 19, 2020 at 7:57 PM Al Viro  wrote:
>
> On Sat, Sep 19, 2020 at 05:14:41PM -0700, Andy Lutomirski wrote:
>
> > > 2) have you counted the syscalls that do and do not need that?
> >
> > No.
>
> Might be illuminating...
>
> > > 3) how many of those realistically *can* be unified with their
> > > compat counterparts?  [hint: ioctl(2) cannot]
> >
> > There would be no requirement to unify anything.  The idea is that
> > we'd get rid of all the global state flags.
>
> _What_ global state flags?  When you have separate SYSCALL_DEFINE3(ioctl...)
> and COMPAT_SYSCALL_DEFINE3(ioctl...), there's no flags at all, global or
> local.  They only come into the play when you try to share the same function
> for both, right on the top level.

...

>
> > For ioctl, we'd have a new file_operation:
> >
> > long ioctl(struct file *, unsigned int, unsigned long, enum syscall_arch);
> >
> > I'm not saying this is easy, but I think it's possible and the result
> > would be more obviously correct than what we have now.
>
> No, it would not.  Seriously, from time to time a bit of RTFS before grand
> proposals turns out to be useful.

As one example, look at __sys_setsockopt().  It's called for the
native and compat versions, and it contains an in_compat_syscall()
check.  (This particularly check looks dubious to me, but that's
another story.)  If this were to be done with equivalent semantics
without a separate COMPAT_DEFINE_SYSCALL and without
in_compat_syscall(), there would need to be some indication as to
whether this is compat or native setsockopt.  There are other
setsockopt implementations in the net stack with more
legitimate-seeming uses of in_compat_syscall() that would need some
other mechanism if in_compat_syscall() were to go away.

setsockopt is (I hope!) out of scope for io_uring, but the situation
isn't fundamentally different from read and write.


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Andy Lutomirski
On Sat, Sep 19, 2020 at 4:24 PM Al Viro  wrote:
>
> On Sat, Sep 19, 2020 at 03:53:40PM -0700, Andy Lutomirski wrote:
>
> > > It would not be a win - most of the syscalls don't give a damn
> > > about 32bit vs. 64bit...
> >
> > Any reasonable implementation would optimize it out for syscalls that don’t 
> > care.  Or it could be explicit:
> >
> > DEFINE_MULTIARCH_SYSCALL(...)
>
> 1) what would that look like?

In effect, it would work like this:

/* Arch-specific, but there's a generic case for sane architectures. */
enum syscall_arch {
  SYSCALL_NATIVE,
  SYSCALL_COMPAT,
  SYSCALL_X32,
};

DEFINE_MULTIARCH_SYSCALLn(args, arch)
{
  args are the args here, and arch is the arch.
}

> 2) have you counted the syscalls that do and do not need that?

No.

> 3) how many of those realistically *can* be unified with their
> compat counterparts?  [hint: ioctl(2) cannot]

There would be no requirement to unify anything.  The idea is that
we'd get rid of all the global state flags.

For ioctl, we'd have a new file_operation:

long ioctl(struct file *, unsigned int, unsigned long, enum syscall_arch);

I'm not saying this is easy, but I think it's possible and the result
would be more obviously correct than what we have now.


Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Andy Lutomirski



> On Sep 19, 2020, at 3:41 PM, Al Viro  wrote:
> 
> On Sat, Sep 19, 2020 at 03:23:54PM -0700, Andy Lutomirski wrote:
>> 
>>>> On Sep 19, 2020, at 3:09 PM, Al Viro  wrote:
>>> 
>>> On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
>>>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>>>> Said that, why not provide a variant that would take an explicit
>>>>> "is it compat" argument and use it there?  And have the normal
>>>>> one pass in_compat_syscall() to that...
>>>> 
>>>> That would help to not introduce a regression with this series yes.
>>>> But it wouldn't fix existing bugs when io_uring is used to access
>>>> read or write methods that use in_compat_syscall().  One example that
>>>> I recently ran into is drivers/scsi/sg.c.
>>> 
>>> So screw such read/write methods - don't use them with io_uring.
>>> That, BTW, is one of the reasons I'm sceptical about burying the
>>> decisions deep into the callchain - we don't _want_ different
>>> data layouts on read/write depending upon the 32bit vs. 64bit
>>> caller, let alone the pointer-chasing garbage that is /dev/sg.
>> 
>> Well, we could remove in_compat_syscall(), etc and instead have an implicit 
>> parameter in DEFINE_SYSCALL.  Then everything would have to be explicit.  
>> This would probably be a win, although it could be quite a bit of work.
> 
> It would not be a win - most of the syscalls don't give a damn
> about 32bit vs. 64bit...

Any reasonable implementation would optimize it out for syscalls that don’t 
care.  Or it could be explicit:

DEFINE_MULTIARCH_SYSCALL(...)

Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Andy Lutomirski


> On Sep 19, 2020, at 3:09 PM, Al Viro  wrote:
> 
> On Fri, Sep 18, 2020 at 05:16:15PM +0200, Christoph Hellwig wrote:
>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>> Said that, why not provide a variant that would take an explicit
>>> "is it compat" argument and use it there?  And have the normal
>>> one pass in_compat_syscall() to that...
>> 
>> That would help to not introduce a regression with this series yes.
>> But it wouldn't fix existing bugs when io_uring is used to access
>> read or write methods that use in_compat_syscall().  One example that
>> I recently ran into is drivers/scsi/sg.c.
> 
> So screw such read/write methods - don't use them with io_uring.
> That, BTW, is one of the reasons I'm sceptical about burying the
> decisions deep into the callchain - we don't _want_ different
> data layouts on read/write depending upon the 32bit vs. 64bit
> caller, let alone the pointer-chasing garbage that is /dev/sg.

Well, we could remove in_compat_syscall(), etc and instead have an implicit 
parameter in DEFINE_SYSCALL.  Then everything would have to be explicit.  This 
would probably be a win, although it could be quite a bit of work.

Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Andy Lutomirski


> On Sep 19, 2020, at 2:16 PM, Arnd Bergmann  wrote:
> 
> On Sat, Sep 19, 2020 at 6:21 PM Andy Lutomirski  wrote:
>>> On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig  wrote:
>>> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
>>>> Said that, why not provide a variant that would take an explicit
>>>> "is it compat" argument and use it there?  And have the normal
>>>> one pass in_compat_syscall() to that...
>>> 
>>> That would help to not introduce a regression with this series yes.
>>> But it wouldn't fix existing bugs when io_uring is used to access
>>> read or write methods that use in_compat_syscall().  One example that
>>> I recently ran into is drivers/scsi/sg.c.
> 
> Ah, so reading /dev/input/event* would suffer from the same issue,
> and that one would in fact be broken by your patch in the hypothetical
> case that someone tried to use io_uring to read /dev/input/event on x32...
> 
> For reference, I checked the socket timestamp handling that has a
> number of corner cases with time32/time64 formats in compat mode,
> but none of those appear to be affected by the problem.
> 
>> Aside from the potentially nasty use of per-task variables, one thing
>> I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
>> going to have a generic mechanism for this, shouldn't we allow a full
>> override of the syscall arch instead of just allowing forcing compat
>> so that a compat syscall can do a non-compat operation?
> 
> The only reason it's needed here is that the caller is in a kernel
> thread rather than a system call. Are there any possible scenarios
> where one would actually need the opposite?
> 

I can certainly imagine needing to force x32 mode from a kernel thread.

As for the other direction: what exactly are the desired bitness/arch semantics 
of io_uring?  Is the operation bitness chosen by the io_uring creation or by 
the io_uring_enter() bitness?

Re: [PATCH 1/9] kernel: add a PF_FORCE_COMPAT flag

2020-09-19 Thread Andy Lutomirski
On Fri, Sep 18, 2020 at 8:16 AM Christoph Hellwig  wrote:
>
> On Fri, Sep 18, 2020 at 02:58:22PM +0100, Al Viro wrote:
> > Said that, why not provide a variant that would take an explicit
> > "is it compat" argument and use it there?  And have the normal
> > one pass in_compat_syscall() to that...
>
> That would help to not introduce a regression with this series yes.
> But it wouldn't fix existing bugs when io_uring is used to access
> read or write methods that use in_compat_syscall().  One example that
> I recently ran into is drivers/scsi/sg.c.

Aside from the potentially nasty use of per-task variables, one thing
I don't like about PF_FORCE_COMPAT is that it's one-way.  If we're
going to have a generic mechanism for this, shouldn't we allow a full
override of the syscall arch instead of just allowing forcing compat
so that a compat syscall can do a non-compat operation?


Re: ptrace_syscall_32 is failing

2020-09-02 Thread Andy Lutomirski
On Wed, Sep 2, 2020 at 1:29 AM Thomas Gleixner  wrote:
>
> On Tue, Sep 01 2020 at 17:09, Andy Lutomirski wrote:
> > On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner  wrote:
> >> > I think that they almost work for x86, but not quite as
> >> > indicated by this bug.  Even if we imagine we can somehow hack around
> >> > this bug, I imagine we're going to find other problems with this
> >> > model, e.g. the potential upcoming exit problem I noted in my review.
> >>
> >> What's the upcoming problem?
> >
> > If we ever want to get single-stepping fully correct across syscalls,
> > we might need to inject SIGTRAP on syscall return. This would be more
> > awkward if we can't run instrumentable code after the syscall part of
> > the syscall is done.
>
> We run a lot of instrumentable code after sys_foo() returns. Otherwise
> all the TIF work would not be possible at all.
>
> But you might tell me where exactly you want to inject the SIGTRAP in
> the syscall exit code flow.

It would be a bit complicated.  Definitely after any signals from the
syscall are delivered.  Right now, I think that we don't deliver a
SIGTRAP on the instruction boundary after SYSCALL while
single-stepping.  (I think we used to, but only sometimes, and now we
are at least consistent.)  This is because IRET will not trap if it
starts with TF clear and ends up setting it.  (I asked Intel to
document this, and I think they finally did, although I haven't gotten
around to reading the new docs.  Certainly the old docs as of a year
or two ago had no description whatsoever of how TF changes worked.)

Deciding exactly *when* a trap should occur would be nontrivial -- we
can't trap on sigreturn() from a SIGTRAP, for example.

So this isn't fully worked out.

>
> >> I don't think we want that in general. The current variant is perfectly
> >> fine for everything except the 32bit fast syscall nonsense. Also
> >> irqentry_entry/exit is not equivalent to the syscall_enter/exit
> >> counterparts.
> >
> > If there are any architectures in which actual work is needed to
> > figure out whether something is a syscall in the first place, they'll
> > want to do the usual kernel entry work before the syscall entry work.
>
> That's low level entry code which does not require RCU, lockdep, tracing
> or whatever muck we setup before actual work can be done.
>
> arch_asm_entry()
>   ...
>   arch_c_entry(cause) {
> switch(cause) {
>   case EXCEPTION: arch_c_exception(...);
>   case SYSCALL: arch_c_syscall(...);
>   ...
> }

You're assuming that figuring out the cause doesn't need the kernel
entry code to run first.  In the case of the 32-bit vDSO fast
syscalls, we arguably don't know whether an entry is a syscall until
we have done a user memory access.  Logically, we're doing:

if (get_user() < 0) {
  /* Not a syscall.  This is actually a silly operation that sets AX =
-EFAULT and returns.  Do not audit or invoke ptrace. */
} else {
  /* This actually is a syscall. */
}

So we really do want to stick arch code between the
enter_from_user_mode() and the audit check.  We *can't* audit because
we don't know the syscall args.  Now maybe we could invent new
semantics for this in which a fault here is still somehow a syscall,
but I think that would be a real ABI change and would want very
careful thought.  And it would be weird -- syscalls are supposed to
actually call the syscall handler, aren't they?  (Arguably we should
go back in time and make this a SIGSEGV.  We have the infrastructure
to do this cleanly, but when I wrote the code I just copied the ABI
from code that was before my time.  Even so, it would be an exception,
not a syscall.)

>
> You really want to differentiate between exception and syscall
> entry/exit.
>

Why do we want to distinguish between exception and syscall
entry/exit?  For the enter part, AFAICS the exception case boils down
to enter_from_user_mode() and the syscall case is:

enter_from_user_mode(regs);
instrumentation_begin();

local_irq_enable();
ti_work = READ_ONCE(current_thread_info()->flags);
if (ti_work & SYSCALL_ENTER_WORK)
syscall = syscall_trace_enter(regs, syscall, ti_work);
instrumentation_end();

Which would decompose quite nicely as a regular (non-syscall) entry
plus the syscall part later.

> The splitting of syscall_enter_from_user_mode() is only necessary for
> that 32bit fast syscall thing on x86 and there is no point to open code
> it with two calls for e.g. do_syscall_64().
>
> > Maybe your patch actually makes this possible -- I haven't digested
> > all the details yet.
> >
> &g

Re: ptrace_syscall_32 is failing

2020-09-01 Thread Andy Lutomirski
On Tue, Sep 1, 2020 at 4:50 PM Thomas Gleixner  wrote:
>
> On Sun, Aug 30 2020 at 08:52, Andy Lutomirski wrote:
> >> > [RUN]SYSCALL
> >> > [FAIL]Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
> >> > [RUN]SYSCALL
> >> > [OK]Args after SIGUSR1 are correct (ax = -514)
> >> > [OK]Child got SIGUSR1
> >> > [RUN]Step again
> >> > [OK]pause(2) restarted correctly
> >>
> >> Bisected to commit 0b085e68f407 ("x86/entry: Consolidate 32/64 bit
> >> syscall entry").
> >> It looks like it is because syscall_enter_from_user_mode() is called
> >> before reading the 6th argument from the user stack.
>
> Bah.I don't know how I managed to miss that part and interestingly
> enough that none of the robots caught that either
>
> > Thomas, can we revert the syscall_enter() and syscall_exit() part of
> > the series?
>
> Hrm.
>
> > I think that they almost work for x86, but not quite as
> > indicated by this bug.  Even if we imagine we can somehow hack around
> > this bug, I imagine we're going to find other problems with this
> > model, e.g. the potential upcoming exit problem I noted in my review.
>
> What's the upcoming problem?

If we ever want to get single-stepping fully correct across syscalls,
we might need to inject SIGTRAP on syscall return. This would be more
awkward if we can't run instrumentable code after the syscall part of
the syscall is done.

>
> > I really think the model should be:
> >
> > void do_syscall_whatever(...)
> > {
> >   irqentry_enter(...);
> >   instrumentation_begin();
> >
> >   /* Do whatever arch ABI oddities are needed on entry. */
> >
> >   Then either:
> >   syscall_begin(arch, nr, regs);
> >   dispatch the syscall;
> >   syscall_end(arch, nr, regs);
> >
> >   Or just:
> >   generic_do_syscall(arch, nr, regs);
> >
> >   /* Do whatever arch ABI oddities are needed on exit from the syscall. */
> >
> >   instrumentation_end();
> >   irqentry_exit(...);
> > }
>
> I don't think we want that in general. The current variant is perfectly
> fine for everything except the 32bit fast syscall nonsense. Also
> irqentry_entry/exit is not equivalent to the syscall_enter/exit
> counterparts.

If there are any architectures in which actual work is needed to
figure out whether something is a syscall in the first place, they'll
want to do the usual kernel entry work before the syscall entry work.
Maybe your patch actually makes this possible -- I haven't digested
all the details yet.

Who advised you to drop the arch parameter?

> ---
>  arch/x86/entry/common.c  |   29 
>  include/linux/entry-common.h |   51 
> +++
>  kernel/entry/common.c|   35 -
>  3 files changed, 91 insertions(+), 24 deletions(-)
>
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -60,16 +60,10 @@
>  #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
>  static __always_inline unsigned int syscall_32_enter(struct pt_regs *regs)
>  {
> -   unsigned int nr = (unsigned int)regs->orig_ax;
> -
> if (IS_ENABLED(CONFIG_IA32_EMULATION))
> current_thread_info()->status |= TS_COMPAT;
> -   /*
> -* Subtlety here: if ptrace pokes something larger than 2^32-1 into
> -* orig_ax, the unsigned int return value truncates it.  This may
> -* or may not be necessary, but it matches the old asm behavior.
> -*/
> -   return (unsigned int)syscall_enter_from_user_mode(regs, nr);
> +
> +   return (unsigned int)regs->orig_ax;
>  }
>
>  /*
> @@ -91,15 +85,29 @@ static __always_inline void do_syscall_3
>  {
> unsigned int nr = syscall_32_enter(regs);
>
> +   /*
> +* Subtlety here: if ptrace pokes something larger than 2^32-1 into
> +* orig_ax, the unsigned int return value truncates it.  This may
> +* or may not be necessary, but it matches the old asm behavior.
> +*/
> +   nr = (unsigned int)syscall_enter_from_user_mode(regs, nr);
> +
> do_syscall_32_irqs_on(regs, nr);
> syscall_exit_to_user_mode(regs);
>  }
>
>  static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
>  {
> -   unsigned int nr = syscall_32_enter(regs);
> +   unsigned int nr = syscall_32_enter(regs);
> int res;
>
> +   /*
> +* This cannot use syscall_enter_from_user_mode() as it has to
> +* fetch EBP before invoking any of the syscall entry work
> +* functions.
> +*/
> +   syscall_enter_from_user_mode_prepare(regs);

I'm getting lost in all these "enter" functions...


Re: ptrace_syscall_32 is failing

2020-08-30 Thread Andy Lutomirski
On Sat, Aug 29, 2020 at 9:40 PM Brian Gerst  wrote:
>
> On Sat, Aug 29, 2020 at 12:52 PM Andy Lutomirski  wrote:
> >
> > Seems to be a recent regression, maybe related to entry/exit work changes.
> >
> > # ./tools/testing/selftests/x86/ptrace_syscall_32
> > [RUN]Check int80 return regs
> > [OK]getpid() preserves regs
> > [OK]kill(getpid(), SIGUSR1) preserves regs
> > [RUN]Check AT_SYSINFO return regs
> > [OK]getpid() preserves regs
> > [OK]kill(getpid(), SIGUSR1) preserves regs
> > [RUN]ptrace-induced syscall restart
> > Child will make one syscall
> > [RUN]SYSEMU
> > [FAIL]Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
> > [RUN]Restart the syscall (ip = 0xf7f3b549)
> > [OK]Restarted nr and args are correct
> > [RUN]Change nr and args and restart the syscall (ip = 0xf7f3b549)
> > [OK]Replacement nr and args are correct
> > [OK]Child exited cleanly
> > [RUN]kernel syscall restart under ptrace
> > Child will take a nap until signaled
> > [RUN]SYSCALL
> > [FAIL]Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
> > [RUN]SYSCALL
> > [OK]Args after SIGUSR1 are correct (ax = -514)
> > [OK]Child got SIGUSR1
> > [RUN]Step again
> > [OK]pause(2) restarted correctly
>
> Bisected to commit 0b085e68f407 ("x86/entry: Consolidate 32/64 bit
> syscall entry").
> It looks like it is because syscall_enter_from_user_mode() is called
> before reading the 6th argument from the user stack.

Ugh.  I caught, in review, a potential related issue with exit (not a
problem in current kernels), but I missed the entry version.

Thomas, can we revert the syscall_enter() and syscall_exit() part of
the series?  I think that they almost work for x86, but not quite as
indicated by this bug.  Even if we imagine we can somehow hack around
this bug, I imagine we're going to find other problems with this
model, e.g. the potential upcoming exit problem I noted in my review.

I really think the model should be:

void do_syscall_whatever(...)
{
  irqentry_enter(...);
  instrumentation_begin();

  /* Do whatever arch ABI oddities are needed on entry. */

  Then either:
  syscall_begin(arch, nr, regs);
  dispatch the syscall;
  syscall_end(arch, nr, regs);

  Or just:
  generic_do_syscall(arch, nr, regs);

  /* Do whatever arch ABI oddities are needed on exit from the syscall. */

  instrumentation_end();
  irqentry_exit(...);
}

x86 has an ABI oddity needed on entry: this fast syscall argument
fixup.  We also might end up with ABI oddities on exit if we ever try
to make single-stepping of syscalls work fully correctly.  x86 sort of
gets away without specifying arch because the arch helpers that get
called for audit, etc can deduce the arch, but this is kind of gross.
I suppose we could omit arch as an explicit parameter.

Or I suppose we could try to rejigger the API in time for 5.9.
Fortunately only x86 uses the new APIs so far.  I cc'd a bunch of
other arch maintainers to see if other architectures fit well in the
new syscall_enter() model, but I feel like the fact that x86 is
already broken indicates that we messed it up a bit.

--Andy


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-15 Thread Andy Lutomirski



> On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Mathieu Desnoyers's message of July 14, 2020 12:13 am:
>> - On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:
>> 
>>> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
 Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
> Also, as it stands, I can easily see in_irq() ceasing to promise to
> serialize.  There are older kernels for which it does not promise to
> serialize.  And I have plans to make it stop serializing in the
> nearish future.
 
 You mean x86's return from interrupt? Sounds fun... you'll konw where to
 update the membarrier sync code, at least :)
>>> 
>>> Oh, I should actually say Mathieu recently clarified a return from
>>> interrupt doesn't fundamentally need to serialize in order to support
>>> membarrier sync core.
>> 
>> Clarification to your statement:
>> 
>> Return from interrupt to kernel code does not need to be context serializing
>> as long as kernel serializes before returning to user-space.
>> 
>> However, return from interrupt to user-space needs to be context serializing.
> 
> Hmm, I'm not sure it's enough even with the sync in the exit_lazy_tlb
> in the right places.
> 
> A kernel thread does a use_mm, then it blocks and the user process with
> the same mm runs on that CPU, and then it calls into the kernel, blocks,
> the kernel thread runs again, another CPU issues a membarrier which does
> not IPI this one because it's running a kthread, and then the kthread
> switches back to the user process (still without having unused the mm),
> and then the user process returns from syscall without having done a 
> core synchronising instruction.
> 
> The cause of the problem is you want to avoid IPI'ing kthreads. Why?
> I'm guessing it really only matters as an optimisation in case of idle
> threads. Idle thread is easy (well, easier) because it won't use_mm, so 
> you could check for rq->curr == rq->idle in your loop (in a suitable 
> sched accessor function).
> 
> But... I'm not really liking this subtlety in the scheduler for all this 
> (the scheduler still needs the barriers when switching out of idle).
> 
> Can it be improved somehow? Let me forget x86 core sync problem for now
> (that _may_ be a bit harder), and step back and look at what we're doing.
> The memory barrier case would actually suffer from the same problem as
> core sync, because in the same situation it has no implicit mmdrop in
> the scheduler switch code either.
> 
> So what are we doing with membarrier? We want any activity caused by the 
> set of CPUs/threads specified that can be observed by this thread before 
> calling membarrier is appropriately fenced from activity that can be 
> observed to happen after the call returns.
> 
> CPU0 CPU1
> 1. user stuff
> a. membarrier()  2. enter kernel
> b. read rq->curr 3. rq->curr switched to kthread
> c. is kthread, skip IPI  4. switch_to kthread
> d. return to user5. rq->curr switched to user thread
> 6. switch_to user thread
> 7. exit kernel
> 8. more user stuff
> 
> As far as I can see, the problem is CPU1 might reorder step 5 and step
> 8, so you have mmdrop of lazy mm be a mb after step 6.
> 
> But why? The membarrier call only cares that there is a full barrier
> between 1 and 8, right? Which it will get from the previous context
> switch to the kthread.
> 
> I must say the memory barrier comments in membarrier could be improved
> a bit (unless I'm missing where the main comment is). It's fine to know
> what barriers pair with one another, but we need to know which exact
> memory accesses it is ordering
> 
>   /*
> * Matches memory barriers around rq->curr modification in
> * scheduler.
> */
> 
> Sure, but it doesn't say what else is being ordered. I think it's just
> the user memory accesses, but would be nice to make that a bit more
> explicit. If we had such comments then we might know this case is safe.
> 
> I think the funny powerpc barrier is a similar case of this. If we
> ever see remote_rq->curr->flags & PF_KTHREAD, then we _know_ that
> CPU has or will have issued a memory barrier between running user
> code.
> 
> So AFAIKS all this membarrier stuff in kernel/sched/core.c could
> just go away. Except x86 because thread switch doesn't imply core
> sync, so CPU1 between 1 and 8 may never issue a core sync instruction
> the same way a context switch must be a full mb.
> 
> Before getting to x86 -- Am I right, or way off track here?

I find it hard to believe that this is x86 only. Why would thread switch imply 
core sync on any architecture?  Is x86 unique in having a stupid expensive core 
sync that is heavier than smp_mb()?

But I’m wondering if all this deferred sync stuff is wrong. In the brave new 
world of io_uring and such, perhaps kerne

Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-14 Thread Andy Lutomirski



> On Jul 13, 2020, at 11:31 PM, Nicholas Piggin  wrote:
> 
> Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
>> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
>>> 
 On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
 
 Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>> 
>> On big systems, the mm refcount can become highly contented when doing
>> a lot of context switching with threaded applications (particularly
>> switching between the idle thread and an application thread).
>> 
>> Abandoning lazy tlb slows switching down quite a bit in the important
>> user->idle->user cases, so so instead implement a non-refcounted scheme
>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>> any remaining lazy ones.
>> 
>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>> with as many software threads as CPUs (so each switch will go in and
>> out of idle), upstream can achieve a rate of about 1 million context
>> switches per second. After this patch it goes up to 118 million.
>> 
> 
> I read the patch a couple of times, and I have a suggestion that could
> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
> refcount.  You're saying "hey, this mm has no more references, but it
> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
> those references too."  I'm wondering whether you actually need the
> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
> for real?  Roughly, in __mmdrop(), you would only free the page tables
> if mm_cpumask() is empty.  And, in the code that removes a CPU from
> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
> you just removed the last bit from mm_cpumask and potentially free the
> mm.
> 
> Getting the locking right here could be a bit tricky -- you need to
> avoid two CPUs simultaneously exiting lazy TLB and thinking they
> should free the mm, and you also need to avoid an mm with mm_users
> hitting zero concurrently with the last remote CPU using it lazily
> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
> mm" and mm_count == 0 meaning "now it's dead" and using some careful
> cmpxchg or dec_return to make sure that only one CPU frees it.
> 
> Or maybe you'd need a lock or RCU for this, but the idea would be to
> only ever take the lock after mm_users goes to zero.
 
 I don't think it's nonsense, it could be a good way to avoid IPIs.
 
 I haven't seen much problem here that made me too concerned about IPIs 
 yet, so I think the simple patch may be good enough to start with
 for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
 unlazying with the exit TLB flush without doing anything fancy with
 ref counting, but we'll see.
>>> 
>>> I would be cautious with benchmarking here. I would expect that the
>>> nasty cases may affect power consumption more than performance — the 
>>> specific issue is IPIs hitting idle cores, and the main effects are to 
>>> slow down exit() a bit but also to kick the idle core out of idle. 
>>> Although, if the idle core is in a deep sleep, that IPI could be 
>>> *very* slow.
>> 
>> It will tend to be self-limiting to some degree (deeper idle cores
>> would tend to have less chance of IPI) but we have bigger issues on
>> powerpc with that, like broadcast IPIs to the mm cpumask for THP
>> management. Power hasn't really shown up as an issue but powerpc
>> CPUs may have their own requirements and issues there, shall we say.
>> 
>>> So I think it’s worth at least giving this a try.
>> 
>> To be clear it's not a complete solution itself. The problem is of 
>> course that mm cpumask gives you false negatives, so the bits
>> won't always clean up after themselves as CPUs switch away from their
>> lazy tlb mms.
> 
> ^^
> 
> False positives: CPU is in the mm_cpumask, but is not using the mm
> as a lazy tlb. So there can be bits left and never freed.
> 
> If you closed the false positives, you're back to a shared mm cache
> line on lazy mm context switches.

x86 has this exact problem. At least no more than 64*8 CPUs share the cache 
line :)

Can your share your benchmark?

> 
> Thanks,
> Nick


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-13 Thread Andy Lutomirski


> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin  wrote:
> 
> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>>> 
>>> On big systems, the mm refcount can become highly contented when doing
>>> a lot of context switching with threaded applications (particularly
>>> switching between the idle thread and an application thread).
>>> 
>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>> any remaining lazy ones.
>>> 
>>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>>> with as many software threads as CPUs (so each switch will go in and
>>> out of idle), upstream can achieve a rate of about 1 million context
>>> switches per second. After this patch it goes up to 118 million.
>>> 
>> 
>> I read the patch a couple of times, and I have a suggestion that could
>> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
>> refcount.  You're saying "hey, this mm has no more references, but it
>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>> those references too."  I'm wondering whether you actually need the
>> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
>> for real?  Roughly, in __mmdrop(), you would only free the page tables
>> if mm_cpumask() is empty.  And, in the code that removes a CPU from
>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>> you just removed the last bit from mm_cpumask and potentially free the
>> mm.
>> 
>> Getting the locking right here could be a bit tricky -- you need to
>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>> should free the mm, and you also need to avoid an mm with mm_users
>> hitting zero concurrently with the last remote CPU using it lazily
>> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>> cmpxchg or dec_return to make sure that only one CPU frees it.
>> 
>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>> only ever take the lock after mm_users goes to zero.
> 
> I don't think it's nonsense, it could be a good way to avoid IPIs.
> 
> I haven't seen much problem here that made me too concerned about IPIs 
> yet, so I think the simple patch may be good enough to start with
> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
> unlazying with the exit TLB flush without doing anything fancy with
> ref counting, but we'll see.

I would be cautious with benchmarking here. I would expect that the nasty cases 
may affect power consumption more than performance — the specific issue is IPIs 
hitting idle cores, and the main effects are to slow down exit() a bit but also 
to kick the idle core out of idle. Although, if the idle core is in a deep 
sleep, that IPI could be *very* slow.

So I think it’s worth at least giving this a try.

> 
> Thanks,
> Nick


Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-13 Thread Andy Lutomirski
On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>
> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
>
> Abandoning lazy tlb slows switching down quite a bit in the important
> user->idle->user cases, so so instead implement a non-refcounted scheme
> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
> any remaining lazy ones.
>
> On a 16-socket 192-core POWER8 system, a context switching benchmark
> with as many software threads as CPUs (so each switch will go in and
> out of idle), upstream can achieve a rate of about 1 million context
> switches per second. After this patch it goes up to 118 million.
>

I read the patch a couple of times, and I have a suggestion that could
be nonsense.  You are, effectively, using mm_cpumask() as a sort of
refcount.  You're saying "hey, this mm has no more references, but it
still has nonempty mm_cpumask(), so let's send an IPI and shoot down
those references too."  I'm wondering whether you actually need the
IPI.  What if, instead, you actually treated mm_cpumask as a refcount
for real?  Roughly, in __mmdrop(), you would only free the page tables
if mm_cpumask() is empty.  And, in the code that removes a CPU from
mm_cpumask(), you would check if mm_users == 0 and, if so, check if
you just removed the last bit from mm_cpumask and potentially free the
mm.

Getting the locking right here could be a bit tricky -- you need to
avoid two CPUs simultaneously exiting lazy TLB and thinking they
should free the mm, and you also need to avoid an mm with mm_users
hitting zero concurrently with the last remote CPU using it lazily
exiting lazy TLB.  Perhaps this could be resolved by having mm_count
== 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
mm" and mm_count == 0 meaning "now it's dead" and using some careful
cmpxchg or dec_return to make sure that only one CPU frees it.

Or maybe you'd need a lock or RCU for this, but the idea would be to
only ever take the lock after mm_users goes to zero.

--Andy


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-13 Thread Andy Lutomirski
On Mon, Jul 13, 2020 at 7:13 AM Mathieu Desnoyers
 wrote:
>
> - On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:
>
> > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
> >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
> >>> Also, as it stands, I can easily see in_irq() ceasing to promise to
> >>> serialize.  There are older kernels for which it does not promise to
> >>> serialize.  And I have plans to make it stop serializing in the
> >>> nearish future.
> >>
> >> You mean x86's return from interrupt? Sounds fun... you'll konw where to
> >> update the membarrier sync code, at least :)
> >
> > Oh, I should actually say Mathieu recently clarified a return from
> > interrupt doesn't fundamentally need to serialize in order to support
> > membarrier sync core.
>
> Clarification to your statement:
>
> Return from interrupt to kernel code does not need to be context serializing
> as long as kernel serializes before returning to user-space.
>
> However, return from interrupt to user-space needs to be context serializing.
>

Indeed, and I figured this out on the first read through because I'm
quite familiar with the x86 entry code.  But Nick somehow missed this,
and Nick is the one who wrote the patch.

Nick, I think this helps prove my point.  The code you're submitting
may well be correct, but it's unmaintainable.  At the very least, this
needs a comment explaining, from the perspective of x86, *exactly*
what exit_lazy_tlb() is promising, why it's promising it, how it
achieves that promise, and what code cares about it.  Or we could do
something with TIF flags and make this all less magical, although that
will probably end up very slightly slower.

--Andy


Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-10 Thread Andy Lutomirski
On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin  wrote:
>
> And get rid of the generic sync_core_before_usermode facility.
>
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.
>
> Consider if an arch did not synchronize core in switch_mm either, then
> membarrier_mm_sync_core_before_usermode would be in the wrong place
> but arch specific mmu context functions would still be the right place.
> There is also a exit_lazy_tlb case that is not covered by this call, which
> could be a bugs (kthread use mm the membarrier process's mm then context
> switch back to the process without switching mm or lazy mm switch).
>
> This makes lazy tlb code a bit more modular.

The mm-switching and TLB-management has often had the regrettable
property that it gets wired up in a way that seems to work at the time
but doesn't have clear semantics, and I'm a bit concerned that this
patch is in that category.  If I'm understanding right, you're trying
to enforce the property that exiting lazy TLB mode will promise to
sync the core eventually.  But this has all kinds of odd properties:

 - Why is exit_lazy_tlb() getting called at all in the relevant cases?
 When is it permissible to call it?  I look at your new code and see:

> +/*
> + * Ensure that a core serializing instruction is issued before returning
> + * to user-mode, if a SYNC_CORE was requested. x86 implements return to
> + * user-space through sysexit, sysrel, and sysretq, which are not core
> + * serializing.
> + *
> + * See the membarrier comment in finish_task_switch as to why this is done
> + * in exit_lazy_tlb.
> + */
> +#define exit_lazy_tlb exit_lazy_tlb
> +static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct 
> *tsk)
> +{
> +   /* Switching mm is serializing with write_cr3 */
> +if (tsk->mm != mm)
> +return;

And my brain says WTF?  Surely you meant something like if
(WARN_ON_ONCE(tsk->mm != mm)) { /* egads, what even happened?  how do
we try to recover well enough to get a crashed logged at least? */ }

So this needs actual documentation, preferably in comments near the
function, of what the preconditions are and what this mm parameter is.
Once that's done, then we could consider whether it's appropriate to
have this function promise to sync the core under some conditions.

 - This whole structure seems to rely on the idea that switching mm
syncs something.  I periodically ask chip vendor for non-serializing
mm switches.  Specifically, in my dream world, we have totally
separate user and kernel page tables.  Changing out the user tables
doesn't serialize or even create a fence.  Instead it creates the
minimum required pipeline hazard such that user memory access and
switches to usermode will make sure they access memory through the
correct page tables.  I haven't convinced a chip vendor yet, but there
are quite a few hundreds of cycles to be saved here.  With this in
mind, I see the fencing aspects of the TLB handling code as somewhat
of an accident.  I'm fine with documenting them and using them to
optimize other paths, but I think it should be explicit.  For example:

/* Also does a full barrier?  (Or a sync_core()-style barrier.)
However, if you rely on this, you must document it in a comment where
you call this function. *?
void switch_mm_irqs_off()
{
}

This is kind of like how we strongly encourage anyone using smp_?mb()
to document what they are fencing against.

Also, as it stands, I can easily see in_irq() ceasing to promise to
serialize.  There are older kernels for which it does not promise to
serialize.  And I have plans to make it stop serializing in the
nearish future.

--Andy


Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Andy Lutomirski



> On Jun 15, 2020, at 1:56 PM, Luck, Tony  wrote:
> 
> 
>> 
>> Are we planning to keep PASID live once a task has used it once or are we 
>> going to swap it lazily?  If the latter, a percpu variable might be better.
> 
> Current plan is "touch it once and the task owns it until exit(2)"
> 
> Maybe someday in the future when we have data on how applications
> actually use accelerators we could look at something more complex
> if usage patterns look like it would be beneficial.
> 
> 

So what’s the RDMSR for?  Surely you
have some state somewhere that says “this task has a PASID.”  Can’t you just 
make sure that stays in sync with the MSR?  Then, on #GP, if the task already 
has a PASID, you know the MSR is set.

Re: [PATCH v2 12/12] x86/traps: Fix up invalid PASID

2020-06-15 Thread Andy Lutomirski


> On Jun 15, 2020, at 1:17 PM, Fenghua Yu  wrote:
> 
> Hi, Peter,
> 
>> On Mon, Jun 15, 2020 at 09:09:28PM +0200, Peter Zijlstra wrote:
>>> On Mon, Jun 15, 2020 at 11:55:29AM -0700, Fenghua Yu wrote:
>>> 
>>> Or do you suggest to add a random new flag in struct thread_info instead
>>> of a TIF flag?
>> 
>> Why thread_info? What's wrong with something simple like the below. It
>> takes a bit from the 'strictly current' flags word.
>> 
>> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index b62e6aaf28f0..fca830b97055 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -801,6 +801,9 @@ struct task_struct {
>>/* Stalled due to lack of memory */
>>unsignedin_memstall:1;
>> #endif
>> +#ifdef CONFIG_PCI_PASID
>> +unsignedhas_valid_pasid:1;
>> +#endif
>> 
>>unsigned longatomic_flags; /* Flags requiring atomic access. 
>> */
>> 
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 142b23645d82..10b3891be99e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -955,6 +955,10 @@ static struct task_struct *dup_task_struct(struct 
>> task_struct *orig, int node)
>>tsk->use_memdelay = 0;
>> #endif
>> 
>> +#ifdef CONFIG_PCI_PASID
>> +tsk->has_valid_pasid = 0;
>> +#endif
>> +
>> #ifdef CONFIG_MEMCG
>>tsk->active_memcg = NULL;
>> #endif
> 
> The PASID MSR is x86 specific although PASID is PCIe concept and per-mm.
> Checking if the MSR has valid PASID (bit31=1) is an x86 specifc work.
> The flag should be cleared in cloned()/forked() and is only set and
> read in fixup() in x86 #GP for heuristic. It's not used anywhere outside
> of x86.
> 
> That's why we think the flag should be in x86 struct thread_info instead
> of in generice struct task_struct.
> 

Are we planning to keep PASID live once a task has used it once or are we going 
to swap it lazily?  If the latter, a percpu variable might be better.

> Please advice.
> 
> Thanks.
> 
> -Fenghua


Re: [RFC PATCH v3 08/12] lib: vdso: allow arches to provide vdso data pointer

2020-01-16 Thread Andy Lutomirski
On Thu, Jan 16, 2020 at 2:35 AM Thomas Gleixner  wrote:
>
> static __maybe_unused int
> __cvdso_data_clock_gettime(clockid_t clock, struct __kernel_timespec *ts,
>const struct vdso_data *vd)
> {
> .
> }
>
> static __maybe_unused int
> __cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
> {
> const struct vdso_data *vd = __arch_get_vdso_data();
>
> return __cvdso_data_clock_gettime(clock, ts, vd);
> }
>
> and then use __cvdso_data_clock_gettime on PPC and let the other archs
> unmodified.
>
>

FWIW, I did some experiments on x86 with gcc 9.2.  gcc 9.2 uses
rip-relative accesses if I simplify the config enough and otherwise
materializes the pointer.  Presumably it decides that the code size
reduction is worth it if there are a lot of accesses.

I suspect that tglx's suggestion will be fine or at worst will add
negligible overhead on x86_64.


Re: [RFC PATCH v4 10/11] lib: vdso: Allow arches to override the ns shift operation

2020-01-16 Thread Andy Lutomirski
On Thu, Jan 16, 2020 at 11:57 AM Thomas Gleixner  wrote:
>
> Andy Lutomirski  writes:
> > On Thu, Jan 16, 2020 at 9:58 AM Christophe Leroy
> >
> > Would mul_u64_u64_shr() be a good alternative?  Could we adjust it to
> > assume the shift is less than 32?  That function exists to benefit
> > 32-bit arches.
>
> We'd want mul_u64_u32_shr() for this. The rules for mult and shift are:
>

That's what I meant to type...

>  1 <= mult <= U32_MAX
>
>  1 <= shift <= 32
>
> If we want to enforce a shift < 32 we need to limit that conditionally
> in the calculation/registration function.
>
> Thanks,
>
> tglx
>


Re: [RFC PATCH v4 08/11] lib: vdso: allow fixed clock mode

2020-01-16 Thread Andy Lutomirski
On Thu, Jan 16, 2020 at 12:14 PM Thomas Gleixner  wrote:
>
> Christophe Leroy  writes:
>
> Can you please adjust the prefix for future patches to lib/vdso: and
> start the sentence after the colon with an uppercase letter?
>
> > On arches like POWERPC, the clock is always the timebase, it
>
> Please spell out architectures. Changelogs are not space constraint.
>
> > cannot be changed on the fly and it is always VDSO capable.
>
> Also this sentence does not make sense as it might suggests that
> architectures with a fixed compile time known clocksource have something
> named timebase. Something like this is more clear:
>
> Some architectures have a fixed clocksource which is known at compile
> time and cannot be replaced or disabled at runtime, e.g. timebase on
> PowerPC. For such cases the clock mode check in the VDSO code is
> pointless.
>

I wonder if we should use this on x86 bare-metal if we have
sufficiently invariant TSC.  (Via static_cpu_has(), not compiled in.)
Maybe there is no such x86 machine.

I really really want Intel or AMD to introduce machines where the TSC
pinky-swears to count in actual nanoseconds.

--Andy


Re: [RFC PATCH v4 10/11] lib: vdso: Allow arches to override the ns shift operation

2020-01-16 Thread Andy Lutomirski
On Thu, Jan 16, 2020 at 9:58 AM Christophe Leroy
 wrote:
>
> On powerpc/32, GCC (8.1) generates pretty bad code for the
> ns >>= vd->shift operation taking into account that the
> shift is always < 32 and the upper part of the result is
> likely to be nul. GCC makes reversed assumptions considering
> the shift to be likely >= 32 and the upper part to be like not nul.
>
> unsigned long long shift(unsigned long long x, unsigned char s)
> {
> return x >> s;
> }
>
> results in:
>
> 0018 :
>   18:   35 25 ff e0 addic.  r9,r5,-32
>   1c:   41 80 00 10 blt 2c 
>   20:   7c 64 4c 30 srw r4,r3,r9
>   24:   38 60 00 00 li  r3,0
>   28:   4e 80 00 20 blr
>   2c:   54 69 08 3c rlwinm  r9,r3,1,0,30
>   30:   21 45 00 1f subfic  r10,r5,31
>   34:   7c 84 2c 30 srw r4,r4,r5
>   38:   7d 29 50 30 slw r9,r9,r10
>   3c:   7c 63 2c 30 srw r3,r3,r5
>   40:   7d 24 23 78 or  r4,r9,r4
>   44:   4e 80 00 20 blr
>
> Even when forcing the shift with an &= 31, it still considers
> the shift as likely >= 32.
>
> Define a vdso_shift_ns() macro that can be overriden by
> arches.

Would mul_u64_u64_shr() be a good alternative?  Could we adjust it to
assume the shift is less than 32?  That function exists to benefit
32-bit arches.

--Andy


Re: [RFC PATCH v2 01/10] lib: vdso: ensure all arches have 32bit fallback

2020-01-10 Thread Andy Lutomirski



> On Jan 10, 2020, at 10:56 AM, Thomas Gleixner  wrote:
> 
> Andy Lutomirski  writes:
> 
>>> On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
>>>  wrote:
>>> 
>>> In order to simplify next step which moves fallback call at arch
>>> level, ensure all arches have a 32bit fallback instead of handling
>>> the lack of 32bit fallback in the common code based
>>> on VDSO_HAS_32BIT_FALLBACK
>> 
>> I don't like this.  You've implemented what appear to be nonsensical
>> fallbacks (the 32-bit fallback for a 64-bit vDSO build?  There's no
>> such thing).
>> 
>> How exactly does this simplify patch 2?
> 
> There is a patchset from Vincenzo which fell through the cracks which
> addresses the VDS_HAS_32BIT_FALLBACK issue properly. I'm about to pick
> it up. See:
> 
> https://lore.kernel.org/lkml/20190830135902.20861-1-vincenzo.frasc...@arm.com/
> 

Thanks.  I had been wondering why the conditionals were still there, since I 
remember seeing these patches.

> Thanks,
> 
>tglx


Re: [RFC PATCH v2 04/10] lib: vdso: get pointer to vdso data from the arch

2019-12-24 Thread Andy Lutomirski
On Tue, Dec 24, 2019 at 4:15 AM Andy Lutomirski  wrote:
>
>
>
> > On Dec 24, 2019, at 7:53 PM, christophe leroy  
> > wrote:
> >
> > 
> >
> >> Le 24/12/2019 à 03:27, Andy Lutomirski a écrit :
> >>> On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
> >>>  wrote:
> >>>
> >>> On powerpc, __arch_get_vdso_data() clobbers the link register,
> >>> requiring the caller to set a stack frame in order to save it.
> >>>
> >>> As the parent function already has to set a stack frame and save
> >>> the link register to call the C vdso function, retriving the
> >>> vdso data pointer there is lighter.
> >> I'm confused.  Can't you inline __arch_get_vdso_data()?  Or is the
> >> issue that you can't retrieve the program counter on power without
> >> clobbering the link register?
> >
> > Yes it can be inlined (I did it in V1 
> > https://patchwork.ozlabs.org/patch/1180571/), but you can't do it without 
> > clobbering the link register, because the only way to get the program 
> > counter is to do to as if you were calling another function but you call to 
> > the address which just follows where you are, so that it sets LR which the 
> > simulated return address which corresponds to the address following the 
> > branch.
> >
> > static __always_inline
> > const struct vdso_data *__arch_get_vdso_data(void)
> > {
> >void *ptr;
> >
> >asm volatile(
> >"bcl20, 31, .+4;\n"
> >"mflr%0;\n"
> >"addi%0, %0, __kernel_datapage_offset - (.-4);\n"
> >: "=b"(ptr) : : "lr");
> >
> >return ptr + *(unsigned long *)ptr;
> > }
> >
> >> I would imagine that this patch generates worse code on any
> >> architecture with PC-relative addressing modes (which includes at
> >> least x86_64, and I would guess includes most modern architectures).
> >
> > Why ? Powerpc is also using PC-relative addressing for all calls but 
> > indirect calls.
>
> I mean PC-relative access for data.  The data page is at a constant, known 
> offset from the vDSO text.
>
> I haven’t checked how much x86_64 benefits from this, but at least the 
> non-array fields ought to be accessible with a PC-relative access.
>
> It should be possible to refactor a little bit so that the compiler can still 
> see what’s going on.  Maybe your patch actually does this. I’d want to look 
> at the assembly.  This also might not matter much on x86_64 in particular, 
> since x86_64 can convert a PC-relative address to an absolute address with a 
> single instruction with no clobbers.
>
> Does power have PC-relative data access?  If so, I wonder if the code can be 
> arranged so that even the array accesses don’t require computing an absolute 
> address at any point.

Indeed the x86 code is also suboptimal, but at least the unnecessary
absolute address calculation is cheap on x86_64.  Ideally we'd pass
around offsets into the vdso data instead of passing pointers, and
maybe the compiler will figure it out.  I can try to play with this in
the morning.


Re: [RFC PATCH v2 04/10] lib: vdso: get pointer to vdso data from the arch

2019-12-24 Thread Andy Lutomirski



> On Dec 24, 2019, at 7:53 PM, christophe leroy  wrote:
> 
> 
> 
>> Le 24/12/2019 à 03:27, Andy Lutomirski a écrit :
>>> On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
>>>  wrote:
>>> 
>>> On powerpc, __arch_get_vdso_data() clobbers the link register,
>>> requiring the caller to set a stack frame in order to save it.
>>> 
>>> As the parent function already has to set a stack frame and save
>>> the link register to call the C vdso function, retriving the
>>> vdso data pointer there is lighter.
>> I'm confused.  Can't you inline __arch_get_vdso_data()?  Or is the
>> issue that you can't retrieve the program counter on power without
>> clobbering the link register?
> 
> Yes it can be inlined (I did it in V1 
> https://patchwork.ozlabs.org/patch/1180571/), but you can't do it without 
> clobbering the link register, because the only way to get the program counter 
> is to do to as if you were calling another function but you call to the 
> address which just follows where you are, so that it sets LR which the 
> simulated return address which corresponds to the address following the 
> branch.
> 
> static __always_inline
> const struct vdso_data *__arch_get_vdso_data(void)
> {
>void *ptr;
> 
>asm volatile(
>"bcl20, 31, .+4;\n"
>"mflr%0;\n"
>"addi%0, %0, __kernel_datapage_offset - (.-4);\n"
>: "=b"(ptr) : : "lr");
> 
>return ptr + *(unsigned long *)ptr;
> }
> 
>> I would imagine that this patch generates worse code on any
>> architecture with PC-relative addressing modes (which includes at
>> least x86_64, and I would guess includes most modern architectures).
> 
> Why ? Powerpc is also using PC-relative addressing for all calls but indirect 
> calls.

I mean PC-relative access for data.  The data page is at a constant, known 
offset from the vDSO text.

I haven’t checked how much x86_64 benefits from this, but at least the 
non-array fields ought to be accessible with a PC-relative access.

It should be possible to refactor a little bit so that the compiler can still 
see what’s going on.  Maybe your patch actually does this. I’d want to look at 
the assembly.  This also might not matter much on x86_64 in particular, since 
x86_64 can convert a PC-relative address to an absolute address with a single 
instruction with no clobbers.

Does power have PC-relative data access?  If so, I wonder if the code can be 
arranged so that even the array accesses don’t require computing an absolute 
address at any point.


Re: [RFC PATCH v2 02/10] lib: vdso: move call to fallback out of common code.

2019-12-24 Thread Andy Lutomirski


> On Dec 24, 2019, at 7:41 PM, christophe leroy  wrote:
> 
> 
> 
>> Le 24/12/2019 à 03:24, Andy Lutomirski a écrit :
>>> On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
>>>  wrote:
>>> 
>>> On powerpc, VDSO functions and syscalls cannot be implemented in C
>>> because the Linux kernel ABI requires that CR[SO] bit is set in case
>>> of error and cleared when no error.
>>> 
>>> As this cannot be done in C, C VDSO functions and syscall'based
>>> fallback need a trampoline in ASM.
>>> 
>>> By moving the fallback calls out of the common code, arches like
>>> powerpc can implement both the call to C VDSO and the fallback call
>>> in a single trampoline function.
>> Maybe the issue is that I'm not a powerpc person, but I don't
>> understand this.  The common vDSO code is in C.  Presumably this means
>> that you need an asm trampoline no matter what to call the C code.  Is
>> the improvement that, with this change, you can have the asm
>> trampoline do a single branch, so it's logically:
>> ret = [call the C code];
>> if (ret == 0) {
>>  set success bit;
>> } else {
>>  ret = fallback;
>>  if (ret == 0)
>>   set success bit;
>> else
>>   set failure bit;
>> }
> 
> More simple than above, in fact it is:
> 
> ret = [call the C code];
> if (ret == 0) {
> set success bit;
> } else {
> ret = fallback [ which sets the success/failure bit];
> }
> return ret

Cute.

> 
> 
>> return ret;
>> instead of:
>> ret = [call the C code, which includes the fallback];
> 
> C code cannot handle the success/failure bit so we need to do something which 
> does:
> 
> int assembly_to_fallback()
> {
>ret = [syscall the fallback]
>if (success bit set)
>return ret;
>else
>return -ret;
> }

Wait, your calling convention has syscalls return positive values on error?

But I think this is moot. The syscalls in question never return nonzero success 
values, so you should be able to inline the syscall without worrying about this.

> 
> Also means going back and forth between the success bit and negative return.
> 
>> if (ret == 0)
>>   set success bit;
>> else
>>   set failure bit;
>> It's not obvious to me that the former ought to be faster.
>>> 
>>> The two advantages are:
>>> - No need play back and forth with CR[SO] and negative return value.
>>> - No stack frame is required in VDSO C functions for the fallbacks.
>> How is no stack frame required?  Do you mean that the presence of the
>> fallback causes worse code generation?  Can you improve the fallback
>> instead?
> 
> When function F1 calls function F2 (with BL insn), the link register (LR) is 
> set with the return address in F1, so that at the end of F2, F2 branches to 
> LR (with BLR insn), that's how you return from functions.
> 
> When F2 calls function F3, the same happens, LR is set to the return of F3 
> into F2. It means that F2 has to save LR in order to be able to return to F1, 
> otherwise the return address from F2 into F1 is lost.
> 
> But ... thinking about it once more, indeed fallback means doing a syscall, 
> and in fact I realise that syscalls won't clobber LR, so it should be 
> possible to do something. Let me try it.
> 

With that plus assume that nonzero return means failure, I think you should 
have all your bases covered.

Re: [RFC PATCH v2 07/10] lib: vdso: don't use READ_ONCE() in __c_kernel_time()

2019-12-24 Thread Andy Lutomirski


> On Dec 24, 2019, at 7:12 PM, christophe leroy  wrote:
> 
> 
> 
>> Le 24/12/2019 à 02:58, Andy Lutomirski a écrit :
>>> On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
>>>  wrote:
>>> 
>>> READ_ONCE() forces the read of the 64 bit value of
>>> vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec allthough
>>> only the lower part is needed.
>> Seems reasonable and very unlikely to be harmful.  That being said,
>> this function really ought to be considered deprecated -- 32-bit
>> time_t is insufficient.
>> Do you get even better code if you move the read into the if statement?
> 
> Euh ...
> 
> How can you return t when time pointer is NULL if you read t only when time 
> pointer is not NULL ?
> 
> 

Duh, never mind.

But this means your patch may be buggy: you need to make sure the compiler 
returns the *same* value it stores. Maybe you’re saved by the potential 
aliasing between the data page and the passed parameter and the value you read, 
but that’sa bad thing to rely on.

Try barrier() after the read.

Re: [RFC PATCH v2 05/10] lib: vdso: inline do_hres()

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> do_hres() is called from several places, so GCC doesn't inline
> it at first.
>
> do_hres() takes a struct __kernel_timespec * parameter for
> passing the result. In the 32 bits case, this parameter corresponds
> to a local var in the caller. In order to provide a pointer
> to this structure, the caller has to put it in its stack and
> do_hres() has to write the result in the stack. This is suboptimal,
> especially on RISC processor like powerpc.
>
> By making GCC inline the function, the struct __kernel_timespec
> remains a local var using registers, avoiding the need to write and
> read stack.
>
> The improvement is significant on powerpc.

I'm okay with it, mainly because I don't expect many workloads to have
more than one copy of the code hot at the same time.


Re: [RFC PATCH v2 04/10] lib: vdso: get pointer to vdso data from the arch

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> On powerpc, __arch_get_vdso_data() clobbers the link register,
> requiring the caller to set a stack frame in order to save it.
>
> As the parent function already has to set a stack frame and save
> the link register to call the C vdso function, retriving the
> vdso data pointer there is lighter.

I'm confused.  Can't you inline __arch_get_vdso_data()?  Or is the
issue that you can't retrieve the program counter on power without
clobbering the link register?

I would imagine that this patch generates worse code on any
architecture with PC-relative addressing modes (which includes at
least x86_64, and I would guess includes most modern architectures).

--Andy


Re: [RFC PATCH v2 02/10] lib: vdso: move call to fallback out of common code.

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> On powerpc, VDSO functions and syscalls cannot be implemented in C
> because the Linux kernel ABI requires that CR[SO] bit is set in case
> of error and cleared when no error.
>
> As this cannot be done in C, C VDSO functions and syscall'based
> fallback need a trampoline in ASM.
>
> By moving the fallback calls out of the common code, arches like
> powerpc can implement both the call to C VDSO and the fallback call
> in a single trampoline function.

Maybe the issue is that I'm not a powerpc person, but I don't
understand this.  The common vDSO code is in C.  Presumably this means
that you need an asm trampoline no matter what to call the C code.  Is
the improvement that, with this change, you can have the asm
trampoline do a single branch, so it's logically:

ret = [call the C code];
if (ret == 0) {
 set success bit;
} else {
 ret = fallback;
 if (ret == 0)
  set success bit;
else
  set failure bit;
}

return ret;

instead of:

ret = [call the C code, which includes the fallback];
if (ret == 0)
  set success bit;
else
  set failure bit;

It's not obvious to me that the former ought to be faster.

>
> The two advantages are:
> - No need play back and forth with CR[SO] and negative return value.
> - No stack frame is required in VDSO C functions for the fallbacks.

How is no stack frame required?  Do you mean that the presence of the
fallback causes worse code generation?  Can you improve the fallback
instead?


Re: [RFC PATCH v2 01/10] lib: vdso: ensure all arches have 32bit fallback

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> In order to simplify next step which moves fallback call at arch
> level, ensure all arches have a 32bit fallback instead of handling
> the lack of 32bit fallback in the common code based
> on VDSO_HAS_32BIT_FALLBACK

I don't like this.  You've implemented what appear to be nonsensical
fallbacks (the 32-bit fallback for a 64-bit vDSO build?  There's no
such thing).

How exactly does this simplify patch 2?

--Andy


Re: [RFC PATCH v2 08/10] lib: vdso: Avoid duplication in __cvdso_clock_getres()

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> VDSO_HRES and VDSO_RAW clocks are handled the same way.
>
> Don't duplicate code.
>
> Signed-off-by: Christophe Leroy 

Reviewed-by: Andy Lutomirski 


Re: [RFC PATCH v2 07/10] lib: vdso: don't use READ_ONCE() in __c_kernel_time()

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> READ_ONCE() forces the read of the 64 bit value of
> vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec allthough
> only the lower part is needed.

Seems reasonable and very unlikely to be harmful.  That being said,
this function really ought to be considered deprecated -- 32-bit
time_t is insufficient.

Do you get even better code if you move the read into the if statement?

Reviewed-by: Andy Lutomirski 

--Andy


Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-10-26 Thread Andy Lutomirski
On Tue, Oct 22, 2019 at 6:56 AM Christophe Leroy
 wrote:
>
>
> >>> The performance is rather disappoiting. That's most likely all
> >>> calculation in the C implementation are based on 64 bits math and
> >>> converted to 32 bits at the very end. I guess C implementation should
> >>> use 32 bits math like the assembly VDSO does as of today.
> >>
> >>> gettimeofday:vdso: 750 nsec/call
> >>>
> >>> gettimeofday:vdso: 1533 nsec/call
> >
> > Small improvement (3%) with the proposed change:
> >
> > gettimeofday:vdso: 1485 nsec/call
>
> By inlining do_hres() I get the following:
>
> gettimeofday:vdso: 1072 nsec/call
>

A perf report might be informative.


Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-07 Thread Andy Lutomirski


> On Sep 7, 2019, at 10:45 AM, Linus Torvalds  
> wrote:
> 
>> On Sat, Sep 7, 2019 at 10:42 AM Andy Lutomirski  wrote:
>> 
>> Linus, you rejected resolveat() because you wanted a *nice* API
> 
> No. I rejected resoveat() because it was a completely broken garbage
> API that couldn't do even basic stuff right (like O_CREAT).
> 
> We have a ton of flag space in the new openat2() model, we might as
> well leave the old flags alone that people are (a) used to and (b) we
> have code to support _anyway_.
> 
> Making up a new flag namespace is only going to cause us - and users -
> more work, and more confusion. For no actual advantage. It's not going
> to be "cleaner". It's just going to be worse.
> 
> 

If we keep all the flag bits in the same mask with the same values, then we’re 
stuck with O_RDONLY=0 and everything that implies.  We’ll have UPGRADE_READ 
that works differently from the missing plain-old-READ bit, and we can’t 
express execute-only-no-read-or-write. This sucks.

Can we at least split the permission bits into their own mask and make bits 0 
and 1 illegal in the main set of flags in openat2?

There’s another thread going on right now about adding a bit along the lines of 
“MAYEXEC”, and one of the conclusions was that it should wait for openat2 so 
that it can have same semantics. If we’re stuck with O_RDONLY and friends, then 
MAYEXEC is doomed to being at least a bit nonsensical.

As an analogy, AMD64 introduced bigger PTEs but kept the same nonsense encoding 
of read and write permission. And then we got NX, and now we’re getting little 
holes in the encoding stolen by CET to mean new silly things.  I don’t know if 
you’ve been following the various rounds of patches, but it is truly horrible. 
The mapping from meaning to the actual bits is *shit*, and AMD64 should have 
made a clean break instead.

open()’s permission bits are basically the same situation. And the kernel 
*already* has a non-type-safe translation layer. Please, please let openat2() 
at least get rid of the turd in open()’s bits 0 and 1.



Re: [PATCH v12 11/12] open: openat2(2) syscall

2019-09-07 Thread Andy Lutomirski



> On Sep 7, 2019, at 9:58 AM, Linus Torvalds  
> wrote:
> 
>> On Sat, Sep 7, 2019 at 5:40 AM Jeff Layton  wrote:
>> 
>> After thinking about this a bit, I wonder if we might be better served
>> with a new set of OA2_* flags instead of repurposing the O_* flags?
> 
> I'd hate to have yet _another_ set of translation functions, and
> another chance of people just getting it wrong either in user space or
> the kernel.
> 
> So no. Let's not make another set of flags that has no sane way to
> have type-safety to avoid more confusion.
> 
> The new flags that _only_ work with openat2() might be named with a
> prefix/suffix to mark that, but I'm not sure it's a huge deal.
> 
>

I agree with the philosophy, but I think it doesn’t apply in this case.  Here 
are the flags:

O_RDONLY, O_WRONLY, O_RDWR: not even a proper bitmask. The kernel already has 
the FMODE_ bits to make this make sense. How about we make the openat2 
permission bits consistent with the internal representation and let the O_ 
permission bits remain as an awful translation.  The kernel already translates 
like this, and it already sucks.

O_CREAT, O_TMPFILE, O_NOCTTY, O_TRUNC: not modes on the fd at all.  These 
affect the meaning of open().  Heck, for openat2, NOCTTY should be this default.

O_EXCL: hopelessly overloaded.

O_APPEND, O_DIRECT, O_SYNC, O_DSYNC, O_LARGEFILE, O_NOATIME, O_PATH, 
O_NONBLOCK: genuine mode bits

O_CLOEXEC: special because it affects the fd, not the struct file.

Linus, you rejected resolveat() because you wanted a *nice* API that people 
would use and that might even be adopted by other OSes. Let’s please not make 
openat2() be a giant pile of crap in the name of consistency with open().  
open(), frankly, is horrible.



Re: [PATCH v4 1/3] kasan: support backing vmalloc space with real shadow memory

2019-08-19 Thread Andy Lutomirski
> On Aug 18, 2019, at 8:58 PM, Daniel Axtens  wrote:
>

>>> Each page of shadow memory represent 8 pages of real memory. Could we use
>>> page_ref to count how many pieces of a shadow page are used so that we can
>>> free it when the ref count decreases to 0.
>
> I'm not sure how much of a difference it will make, but I'll have a look.
>

There are a grand total of eight possible pages that could require a
given shadow page. I would suggest that, instead of reference
counting, you just check all eight pages.

Or, better yet, look at the actual vm_area_struct and are where prev
and next point. That should tell you exactly which range can be freed.


Re: [PATCH v4 1/3] kasan: support backing vmalloc space with real shadow memory

2019-08-16 Thread Andy Lutomirski
On Fri, Aug 16, 2019 at 10:08 AM Mark Rutland  wrote:
>
> Hi Christophe,
>
> On Fri, Aug 16, 2019 at 09:47:00AM +0200, Christophe Leroy wrote:
> > Le 15/08/2019 à 02:16, Daniel Axtens a écrit :
> > > Hook into vmalloc and vmap, and dynamically allocate real shadow
> > > memory to back the mappings.
> > >
> > > Most mappings in vmalloc space are small, requiring less than a full
> > > page of shadow space. Allocating a full shadow page per mapping would
> > > therefore be wasteful. Furthermore, to ensure that different mappings
> > > use different shadow pages, mappings would have to be aligned to
> > > KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE.
> > >
> > > Instead, share backing space across multiple mappings. Allocate
> > > a backing page the first time a mapping in vmalloc space uses a
> > > particular page of the shadow region. Keep this page around
> > > regardless of whether the mapping is later freed - in the mean time
> > > the page could have become shared by another vmalloc mapping.
> > >
> > > This can in theory lead to unbounded memory growth, but the vmalloc
> > > allocator is pretty good at reusing addresses, so the practical memory
> > > usage grows at first but then stays fairly stable.
> >
> > I guess people having gigabytes of memory don't mind, but I'm concerned
> > about tiny targets with very little amount of memory. I have boards with as
> > little as 32Mbytes of RAM. The shadow region for the linear space already
> > takes one eighth of the RAM. I'd rather avoid keeping unused shadow pages
> > busy.
>
> I think this depends on how much shadow would be in constant use vs what
> would get left unused. If the amount in constant use is sufficiently
> large (or the residue is sufficiently small), then it may not be
> worthwhile to support KASAN_VMALLOC on such small systems.
>
> > Each page of shadow memory represent 8 pages of real memory. Could we use
> > page_ref to count how many pieces of a shadow page are used so that we can
> > free it when the ref count decreases to 0.
> >
> > > This requires architecture support to actually use: arches must stop
> > > mapping the read-only zero page over portion of the shadow region that
> > > covers the vmalloc space and instead leave it unmapped.
> >
> > Why 'must' ? Couldn't we switch back and forth from the zero page to real
> > page on demand ?
> >
> > If the zero page is not mapped for unused vmalloc space, bad memory accesses
> > will Oops on the shadow memory access instead of Oopsing on the real bad
> > access, making it more difficult to locate and identify the issue.
>
> I agree this isn't nice, though FWIW this can already happen today for
> bad addresses that fall outside of the usual kernel address space. We
> could make the !KASAN_INLINE checks resilient to this by using
> probe_kernel_read() to check the shadow, and treating unmapped shadow as
> poison.

Could we instead modify the page fault handlers to detect this case
and print a useful message?


Re: [PATCH v2 3/6] x86: clean up _TIF_SYSCALL_EMU handling using ptrace_syscall_enter hook

2019-04-30 Thread Andy Lutomirski
On Mon, Mar 18, 2019 at 3:49 AM Sudeep Holla  wrote:
>
> Now that we have a new hook ptrace_syscall_enter that can be called from
> syscall entry code and it handles PTRACE_SYSEMU in generic code, we
> can do some cleanup using the same in syscall_trace_enter.
>
> Further the extra logic to find single stepping PTRACE_SYSEMU_SINGLESTEP
> in syscall_slow_exit_work seems unnecessary. Let's remove the same.
>

Unless the patch set contains a selftest that exercises all the
interesting cases here, NAK.  To be clear, there needs to be a test
that passes on an unmodified kernel and still passes on a patched
kernel.  And that test case needs to *fail* if, for example, you force
"emulated" to either true or false rather than reading out the actual
value.

--Andy


Re: [PATCH v2 0/6] ptrace: consolidate PTRACE_SYSEMU handling and add support for arm64

2019-03-18 Thread Andy Lutomirski
On Mon, Mar 18, 2019 at 3:49 AM Sudeep Holla  wrote:
>
> Hi,
>
> This patchset evolved from the discussion in the thread[0][1]. When we
> wanted to add PTRACE_SYSEMU support to ARM64, we thought instead of
> duplicating what other architectures like x86 and powerpc have done,
> let consolidate the existing support and move it to the core as there's
> nothing arch specific in it.
>

In the discussion from the first version, there was talk of some
testing.  Can you put the test cases in question into v3?

--Andy


  1   2   >