Maxime,

I read your e-mail carefully and conclude that the best way forward here is
put this one to core@ for a technical decision.

Cheers,
Andrew

On Wed, Jun 03, 2020 at 08:25:32AM +0200, Maxime Villard wrote:
> Le 03/06/2020 ? 01:49, Andrew Doran a ?crit?:
> > On Tue, Jun 02, 2020 at 08:41:53AM +0200, Maxime Villard wrote:
> > 
> > > Le 02/06/2020 ? 00:58, Andrew Doran a ?crit?:
> > > > Module Name:    src
> > > > Committed By:   ad
> > > > Date:           Mon Jun  1 22:58:06 UTC 2020
> > > > 
> > > > Modified Files:
> > > >         src/sys/arch/amd64/amd64: cpufunc.S
> > > >         src/sys/arch/amd64/include: frameasm.h
> > > > 
> > > > Log Message:
> > > > Reported-by: syzbot+6dd5a230d19f0cbc7...@syzkaller.appspotmail.com
> > > > 
> > > > Instrument STOS/MOVS for KMSAN to unbreak it.
> > > > 
> > > > 
> > > > To generate a diff of this commit:
> > > > cvs rdiff -u -r1.58 -r1.59 src/sys/arch/amd64/amd64/cpufunc.S
> > > > cvs rdiff -u -r1.49 -r1.50 src/sys/arch/amd64/include/frameasm.h
> > > > 
> > > > Please note that diffs are not public domain; they are subject to the
> > > > copyright notices on the relevant files.
> > > 
> > > Can you just stop ignoring the remarks that are made?
> > 
> > That's up to you Maxime.  If you habitually make it difficult for people to
> > come to a reasonable compromise with you, then you're asking to not be taken
> > seriously and will find yourself being ignored.
> 
> You are confused. I asked for KMSAN to be unbroken, and proposed an 
> alternative,
> which is atomic_load_relaxed. You were free to explain why my point was a bad
> idea or why it didn't matter, but you refused, and just added a hack on top of
> another. I'm afraid that's up to you.
> 
> But whatever, it doesn't matter. Thanks for reverting the pmap.c changes, it
> is more correct now than before.
> 
> > > I said explicitly
> > > that adding manual instrumentation here is _wrong_.
> > 
> > I don't share your assessment in the general sense.  It should be possible
> > to instrument assembly code because KMSAN is useful AND we can't get by
> > without assembly - there are some things that C just can't do (or do well
> > enough).
> > 
> > On the assembly thing recall that recently you expressed a desire to remove
> > all of the amd64 assembly string functions from libc because of sanitizers -
> > I invested my time to do up a little demo to try and show you why that's not
> > a good idea:
> > 
> >     http://mail-index.netbsd.org/port-amd64/2020/04/19/msg003226.html
> 
> I saw, yes. I answered explaining that a conversation with Ryo Shimizu had
> changed my mind a bit, and seeing your results (which as far as I can tell
> do not indicate a performance improvement significant enough to not be
> considered as noise), I asked you whether it was that relevant. You didn't
> follow up though.
> 
> > The system is a balancing act.  There are lots of factors to be taken into
> > account: maintainability, tooling like KMSAN, user's varying needs, the
> > capabilites of different machines, performance, feature set and so on, and
> > dare I say it even the enjoyment of the people working on the project is
> > important too.  Myopic focus on one factor only to the detriment of others
> > is no good.
> 
> I am well aware of that.
> 
> > > The kMSan ASM fixups
> > > are limited to args/returns, and that is part of a sensical policy that
> > > _should not_ be changed without a good reason.
> > > 
> > > x86_movs/x86_stos have strictly _no reason_ to exist. Of the 18 
> > > conversions
> > > you made to them in pmap.c, not one is justified. memcpy/memset were all
> > > correct.
> > 
> > I introduced these functions as a compromise because you were unhappy with
> > use of memcpy() to copy PDEs.  See:
> > 
> >     http://mail-index.netbsd.org/port-amd64/2020/05/23/msg003280.html
> > 
> > You wrote:
> > 
> >     In the [XXX] window, the PTEs could be used by userland.  If you
> >     copied them using memcpy(), some parts of the bytes could contain
> >     stale values.
> 
> Sure, I was explicitly referring to SVS, which has an unusual way of
> accessing PTEs (contrary to pmap), which is why it needs special atomic
> care that other places do not.
> 
> > Live PDEs are also copied in pmap.c so I made a change there too.  After
> > that I decided "why not" and used the new functions everywhere PDEs/PTEs or
> > pages are block zeroed / copied.  But I'm also happy with memcpy()/memset().
> > Either way will work.  In fairness I do work too fast sometimes.
> > 
> > > The only reason you made these big unneeded changes is for SVS not to take
> > > the bus lock,
> > 
> > There is no bus lock on x86 (not since the 1990s anyway).
> > 
> > > but as was said already, atomic_load_relaxed will do what
> > > you want without the need for these functions.
> > > 
> > > Please revert _both changes now_, this one and the previous one which
> > > introduced both functions, and let's use atomic_load_relaxed.
> > 
> > You're focusing on only one factor.  I'll explain in detail.  Here is the
> > original code I replaced:
> > 
> >      685 static inline pt_entry_t
> >      686 svs_pte_atomic_read(struct pmap *pmap, size_t idx)
> >      687 {
> >      688    /*
> >      689     * XXX: We don't have a basic atomic_fetch_64 function?
> >      690     */
> >      691    return atomic_cas_64(&pmap->pm_pdir[idx], 666, 666);
> >      692 }
> > ...
> >      717    /* User slots. */
> >      718    for (i = 0; i < PDIR_SLOT_USERLIM; i++) {
> >      719            pte = svs_pte_atomic_read(pmap, i);
> >      720            ci->ci_svs_updir[i] = pte;
> >      721    }
> > 
> > There's no need for an atomic op there because fetches on x86 are by
> > definition atomic, and it does nothing to alter the memory ordering in this
> > case.  There are side effects to the atomic op: it's serializing and always
> > generates an unbuffered writeback, even in the failure case.  So the source
> > is being copied into itself, as well as into the destination, and the CPU's
> > store buffer is rendered ineffective.
> > 
> > A cut-n-paste replacement to use the relaxed functions predictably ties the
> > compiler in knots and the generated code is bad.
> > 
> >     /* User slots. */
> >     for (i = 0; i < PDIR_SLOT_USERLIM; i++) {
> >             pte = atomic_load_relaxed(&pmap->pm_pdir[i]);
> >             atomic_store_relaxed(&ci->ci_svs_updir[i], pte);
> 
> Note that atomic_store_relaxed is not needed here. This is the per-cpu page
> tables, and they cannot be accessed by someone else at the same time. All
> we care about is fetching the pmap page tables in quads.
> 
> >     }
> > 
> > 0000000000400c9f <copy_relaxed>:
> >    400c9f:       48 8b 06                mov    (%rsi),%rax
> >    400ca2:       48 8b 17                mov    (%rdi),%rdx
> >    400ca5:       48 8d b0 00 00 40 06    lea    0x6400000(%rax),%rsi
> >    400cac:       48 8b 08                mov    (%rax),%rcx
> >    400caf:       48 89 4c 24 f8          mov    %rcx,-0x8(%rsp)
> >    400cb4:       48 8b 4c 24 f8          mov    -0x8(%rsp),%rcx
> >    400cb9:       48 89 0a                mov    %rcx,(%rdx)
> >    400cbc:       48 83 c0 08             add    $0x8,%rax
> >    400cc0:       48 83 c2 08             add    $0x8,%rdx
> >    400cc4:       48 39 f0                cmp    %rsi,%rax
> >    400cc7:       75 e3                   jne    400cac <copy_relaxed+0xd>
> >    400cc9:       c3                      retq
> > 
> > To get the relaxed functions working well I reckon you'd need to copy the
> > pointers and unroll the loop and even then it might still be bad, and
> > looks daft to boot:
> > 
> >     src = pmap->pm_pdir;
> >     dst = ci->ci_svs_updir;
> >     for (i = 0; i < PDIR_SLOT_USERLIM / 8; i += 8) {
> >             a = atomic_load_relaxed(&src[i+0]);
> >             b = atomic_load_relaxed(&src[i+1]);
> >             c = atomic_load_relaxed(&src[i+2]);
> >             d = atomic_load_relaxed(&src[i+3]);
> >             e = atomic_load_relaxed(&src[i+4]);
> >             f = atomic_load_relaxed(&src[i+5]);
> >             g = atomic_load_relaxed(&src[i+6]);
> >             h = atomic_load_relaxed(&src[i+7]);
> >             atomic_store_relaxed(&dst[i+0], a);
> >             atomic_store_relaxed(&dst[i+1], b);
> >             atomic_store_relaxed(&dst[i+2], c);
> >             atomic_store_relaxed(&dst[i+3], d);
> >             atomic_store_relaxed(&dst[i+4], e);
> >             atomic_store_relaxed(&dst[i+5], f);
> >             atomic_store_relaxed(&dst[i+6], g);
> >             atomic_store_relaxed(&dst[i+7], h);
> >     }
> >     for (; i < PDIR_SLOT_USERLIM; i++) {
> >             pte = atomic_load_relaxed(&src[i]);
> >             atomic_store_relaxed(&src[i], pte);
> >     }
> 
> Considering that atomic_store_relaxed is not needed, probably the situation
> without it is better than what it looks like in this disassembly.
> 
> > A way to satisfy all requirements is to use the instruction built into the
> > processor explicitly for block copies (REP MOVSQ) and tell sanitizers about
> > it.  memcpy() (or if you're worried about byte stores an explicit bit of
> > assembly) gives you access to that.  Here's a test I wrote to show how these
> > variants look in practice (minus the silly unrolled variant):
> > 
> >     http://www.netbsd.org/~ad/2020/copy.c
> > 
> > On a Xeon:
> > 
> >     $ ./a.out
> >     alloc
> >     fault
> >     run
> >     atomic  707.262056MB per second
> >     relaxed 3054.676292MB per second
> >     memcpy  3494.007480MB per second
> > 
> > On a low-power AMD CPU:
> > 
> >     $ ./a.out
> >     alloc
> >     fault
> >     run
> >     atomic  253.765483MB per second
> >     relaxed 788.343882MB per second
> >     memcpy  935.265225MB per second
> 
> Thanks for the measurements; I have a few remarks here:
> 
>  - You're doing copy tests with 104857600 bytes. But the kernel code in
>    question only copies 2048 bytes. So how can it be compared precisely?
>    It's been a while since I last checked the Intel performance manual,
>    but last I checked, there were odd behaviors when it comes to loops;
>    up to a certain iteration count, the CPU lets down internal
>    optimizations and uses a slow path. So if you copy much more in your
>    benchmark than in the real situation, you may well be triggering the
>    slow paths in the CPU, that are not triggered in the real situation.
> 
>  - As I said, atomic_store_relaxed() is not needed, so the copy_relaxed
>    count is likely lower in this benchmark than it needs to be in the
>    real situation.
> 
>  - You didn't include movsq? Since you decided to go with movsq, it
>    would have been relevant to compare it.
> 
> Thanks,
> Maxime

Reply via email to