amd64

Andrew Doran Tue, 02 Jun 2020 16:50:47 -0700

On Tue, Jun 02, 2020 at 08:41:53AM +0200, Maxime Villard wrote:

> Le 02/06/2020 ? 00:58, Andrew Doran a ?crit?:
> > Module Name:        src
> > Committed By:       ad
> > Date:               Mon Jun  1 22:58:06 UTC 2020
> > 
> > Modified Files:
> >     src/sys/arch/amd64/amd64: cpufunc.S
> >     src/sys/arch/amd64/include: frameasm.h
> > 
> > Log Message:
> > Reported-by: syzbot+6dd5a230d19f0cbc7...@syzkaller.appspotmail.com
> > 
> > Instrument STOS/MOVS for KMSAN to unbreak it.
> > 
> > 
> > To generate a diff of this commit:
> > cvs rdiff -u -r1.58 -r1.59 src/sys/arch/amd64/amd64/cpufunc.S
> > cvs rdiff -u -r1.49 -r1.50 src/sys/arch/amd64/include/frameasm.h
> > 
> > Please note that diffs are not public domain; they are subject to the
> > copyright notices on the relevant files.
> 
> Can you just stop ignoring the remarks that are made?

That's up to you Maxime.  If you habitually make it difficult for people to
come to a reasonable compromise with you, then you're asking to not be taken
seriously and will find yourself being ignored.

> I said explicitly
> that adding manual instrumentation here is _wrong_.

I don't share your assessment in the general sense.  It should be possible
to instrument assembly code because KMSAN is useful AND we can't get by
without assembly - there are some things that C just can't do (or do well
enough).

On the assembly thing recall that recently you expressed a desire to remove
all of the amd64 assembly string functions from libc because of sanitizers -
I invested my time to do up a little demo to try and show you why that's not
a good idea:

        http://mail-index.netbsd.org/port-amd64/2020/04/19/msg003226.html

The system is a balancing act.  There are lots of factors to be taken into
account: maintainability, tooling like KMSAN, user's varying needs, the
capabilites of different machines, performance, feature set and so on, and
dare I say it even the enjoyment of the people working on the project is
important too.  Myopic focus on one factor only to the detriment of others
is no good.

> The kMSan ASM fixups
> are limited to args/returns, and that is part of a sensical policy that
> _should not_ be changed without a good reason.
>
> x86_movs/x86_stos have strictly _no reason_ to exist. Of the 18 conversions
> you made to them in pmap.c, not one is justified. memcpy/memset were all
> correct.

I introduced these functions as a compromise because you were unhappy with
use of memcpy() to copy PDEs.  See:

        http://mail-index.netbsd.org/port-amd64/2020/05/23/msg003280.html

You wrote:

        In the [XXX] window, the PTEs could be used by userland.  If you
        copied them using memcpy(), some parts of the bytes could contain
        stale values.

Live PDEs are also copied in pmap.c so I made a change there too.  After
that I decided "why not" and used the new functions everywhere PDEs/PTEs or
pages are block zeroed / copied.  But I'm also happy with memcpy()/memset(). 
Either way will work.  In fairness I do work too fast sometimes.

> The only reason you made these big unneeded changes is for SVS not to take
> the bus lock,

There is no bus lock on x86 (not since the 1990s anyway).

> but as was said already, atomic_load_relaxed will do what
> you want without the need for these functions.
>
> Please revert _both changes now_, this one and the previous one which
> introduced both functions, and let's use atomic_load_relaxed.

You're focusing on only one factor.  I'll explain in detail.  Here is the
original code I replaced:

    685 static inline pt_entry_t
    686 svs_pte_atomic_read(struct pmap *pmap, size_t idx)
    687 {
    688         /*
    689          * XXX: We don't have a basic atomic_fetch_64 function?
    690          */
    691         return atomic_cas_64(&pmap->pm_pdir[idx], 666, 666);
    692 }
...
    717         /* User slots. */
    718         for (i = 0; i < PDIR_SLOT_USERLIM; i++) {
    719                 pte = svs_pte_atomic_read(pmap, i);
    720                 ci->ci_svs_updir[i] = pte;
    721         }

There's no need for an atomic op there because fetches on x86 are by
definition atomic, and it does nothing to alter the memory ordering in this
case.  There are side effects to the atomic op: it's serializing and always
generates an unbuffered writeback, even in the failure case.  So the source
is being copied into itself, as well as into the destination, and the CPU's
store buffer is rendered ineffective.

A cut-n-paste replacement to use the relaxed functions predictably ties the
compiler in knots and the generated code is bad.

        /* User slots. */
        for (i = 0; i < PDIR_SLOT_USERLIM; i++) {
                pte = atomic_load_relaxed(&pmap->pm_pdir[i]);
                atomic_store_relaxed(&ci->ci_svs_updir[i], pte);
        }

0000000000400c9f <copy_relaxed>:
  400c9f:       48 8b 06                mov    (%rsi),%rax
  400ca2:       48 8b 17                mov    (%rdi),%rdx
  400ca5:       48 8d b0 00 00 40 06    lea    0x6400000(%rax),%rsi
  400cac:       48 8b 08                mov    (%rax),%rcx
  400caf:       48 89 4c 24 f8          mov    %rcx,-0x8(%rsp)
  400cb4:       48 8b 4c 24 f8          mov    -0x8(%rsp),%rcx
  400cb9:       48 89 0a                mov    %rcx,(%rdx)
  400cbc:       48 83 c0 08             add    $0x8,%rax
  400cc0:       48 83 c2 08             add    $0x8,%rdx
  400cc4:       48 39 f0                cmp    %rsi,%rax
  400cc7:       75 e3                   jne    400cac <copy_relaxed+0xd>
  400cc9:       c3                      retq   

To get the relaxed functions working well I reckon you'd need to copy the
pointers and unroll the loop and even then it might still be bad, and
looks daft to boot:

        src = pmap->pm_pdir;
        dst = ci->ci_svs_updir;
        for (i = 0; i < PDIR_SLOT_USERLIM / 8; i += 8) {
                a = atomic_load_relaxed(&src[i+0]);
                b = atomic_load_relaxed(&src[i+1]);
                c = atomic_load_relaxed(&src[i+2]);
                d = atomic_load_relaxed(&src[i+3]);
                e = atomic_load_relaxed(&src[i+4]);
                f = atomic_load_relaxed(&src[i+5]);
                g = atomic_load_relaxed(&src[i+6]);
                h = atomic_load_relaxed(&src[i+7]);
                atomic_store_relaxed(&dst[i+0], a);
                atomic_store_relaxed(&dst[i+1], b);
                atomic_store_relaxed(&dst[i+2], c);
                atomic_store_relaxed(&dst[i+3], d);
                atomic_store_relaxed(&dst[i+4], e);
                atomic_store_relaxed(&dst[i+5], f);
                atomic_store_relaxed(&dst[i+6], g);
                atomic_store_relaxed(&dst[i+7], h);
        }
        for (; i < PDIR_SLOT_USERLIM; i++) {
                pte = atomic_load_relaxed(&src[i]);
                atomic_store_relaxed(&src[i], pte);
        }

A way to satisfy all requirements is to use the instruction built into the
processor explicitly for block copies (REP MOVSQ) and tell sanitizers about
it.  memcpy() (or if you're worried about byte stores an explicit bit of
assembly) gives you access to that.  Here's a test I wrote to show how these
variants look in practice (minus the silly unrolled variant):

        http://www.netbsd.org/~ad/2020/copy.c

On a Xeon:

        $ ./a.out
        alloc
        fault
        run
        atomic  707.262056MB per second
        relaxed 3054.676292MB per second
        memcpy  3494.007480MB per second

On a low-power AMD CPU:

        $ ./a.out
        alloc
        fault
        run
        atomic  253.765483MB per second
        relaxed 788.343882MB per second
        memcpy  935.265225MB per second

Here in this flamegraph you can see why this particular change matters, but
the general idea applies elsewhere.  With the original code almost all of
the time under the various pmap_load() calls is spent in svs.c (numerous
ways in, uiomove() is most prominent).

        http://www.netbsd.org/~ad/2020/hackbench-may.svg

Cheers,
Andrew

Re: [stos, again] Re: CVS commit: src/sys/arch/amd64

Reply via email to