Maxime, I read your e-mail carefully and conclude that the best way forward here is put this one to core@ for a technical decision.
Cheers, Andrew On Wed, Jun 03, 2020 at 08:25:32AM +0200, Maxime Villard wrote: > Le 03/06/2020 ? 01:49, Andrew Doran a ?crit?: > > On Tue, Jun 02, 2020 at 08:41:53AM +0200, Maxime Villard wrote: > > > > > Le 02/06/2020 ? 00:58, Andrew Doran a ?crit?: > > > > Module Name: src > > > > Committed By: ad > > > > Date: Mon Jun 1 22:58:06 UTC 2020 > > > > > > > > Modified Files: > > > > src/sys/arch/amd64/amd64: cpufunc.S > > > > src/sys/arch/amd64/include: frameasm.h > > > > > > > > Log Message: > > > > Reported-by: syzbot+6dd5a230d19f0cbc7...@syzkaller.appspotmail.com > > > > > > > > Instrument STOS/MOVS for KMSAN to unbreak it. > > > > > > > > > > > > To generate a diff of this commit: > > > > cvs rdiff -u -r1.58 -r1.59 src/sys/arch/amd64/amd64/cpufunc.S > > > > cvs rdiff -u -r1.49 -r1.50 src/sys/arch/amd64/include/frameasm.h > > > > > > > > Please note that diffs are not public domain; they are subject to the > > > > copyright notices on the relevant files. > > > > > > Can you just stop ignoring the remarks that are made? > > > > That's up to you Maxime. If you habitually make it difficult for people to > > come to a reasonable compromise with you, then you're asking to not be taken > > seriously and will find yourself being ignored. > > You are confused. I asked for KMSAN to be unbroken, and proposed an > alternative, > which is atomic_load_relaxed. You were free to explain why my point was a bad > idea or why it didn't matter, but you refused, and just added a hack on top of > another. I'm afraid that's up to you. > > But whatever, it doesn't matter. Thanks for reverting the pmap.c changes, it > is more correct now than before. > > > > I said explicitly > > > that adding manual instrumentation here is _wrong_. > > > > I don't share your assessment in the general sense. It should be possible > > to instrument assembly code because KMSAN is useful AND we can't get by > > without assembly - there are some things that C just can't do (or do well > > enough). > > > > On the assembly thing recall that recently you expressed a desire to remove > > all of the amd64 assembly string functions from libc because of sanitizers - > > I invested my time to do up a little demo to try and show you why that's not > > a good idea: > > > > http://mail-index.netbsd.org/port-amd64/2020/04/19/msg003226.html > > I saw, yes. I answered explaining that a conversation with Ryo Shimizu had > changed my mind a bit, and seeing your results (which as far as I can tell > do not indicate a performance improvement significant enough to not be > considered as noise), I asked you whether it was that relevant. You didn't > follow up though. > > > The system is a balancing act. There are lots of factors to be taken into > > account: maintainability, tooling like KMSAN, user's varying needs, the > > capabilites of different machines, performance, feature set and so on, and > > dare I say it even the enjoyment of the people working on the project is > > important too. Myopic focus on one factor only to the detriment of others > > is no good. > > I am well aware of that. > > > > The kMSan ASM fixups > > > are limited to args/returns, and that is part of a sensical policy that > > > _should not_ be changed without a good reason. > > > > > > x86_movs/x86_stos have strictly _no reason_ to exist. Of the 18 > > > conversions > > > you made to them in pmap.c, not one is justified. memcpy/memset were all > > > correct. > > > > I introduced these functions as a compromise because you were unhappy with > > use of memcpy() to copy PDEs. See: > > > > http://mail-index.netbsd.org/port-amd64/2020/05/23/msg003280.html > > > > You wrote: > > > > In the [XXX] window, the PTEs could be used by userland. If you > > copied them using memcpy(), some parts of the bytes could contain > > stale values. > > Sure, I was explicitly referring to SVS, which has an unusual way of > accessing PTEs (contrary to pmap), which is why it needs special atomic > care that other places do not. > > > Live PDEs are also copied in pmap.c so I made a change there too. After > > that I decided "why not" and used the new functions everywhere PDEs/PTEs or > > pages are block zeroed / copied. But I'm also happy with memcpy()/memset(). > > Either way will work. In fairness I do work too fast sometimes. > > > > > The only reason you made these big unneeded changes is for SVS not to take > > > the bus lock, > > > > There is no bus lock on x86 (not since the 1990s anyway). > > > > > but as was said already, atomic_load_relaxed will do what > > > you want without the need for these functions. > > > > > > Please revert _both changes now_, this one and the previous one which > > > introduced both functions, and let's use atomic_load_relaxed. > > > > You're focusing on only one factor. I'll explain in detail. Here is the > > original code I replaced: > > > > 685 static inline pt_entry_t > > 686 svs_pte_atomic_read(struct pmap *pmap, size_t idx) > > 687 { > > 688 /* > > 689 * XXX: We don't have a basic atomic_fetch_64 function? > > 690 */ > > 691 return atomic_cas_64(&pmap->pm_pdir[idx], 666, 666); > > 692 } > > ... > > 717 /* User slots. */ > > 718 for (i = 0; i < PDIR_SLOT_USERLIM; i++) { > > 719 pte = svs_pte_atomic_read(pmap, i); > > 720 ci->ci_svs_updir[i] = pte; > > 721 } > > > > There's no need for an atomic op there because fetches on x86 are by > > definition atomic, and it does nothing to alter the memory ordering in this > > case. There are side effects to the atomic op: it's serializing and always > > generates an unbuffered writeback, even in the failure case. So the source > > is being copied into itself, as well as into the destination, and the CPU's > > store buffer is rendered ineffective. > > > > A cut-n-paste replacement to use the relaxed functions predictably ties the > > compiler in knots and the generated code is bad. > > > > /* User slots. */ > > for (i = 0; i < PDIR_SLOT_USERLIM; i++) { > > pte = atomic_load_relaxed(&pmap->pm_pdir[i]); > > atomic_store_relaxed(&ci->ci_svs_updir[i], pte); > > Note that atomic_store_relaxed is not needed here. This is the per-cpu page > tables, and they cannot be accessed by someone else at the same time. All > we care about is fetching the pmap page tables in quads. > > > } > > > > 0000000000400c9f <copy_relaxed>: > > 400c9f: 48 8b 06 mov (%rsi),%rax > > 400ca2: 48 8b 17 mov (%rdi),%rdx > > 400ca5: 48 8d b0 00 00 40 06 lea 0x6400000(%rax),%rsi > > 400cac: 48 8b 08 mov (%rax),%rcx > > 400caf: 48 89 4c 24 f8 mov %rcx,-0x8(%rsp) > > 400cb4: 48 8b 4c 24 f8 mov -0x8(%rsp),%rcx > > 400cb9: 48 89 0a mov %rcx,(%rdx) > > 400cbc: 48 83 c0 08 add $0x8,%rax > > 400cc0: 48 83 c2 08 add $0x8,%rdx > > 400cc4: 48 39 f0 cmp %rsi,%rax > > 400cc7: 75 e3 jne 400cac <copy_relaxed+0xd> > > 400cc9: c3 retq > > > > To get the relaxed functions working well I reckon you'd need to copy the > > pointers and unroll the loop and even then it might still be bad, and > > looks daft to boot: > > > > src = pmap->pm_pdir; > > dst = ci->ci_svs_updir; > > for (i = 0; i < PDIR_SLOT_USERLIM / 8; i += 8) { > > a = atomic_load_relaxed(&src[i+0]); > > b = atomic_load_relaxed(&src[i+1]); > > c = atomic_load_relaxed(&src[i+2]); > > d = atomic_load_relaxed(&src[i+3]); > > e = atomic_load_relaxed(&src[i+4]); > > f = atomic_load_relaxed(&src[i+5]); > > g = atomic_load_relaxed(&src[i+6]); > > h = atomic_load_relaxed(&src[i+7]); > > atomic_store_relaxed(&dst[i+0], a); > > atomic_store_relaxed(&dst[i+1], b); > > atomic_store_relaxed(&dst[i+2], c); > > atomic_store_relaxed(&dst[i+3], d); > > atomic_store_relaxed(&dst[i+4], e); > > atomic_store_relaxed(&dst[i+5], f); > > atomic_store_relaxed(&dst[i+6], g); > > atomic_store_relaxed(&dst[i+7], h); > > } > > for (; i < PDIR_SLOT_USERLIM; i++) { > > pte = atomic_load_relaxed(&src[i]); > > atomic_store_relaxed(&src[i], pte); > > } > > Considering that atomic_store_relaxed is not needed, probably the situation > without it is better than what it looks like in this disassembly. > > > A way to satisfy all requirements is to use the instruction built into the > > processor explicitly for block copies (REP MOVSQ) and tell sanitizers about > > it. memcpy() (or if you're worried about byte stores an explicit bit of > > assembly) gives you access to that. Here's a test I wrote to show how these > > variants look in practice (minus the silly unrolled variant): > > > > http://www.netbsd.org/~ad/2020/copy.c > > > > On a Xeon: > > > > $ ./a.out > > alloc > > fault > > run > > atomic 707.262056MB per second > > relaxed 3054.676292MB per second > > memcpy 3494.007480MB per second > > > > On a low-power AMD CPU: > > > > $ ./a.out > > alloc > > fault > > run > > atomic 253.765483MB per second > > relaxed 788.343882MB per second > > memcpy 935.265225MB per second > > Thanks for the measurements; I have a few remarks here: > > - You're doing copy tests with 104857600 bytes. But the kernel code in > question only copies 2048 bytes. So how can it be compared precisely? > It's been a while since I last checked the Intel performance manual, > but last I checked, there were odd behaviors when it comes to loops; > up to a certain iteration count, the CPU lets down internal > optimizations and uses a slow path. So if you copy much more in your > benchmark than in the real situation, you may well be triggering the > slow paths in the CPU, that are not triggered in the real situation. > > - As I said, atomic_store_relaxed() is not needed, so the copy_relaxed > count is likely lower in this benchmark than it needs to be in the > real situation. > > - You didn't include movsq? Since you decided to go with movsq, it > would have been relevant to compare it. > > Thanks, > Maxime