Re: XEN modules?
I know there's some magic somewhere in the Makefiles that automatically builds both "normal" and XEN variants of the modules for amd64 (and, for i386, there's a PAE variant, too)! But I can't seem to find the magic! Anyway, it seems to me that, when building the XEN (or PAE) variants of each module, they should be built with -DXEN (or -DPAE) options. There is at least one source file, compiled into the compat module, which has some code conditional on XEN. (For those who care, it is the compat_60 x86 code for updating AMD cpu_ucode.) Based on log files, we're currently not defining XEN or PAE for the modules builds. As a result the XEN variant of the compat module does not contain the required code, preventing use of the 6.0 cpuctl command. So, 1. Would it be reasonable to define the XEN and PAE macros when building the module variants? Still need an answer to this question... 2. What would it take to make it happen? Add CPPFLAGS+= -DXEN (or PAE) to the appropriate *.mk files? (See #3 below) These files already have XEN=1 and PAE=1 3. Where's the magic? :) OK, found the magic. It's in src/sys/modules/arch/x86/amd64-xen/bsd.amd64-xen.mk src/sys/modules/arch/x86/i386-xen/bsd.i386-xen.mk andsrc/sys/modules/arch/x86/i386pae-xen/bsd.i386pae-xen.mk If the answer to Question #1 is "Yes, it's reasonable" then I can take care of making the changes. Follow-on question: the changes identified above have the desired effect, and code conditioned on XEN is now being built. Unfortunately, that has exposed the next problem! The XEN-specific code has a #include "xen/intrptrs.h" which for kernel builds would be found due to a sym-link in the kernel's object directory: xen -> /build/netbsd-compat/src/sys/arch/xen/include This symlink doesn't exist in the module's object directory, so the #include fails. It would appear that I might need to add some more magic to create the xen (and/or pae) symlink for the modules. It looks like this is handled in share/mk/bsd.klinks.mk but it's not exactly clear what I should add. And I can't seem to find where the kernel Makefiles make this happen, so no examples from which to borrow. I can't tell if I should (a) set XEN_BUILD to something (to what?), (2) override an existing value of KLINK_MACHINE, or (3) just update KLINK_FILES? (And what is xen-ma used for, anyway?) Again, any clue/help would be greatly appreciated. (As you all can probably tell from this thread, I'm far from an expert when it comes to our build infrastructure...) +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
Debug Register (x86) trap - race
I've been investigating the race in ptrace(2) ATF tests: dbregs_* for x86 (i386 and amd64). Examples: NetBSD/amd64 http://releng.netbsd.org/b5reports/amd64/2018/2018.03.06.11.21.31/test.html#lib_libc_sys_t_ptrace_wait6_dbregs_dr3_trap_variable_readwrite_write_2bytes NetBSD/i386 http://releng.netbsd.org/b5reports/i386/2017/2017.12.06.17.54.58/test.html#lib_libc_sys_t_ptrace_wait3_dbregs_dr2_trap_variable_readwrite_read_2bytes I still don't know what is the root cause for the race and I'm open for suggestions. Observations: 1. This bug is reproducible only for the scenario with debug register trap according to the tests of mine. 2. It's reproducible on slower x86 hardware or in softemu (qemu). I have not been able to reproduce a single failure on post-Core2Duo CPU. 3. Adding or changing the scenario slightly of tests (like to PT_STEP or changing SIGSTOP to other signal) makes the test disappear at all. 4. Adding almost any debug code anywhere makes this test either disappear or make reproducible much less frequently (sometimes once a day). 5. I've detected that the bug is caused by the fact that _lwp_kill(2) (called via raise(2)) [under a debugger] can trigger two calls of wait(2) to return for the same signal. A few days ago, I've prepared this document: http://netbsd.org/~kamil/kernel/sigstop.txt 6. According to my tests in lwp_userret(): 1549 /* 1550 * It is safe to do this read unlocked on a MP system.. 1551 */ 1552 while ((l->l_flag & LW_USERRET) != 0) { 1553 /* 1554 * Process pending signals first, unless the process 1555 * is dumping core or exiting, where we will instead 1556 * enter the LW_WSUSPEND case below. 1557 */ 1558 if ((l->l_flag & (LW_PENDSIG | LW_WCORE | LW_WEXIT)) == 1559 LW_PENDSIG) { 1560 mutex_enter(p->p_lock); 1561 while ((sig = issignal(l)) != 0) 1562 postsig(sig); 1563 mutex_exit(p->p_lock); 1564 } 1565 -- src/sys/kern/kern_lwp.c We always enter the while() loop and call issignal(). Sometimes we extract a signal and call postsig(). Usually for the call, we see no misbehavior. 7. I've checked that PT_CONTINUE branch where p_stat != SSTOP is never taken, at least for this race. We always call: 828 /* Finally, deliver the requested signal (or none). */ 829 if (t->p_stat == SSTOP) { 830 /* 831 * Unstop the process. If it needs to take a 832 * signal, make all efforts to ensure that at 833 * an LWP runs to see it. 834 */ 835 t->p_xsig = signo; 836 if (resume_all) 837 proc_unstop(t); 838 else 839 lwp_unstop(lt); 840 return 0; 841 } -- src/sys/kern/sys_ptrace_common.c 8. We checked that we call sigpost() twice for a single _lwp_kill(2) call: A. handle_syscall() -> syscall() -> sv_invoke() -> sys__lwp_kill() -> kpsignal2() -> sigpost() B. lwp_userret() -> issignal() -> sigswitch() -> proc_stop() -> sigpost() 9. Adding two consequential wait(2) calls, one after (checking for error condition for another with WNOHANG) - makes the bug disappear completely. I'm looking for hints to speedup the research. signature.asc Description: OpenPGP digital signature
Re: XEN modules?
On Mon, 26 Mar 2018, Paul Goyette wrote: I know there's some magic somewhere in the Makefiles that automatically builds both "normal" and XEN variants of the modules for amd64 (and, for i386, there's a PAE variant, too)! But I can't seem to find the magic! Anyway, it seems to me that, when building the XEN (or PAE) variants of each module, they should be built with -DXEN (or -DPAE) options. There is at least one source file, compiled into the compat module, which has some code conditional on XEN. (For those who care, it is the compat_60 x86 code for updating AMD cpu_ucode.) Based on log files, we're currently not defining XEN or PAE for the modules builds. As a result the XEN variant of the compat module does not contain the required code, preventing use of the 6.0 cpuctl command. So, 1. Would it be reasonable to define the XEN and PAE macros when building the module variants? Still need an answer to this question... 2. What would it take to make it happen? Add CPPFLAGS+= -DXEN (or PAE) to the appropriate *.mk files? (See #3 below) These files already have XEN=1 and PAE=1 3. Where's the magic? :) OK, found the magic. It's in src/sys/modules/arch/x86/amd64-xen/bsd.amd64-xen.mk src/sys/modules/arch/x86/i386-xen/bsd.i386-xen.mk and src/sys/modules/arch/x86/i386pae-xen/bsd.i386pae-xen.mk If the answer to Question #1 is "Yes, it's reasonable" then I can take care of making the changes. Opinions, anyone? +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
XEN modules?
I know there's some magic somewhere in the Makefiles that automatically builds both "normal" and XEN variants of the modules for amd64 (and, for i386, there's a PAE variant, too)! But I can't seem to find the magic! Anyway, it seems to me that, when building the XEN (or PAE) variants of each module, they should be built with -DXEN (or -DPAE) options. There is at least one source file, compiled into the compat module, which has some code conditional on XEN. (For those who care, it is the compat_60 x86 code for updating AMD cpu_ucode.) Based on log files, we're currently not defining XEN or PAE for the modules builds. As a result the XEN variant of the compat module does not contain the required code, preventing use of the 6.0 cpuctl command. So, 1. Would it be reasonable to define the XEN and PAE macros when building the module variants? 2. What would it take to make it happen? 3. Where's the magic? :) +--+--++ | Paul Goyette | PGP Key fingerprint: | E-mail addresses: | | (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com | | Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org | +--+--++
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
Le 25/03/2018 à 17:03, Michael van Elst a écrit : On Sun, Mar 25, 2018 at 03:39:21PM +0200, Maxime Villard wrote: The main cpu (on which your program executes) sends IPIs to remote cpus, these cpus were idle and entered the halted state, and they take some time to wake up and process the IPI. And that's time the main cpu spends waiting. As you said yourself it's not trivial to fix this, because the wakeup path can be whatever interrupt entry point, and if we were to add some code there it would slow down interrupt entry in general. The slow down would be minimal, the cost comes from refilling the TLB and that would happen independent on when the TLB was flushed. I wouldn't be so sure; a new branch in the interrupt entry points (with interrupts disabled) does cost something, and it's about the hottest path in the kernel. We could perhaps have something to instruct the main cpu not to wait for cpus that are idle, only wait for those that aren't. Is there a guarantee that the IPI handler on the other CPUs has already started? I don't think so, but I would have to re-check the LAPIC spec. Could it be deferred by some other high-level interrupt or by a section with disabled interrupts? I don't think so (and I'm also not totally sure what you mean). In all cases you need to handle the wakeup in the interrupt entry points. An efficient way would be, before going into hlt, to flush the tlb and to set a "idle" flag somewhere in curcpu. Then, we change each interrupt entry point to set this flag to zero automatically regardless of its previous value. Finally, when a cpu performs a tlb shootdown, it does not send IPIs to the cpus flagged as idle. Maxime
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sun, Mar 25, 2018 at 07:24:25PM +0200, Maxime Villard wrote: > Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit : > > The other question is whether we can't just use the direct map for this > > on amd64 and similar platforms? > > no, because nothing guarantees the physical pages are contiguous. At least for ubc_zerorange, they certainly don't have to be. For ubc_uiomove, it's a bit more work to split the IO vector, but certainly feasible as well. Alternatively, propagate it down to uiomove to allow using *two* IO vectors. Joerg
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit : On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote: jo...@bec.de (Joerg Sonnenberger) writes: What about having a passive unmap as fourth option? I.e. when unmapping in the transfer map, just add them to a FIFO. Process the FIFO on each CPU when there is time or the transfer map runs out of space. Context switching for example would be a natural place to do any such invalidation. The mapping is so temporary that it is almost only used on a single CPU. Basically it's: win = ubc_alloc(); if (!uiomove(win, ...)) memset(win, 0, ...); ubc_release(win, ...); For this to be even visible on multiple CPUs, the thread would need to migrate to another CPU. Locking down the LWP on a single CPU might be the cheapest solution. Yeah, that's what ephemeral mappings where supposed to be for. emaps are read-only. The other question is whether we can't just use the direct map for this on amd64 and similar platforms? no, because nothing guarantees the physical pages are contiguous.
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote: > jo...@bec.de (Joerg Sonnenberger) writes: > > >What about having a passive unmap as fourth option? I.e. when unmapping > >in the transfer map, just add them to a FIFO. Process the FIFO on each > >CPU when there is time or the transfer map runs out of space. Context > >switching for example would be a natural place to do any such > >invalidation. > > The mapping is so temporary that it is almost only used on a single CPU. > Basically it's: > > win = ubc_alloc(); > if (!uiomove(win, ...)) > memset(win, 0, ...); > ubc_release(win, ...); > > For this to be even visible on multiple CPUs, the thread would need > to migrate to another CPU. Locking down the LWP on a single CPU > might be the cheapest solution. Yeah, that's what ephemeral mappings where supposed to be for. The other question is whether we can't just use the direct map for this on amd64 and similar platforms? Joerg
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
jo...@bec.de (Joerg Sonnenberger) writes: >What about having a passive unmap as fourth option? I.e. when unmapping >in the transfer map, just add them to a FIFO. Process the FIFO on each >CPU when there is time or the transfer map runs out of space. Context >switching for example would be a natural place to do any such >invalidation. The mapping is so temporary that it is almost only used on a single CPU. Basically it's: win = ubc_alloc(); if (!uiomove(win, ...)) memset(win, 0, ...); ubc_release(win, ...); For this to be even visible on multiple CPUs, the thread would need to migrate to another CPU. Locking down the LWP on a single CPU might be the cheapest solution. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sun, Mar 25, 2018 at 03:39:21PM +0200, Maxime Villard wrote: > > The main cpu (on which your program executes) sends IPIs to remote cpus, these > cpus were idle and entered the halted state, and they take some time to wake > up and process the IPI. And that's time the main cpu spends waiting. > > As you said yourself it's not trivial to fix this, because the wakeup path > can be whatever interrupt entry point, and if we were to add some code there > it would slow down interrupt entry in general. The slow down would be minimal, the cost comes from refilling the TLB and that would happen independent on when the TLB was flushed. > We could perhaps have something to instruct the main cpu not to wait for cpus > that are idle, only wait for those that aren't. Is there a guarantee that the IPI handler on the other CPUs has already started? Could it be deferred by some other high-level interrupt or by a section with disabled interrupts? Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sat, Mar 24, 2018 at 08:42:34PM +0100, Jaromír Doleček wrote: > The problem there is that FFS triggers a pathologic case. I/O transfer maps > and then unmaps each block into kernel pmap, so that the data could be > copied into user memory. This triggers TLB shootdown IPIs for each FS > block, sent to all CPUs which happen to be idle, or otherwise running on > kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a > lot of synchronization overhead. What about having a passive unmap as fourth option? I.e. when unmapping in the transfer map, just add them to a FIFO. Process the FIFO on each CPU when there is time or the transfer map runs out of space. Context switching for example would be a natural place to do any such invalidation. Joerg
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
Le 25/03/2018 à 13:48, Michael van Elst a écrit : jaromir.dole...@gmail.com (=?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?=) writes: The problem there is that FFS triggers a pathologic case. I/O transfer maps and then unmaps each block into kernel pmap, so that the data could be copied into user memory. For some reason it's not per block. There is a mechanism that moves an 8kbyte window, independent of the block size. You can easily change the window size (I'm currently experimenting with 128kbyte) by building a kernel with e.g. options UBC_WINSHIFT=17. Independent of how the scaling problem will be handled, I suggest to increase that window to at least MAXPHYS. Yes. 3. change UVM code to not do this mapping via kernel map, so pmap_update() and the IPIs are not triggered in first place What I currently don't understand is that we only see one TLB shootdown per window and that happens when it is _unmapped_. Mapping the pages (with the exception of overwriting) is done by the regular fault handler and apparently this does not flush the TLB. IIRC that's normal. Each time we map a page, we know it wasn't mapped before, so no need to flush the TLB, since we know the page isn't cached. However, when unmapping, we do need to flush the TLB. It's an optimization; if we map a page that we end up not touching, it won't have "used" flag, and we can unmap itvwithout flushing the TLB. So it saves us one tlb shootdown. By the way I saw your answer to the PR (by looking at mail-index.netbsd.org, because you didn't CC me in the mail). I think you are right - I was more thinking about a problem in pmap, but the issue may just be that there is a latency in the cpu wakeup, as you said. The main cpu (on which your program executes) sends IPIs to remote cpus, these cpus were idle and entered the halted state, and they take some time to wake up and process the IPI. And that's time the main cpu spends waiting. As you said yourself it's not trivial to fix this, because the wakeup path can be whatever interrupt entry point, and if we were to add some code there it would slow down interrupt entry in general. We could perhaps have something to instruct the main cpu not to wait for cpus that are idle, only wait for those that aren't. Maxime
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sun, Mar 25, 2018 at 11:48:02AM -, Michael van Elst wrote: > [...] > >3. change UVM code to not do this mapping via kernel map, so pmap_update() > >and the IPIs are not triggered in first place > > What I currently don't understand is that we only see one TLB shootdown > per window and that happens when it is _unmapped_. Mapping the pages > (with the exception of overwriting) is done by the regular fault handler > and apparently this does not flush the TLB. This makes me think that it might > not be required at all and is just an unwanted side-effect of pmap_update. I think it's because if we get a fault, then this VA is not mapped in the page table, and so it can't be cached in any TLB. So there is nothing to flush in this case. -- Manuel Bouyer NetBSD: 26 ans d'experience feront toujours la difference --
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
jaromir.dole...@gmail.com (=?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?=) writes: >The problem there is that FFS triggers a pathologic case. I/O transfer maps >and then unmaps each block into kernel pmap, so that the data could be >copied into user memory. For some reason it's not per block. There is a mechanism that moves an 8kbyte window, independent of the block size. You can easily change the window size (I'm currently experimenting with 128kbyte) by building a kernel with e.g. options UBC_WINSHIFT=17. Independent of how the scaling problem will be handled, I suggest to increase that window to at least MAXPHYS. >3. change UVM code to not do this mapping via kernel map, so pmap_update() >and the IPIs are not triggered in first place What I currently don't understand is that we only see one TLB shootdown per window and that happens when it is _unmapped_. Mapping the pages (with the exception of overwriting) is done by the regular fault handler and apparently this does not flush the TLB. This makes me think that it might not be required at all and is just an unwanted side-effect of pmap_update. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Fixing excessive shootdowns during FFS read (kern/53124) on x86
Hello, I'd like to gather some feedback on how to best tackle kern/53124. The problem there is that FFS triggers a pathologic case. I/O transfer maps and then unmaps each block into kernel pmap, so that the data could be copied into user memory. This triggers TLB shootdown IPIs for each FS block, sent to all CPUs which happen to be idle, or otherwise running on kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a lot of synchronization overhead. I see three possible ways how to fix this particular case: 1. make it possible to lazy invalidate TLB also for kernel pmap, i.e. make pmap_deactivate()/pmap_reactivate() work with kernel pmap - this avoids the TLB shootdown IPIs to be sent to idle CPUs 2. make idle lwp run in it's own pmap and address space - variant to #1, avoids changing invariant about kernel map being always loaded 3. change UVM code to not do this mapping via kernel map, so pmap_update() and the IPIs are not triggered in first place I reckon #2 would add a lot of overhead into the idle routine, it would require at least an address space switch. This address space switch might be fairly cheap on architectures with address space ID (avoiding TLB flushes), but still much more than what the idle entry/exit does now. Variants of problems with #3 was discussed on and off during the years as I recall, but there is no resolution yet, and I'm not aware of anyone actively working on this. I understand this would be Hard, with nobody currently having the time and courage. This leaves #1 as a short-term practical solution. Anyone foresees any particular problems with this, does it have a chance to fly? Any other idea? Jaromir