Re: XEN modules?

2018-03-25 Thread Paul Goyette
I know there's some magic somewhere in the Makefiles that automatically 
builds both "normal" and XEN variants of the modules for amd64 (and, for 
i386, there's a PAE variant, too)!  But I can't seem to find the magic!


Anyway, it seems to me that, when building the XEN (or PAE) variants of 
each module, they should be built with -DXEN (or -DPAE) options.  There is 
at least one source file, compiled into the compat module, which has some 
code conditional on XEN.  (For those who care, it is the compat_60 x86 code 
for updating AMD cpu_ucode.)


Based on log files, we're currently not defining XEN or PAE for the modules 
builds.  As a result the XEN variant of the compat module does not contain 
the required code, preventing use of the 6.0 cpuctl command.


So,

1. Would it be reasonable to define the XEN and PAE macros when building
  the module variants?


Still need an answer to this question...


2. What would it take to make it happen?


Add

CPPFLAGS+=  -DXEN (or PAE)

to the appropriate *.mk files?  (See #3 below)  These files already have

XEN=1
  and   PAE=1


3. Where's the magic?  :)


OK, found the magic.  It's in

src/sys/modules/arch/x86/amd64-xen/bsd.amd64-xen.mk
src/sys/modules/arch/x86/i386-xen/bsd.i386-xen.mk
 andsrc/sys/modules/arch/x86/i386pae-xen/bsd.i386pae-xen.mk


If the answer to Question #1 is "Yes, it's reasonable" then I can take care 
of making the changes.



Follow-on question:  the changes identified above have the desired 
effect, and code conditioned on XEN is now being built.  Unfortunately, 
that has exposed the next problem!


The XEN-specific code has a

#include "xen/intrptrs.h"

which for kernel builds would be found due to a sym-link in the kernel's 
object directory:


xen -> /build/netbsd-compat/src/sys/arch/xen/include

This symlink doesn't exist in the module's object directory, so the 
#include fails.


It would appear that I might need to add some more magic to create the 
xen (and/or pae) symlink for the modules.  It looks like this is handled 
in share/mk/bsd.klinks.mk but it's not exactly clear what I should add. 
And I can't seem to find where the kernel Makefiles make this happen, so 
no examples from which to borrow.  I can't tell if I should (a) set 
XEN_BUILD to something (to what?), (2) override an existing value of 
KLINK_MACHINE, or (3) just update KLINK_FILES?  (And what is xen-ma used 
for, anyway?)


Again, any clue/help would be greatly appreciated.

(As you all can probably tell from this thread, I'm far from an expert 
when it comes to our build infrastructure...)




+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


Debug Register (x86) trap - race

2018-03-25 Thread Kamil Rytarowski
I've been investigating the race in ptrace(2) ATF tests: dbregs_* for
x86 (i386 and amd64).

Examples:

NetBSD/amd64
http://releng.netbsd.org/b5reports/amd64/2018/2018.03.06.11.21.31/test.html#lib_libc_sys_t_ptrace_wait6_dbregs_dr3_trap_variable_readwrite_write_2bytes

NetBSD/i386
http://releng.netbsd.org/b5reports/i386/2017/2017.12.06.17.54.58/test.html#lib_libc_sys_t_ptrace_wait3_dbregs_dr2_trap_variable_readwrite_read_2bytes



I still don't know what is the root cause for the race and I'm open for
suggestions.



Observations:
1. This bug is reproducible only for the scenario with debug register
trap according to the tests of mine.

2. It's reproducible on slower x86 hardware or in softemu (qemu). I have
not been able to reproduce a single failure on post-Core2Duo CPU.

3. Adding or changing the scenario slightly of tests (like to PT_STEP or
changing SIGSTOP to other signal) makes the test disappear at all.

4. Adding almost any debug code anywhere makes this test either
disappear or make reproducible much less frequently (sometimes once a day).

5. I've detected that the bug is caused by the fact that _lwp_kill(2)
(called via raise(2)) [under a debugger] can trigger two calls of
wait(2) to return for the same signal.

A few days ago, I've prepared this document:

http://netbsd.org/~kamil/kernel/sigstop.txt

6. According to my tests in lwp_userret():

   1549 /*
   1550  * It is safe to do this read unlocked on a MP system..
   1551  */
   1552 while ((l->l_flag & LW_USERRET) != 0) {
   1553 /*
   1554  * Process pending signals first, unless the process
   1555  * is dumping core or exiting, where we will instead
   1556  * enter the LW_WSUSPEND case below.
   1557  */
   1558 if ((l->l_flag & (LW_PENDSIG | LW_WCORE | LW_WEXIT)) ==
   1559 LW_PENDSIG) {
   1560 mutex_enter(p->p_lock);
   1561 while ((sig = issignal(l)) != 0)
   1562 postsig(sig);
   1563 mutex_exit(p->p_lock);
   1564 }
   1565

 -- src/sys/kern/kern_lwp.c

We always enter the while() loop and call issignal(). Sometimes we
extract a signal and call postsig(). Usually for the call, we see no
misbehavior.

7. I've checked that PT_CONTINUE branch where p_stat != SSTOP is never
taken, at least for this race. We always call:

828 /* Finally, deliver the requested signal (or none). */
829 if (t->p_stat == SSTOP) {
830 /*
831  * Unstop the process.  If it needs to take a
832  * signal, make all efforts to ensure that at
833  * an LWP runs to see it.
834  */
835 t->p_xsig = signo;
836 if (resume_all)
837 proc_unstop(t);
838 else
839 lwp_unstop(lt);
840 return 0;
841 }

 -- src/sys/kern/sys_ptrace_common.c

8. We checked that we call sigpost() twice for a single _lwp_kill(2) call:
 A. handle_syscall() -> syscall() -> sv_invoke() -> sys__lwp_kill() ->
kpsignal2() -> sigpost()
 B. lwp_userret() -> issignal() -> sigswitch() -> proc_stop() -> sigpost()

9. Adding two consequential wait(2) calls, one after (checking for error
condition for another with WNOHANG) - makes the bug disappear completely.



I'm looking for hints to speedup the research.



signature.asc
Description: OpenPGP digital signature


Re: XEN modules?

2018-03-25 Thread Paul Goyette

On Mon, 26 Mar 2018, Paul Goyette wrote:

I know there's some magic somewhere in the Makefiles that automatically 
builds both "normal" and XEN variants of the modules for amd64 (and, for 
i386, there's a PAE variant, too)!  But I can't seem to find the magic!


Anyway, it seems to me that, when building the XEN (or PAE) variants of each 
module, they should be built with -DXEN (or -DPAE) options.  There is at 
least one source file, compiled into the compat module, which has some code 
conditional on XEN.  (For those who care, it is the compat_60 x86 code for 
updating AMD cpu_ucode.)


Based on log files, we're currently not defining XEN or PAE for the modules 
builds.  As a result the XEN variant of the compat module does not contain 
the required code, preventing use of the 6.0 cpuctl command.


So,

1. Would it be reasonable to define the XEN and PAE macros when building
  the module variants?


Still need an answer to this question...


2. What would it take to make it happen?


Add

CPPFLAGS+=  -DXEN (or PAE)

to the appropriate *.mk files?  (See #3 below)  These files already have

XEN=1
   and  PAE=1


3. Where's the magic?  :)


OK, found the magic.  It's in

src/sys/modules/arch/x86/amd64-xen/bsd.amd64-xen.mk
src/sys/modules/arch/x86/i386-xen/bsd.i386-xen.mk
  and   src/sys/modules/arch/x86/i386pae-xen/bsd.i386pae-xen.mk


If the answer to Question #1 is "Yes, it's reasonable" then I can take 
care of making the changes.


Opinions, anyone?



+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


XEN modules?

2018-03-25 Thread Paul Goyette
I know there's some magic somewhere in the Makefiles that automatically 
builds both "normal" and XEN variants of the modules for amd64 (and, for 
i386, there's a PAE variant, too)!  But I can't seem to find the magic!


Anyway, it seems to me that, when building the XEN (or PAE) variants of 
each module, they should be built with -DXEN (or -DPAE) options.  There 
is at least one source file, compiled into the compat module, which has 
some code conditional on XEN.  (For those who care, it is the compat_60 
x86 code for updating AMD cpu_ucode.)


Based on log files, we're currently not defining XEN or PAE for the 
modules builds.  As a result the XEN variant of the compat module does 
not contain the required code, preventing use of the 6.0 cpuctl command.


So,

1. Would it be reasonable to define the XEN and PAE macros when building
   the module variants?

2. What would it take to make it happen?

3. Where's the magic?  :)



+--+--++
| Paul Goyette | PGP Key fingerprint: | E-mail addresses:  |
| (Retired)| FA29 0E3B 35AF E8AE 6651 | paul at whooppee dot com   |
| Kernel Developer | 0786 F758 55DE 53BA 7731 | pgoyette at netbsd dot org |
+--+--++


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Maxime Villard

Le 25/03/2018 à 17:03, Michael van Elst a écrit :

On Sun, Mar 25, 2018 at 03:39:21PM +0200, Maxime Villard wrote:


The main cpu (on which your program executes) sends IPIs to remote cpus, these
cpus were idle and entered the halted state, and they take some time to wake
up and process the IPI. And that's time the main cpu spends waiting.

As you said yourself it's not trivial to fix this, because the wakeup path
can be whatever interrupt entry point, and if we were to add some code there
it would slow down interrupt entry in general.


The slow down would be minimal, the cost comes from refilling the TLB and
that would happen independent on when the TLB was flushed.


I wouldn't be so sure; a new branch in the interrupt entry points (with
interrupts disabled) does cost something, and it's about the hottest path in
the kernel.


We could perhaps have something to instruct the main cpu not to wait for cpus
that are idle, only wait for those that aren't.


Is there a guarantee that the IPI handler on the other CPUs has already
started?


I don't think so, but I would have to re-check the LAPIC spec.


Could it be deferred by some other high-level interrupt or
by a section with disabled interrupts?


I don't think so (and I'm also not totally sure what you mean).

In all cases you need to handle the wakeup in the interrupt entry points.

An efficient way would be, before going into hlt, to flush the tlb and to
set a "idle" flag somewhere in curcpu. Then, we change each interrupt entry
point to set this flag to zero automatically regardless of its previous
value. Finally, when a cpu performs a tlb shootdown, it does not send IPIs
to the cpus flagged as idle.

Maxime


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sun, Mar 25, 2018 at 07:24:25PM +0200, Maxime Villard wrote:
> Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit :
> > The other question is whether we can't just use the direct map for this
> > on amd64 and similar platforms?
> 
> no, because nothing guarantees the physical pages are contiguous.

At least for ubc_zerorange, they certainly don't have to be. For
ubc_uiomove, it's a bit more work to split the IO vector, but certainly
feasible as well. Alternatively, propagate it down to uiomove to allow
using *two* IO vectors.

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Maxime Villard

Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit :

On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote:

jo...@bec.de (Joerg Sonnenberger) writes:


What about having a passive unmap as fourth option? I.e. when unmapping
in the transfer map, just add them to a FIFO. Process the FIFO on each
CPU when there is time or the transfer map runs out of space. Context
switching for example would be a natural place to do any such
invalidation.


The mapping is so temporary that it is almost only used on a single CPU.
Basically it's:

win = ubc_alloc();
if (!uiomove(win, ...))
memset(win, 0, ...);
ubc_release(win, ...);

For this to be even visible on multiple CPUs, the thread would need
to migrate to another CPU. Locking down the LWP on a single CPU
might be the cheapest solution.


Yeah, that's what ephemeral mappings where supposed to be for.


emaps are read-only.


The other question is whether we can't just use the direct map for this
on amd64 and similar platforms?


no, because nothing guarantees the physical pages are contiguous.


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote:
> jo...@bec.de (Joerg Sonnenberger) writes:
> 
> >What about having a passive unmap as fourth option? I.e. when unmapping
> >in the transfer map, just add them to a FIFO. Process the FIFO on each
> >CPU when there is time or the transfer map runs out of space. Context
> >switching for example would be a natural place to do any such
> >invalidation.
> 
> The mapping is so temporary that it is almost only used on a single CPU.
> Basically it's:
> 
> win = ubc_alloc();
> if (!uiomove(win, ...))
>   memset(win, 0, ...);
> ubc_release(win, ...);
> 
> For this to be even visible on multiple CPUs, the thread would need
> to migrate to another CPU. Locking down the LWP on a single CPU
> might be the cheapest solution.

Yeah, that's what ephemeral mappings where supposed to be for. The other
question is whether we can't just use the direct map for this on amd64
and similar platforms?

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Michael van Elst
jo...@bec.de (Joerg Sonnenberger) writes:

>What about having a passive unmap as fourth option? I.e. when unmapping
>in the transfer map, just add them to a FIFO. Process the FIFO on each
>CPU when there is time or the transfer map runs out of space. Context
>switching for example would be a natural place to do any such
>invalidation.

The mapping is so temporary that it is almost only used on a single CPU.
Basically it's:

win = ubc_alloc();
if (!uiomove(win, ...))
memset(win, 0, ...);
ubc_release(win, ...);

For this to be even visible on multiple CPUs, the thread would need
to migrate to another CPU. Locking down the LWP on a single CPU
might be the cheapest solution.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Michael van Elst
On Sun, Mar 25, 2018 at 03:39:21PM +0200, Maxime Villard wrote:
> 
> The main cpu (on which your program executes) sends IPIs to remote cpus, these
> cpus were idle and entered the halted state, and they take some time to wake
> up and process the IPI. And that's time the main cpu spends waiting.
> 
> As you said yourself it's not trivial to fix this, because the wakeup path
> can be whatever interrupt entry point, and if we were to add some code there
> it would slow down interrupt entry in general.

The slow down would be minimal, the cost comes from refilling the TLB and
that would happen independent on when the TLB was flushed.


> We could perhaps have something to instruct the main cpu not to wait for cpus
> that are idle, only wait for those that aren't.

Is there a guarantee that the IPI handler on the other CPUs has already
started? Could it be deferred by some other high-level interrupt or
by a section with disabled interrupts?



Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sat, Mar 24, 2018 at 08:42:34PM +0100, Jaromír Doleček wrote:
> The problem there is that FFS triggers a pathologic case. I/O transfer maps
> and then unmaps each block into kernel pmap, so that the data could be
> copied into user memory. This triggers TLB shootdown IPIs for each FS
> block, sent  to all CPUs which happen to be idle, or otherwise running on
> kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a
> lot of synchronization overhead.

What about having a passive unmap as fourth option? I.e. when unmapping
in the transfer map, just add them to a FIFO. Process the FIFO on each
CPU when there is time or the transfer map runs out of space. Context
switching for example would be a natural place to do any such
invalidation.

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Maxime Villard

Le 25/03/2018 à 13:48, Michael van Elst a écrit :

jaromir.dole...@gmail.com (=?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?=) writes:


The problem there is that FFS triggers a pathologic case. I/O transfer maps
and then unmaps each block into kernel pmap, so that the data could be
copied into user memory.


For some reason it's not per block. There is a mechanism that moves
an 8kbyte window, independent of the block size. You can easily change
the window size (I'm currently experimenting with 128kbyte) by building
a kernel with e.g. options UBC_WINSHIFT=17.

Independent of how the scaling problem will be handled, I suggest to
increase that window to at least MAXPHYS.


Yes.


3. change UVM code to not do this mapping via kernel map, so pmap_update()
and the IPIs are not triggered in first place


What I currently don't understand is that we only see one TLB shootdown
per window and that happens when it is _unmapped_. Mapping the pages
(with the exception of overwriting) is done by the regular fault handler
and apparently this does not flush the TLB.


IIRC that's normal. Each time we map a page, we know it wasn't mapped before,
so no need to flush the TLB, since we know the page isn't cached. However,
when unmapping, we do need to flush the TLB. It's an optimization; if we map a
page that we end up not touching, it won't have "used" flag, and we can unmap
itvwithout flushing the TLB. So it saves us one tlb shootdown.

By the way I saw your answer to the PR (by looking at mail-index.netbsd.org,
because you didn't CC me in the mail).

I think you are right - I was more thinking about a problem in pmap, but the
issue may just be that there is a latency in the cpu wakeup, as you said.

The main cpu (on which your program executes) sends IPIs to remote cpus, these
cpus were idle and entered the halted state, and they take some time to wake
up and process the IPI. And that's time the main cpu spends waiting.

As you said yourself it's not trivial to fix this, because the wakeup path
can be whatever interrupt entry point, and if we were to add some code there
it would slow down interrupt entry in general.

We could perhaps have something to instruct the main cpu not to wait for cpus
that are idle, only wait for those that aren't.

Maxime


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Manuel Bouyer
On Sun, Mar 25, 2018 at 11:48:02AM -, Michael van Elst wrote:
> [...]
> >3. change UVM code to not do this mapping via kernel map, so pmap_update()
> >and the IPIs are not triggered in first place
> 
> What I currently don't understand is that we only see one TLB shootdown
> per window and that happens when it is _unmapped_. Mapping the pages
> (with the exception of overwriting) is done by the regular fault handler
> and apparently this does not flush the TLB. This makes me think that it might
> not be required at all and is just an unwanted side-effect of pmap_update.

I think it's because if we get a fault, then this VA is not mapped
in the page table, and so it can't be cached in any TLB. So there is
nothing to flush in this case.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Michael van Elst
jaromir.dole...@gmail.com (=?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?=) writes:

>The problem there is that FFS triggers a pathologic case. I/O transfer maps
>and then unmaps each block into kernel pmap, so that the data could be
>copied into user memory.

For some reason it's not per block. There is a mechanism that moves
an 8kbyte window, independent of the block size. You can easily change
the window size (I'm currently experimenting with 128kbyte) by building
a kernel with e.g. options UBC_WINSHIFT=17.

Independent of how the scaling problem will be handled, I suggest to
increase that window to at least MAXPHYS.


>3. change UVM code to not do this mapping via kernel map, so pmap_update()
>and the IPIs are not triggered in first place

What I currently don't understand is that we only see one TLB shootdown
per window and that happens when it is _unmapped_. Mapping the pages
(with the exception of overwriting) is done by the regular fault handler
and apparently this does not flush the TLB. This makes me think that it might
not be required at all and is just an unwanted side-effect of pmap_update.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Jaromír Doleček
Hello,

I'd like to gather some feedback on how to best tackle kern/53124.

The problem there is that FFS triggers a pathologic case. I/O transfer maps
and then unmaps each block into kernel pmap, so that the data could be
copied into user memory. This triggers TLB shootdown IPIs for each FS
block, sent  to all CPUs which happen to be idle, or otherwise running on
kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a
lot of synchronization overhead.

I see three possible ways how to fix this particular case:
1. make it possible to lazy invalidate TLB also for kernel pmap, i.e. make
pmap_deactivate()/pmap_reactivate() work with kernel pmap -  this avoids
the TLB shootdown IPIs to be sent to idle CPUs
2. make idle lwp run in it's own pmap and address space - variant to #1,
avoids changing invariant about kernel map being always loaded
3. change UVM code to not do this mapping via kernel map, so pmap_update()
and the IPIs are not triggered in first place

I reckon #2 would add a lot of overhead into the idle routine, it would
require at least an address space switch. This address space switch might
be fairly cheap on architectures with address space ID (avoiding TLB
flushes), but still much more than what the idle entry/exit does now.

Variants of problems with #3 was discussed on and off during the years as I
recall, but there is no resolution yet, and I'm not aware of anyone
actively working on this. I understand this would be Hard, with nobody
currently having the time and courage.

This leaves #1 as a short-term practical solution. Anyone foresees any
particular problems with this, does it have a chance to fly? Any other idea?

Jaromir