Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-06-12 Thread Eric Hawicz

On 6/10/2018 3:02 PM, Jaromír Doleček wrote:

2018-05-12 21:24 GMT+02:00 Chuck Silvers :

the problem with keeping the pages locked (ie. PG_BUSY) while accessing
the user address space is that it can lead to deadlock if the page

Meanwhile I've tested this scenario - wrote a test program to do
mmap(2) for file, then calling write() using the memory from mmap as
the buffer for write(2) for the same file offset.

It actually caused the process to get SIGSEGV with new code, where the
old code made write(2) just returned with EINVAL. I think this happens
because fault resolution returns EINVAL when the page has PG_BUSY.
Funny thing is that actually on Mac OS X that code actually does end
up with deadlock, and unkillable process :D

Maybe the SIGSEGV is actually acceptable behaviour? I don't think
there would be valid reason to do mmap(2)/write(2) combination like
that.


FWIW, a similar test on a Linux box (Debian 8, 3.16 kernel) works w/o 
errors but behaves rather oddly if the offsets don't exactly line up.  
(e.g. w/ offset 1, the first 8 chars written repeat the first read, then 
every 32 chars one char is repeated)
As long as we fail somehow, whether EINVAL or SIGSEGV, at least it'll be 
better than *that*. :)


Eric


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-06-10 Thread Jaromír Doleček
2018-05-12 21:24 GMT+02:00 Chuck Silvers :
> the problem with keeping the pages locked (ie. PG_BUSY) while accessing
> the user address space is that it can lead to deadlock if the page

Meanwhile I've tested this scenario - wrote a test program to do
mmap(2) for file, then calling write() using the memory from mmap as
the buffer for write(2) for the same file offset.

It actually caused the process to get SIGSEGV with new code, where the
old code made write(2) just returned with EINVAL. I think this happens
because fault resolution returns EINVAL when the page has PG_BUSY.
Funny thing is that actually on Mac OS X that code actually does end
up with deadlock, and unkillable process :D

Maybe the SIGSEGV is actually acceptable behaviour? I don't think
there would be valid reason to do mmap(2)/write(2) combination like
that.

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-13 Thread Chuck Silvers
On Sun, May 13, 2018 at 02:59:50PM +0200, Jaromr Dole?ek wrote:
> 2018-05-12 21:24 GMT+02:00 Chuck Silvers :
> > the problem with keeping the pages locked (ie. PG_BUSY) while accessing
> > the user address space is that it can lead to deadlock if the page
> > we are accessing via the kernel mapping is the same page that we are
> > accessing via the user mapping.
> 
> The pages marked PG_BUSY for the UBC transfer are part of page cache.
> I sincerely hope it's impossible to have any user mapping for it.

the whole point of UBC is that the pages caching the data used by read/write
are the same pages used by mmap.  that's how we avoid the double-caching
and cache coherency problems that the pre-UBC scheme had.
so yes, these pages can have user mappings at the same time as
they are mapped in the kernel.

-Chuck


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-13 Thread Jaromír Doleček
2018-05-12 21:24 GMT+02:00 Chuck Silvers :
> the problem with keeping the pages locked (ie. PG_BUSY) while accessing
> the user address space is that it can lead to deadlock if the page
> we are accessing via the kernel mapping is the same page that we are
> accessing via the user mapping.

The pages marked PG_BUSY for the UBC transfer are part of page cache.
I sincerely hope it's impossible to have any user mapping for it.

Is there are any other scenario for the deadlock in this context?

That said, the PG_BUSY is at least a scalability problem, since it
restricts the copy operation within the UBC window to only one at a
time. Old code holds PG_BUSY only during the fault - with exception of
the UBC_FAULTBUSY branch, which still keeps the lock during the whole
operation.

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-12 Thread Chuck Silvers
hi,

sorry for being so late to the party on this thread,
I should have replied long ago.


On Fri, May 11, 2018 at 05:28:18PM +0200, Jaromr Dole?ek wrote:
> I've modified the uvm_bio.c code to use direct map for amd64. Change
> seems to be stable, I've also run 'build.sh tools' to test out the
> system, build suceeded and I didn't observe any problems with
> stability. Test was on 6-core Intel CPU with HT on (i.e. "12" cores
> visible to OS), NVMe disk, cca 2GB file read done via:
> 
> dd if=foo of=/dev/null bs=64k msgfmt=human
> 
> Because of the proper formatting, the speed is now in proper Gibibytes.
> 
> 1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s
> 2. cache read: old: 2.2 GiB/s, new: 5.6 GiB/s
> 
> Seems the 1.9 GiB/s is device limit.
> 
> The patch for this is at:
> http://www.netbsd.org/~jdolecek/uvm_bio_direct.diff

the problem with keeping the pages locked (ie. PG_BUSY) while accessing
the user address space is that it can lead to deadlock if the page
we are accessing via the kernel mapping is the same page that we are
accessing via the user mapping.  if the user space access faults
(eg. because the page has not been accessed via that mapping yet)
then the fault handler will try to lock the same page again.
even if it's not the same page, it is generally not safe for a thread
to wait to lock a page while that thread has locked another page already.

however if you don't lock the object page while accessing user space,
then the object page can change identity out from under you,
corrupting whatever the page is being used for in its new identity.

my plan for this in the longer term was to add a mechanism in UVM
for preventing a page from changing identity, which would be somewhat
similar to the existing loan mechanism except that the page could
still be written to without making a copy of it.  linux and freebsd
(and solaris) each have something like this, though they are all
a bit different in the details.  using this new mechanism, we could
keep the page being accessed via the direct map from changing identity
while still allowing normal page faults to happen on the user mapping.

there are probably less intrusive ways to attack the problem at hand,
but the above approach seemed like the best way to go in the long run.

I was hoping the emap thing would work for the shorter term, but if
that's not practical to make work then we can look for something else.


> During testing I noticed that the read-ahead code slows down the
> cached case quite a lot:
> no read-ahead:
> 1. non-cached read: old: 851 MiB/s, new: 1.1 GiB/s
> 2. cached read: old: 2.3 GiB/s, new: 6.8 GiB/s
> 
> I've implemented a tweak to read-ahead code to skip the full
> read-ahead if last page of the range is already in cache, this
> improved things a lot:
> smarter read-ahead:
> 1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s (no change compared
> to old read-ahead)
> 2. cached read: old: 2.2 GiB/s, new: 6.6 GiB/s
> 
> Patch for this is at:
> http://www.netbsd.org/~jdolecek/uvm_readahead_skip.diff

the concept is good, but would using PGO_NOWAIT work about as well as
adding a new PGO_CACHEONLY flag?  if any pages are not in memory then
the pgo_get call would fail since that would require waiting for them
to be read.  the suggestion from others to just use uvm_pagelookup()
would be fine too.

it would also be nice to put this new logic for checking if an object page
is in memory into a separate function in case it's useful in other contexts
(and to isolate the implementation a bit).


> For the direct map, I've implemented new helper pmap function, which
> in turn calls a callback with the appropriate virtual address. This
> way the arch pmap is more free to choose an alternative
> implementations for this, e.g. without actually having a direct map,
> like sparc64.
> 
> The code right now has some instrumentation for testing, but the
> general idea should be visible.
> 
> Opinions? Particularly, I'm not quite sure if it safe to avoid
> triggering a write fault in ubc_uiomove(), and whether the new
> heuristics for read-ahead is good enough for general case.

is there a specific case about triggering a write fault in ubc_uiomove()
that you are concerned about?  the general rule is that a page being modified
must be marked dirty either as part of the modification (if it is being written
via a regular ref/mod-tracking pmap entry) or after the modifications are
complete (if it is being written via a non ref/mod-tracking mechanism such
as the direct map).

the read-ahead logic in post-UBC netbsd has never been very sophisticated,
I'd be impressed if you managed to make it worse.  :-)

-Chuck


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-12 Thread Jaromír Doleček
2018-05-12 7:07 GMT+02:00 Michael van Elst :
> On Fri, May 11, 2018 at 05:28:18PM +0200, Jaromír Dole?ek wrote:
>> I've implemented a tweak to read-ahead code to skip the full
>> read-ahead if last page of the range is already in cache, this
>> improved things a lot:
>
> That looks a bit like a hack. Can't you just call uvm_pagelookup()
> to determine if the last page is cached? Or if that's too UVM internal,
> call uvm_findpages() with UFP_NOALLOC.

I thought it would be less abstraction violation when done in
getpages. But actually using uvm_pagelookup() seems to be fine from
that standpoint. Furthermore, it avoids some page flag manipulation,
and makes the change much more localized, to only uvm_readahead.c. So
I'll go with that.

Thanks.

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-11 Thread Michael van Elst
On Fri, May 11, 2018 at 05:28:18PM +0200, Jaromír Dole?ek wrote:

> I've implemented a tweak to read-ahead code to skip the full
> read-ahead if last page of the range is already in cache, this
> improved things a lot:

That looks a bit like a hack. Can't you just call uvm_pagelookup()
to determine if the last page is cached? Or if that's too UVM internal,
call uvm_findpages() with UFP_NOALLOC.


Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-11 Thread Jason Thorpe


> On May 11, 2018, at 8:28 AM, Jaromír Doleček  
> wrote:
> 
> I've modified the uvm_bio.c code to use direct map for amd64. Change
> seems to be stable, I've also run 'build.sh tools' to test out the
> system, build suceeded and I didn't observe any problems with
> stability. Test was on 6-core Intel CPU with HT on (i.e. "12" cores
> visible to OS), NVMe disk, cca 2GB file read done via:

Very cool.  Alpha and MIPS could take advantage of this as well.

-- thorpej



Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-05-11 Thread Jaromír Doleček
I've modified the uvm_bio.c code to use direct map for amd64. Change
seems to be stable, I've also run 'build.sh tools' to test out the
system, build suceeded and I didn't observe any problems with
stability. Test was on 6-core Intel CPU with HT on (i.e. "12" cores
visible to OS), NVMe disk, cca 2GB file read done via:

dd if=foo of=/dev/null bs=64k msgfmt=human

Because of the proper formatting, the speed is now in proper Gibibytes.

1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s
2. cache read: old: 2.2 GiB/s, new: 5.6 GiB/s

Seems the 1.9 GiB/s is device limit.

The patch for this is at:
http://www.netbsd.org/~jdolecek/uvm_bio_direct.diff

During testing I noticed that the read-ahead code slows down the
cached case quite a lot:
no read-ahead:
1. non-cached read: old: 851 MiB/s, new: 1.1 GiB/s
2. cached read: old: 2.3 GiB/s, new: 6.8 GiB/s

I've implemented a tweak to read-ahead code to skip the full
read-ahead if last page of the range is already in cache, this
improved things a lot:
smarter read-ahead:
1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s (no change compared
to old read-ahead)
2. cached read: old: 2.2 GiB/s, new: 6.6 GiB/s

Patch for this is at:
http://www.netbsd.org/~jdolecek/uvm_readahead_skip.diff

For the direct map, I've implemented new helper pmap function, which
in turn calls a callback with the appropriate virtual address. This
way the arch pmap is more free to choose an alternative
implementations for this, e.g. without actually having a direct map,
like sparc64.

The code right now has some instrumentation for testing, but the
general idea should be visible.

Opinions? Particularly, I'm not quite sure if it safe to avoid
triggering a write fault in ubc_uiomove(), and whether the new
heuristics for read-ahead is good enough for general case.

Jaromir

2018-04-19 22:39 GMT+02:00 Jaromír Doleček :
> I've finally got my test rig setup, so was able to check the
> performance difference when using emap.
>
> Good news there is significant speedupon NVMe device, without
> observing any bad side effects so far.
>
> I've setup a test file of 2G, smaller than RAM to fit all in cache, test was:
> dd if=foo of=/dev/null bs=64k
>
> First read (not-cached): old: 1.7 GB/s, new: 2.1 GB/s
> Cached read: old: 2.2 GB/s, new: 3.1 GB/s
>
> Reads from raw device were the same in both cases, around 1.7 GB/s
> which is a bit bizzarde.
>
> If we want to modify the uvm_bio.c code to optionally use direct map,
> there is the problem with the fault technique it uses for read I/O.
> The code doesn't actually enter the pages into the KVA window, it lets
> uimove() to trigger the faults. Is there some advantage to this, why
> is this better than just mapping those 1-2 pages before calling
> uiomove()?
>
> Jaromir
>
> 2018-04-02 21:28 GMT+02:00 Jaromír Doleček :
>> 2018-03-31 13:42 GMT+02:00 Jaromír Doleček :
>>> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :
 Yeah, that's what ephemeral mappings where supposed to be for. The other
 question is whether we can't just use the direct map for this on amd64
 and similar platforms?
>>>
>>> Right, we could/should use emap. I haven't realized emap is actually already
>>> implemented. It's currently used for pipe for the loan/"direct" write.
>>>
>>> I don't know anything about emap thought. Are there any known issues,
>>> do you reckon it's ready to be used for general I/O handling?
>>
>> Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral
>> mapping:
>>
>> http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff
>>
>> Seems to boot, no idea what else it will break.
>>
>> Looking at the state of usage though, the emap is only used for disabled code
>> path for sys_pipe and nowhere else. That code had several on-and-off passes
>> for being enabled in 2009, and no further use since then. Doesn't give too 
>> much
>> confidence.
>>
>> The only port actually having optimization for emap is x86. Since amd64
>> is also the only one supporting direct map, we are really at liberty to pick
>> either one. I'd lean towards direct map, since that doesn't require
>> adding/removing any mapping in pmap_kernel() at all. From looking on the 
>> code,
>> I gather direct map is quite easy to implement for other archs like
>> sparc64. I'd say
>> significantly easier than adding the necessary emap hooks into MD pmaps.
>>
>> Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-04-19 Thread Jaromír Doleček
I've finally got my test rig setup, so was able to check the
performance difference when using emap.

Good news there is significant speedupon NVMe device, without
observing any bad side effects so far.

I've setup a test file of 2G, smaller than RAM to fit all in cache, test was:
dd if=foo of=/dev/null bs=64k

First read (not-cached): old: 1.7 GB/s, new: 2.1 GB/s
Cached read: old: 2.2 GB/s, new: 3.1 GB/s

Reads from raw device were the same in both cases, around 1.7 GB/s
which is a bit bizzarde.

If we want to modify the uvm_bio.c code to optionally use direct map,
there is the problem with the fault technique it uses for read I/O.
The code doesn't actually enter the pages into the KVA window, it lets
uimove() to trigger the faults. Is there some advantage to this, why
is this better than just mapping those 1-2 pages before calling
uiomove()?

Jaromir

2018-04-02 21:28 GMT+02:00 Jaromír Doleček :
> 2018-03-31 13:42 GMT+02:00 Jaromír Doleček :
>> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :
>>> Yeah, that's what ephemeral mappings where supposed to be for. The other
>>> question is whether we can't just use the direct map for this on amd64
>>> and similar platforms?
>>
>> Right, we could/should use emap. I haven't realized emap is actually already
>> implemented. It's currently used for pipe for the loan/"direct" write.
>>
>> I don't know anything about emap thought. Are there any known issues,
>> do you reckon it's ready to be used for general I/O handling?
>
> Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral
> mapping:
>
> http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff
>
> Seems to boot, no idea what else it will break.
>
> Looking at the state of usage though, the emap is only used for disabled code
> path for sys_pipe and nowhere else. That code had several on-and-off passes
> for being enabled in 2009, and no further use since then. Doesn't give too 
> much
> confidence.
>
> The only port actually having optimization for emap is x86. Since amd64
> is also the only one supporting direct map, we are really at liberty to pick
> either one. I'd lean towards direct map, since that doesn't require
> adding/removing any mapping in pmap_kernel() at all. From looking on the code,
> I gather direct map is quite easy to implement for other archs like
> sparc64. I'd say
> significantly easier than adding the necessary emap hooks into MD pmaps.
>
> Jaromir


re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-04-03 Thread matthew green
> 4GB of KVA; and in addition pmap_kernel will consume some KVA too. i386 for
> example has only 4GB of ram and 4GB of KVA, so we'll just never add a direct
> map there.

note that it changes your direct map point, but, i386 GENERIC_PAE
works fine for at least 16GB ram.  it should work upto 64GB.

actually, it only strengthens your direct map point, since it's
even harder to fit 64GB phys into 4GB virt :-)

> As opposed to that, emaps can be implemented everywhere with no constraint on
> the arch. I think they are better than the direct map for uvm_bio.

IIRC, the main reason we stopped using emap is that there were
performance issues.  rmind?


.mrg.


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-04-03 Thread Maxime Villard

Le 02/04/2018 à 21:28, Jaromír Doleček a écrit :

2018-03-31 13:42 GMT+02:00 Jaromír Doleček :

2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :

Yeah, that's what ephemeral mappings where supposed to be for. The other
question is whether we can't just use the direct map for this on amd64
and similar platforms?


Right, we could/should use emap. I haven't realized emap is actually already
implemented. It's currently used for pipe for the loan/"direct" write.

I don't know anything about emap thought. Are there any known issues,
do you reckon it's ready to be used for general I/O handling?


Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral
mapping:

http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff


-   pmap_emap_enter(va, pa, VM_PROT_READ);
+   pmap_emap_enter(va, pa, VM_PROT_READ | VM_PROT_WRITE);

Mmh no, sys_pipe wanted it read-only, we shouldn't make it writable by default.
Adding a prot argument to uvm_emap_enter would be better.


Looking at the state of usage though, the emap is only used for disabled code
path for sys_pipe and nowhere else. That code had several on-and-off passes
for being enabled in 2009, and no further use since then. Doesn't give too much
confidence.

The only port actually having optimization for emap is x86. Since amd64
is also the only one supporting direct map, we are really at liberty to pick
either one. I'd lean towards direct map, since that doesn't require
adding/removing any mapping in pmap_kernel() at all. From looking on the code,
I gather direct map is quite easy to implement for other archs like
sparc64. I'd say significantly easier than adding the necessary emap hooks
into MD pmaps.


There is a good number of architectures where implementing a direct map is not
possible, because of KVA consumption. With a direct map we consume more than
once the physical space. If you have 4GB of ram, the direct map will consume
4GB of KVA; and in addition pmap_kernel will consume some KVA too. i386 for
example has only 4GB of ram and 4GB of KVA, so we'll just never add a direct
map there.

Direct maps are good when the architecture has much, much more KVA than it has
physical space.

I saw some low-KVA architectures have a "partial direct map", where only a
(tiny) area of the physical space is direct-mapped. There, we would have to
adapt uvm_bio to use pmap_kernel instead, which seems ugly.

As opposed to that, emaps can be implemented everywhere with no constraint on
the arch. I think they are better than the direct map for uvm_bio.


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-04-02 Thread Martin Husemann
On Mon, Apr 02, 2018 at 09:28:43PM +0200, Jaromír Dole?ek wrote:
> The only port actually having optimization for emap is x86. Since amd64
> is also the only one supporting direct map, we are really at liberty to pick
> either one.

I am not sure what you mean here - AFIK alpha and mips64 also have a direct
map for all available ram set up (and some arm too, plus I guess the new
aarch64 code as well).

I have not seen arguments why to do it on sparc64 yet, as we can easily
access all memory w/o any mapping, and a direct map has security implications.

> I'd lean towards direct map, since that doesn't require
> adding/removing any mapping in pmap_kernel() at all. From looking on the code,
> I gather direct map is quite easy to implement for other archs like
> sparc64. I'd say
> significantly easier than adding the necessary emap hooks into MD pmaps.

What would that hooks be? Are they documented?

Martin


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map

2018-04-02 Thread Jaromír Doleček
2018-03-31 13:42 GMT+02:00 Jaromír Doleček :
> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :
>> Yeah, that's what ephemeral mappings where supposed to be for. The other
>> question is whether we can't just use the direct map for this on amd64
>> and similar platforms?
>
> Right, we could/should use emap. I haven't realized emap is actually already
> implemented. It's currently used for pipe for the loan/"direct" write.
>
> I don't know anything about emap thought. Are there any known issues,
> do you reckon it's ready to be used for general I/O handling?

Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral
mapping:

http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff

Seems to boot, no idea what else it will break.

Looking at the state of usage though, the emap is only used for disabled code
path for sys_pipe and nowhere else. That code had several on-and-off passes
for being enabled in 2009, and no further use since then. Doesn't give too much
confidence.

The only port actually having optimization for emap is x86. Since amd64
is also the only one supporting direct map, we are really at liberty to pick
either one. I'd lean towards direct map, since that doesn't require
adding/removing any mapping in pmap_kernel() at all. From looking on the code,
I gather direct map is quite easy to implement for other archs like
sparc64. I'd say
significantly easier than adding the necessary emap hooks into MD pmaps.

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-31 Thread Jaromír Doleček
2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger :
> Yeah, that's what ephemeral mappings where supposed to be for. The other
> question is whether we can't just use the direct map for this on amd64
> and similar platforms?

Right, we could/should use emap. I haven't realized emap is actually already
implemented. It's currently used for pipe for the loan/"direct" write.

I don't know anything about emap thought. Are there any known issues,
do you reckon it's ready to be used for general I/O handling?

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-27 Thread Jaromír Doleček
2018-03-27 0:51 GMT+02:00 Michael van Elst :
> UBC_WINSHIFT=17 (i.e. 2*MAXPHYS) works fine, but sure, for KVA limited
> platforms you have to be careful. On the other hand, it's a temporary
> mapping that only exists while doing I/O. How many concurrent I/O
> operations would you expect on a 32bit system?

I think fragmentation might be bigger issue. It could happen there just
won't be continuous 128k KVA block any more, then it would hang
forever, like kern/47030. But maybe this is not concern, e.g. i386 KVA
uses top 1G.

> I'd also like to get another tuning to 8.0. Bumping RA_WINWIZE_MAX
> from 8*MAXPHYS to 16*MAXPHYS helps iSCSI and maybe some nvme devices
> too.

Yes please :) For some reason I thought we have this already. I remember this
mentioned in discussion around time nvme(4) got integrated into tree.

Regarding your issue, I'd also like to turn attention back to the acpi idle too.
The way I read dev/acpi/acpi_cpu_cstate.c, we try to go straight to C3
without any
backoff. It maybe be good to make this more granular, maybe only going
into the deeper states once enough idle time passes, for example longer
time then their wakeup latency.

For example, wakeup from C1 might have 1 usec latency, from C2 has 150 usec
latency, from C3 500.

Any opinion on that, any ACPI experts around?

Jaromir


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-26 Thread Jaromír Doleček
2018-03-25 13:48 GMT+02:00 Michael van Elst :
> For some reason it's not per block. There is a mechanism that moves
> an 8kbyte window, independent of the block size. You can easily change
> the window size (I'm currently experimenting with 128kbyte) by building
> a kernel with e.g. options UBC_WINSHIFT=17.
>
> Independent of how the scaling problem will be handled, I suggest to
> increase that window to at least MAXPHYS.

I think the file mapping window doesn't need to really be tied to MAXPHYS.
But values over MAXPHYS will probably currently [*] not give further
significant speedup for filesystem I/O, unless data is being read from cache.

One thing to note is that this size is used for all file mapping windows, so
having too big value could case KVA shortages on 32bit platforms. I think
original UBC_WINSHIFT tried to balance most usual file size and overall
KVA use in kernel pmap, similar to e.g. MAXSLP works. Maybe bump
to higher value on 64-bit archs only?

If the bump would work, I think we should pull it up to 8.0 too, at this
is probably least likely to break anything out of the proposed options.

Did the experiment work?

Jaromir

[*] I hope to get tls-maxphys integrated for 9.0, by then maybe we need
to revisit this. It would be also nice to revisit the need to map the
memory to kernel pmap in order to do the I/O.


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Maxime Villard

Le 25/03/2018 à 17:03, Michael van Elst a écrit :

On Sun, Mar 25, 2018 at 03:39:21PM +0200, Maxime Villard wrote:


The main cpu (on which your program executes) sends IPIs to remote cpus, these
cpus were idle and entered the halted state, and they take some time to wake
up and process the IPI. And that's time the main cpu spends waiting.

As you said yourself it's not trivial to fix this, because the wakeup path
can be whatever interrupt entry point, and if we were to add some code there
it would slow down interrupt entry in general.


The slow down would be minimal, the cost comes from refilling the TLB and
that would happen independent on when the TLB was flushed.


I wouldn't be so sure; a new branch in the interrupt entry points (with
interrupts disabled) does cost something, and it's about the hottest path in
the kernel.


We could perhaps have something to instruct the main cpu not to wait for cpus
that are idle, only wait for those that aren't.


Is there a guarantee that the IPI handler on the other CPUs has already
started?


I don't think so, but I would have to re-check the LAPIC spec.


Could it be deferred by some other high-level interrupt or
by a section with disabled interrupts?


I don't think so (and I'm also not totally sure what you mean).

In all cases you need to handle the wakeup in the interrupt entry points.

An efficient way would be, before going into hlt, to flush the tlb and to
set a "idle" flag somewhere in curcpu. Then, we change each interrupt entry
point to set this flag to zero automatically regardless of its previous
value. Finally, when a cpu performs a tlb shootdown, it does not send IPIs
to the cpus flagged as idle.

Maxime


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sun, Mar 25, 2018 at 07:24:25PM +0200, Maxime Villard wrote:
> Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit :
> > The other question is whether we can't just use the direct map for this
> > on amd64 and similar platforms?
> 
> no, because nothing guarantees the physical pages are contiguous.

At least for ubc_zerorange, they certainly don't have to be. For
ubc_uiomove, it's a bit more work to split the IO vector, but certainly
feasible as well. Alternatively, propagate it down to uiomove to allow
using *two* IO vectors.

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Maxime Villard

Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit :

On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote:

jo...@bec.de (Joerg Sonnenberger) writes:


What about having a passive unmap as fourth option? I.e. when unmapping
in the transfer map, just add them to a FIFO. Process the FIFO on each
CPU when there is time or the transfer map runs out of space. Context
switching for example would be a natural place to do any such
invalidation.


The mapping is so temporary that it is almost only used on a single CPU.
Basically it's:

win = ubc_alloc();
if (!uiomove(win, ...))
memset(win, 0, ...);
ubc_release(win, ...);

For this to be even visible on multiple CPUs, the thread would need
to migrate to another CPU. Locking down the LWP on a single CPU
might be the cheapest solution.


Yeah, that's what ephemeral mappings where supposed to be for.


emaps are read-only.


The other question is whether we can't just use the direct map for this
on amd64 and similar platforms?


no, because nothing guarantees the physical pages are contiguous.


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote:
> jo...@bec.de (Joerg Sonnenberger) writes:
> 
> >What about having a passive unmap as fourth option? I.e. when unmapping
> >in the transfer map, just add them to a FIFO. Process the FIFO on each
> >CPU when there is time or the transfer map runs out of space. Context
> >switching for example would be a natural place to do any such
> >invalidation.
> 
> The mapping is so temporary that it is almost only used on a single CPU.
> Basically it's:
> 
> win = ubc_alloc();
> if (!uiomove(win, ...))
>   memset(win, 0, ...);
> ubc_release(win, ...);
> 
> For this to be even visible on multiple CPUs, the thread would need
> to migrate to another CPU. Locking down the LWP on a single CPU
> might be the cheapest solution.

Yeah, that's what ephemeral mappings where supposed to be for. The other
question is whether we can't just use the direct map for this on amd64
and similar platforms?

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Michael van Elst
jo...@bec.de (Joerg Sonnenberger) writes:

>What about having a passive unmap as fourth option? I.e. when unmapping
>in the transfer map, just add them to a FIFO. Process the FIFO on each
>CPU when there is time or the transfer map runs out of space. Context
>switching for example would be a natural place to do any such
>invalidation.

The mapping is so temporary that it is almost only used on a single CPU.
Basically it's:

win = ubc_alloc();
if (!uiomove(win, ...))
memset(win, 0, ...);
ubc_release(win, ...);

For this to be even visible on multiple CPUs, the thread would need
to migrate to another CPU. Locking down the LWP on a single CPU
might be the cheapest solution.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Michael van Elst
On Sun, Mar 25, 2018 at 03:39:21PM +0200, Maxime Villard wrote:
> 
> The main cpu (on which your program executes) sends IPIs to remote cpus, these
> cpus were idle and entered the halted state, and they take some time to wake
> up and process the IPI. And that's time the main cpu spends waiting.
> 
> As you said yourself it's not trivial to fix this, because the wakeup path
> can be whatever interrupt entry point, and if we were to add some code there
> it would slow down interrupt entry in general.

The slow down would be minimal, the cost comes from refilling the TLB and
that would happen independent on when the TLB was flushed.


> We could perhaps have something to instruct the main cpu not to wait for cpus
> that are idle, only wait for those that aren't.

Is there a guarantee that the IPI handler on the other CPUs has already
started? Could it be deferred by some other high-level interrupt or
by a section with disabled interrupts?



Greetings,
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Joerg Sonnenberger
On Sat, Mar 24, 2018 at 08:42:34PM +0100, Jaromír Doleček wrote:
> The problem there is that FFS triggers a pathologic case. I/O transfer maps
> and then unmaps each block into kernel pmap, so that the data could be
> copied into user memory. This triggers TLB shootdown IPIs for each FS
> block, sent  to all CPUs which happen to be idle, or otherwise running on
> kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a
> lot of synchronization overhead.

What about having a passive unmap as fourth option? I.e. when unmapping
in the transfer map, just add them to a FIFO. Process the FIFO on each
CPU when there is time or the transfer map runs out of space. Context
switching for example would be a natural place to do any such
invalidation.

Joerg


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Maxime Villard

Le 25/03/2018 à 13:48, Michael van Elst a écrit :

jaromir.dole...@gmail.com (=?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?=) writes:


The problem there is that FFS triggers a pathologic case. I/O transfer maps
and then unmaps each block into kernel pmap, so that the data could be
copied into user memory.


For some reason it's not per block. There is a mechanism that moves
an 8kbyte window, independent of the block size. You can easily change
the window size (I'm currently experimenting with 128kbyte) by building
a kernel with e.g. options UBC_WINSHIFT=17.

Independent of how the scaling problem will be handled, I suggest to
increase that window to at least MAXPHYS.


Yes.


3. change UVM code to not do this mapping via kernel map, so pmap_update()
and the IPIs are not triggered in first place


What I currently don't understand is that we only see one TLB shootdown
per window and that happens when it is _unmapped_. Mapping the pages
(with the exception of overwriting) is done by the regular fault handler
and apparently this does not flush the TLB.


IIRC that's normal. Each time we map a page, we know it wasn't mapped before,
so no need to flush the TLB, since we know the page isn't cached. However,
when unmapping, we do need to flush the TLB. It's an optimization; if we map a
page that we end up not touching, it won't have "used" flag, and we can unmap
itvwithout flushing the TLB. So it saves us one tlb shootdown.

By the way I saw your answer to the PR (by looking at mail-index.netbsd.org,
because you didn't CC me in the mail).

I think you are right - I was more thinking about a problem in pmap, but the
issue may just be that there is a latency in the cpu wakeup, as you said.

The main cpu (on which your program executes) sends IPIs to remote cpus, these
cpus were idle and entered the halted state, and they take some time to wake
up and process the IPI. And that's time the main cpu spends waiting.

As you said yourself it's not trivial to fix this, because the wakeup path
can be whatever interrupt entry point, and if we were to add some code there
it would slow down interrupt entry in general.

We could perhaps have something to instruct the main cpu not to wait for cpus
that are idle, only wait for those that aren't.

Maxime


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Manuel Bouyer
On Sun, Mar 25, 2018 at 11:48:02AM -, Michael van Elst wrote:
> [...]
> >3. change UVM code to not do this mapping via kernel map, so pmap_update()
> >and the IPIs are not triggered in first place
> 
> What I currently don't understand is that we only see one TLB shootdown
> per window and that happens when it is _unmapped_. Mapping the pages
> (with the exception of overwriting) is done by the regular fault handler
> and apparently this does not flush the TLB. This makes me think that it might
> not be required at all and is just an unwanted side-effect of pmap_update.

I think it's because if we get a fault, then this VA is not mapped
in the page table, and so it can't be cached in any TLB. So there is
nothing to flush in this case.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Michael van Elst
jaromir.dole...@gmail.com (=?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?=) writes:

>The problem there is that FFS triggers a pathologic case. I/O transfer maps
>and then unmaps each block into kernel pmap, so that the data could be
>copied into user memory.

For some reason it's not per block. There is a mechanism that moves
an 8kbyte window, independent of the block size. You can easily change
the window size (I'm currently experimenting with 128kbyte) by building
a kernel with e.g. options UBC_WINSHIFT=17.

Independent of how the scaling problem will be handled, I suggest to
increase that window to at least MAXPHYS.


>3. change UVM code to not do this mapping via kernel map, so pmap_update()
>and the IPIs are not triggered in first place

What I currently don't understand is that we only see one TLB shootdown
per window and that happens when it is _unmapped_. Mapping the pages
(with the exception of overwriting) is done by the regular fault handler
and apparently this does not flush the TLB. This makes me think that it might
not be required at all and is just an unwanted side-effect of pmap_update.

-- 
-- 
Michael van Elst
Internet: mlel...@serpens.de
"A potential Snark may lurk in every tree."


Fixing excessive shootdowns during FFS read (kern/53124) on x86

2018-03-25 Thread Jaromír Doleček
Hello,

I'd like to gather some feedback on how to best tackle kern/53124.

The problem there is that FFS triggers a pathologic case. I/O transfer maps
and then unmaps each block into kernel pmap, so that the data could be
copied into user memory. This triggers TLB shootdown IPIs for each FS
block, sent  to all CPUs which happen to be idle, or otherwise running on
kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a
lot of synchronization overhead.

I see three possible ways how to fix this particular case:
1. make it possible to lazy invalidate TLB also for kernel pmap, i.e. make
pmap_deactivate()/pmap_reactivate() work with kernel pmap -  this avoids
the TLB shootdown IPIs to be sent to idle CPUs
2. make idle lwp run in it's own pmap and address space - variant to #1,
avoids changing invariant about kernel map being always loaded
3. change UVM code to not do this mapping via kernel map, so pmap_update()
and the IPIs are not triggered in first place

I reckon #2 would add a lot of overhead into the idle routine, it would
require at least an address space switch. This address space switch might
be fairly cheap on architectures with address space ID (avoiding TLB
flushes), but still much more than what the idle entry/exit does now.

Variants of problems with #3 was discussed on and off during the years as I
recall, but there is no resolution yet, and I'm not aware of anyone
actively working on this. I understand this would be Hard, with nobody
currently having the time and courage.

This leaves #1 as a short-term practical solution. Anyone foresees any
particular problems with this, does it have a chance to fly? Any other idea?

Jaromir