Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
On 6/10/2018 3:02 PM, Jaromír Doleček wrote: 2018-05-12 21:24 GMT+02:00 Chuck Silvers : the problem with keeping the pages locked (ie. PG_BUSY) while accessing the user address space is that it can lead to deadlock if the page Meanwhile I've tested this scenario - wrote a test program to do mmap(2) for file, then calling write() using the memory from mmap as the buffer for write(2) for the same file offset. It actually caused the process to get SIGSEGV with new code, where the old code made write(2) just returned with EINVAL. I think this happens because fault resolution returns EINVAL when the page has PG_BUSY. Funny thing is that actually on Mac OS X that code actually does end up with deadlock, and unkillable process :D Maybe the SIGSEGV is actually acceptable behaviour? I don't think there would be valid reason to do mmap(2)/write(2) combination like that. FWIW, a similar test on a Linux box (Debian 8, 3.16 kernel) works w/o errors but behaves rather oddly if the offsets don't exactly line up. (e.g. w/ offset 1, the first 8 chars written repeat the first read, then every 32 chars one char is repeated) As long as we fail somehow, whether EINVAL or SIGSEGV, at least it'll be better than *that*. :) Eric
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
2018-05-12 21:24 GMT+02:00 Chuck Silvers : > the problem with keeping the pages locked (ie. PG_BUSY) while accessing > the user address space is that it can lead to deadlock if the page Meanwhile I've tested this scenario - wrote a test program to do mmap(2) for file, then calling write() using the memory from mmap as the buffer for write(2) for the same file offset. It actually caused the process to get SIGSEGV with new code, where the old code made write(2) just returned with EINVAL. I think this happens because fault resolution returns EINVAL when the page has PG_BUSY. Funny thing is that actually on Mac OS X that code actually does end up with deadlock, and unkillable process :D Maybe the SIGSEGV is actually acceptable behaviour? I don't think there would be valid reason to do mmap(2)/write(2) combination like that. Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
On Sun, May 13, 2018 at 02:59:50PM +0200, Jaromr Dole?ek wrote: > 2018-05-12 21:24 GMT+02:00 Chuck Silvers: > > the problem with keeping the pages locked (ie. PG_BUSY) while accessing > > the user address space is that it can lead to deadlock if the page > > we are accessing via the kernel mapping is the same page that we are > > accessing via the user mapping. > > The pages marked PG_BUSY for the UBC transfer are part of page cache. > I sincerely hope it's impossible to have any user mapping for it. the whole point of UBC is that the pages caching the data used by read/write are the same pages used by mmap. that's how we avoid the double-caching and cache coherency problems that the pre-UBC scheme had. so yes, these pages can have user mappings at the same time as they are mapped in the kernel. -Chuck
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
2018-05-12 21:24 GMT+02:00 Chuck Silvers: > the problem with keeping the pages locked (ie. PG_BUSY) while accessing > the user address space is that it can lead to deadlock if the page > we are accessing via the kernel mapping is the same page that we are > accessing via the user mapping. The pages marked PG_BUSY for the UBC transfer are part of page cache. I sincerely hope it's impossible to have any user mapping for it. Is there are any other scenario for the deadlock in this context? That said, the PG_BUSY is at least a scalability problem, since it restricts the copy operation within the UBC window to only one at a time. Old code holds PG_BUSY only during the fault - with exception of the UBC_FAULTBUSY branch, which still keeps the lock during the whole operation. Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
hi, sorry for being so late to the party on this thread, I should have replied long ago. On Fri, May 11, 2018 at 05:28:18PM +0200, Jaromr Dole?ek wrote: > I've modified the uvm_bio.c code to use direct map for amd64. Change > seems to be stable, I've also run 'build.sh tools' to test out the > system, build suceeded and I didn't observe any problems with > stability. Test was on 6-core Intel CPU with HT on (i.e. "12" cores > visible to OS), NVMe disk, cca 2GB file read done via: > > dd if=foo of=/dev/null bs=64k msgfmt=human > > Because of the proper formatting, the speed is now in proper Gibibytes. > > 1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s > 2. cache read: old: 2.2 GiB/s, new: 5.6 GiB/s > > Seems the 1.9 GiB/s is device limit. > > The patch for this is at: > http://www.netbsd.org/~jdolecek/uvm_bio_direct.diff the problem with keeping the pages locked (ie. PG_BUSY) while accessing the user address space is that it can lead to deadlock if the page we are accessing via the kernel mapping is the same page that we are accessing via the user mapping. if the user space access faults (eg. because the page has not been accessed via that mapping yet) then the fault handler will try to lock the same page again. even if it's not the same page, it is generally not safe for a thread to wait to lock a page while that thread has locked another page already. however if you don't lock the object page while accessing user space, then the object page can change identity out from under you, corrupting whatever the page is being used for in its new identity. my plan for this in the longer term was to add a mechanism in UVM for preventing a page from changing identity, which would be somewhat similar to the existing loan mechanism except that the page could still be written to without making a copy of it. linux and freebsd (and solaris) each have something like this, though they are all a bit different in the details. using this new mechanism, we could keep the page being accessed via the direct map from changing identity while still allowing normal page faults to happen on the user mapping. there are probably less intrusive ways to attack the problem at hand, but the above approach seemed like the best way to go in the long run. I was hoping the emap thing would work for the shorter term, but if that's not practical to make work then we can look for something else. > During testing I noticed that the read-ahead code slows down the > cached case quite a lot: > no read-ahead: > 1. non-cached read: old: 851 MiB/s, new: 1.1 GiB/s > 2. cached read: old: 2.3 GiB/s, new: 6.8 GiB/s > > I've implemented a tweak to read-ahead code to skip the full > read-ahead if last page of the range is already in cache, this > improved things a lot: > smarter read-ahead: > 1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s (no change compared > to old read-ahead) > 2. cached read: old: 2.2 GiB/s, new: 6.6 GiB/s > > Patch for this is at: > http://www.netbsd.org/~jdolecek/uvm_readahead_skip.diff the concept is good, but would using PGO_NOWAIT work about as well as adding a new PGO_CACHEONLY flag? if any pages are not in memory then the pgo_get call would fail since that would require waiting for them to be read. the suggestion from others to just use uvm_pagelookup() would be fine too. it would also be nice to put this new logic for checking if an object page is in memory into a separate function in case it's useful in other contexts (and to isolate the implementation a bit). > For the direct map, I've implemented new helper pmap function, which > in turn calls a callback with the appropriate virtual address. This > way the arch pmap is more free to choose an alternative > implementations for this, e.g. without actually having a direct map, > like sparc64. > > The code right now has some instrumentation for testing, but the > general idea should be visible. > > Opinions? Particularly, I'm not quite sure if it safe to avoid > triggering a write fault in ubc_uiomove(), and whether the new > heuristics for read-ahead is good enough for general case. is there a specific case about triggering a write fault in ubc_uiomove() that you are concerned about? the general rule is that a page being modified must be marked dirty either as part of the modification (if it is being written via a regular ref/mod-tracking pmap entry) or after the modifications are complete (if it is being written via a non ref/mod-tracking mechanism such as the direct map). the read-ahead logic in post-UBC netbsd has never been very sophisticated, I'd be impressed if you managed to make it worse. :-) -Chuck
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
2018-05-12 7:07 GMT+02:00 Michael van Elst: > On Fri, May 11, 2018 at 05:28:18PM +0200, Jaromír Dole?ek wrote: >> I've implemented a tweak to read-ahead code to skip the full >> read-ahead if last page of the range is already in cache, this >> improved things a lot: > > That looks a bit like a hack. Can't you just call uvm_pagelookup() > to determine if the last page is cached? Or if that's too UVM internal, > call uvm_findpages() with UFP_NOALLOC. I thought it would be less abstraction violation when done in getpages. But actually using uvm_pagelookup() seems to be fine from that standpoint. Furthermore, it avoids some page flag manipulation, and makes the change much more localized, to only uvm_readahead.c. So I'll go with that. Thanks. Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
On Fri, May 11, 2018 at 05:28:18PM +0200, Jaromír Dole?ek wrote: > I've implemented a tweak to read-ahead code to skip the full > read-ahead if last page of the range is already in cache, this > improved things a lot: That looks a bit like a hack. Can't you just call uvm_pagelookup() to determine if the last page is cached? Or if that's too UVM internal, call uvm_findpages() with UFP_NOALLOC. Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
> On May 11, 2018, at 8:28 AM, Jaromír Doleček> wrote: > > I've modified the uvm_bio.c code to use direct map for amd64. Change > seems to be stable, I've also run 'build.sh tools' to test out the > system, build suceeded and I didn't observe any problems with > stability. Test was on 6-core Intel CPU with HT on (i.e. "12" cores > visible to OS), NVMe disk, cca 2GB file read done via: Very cool. Alpha and MIPS could take advantage of this as well. -- thorpej
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
I've modified the uvm_bio.c code to use direct map for amd64. Change seems to be stable, I've also run 'build.sh tools' to test out the system, build suceeded and I didn't observe any problems with stability. Test was on 6-core Intel CPU with HT on (i.e. "12" cores visible to OS), NVMe disk, cca 2GB file read done via: dd if=foo of=/dev/null bs=64k msgfmt=human Because of the proper formatting, the speed is now in proper Gibibytes. 1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s 2. cache read: old: 2.2 GiB/s, new: 5.6 GiB/s Seems the 1.9 GiB/s is device limit. The patch for this is at: http://www.netbsd.org/~jdolecek/uvm_bio_direct.diff During testing I noticed that the read-ahead code slows down the cached case quite a lot: no read-ahead: 1. non-cached read: old: 851 MiB/s, new: 1.1 GiB/s 2. cached read: old: 2.3 GiB/s, new: 6.8 GiB/s I've implemented a tweak to read-ahead code to skip the full read-ahead if last page of the range is already in cache, this improved things a lot: smarter read-ahead: 1. non-cached read: old: 1.7 GiB/s, new: 1.9 GiB/s (no change compared to old read-ahead) 2. cached read: old: 2.2 GiB/s, new: 6.6 GiB/s Patch for this is at: http://www.netbsd.org/~jdolecek/uvm_readahead_skip.diff For the direct map, I've implemented new helper pmap function, which in turn calls a callback with the appropriate virtual address. This way the arch pmap is more free to choose an alternative implementations for this, e.g. without actually having a direct map, like sparc64. The code right now has some instrumentation for testing, but the general idea should be visible. Opinions? Particularly, I'm not quite sure if it safe to avoid triggering a write fault in ubc_uiomove(), and whether the new heuristics for read-ahead is good enough for general case. Jaromir 2018-04-19 22:39 GMT+02:00 Jaromír Doleček: > I've finally got my test rig setup, so was able to check the > performance difference when using emap. > > Good news there is significant speedupon NVMe device, without > observing any bad side effects so far. > > I've setup a test file of 2G, smaller than RAM to fit all in cache, test was: > dd if=foo of=/dev/null bs=64k > > First read (not-cached): old: 1.7 GB/s, new: 2.1 GB/s > Cached read: old: 2.2 GB/s, new: 3.1 GB/s > > Reads from raw device were the same in both cases, around 1.7 GB/s > which is a bit bizzarde. > > If we want to modify the uvm_bio.c code to optionally use direct map, > there is the problem with the fault technique it uses for read I/O. > The code doesn't actually enter the pages into the KVA window, it lets > uimove() to trigger the faults. Is there some advantage to this, why > is this better than just mapping those 1-2 pages before calling > uiomove()? > > Jaromir > > 2018-04-02 21:28 GMT+02:00 Jaromír Doleček : >> 2018-03-31 13:42 GMT+02:00 Jaromír Doleček : >>> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger : Yeah, that's what ephemeral mappings where supposed to be for. The other question is whether we can't just use the direct map for this on amd64 and similar platforms? >>> >>> Right, we could/should use emap. I haven't realized emap is actually already >>> implemented. It's currently used for pipe for the loan/"direct" write. >>> >>> I don't know anything about emap thought. Are there any known issues, >>> do you reckon it's ready to be used for general I/O handling? >> >> Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral >> mapping: >> >> http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff >> >> Seems to boot, no idea what else it will break. >> >> Looking at the state of usage though, the emap is only used for disabled code >> path for sys_pipe and nowhere else. That code had several on-and-off passes >> for being enabled in 2009, and no further use since then. Doesn't give too >> much >> confidence. >> >> The only port actually having optimization for emap is x86. Since amd64 >> is also the only one supporting direct map, we are really at liberty to pick >> either one. I'd lean towards direct map, since that doesn't require >> adding/removing any mapping in pmap_kernel() at all. From looking on the >> code, >> I gather direct map is quite easy to implement for other archs like >> sparc64. I'd say >> significantly easier than adding the necessary emap hooks into MD pmaps. >> >> Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
I've finally got my test rig setup, so was able to check the performance difference when using emap. Good news there is significant speedupon NVMe device, without observing any bad side effects so far. I've setup a test file of 2G, smaller than RAM to fit all in cache, test was: dd if=foo of=/dev/null bs=64k First read (not-cached): old: 1.7 GB/s, new: 2.1 GB/s Cached read: old: 2.2 GB/s, new: 3.1 GB/s Reads from raw device were the same in both cases, around 1.7 GB/s which is a bit bizzarde. If we want to modify the uvm_bio.c code to optionally use direct map, there is the problem with the fault technique it uses for read I/O. The code doesn't actually enter the pages into the KVA window, it lets uimove() to trigger the faults. Is there some advantage to this, why is this better than just mapping those 1-2 pages before calling uiomove()? Jaromir 2018-04-02 21:28 GMT+02:00 Jaromír Doleček: > 2018-03-31 13:42 GMT+02:00 Jaromír Doleček : >> 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger : >>> Yeah, that's what ephemeral mappings where supposed to be for. The other >>> question is whether we can't just use the direct map for this on amd64 >>> and similar platforms? >> >> Right, we could/should use emap. I haven't realized emap is actually already >> implemented. It's currently used for pipe for the loan/"direct" write. >> >> I don't know anything about emap thought. Are there any known issues, >> do you reckon it's ready to be used for general I/O handling? > > Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral > mapping: > > http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff > > Seems to boot, no idea what else it will break. > > Looking at the state of usage though, the emap is only used for disabled code > path for sys_pipe and nowhere else. That code had several on-and-off passes > for being enabled in 2009, and no further use since then. Doesn't give too > much > confidence. > > The only port actually having optimization for emap is x86. Since amd64 > is also the only one supporting direct map, we are really at liberty to pick > either one. I'd lean towards direct map, since that doesn't require > adding/removing any mapping in pmap_kernel() at all. From looking on the code, > I gather direct map is quite easy to implement for other archs like > sparc64. I'd say > significantly easier than adding the necessary emap hooks into MD pmaps. > > Jaromir
re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
> 4GB of KVA; and in addition pmap_kernel will consume some KVA too. i386 for > example has only 4GB of ram and 4GB of KVA, so we'll just never add a direct > map there. note that it changes your direct map point, but, i386 GENERIC_PAE works fine for at least 16GB ram. it should work upto 64GB. actually, it only strengthens your direct map point, since it's even harder to fit 64GB phys into 4GB virt :-) > As opposed to that, emaps can be implemented everywhere with no constraint on > the arch. I think they are better than the direct map for uvm_bio. IIRC, the main reason we stopped using emap is that there were performance issues. rmind? .mrg.
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
Le 02/04/2018 à 21:28, Jaromír Doleček a écrit : 2018-03-31 13:42 GMT+02:00 Jaromír Doleček: 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger : Yeah, that's what ephemeral mappings where supposed to be for. The other question is whether we can't just use the direct map for this on amd64 and similar platforms? Right, we could/should use emap. I haven't realized emap is actually already implemented. It's currently used for pipe for the loan/"direct" write. I don't know anything about emap thought. Are there any known issues, do you reckon it's ready to be used for general I/O handling? Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral mapping: http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff - pmap_emap_enter(va, pa, VM_PROT_READ); + pmap_emap_enter(va, pa, VM_PROT_READ | VM_PROT_WRITE); Mmh no, sys_pipe wanted it read-only, we shouldn't make it writable by default. Adding a prot argument to uvm_emap_enter would be better. Looking at the state of usage though, the emap is only used for disabled code path for sys_pipe and nowhere else. That code had several on-and-off passes for being enabled in 2009, and no further use since then. Doesn't give too much confidence. The only port actually having optimization for emap is x86. Since amd64 is also the only one supporting direct map, we are really at liberty to pick either one. I'd lean towards direct map, since that doesn't require adding/removing any mapping in pmap_kernel() at all. From looking on the code, I gather direct map is quite easy to implement for other archs like sparc64. I'd say significantly easier than adding the necessary emap hooks into MD pmaps. There is a good number of architectures where implementing a direct map is not possible, because of KVA consumption. With a direct map we consume more than once the physical space. If you have 4GB of ram, the direct map will consume 4GB of KVA; and in addition pmap_kernel will consume some KVA too. i386 for example has only 4GB of ram and 4GB of KVA, so we'll just never add a direct map there. Direct maps are good when the architecture has much, much more KVA than it has physical space. I saw some low-KVA architectures have a "partial direct map", where only a (tiny) area of the physical space is direct-mapped. There, we would have to adapt uvm_bio to use pmap_kernel instead, which seems ugly. As opposed to that, emaps can be implemented everywhere with no constraint on the arch. I think they are better than the direct map for uvm_bio.
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
On Mon, Apr 02, 2018 at 09:28:43PM +0200, Jaromír Dole?ek wrote: > The only port actually having optimization for emap is x86. Since amd64 > is also the only one supporting direct map, we are really at liberty to pick > either one. I am not sure what you mean here - AFIK alpha and mips64 also have a direct map for all available ram set up (and some arm too, plus I guess the new aarch64 code as well). I have not seen arguments why to do it on sparc64 yet, as we can easily access all memory w/o any mapping, and a direct map has security implications. > I'd lean towards direct map, since that doesn't require > adding/removing any mapping in pmap_kernel() at all. From looking on the code, > I gather direct map is quite easy to implement for other archs like > sparc64. I'd say > significantly easier than adding the necessary emap hooks into MD pmaps. What would that hooks be? Are they documented? Martin
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86 - emap, direct map
2018-03-31 13:42 GMT+02:00 Jaromír Doleček: > 2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger : >> Yeah, that's what ephemeral mappings where supposed to be for. The other >> question is whether we can't just use the direct map for this on amd64 >> and similar platforms? > > Right, we could/should use emap. I haven't realized emap is actually already > implemented. It's currently used for pipe for the loan/"direct" write. > > I don't know anything about emap thought. Are there any known issues, > do you reckon it's ready to be used for general I/O handling? Okay, so I've hacked to gether a patch to switch uvm_bio.c to ephemeral mapping: http://www.netbsd.org/~jdolecek/uvm_bio_emap.diff Seems to boot, no idea what else it will break. Looking at the state of usage though, the emap is only used for disabled code path for sys_pipe and nowhere else. That code had several on-and-off passes for being enabled in 2009, and no further use since then. Doesn't give too much confidence. The only port actually having optimization for emap is x86. Since amd64 is also the only one supporting direct map, we are really at liberty to pick either one. I'd lean towards direct map, since that doesn't require adding/removing any mapping in pmap_kernel() at all. From looking on the code, I gather direct map is quite easy to implement for other archs like sparc64. I'd say significantly easier than adding the necessary emap hooks into MD pmaps. Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
2018-03-25 17:27 GMT+02:00 Joerg Sonnenberger: > Yeah, that's what ephemeral mappings where supposed to be for. The other > question is whether we can't just use the direct map for this on amd64 > and similar platforms? Right, we could/should use emap. I haven't realized emap is actually already implemented. It's currently used for pipe for the loan/"direct" write. I don't know anything about emap thought. Are there any known issues, do you reckon it's ready to be used for general I/O handling? Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
2018-03-27 0:51 GMT+02:00 Michael van Elst: > UBC_WINSHIFT=17 (i.e. 2*MAXPHYS) works fine, but sure, for KVA limited > platforms you have to be careful. On the other hand, it's a temporary > mapping that only exists while doing I/O. How many concurrent I/O > operations would you expect on a 32bit system? I think fragmentation might be bigger issue. It could happen there just won't be continuous 128k KVA block any more, then it would hang forever, like kern/47030. But maybe this is not concern, e.g. i386 KVA uses top 1G. > I'd also like to get another tuning to 8.0. Bumping RA_WINWIZE_MAX > from 8*MAXPHYS to 16*MAXPHYS helps iSCSI and maybe some nvme devices > too. Yes please :) For some reason I thought we have this already. I remember this mentioned in discussion around time nvme(4) got integrated into tree. Regarding your issue, I'd also like to turn attention back to the acpi idle too. The way I read dev/acpi/acpi_cpu_cstate.c, we try to go straight to C3 without any backoff. It maybe be good to make this more granular, maybe only going into the deeper states once enough idle time passes, for example longer time then their wakeup latency. For example, wakeup from C1 might have 1 usec latency, from C2 has 150 usec latency, from C3 500. Any opinion on that, any ACPI experts around? Jaromir
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
2018-03-25 13:48 GMT+02:00 Michael van Elst: > For some reason it's not per block. There is a mechanism that moves > an 8kbyte window, independent of the block size. You can easily change > the window size (I'm currently experimenting with 128kbyte) by building > a kernel with e.g. options UBC_WINSHIFT=17. > > Independent of how the scaling problem will be handled, I suggest to > increase that window to at least MAXPHYS. I think the file mapping window doesn't need to really be tied to MAXPHYS. But values over MAXPHYS will probably currently [*] not give further significant speedup for filesystem I/O, unless data is being read from cache. One thing to note is that this size is used for all file mapping windows, so having too big value could case KVA shortages on 32bit platforms. I think original UBC_WINSHIFT tried to balance most usual file size and overall KVA use in kernel pmap, similar to e.g. MAXSLP works. Maybe bump to higher value on 64-bit archs only? If the bump would work, I think we should pull it up to 8.0 too, at this is probably least likely to break anything out of the proposed options. Did the experiment work? Jaromir [*] I hope to get tls-maxphys integrated for 9.0, by then maybe we need to revisit this. It would be also nice to revisit the need to map the memory to kernel pmap in order to do the I/O.
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
Le 25/03/2018 à 17:03, Michael van Elst a écrit : On Sun, Mar 25, 2018 at 03:39:21PM +0200, Maxime Villard wrote: The main cpu (on which your program executes) sends IPIs to remote cpus, these cpus were idle and entered the halted state, and they take some time to wake up and process the IPI. And that's time the main cpu spends waiting. As you said yourself it's not trivial to fix this, because the wakeup path can be whatever interrupt entry point, and if we were to add some code there it would slow down interrupt entry in general. The slow down would be minimal, the cost comes from refilling the TLB and that would happen independent on when the TLB was flushed. I wouldn't be so sure; a new branch in the interrupt entry points (with interrupts disabled) does cost something, and it's about the hottest path in the kernel. We could perhaps have something to instruct the main cpu not to wait for cpus that are idle, only wait for those that aren't. Is there a guarantee that the IPI handler on the other CPUs has already started? I don't think so, but I would have to re-check the LAPIC spec. Could it be deferred by some other high-level interrupt or by a section with disabled interrupts? I don't think so (and I'm also not totally sure what you mean). In all cases you need to handle the wakeup in the interrupt entry points. An efficient way would be, before going into hlt, to flush the tlb and to set a "idle" flag somewhere in curcpu. Then, we change each interrupt entry point to set this flag to zero automatically regardless of its previous value. Finally, when a cpu performs a tlb shootdown, it does not send IPIs to the cpus flagged as idle. Maxime
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sun, Mar 25, 2018 at 07:24:25PM +0200, Maxime Villard wrote: > Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit : > > The other question is whether we can't just use the direct map for this > > on amd64 and similar platforms? > > no, because nothing guarantees the physical pages are contiguous. At least for ubc_zerorange, they certainly don't have to be. For ubc_uiomove, it's a bit more work to split the IO vector, but certainly feasible as well. Alternatively, propagate it down to uiomove to allow using *two* IO vectors. Joerg
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
Le 25/03/2018 à 17:27, Joerg Sonnenberger a écrit : On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote: jo...@bec.de (Joerg Sonnenberger) writes: What about having a passive unmap as fourth option? I.e. when unmapping in the transfer map, just add them to a FIFO. Process the FIFO on each CPU when there is time or the transfer map runs out of space. Context switching for example would be a natural place to do any such invalidation. The mapping is so temporary that it is almost only used on a single CPU. Basically it's: win = ubc_alloc(); if (!uiomove(win, ...)) memset(win, 0, ...); ubc_release(win, ...); For this to be even visible on multiple CPUs, the thread would need to migrate to another CPU. Locking down the LWP on a single CPU might be the cheapest solution. Yeah, that's what ephemeral mappings where supposed to be for. emaps are read-only. The other question is whether we can't just use the direct map for this on amd64 and similar platforms? no, because nothing guarantees the physical pages are contiguous.
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sun, Mar 25, 2018 at 03:19:28PM -, Michael van Elst wrote: > jo...@bec.de (Joerg Sonnenberger) writes: > > >What about having a passive unmap as fourth option? I.e. when unmapping > >in the transfer map, just add them to a FIFO. Process the FIFO on each > >CPU when there is time or the transfer map runs out of space. Context > >switching for example would be a natural place to do any such > >invalidation. > > The mapping is so temporary that it is almost only used on a single CPU. > Basically it's: > > win = ubc_alloc(); > if (!uiomove(win, ...)) > memset(win, 0, ...); > ubc_release(win, ...); > > For this to be even visible on multiple CPUs, the thread would need > to migrate to another CPU. Locking down the LWP on a single CPU > might be the cheapest solution. Yeah, that's what ephemeral mappings where supposed to be for. The other question is whether we can't just use the direct map for this on amd64 and similar platforms? Joerg
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
jo...@bec.de (Joerg Sonnenberger) writes: >What about having a passive unmap as fourth option? I.e. when unmapping >in the transfer map, just add them to a FIFO. Process the FIFO on each >CPU when there is time or the transfer map runs out of space. Context >switching for example would be a natural place to do any such >invalidation. The mapping is so temporary that it is almost only used on a single CPU. Basically it's: win = ubc_alloc(); if (!uiomove(win, ...)) memset(win, 0, ...); ubc_release(win, ...); For this to be even visible on multiple CPUs, the thread would need to migrate to another CPU. Locking down the LWP on a single CPU might be the cheapest solution. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sun, Mar 25, 2018 at 03:39:21PM +0200, Maxime Villard wrote: > > The main cpu (on which your program executes) sends IPIs to remote cpus, these > cpus were idle and entered the halted state, and they take some time to wake > up and process the IPI. And that's time the main cpu spends waiting. > > As you said yourself it's not trivial to fix this, because the wakeup path > can be whatever interrupt entry point, and if we were to add some code there > it would slow down interrupt entry in general. The slow down would be minimal, the cost comes from refilling the TLB and that would happen independent on when the TLB was flushed. > We could perhaps have something to instruct the main cpu not to wait for cpus > that are idle, only wait for those that aren't. Is there a guarantee that the IPI handler on the other CPUs has already started? Could it be deferred by some other high-level interrupt or by a section with disabled interrupts? Greetings, -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sat, Mar 24, 2018 at 08:42:34PM +0100, Jaromír Doleček wrote: > The problem there is that FFS triggers a pathologic case. I/O transfer maps > and then unmaps each block into kernel pmap, so that the data could be > copied into user memory. This triggers TLB shootdown IPIs for each FS > block, sent to all CPUs which happen to be idle, or otherwise running on > kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a > lot of synchronization overhead. What about having a passive unmap as fourth option? I.e. when unmapping in the transfer map, just add them to a FIFO. Process the FIFO on each CPU when there is time or the transfer map runs out of space. Context switching for example would be a natural place to do any such invalidation. Joerg
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
Le 25/03/2018 à 13:48, Michael van Elst a écrit : jaromir.dole...@gmail.com (=?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?=) writes: The problem there is that FFS triggers a pathologic case. I/O transfer maps and then unmaps each block into kernel pmap, so that the data could be copied into user memory. For some reason it's not per block. There is a mechanism that moves an 8kbyte window, independent of the block size. You can easily change the window size (I'm currently experimenting with 128kbyte) by building a kernel with e.g. options UBC_WINSHIFT=17. Independent of how the scaling problem will be handled, I suggest to increase that window to at least MAXPHYS. Yes. 3. change UVM code to not do this mapping via kernel map, so pmap_update() and the IPIs are not triggered in first place What I currently don't understand is that we only see one TLB shootdown per window and that happens when it is _unmapped_. Mapping the pages (with the exception of overwriting) is done by the regular fault handler and apparently this does not flush the TLB. IIRC that's normal. Each time we map a page, we know it wasn't mapped before, so no need to flush the TLB, since we know the page isn't cached. However, when unmapping, we do need to flush the TLB. It's an optimization; if we map a page that we end up not touching, it won't have "used" flag, and we can unmap itvwithout flushing the TLB. So it saves us one tlb shootdown. By the way I saw your answer to the PR (by looking at mail-index.netbsd.org, because you didn't CC me in the mail). I think you are right - I was more thinking about a problem in pmap, but the issue may just be that there is a latency in the cpu wakeup, as you said. The main cpu (on which your program executes) sends IPIs to remote cpus, these cpus were idle and entered the halted state, and they take some time to wake up and process the IPI. And that's time the main cpu spends waiting. As you said yourself it's not trivial to fix this, because the wakeup path can be whatever interrupt entry point, and if we were to add some code there it would slow down interrupt entry in general. We could perhaps have something to instruct the main cpu not to wait for cpus that are idle, only wait for those that aren't. Maxime
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
On Sun, Mar 25, 2018 at 11:48:02AM -, Michael van Elst wrote: > [...] > >3. change UVM code to not do this mapping via kernel map, so pmap_update() > >and the IPIs are not triggered in first place > > What I currently don't understand is that we only see one TLB shootdown > per window and that happens when it is _unmapped_. Mapping the pages > (with the exception of overwriting) is done by the regular fault handler > and apparently this does not flush the TLB. This makes me think that it might > not be required at all and is just an unwanted side-effect of pmap_update. I think it's because if we get a fault, then this VA is not mapped in the page table, and so it can't be cached in any TLB. So there is nothing to flush in this case. -- Manuel BouyerNetBSD: 26 ans d'experience feront toujours la difference --
Re: Fixing excessive shootdowns during FFS read (kern/53124) on x86
jaromir.dole...@gmail.com (=?UTF-8?B?SmFyb23DrXIgRG9sZcSNZWs=?=) writes: >The problem there is that FFS triggers a pathologic case. I/O transfer maps >and then unmaps each block into kernel pmap, so that the data could be >copied into user memory. For some reason it's not per block. There is a mechanism that moves an 8kbyte window, independent of the block size. You can easily change the window size (I'm currently experimenting with 128kbyte) by building a kernel with e.g. options UBC_WINSHIFT=17. Independent of how the scaling problem will be handled, I suggest to increase that window to at least MAXPHYS. >3. change UVM code to not do this mapping via kernel map, so pmap_update() >and the IPIs are not triggered in first place What I currently don't understand is that we only see one TLB shootdown per window and that happens when it is _unmapped_. Mapping the pages (with the exception of overwriting) is done by the regular fault handler and apparently this does not flush the TLB. This makes me think that it might not be required at all and is just an unwanted side-effect of pmap_update. -- -- Michael van Elst Internet: mlel...@serpens.de "A potential Snark may lurk in every tree."
Fixing excessive shootdowns during FFS read (kern/53124) on x86
Hello, I'd like to gather some feedback on how to best tackle kern/53124. The problem there is that FFS triggers a pathologic case. I/O transfer maps and then unmaps each block into kernel pmap, so that the data could be copied into user memory. This triggers TLB shootdown IPIs for each FS block, sent to all CPUs which happen to be idle, or otherwise running on kernel pmap. On systems with many idle CPUs these TLB shootdowns cause a lot of synchronization overhead. I see three possible ways how to fix this particular case: 1. make it possible to lazy invalidate TLB also for kernel pmap, i.e. make pmap_deactivate()/pmap_reactivate() work with kernel pmap - this avoids the TLB shootdown IPIs to be sent to idle CPUs 2. make idle lwp run in it's own pmap and address space - variant to #1, avoids changing invariant about kernel map being always loaded 3. change UVM code to not do this mapping via kernel map, so pmap_update() and the IPIs are not triggered in first place I reckon #2 would add a lot of overhead into the idle routine, it would require at least an address space switch. This address space switch might be fairly cheap on architectures with address space ID (avoiding TLB flushes), but still much more than what the idle entry/exit does now. Variants of problems with #3 was discussed on and off during the years as I recall, but there is no resolution yet, and I'm not aware of anyone actively working on this. I understand this would be Hard, with nobody currently having the time and courage. This leaves #1 as a short-term practical solution. Anyone foresees any particular problems with this, does it have a chance to fly? Any other idea? Jaromir