Re: multicast.4: drop unneeded (void *) casts
On Fri, Oct 19, 2018 at 6:20 AM Scott Cheloha wrote: ... > Sure, looks cleaner. ok? ok guenther@
Re: bypass support for iommu on sparc64
On Sat, Oct 20, 2018 at 02:44:29AM +, Joseph Mayer wrote: > > Last iteration from me on this one. > > Why is this not a problem on some other architectures? It is a problem, it's just that other archs don't have an iommu like sparc64. > I'd have thought DMA and hardware being assigned transitory addresses > (from memory allocator or other OS subsystem or driver) mostly is a > lower level phenomenon and memcpy normally applies on higher levels, > isn't it so - for networking for instance, mbuf's take over soon above > the driver level. Does OpenBSD have a pool of to-be-mbufs and it asks > network drivers to write received ethernet frames directly to them, and > similarly transmit ethernet frames directly from mbuf:s? Hrm. There's three views of memory you need to keep in mind here. Memory has a physical address which gets mapped to virtual addresses that the kernel and programs see. Finally, there's the DMA address, which is the address devices use to access physical memory. On most archs the physical and dma addresses are the same thing. On archs with an IOMMU or similar, the dma address can be virtual, just like the kernel addresses are virtual. When you allocate an mbuf, you're getting a chunk of physical memory that is mapped into the kernel virtual address space. For a device to do something with it, the kernel has the bus_dma api that figures out the dma address of the physical memory behind the kernel virtual address. On sparc64, that figuring out involves finding the physical address on the memory, then allocating and filling TTEs. On amd64, it just has to get the physical address of the kva and the device can use it directly. > What potentially or clearly sensitive memory would passthru expose, > driver-owned structures only or all memory? Passthru menas a device can access all the physical memory in a computer. So everything.
Re: bypass support for iommu on sparc64
On Saturday, October 20, 2018 10:14 AM, David Gwynne wrote: > > On 20 Oct 2018, at 11:56 am, Joseph Mayer joseph.ma...@protonmail.com wrote: > > ‐‐‐ Original Message ‐‐‐ > > On Friday, October 19, 2018 5:15 PM, Mark Kettenis mark.kette...@xs4all.nl > > wrote: > > > > > > Date: Fri, 19 Oct 2018 10:22:30 +1000 > > > > From: David Gwynne da...@gwynne.id.au > > > > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote: > > > > > > > > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote: > > > > > > > > > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes), > > > > > > setting up and tearing down the translation table entries (TTEs) > > > > > > is very expensive. so expensive that the cost of doing it for disk > > > > > > io has a noticable impact on compile times. > > > > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide > > > > > > to bypass the iommu for devices that set that flag, therefore > > > > > > avoiding the cost of handling the TTEs. > > > > Question for the unintroduced, what's the scope here, TTE is Sparc's > > page table and reconfiguring them at (process) context switch is > > expensive and this suggestion removes the need for TTE:s for hardware > > device access, but those don't change at context switch? > > We're talking about an IOMMU here, not a traditional MMU providing virtual > addresses for programs. An IOMMU sits between physical memory and the devices > in a machine. It allows DMA addresses to mapped to different parts of > physical memory. Mapping physical memory to a DMA virtual address (or dva) is > how a device that only understands 32bit addresses can work in a 64bit > machine. Memory at high addresses gets mapped to a low dva. > > This is done at runtime on OpenBSD when DMA mappings are loaded or unloaded > by populating Translation Table Entries (TTEs). A TTE is effectively a table > or array mapping DVA pages to physical addresses. Generally device drivers > load and unload dma memory for every I/O or packet or so on. > > IOMMUs in sparc64s have some more features than this. Because they really are > between memory and the devices they can act as a gatekeeper for all memory > accesses. They also have a toggle that can allow a device to have direct or > passthru access to physical memory. If passthru is enabled, there's a special > address range that effectively maps all physical memory into a DVA range. > Devices can be pointed at it without having to manage TTEs. When passthru is > disabled, all accesses must go through TTEs. > > Currently OpenBSD disables passthru. The benefit is devices can't blindly > access sensitive memory unless it is explicitly shared. Note that this is how > it is on most architectures anyway. However, the consequence of managing the > TTEs is that it is expensive, and extremely so in some cases. > > dlg Last iteration from me on this one. Why is this not a problem on some other architectures? I'd have thought DMA and hardware being assigned transitory addresses (from memory allocator or other OS subsystem or driver) mostly is a lower level phenomenon and memcpy normally applies on higher levels, isn't it so - for networking for instance, mbuf's take over soon above the driver level. Does OpenBSD have a pool of to-be-mbufs and it asks network drivers to write received ethernet frames directly to them, and similarly transmit ethernet frames directly from mbuf:s? What potentially or clearly sensitive memory would passthru expose, driver-owned structures only or all memory?
Re: vmd: servicing virtio devices from separate processes
> On 20 Oct 2018, at 4:01 am, Sergio Lopez wrote: > > On Wed, Oct 17, 2018 at 02:04:46PM -0700, Mike Larkin wrote: >> On Fri, Oct 05, 2018 at 01:50:10PM +0200, Sergio Lopez wrote: >>> Right now, vmd already features an excellent privsep model to ensure >>> the process servicing the VM requests to the outside world is running >>> with the lowest possible privileges. >>> >>> I was wondering if we could take a step further, servicing each virtio >>> device from a different process. This design would simplify the >>> implementation and maintenance of those devices, improve the privsep >>> model and increase the resilience of VMs (the crash of a process >>> servicing a device won't bring down the whole VM, and a mechanism to >>> recover from this scenario could be explored). >> >> Our model is generally to not try to recover from crashes like that. Indeed, >> you *want* to crash so the underlying bug can be found and fixed. > > With separate processes you'll still have a crash with core dump from > the process servicing the device, with the additional advantage of being > able to, optionally, try to recover device or debug it independently. > >>> - An in-kernel IRQ chip. This one is the most controversial, as it >>> means emulating a device from a privileged domain, but I'm pretty sure >>> a lapic implementation with enough functionality to serve *BSD/Linux >>> Guests can be small and simple enough to be easily auditable. >> >> This needs to be done, for a number of reasons (device emulation being just >> one). pd@ and I are working on how to implement this using features in >> recent CPUs, since much of the LAPIC emulation can now be handled by the >> CPU itself. We're thinking skylake and later will be the line in the sand >> for this. Otherwise the software emulation is more complex and more prone >> to bugs. I've resisted the urge to put this stuff in the kernel for exactly >> that reason, but with later model CPUs we may be in better shape. We may >> also decide to focus solely on x2APIC. If you're interested in helping in >> this area, I'll keep you in the loop. > > Sure, I'd like to help. I'm quite familiar with KVM's in-kernel irqchip > implementation. > >>> Do you think it's worth exploring this model? What are feelings >>> regarding the in-kernel IRQ chip? >>> >>> Sergio (slp). >>> >> >> All things considered, I'm less sold on the idea of splitting out devices >> into their own processes. I don't see any compelling reason. But we do >> need an IOAPIC and LAPIC implementation at some point, as you point out. > > Well, vmd is still small enough (thankfully) to be able debug it easily. > But, eventually, it'll grow in both size and complexity (specially with > SMP support, I/O optimizations, additional storage backends...) and > having a high degree of modularity really helps here. > > In fact, the QEMU/KVM community is starting to consider going this route > (but QEMU is *huge*). [1] > > Anyways, this is not a "now-or-never" thing. As you suggest, we can work > now on kickfd and IOAPIC/LAPIC, which are useful by themselves. And when > those are in place, I'll be able to write a PoC so we can evaluate its > benefits and drawbacks. Would sending and receiving a VM still work if I/O is run in different processes? dlg
Re: bypass support for iommu on sparc64
> On 20 Oct 2018, at 11:56 am, Joseph Mayer wrote: > > ‐‐‐ Original Message ‐‐‐ > On Friday, October 19, 2018 5:15 PM, Mark Kettenis > wrote: > >>> Date: Fri, 19 Oct 2018 10:22:30 +1000 >>> From: David Gwynne da...@gwynne.id.au >>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote: >>> On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote: > on modern sparc64s (think fire or sparc enterprise Mx000 boxes), > setting up and tearing down the translation table entries (TTEs) > is very expensive. so expensive that the cost of doing it for disk > io has a noticable impact on compile times. > now that there's a BUS_DMA_64BIT flag, we can use that to decide > to bypass the iommu for devices that set that flag, therefore > avoiding the cost of handling the TTEs. > > Question for the unintroduced, what's the scope here, TTE is Sparc's > page table and reconfiguring them at (process) context switch is > expensive and this suggestion removes the need for TTE:s for hardware > device access, but those don't change at context switch? We're talking about an IOMMU here, not a traditional MMU providing virtual addresses for programs. An IOMMU sits between physical memory and the devices in a machine. It allows DMA addresses to mapped to different parts of physical memory. Mapping physical memory to a DMA virtual address (or dva) is how a device that only understands 32bit addresses can work in a 64bit machine. Memory at high addresses gets mapped to a low dva. This is done at runtime on OpenBSD when DMA mappings are loaded or unloaded by populating Translation Table Entries (TTEs). A TTE is effectively a table or array mapping DVA pages to physical addresses. Generally device drivers load and unload dma memory for every I/O or packet or so on. IOMMUs in sparc64s have some more features than this. Because they really are between memory and the devices they can act as a gatekeeper for all memory accesses. They also have a toggle that can allow a device to have direct or passthru access to physical memory. If passthru is enabled, there's a special address range that effectively maps all physical memory into a DVA range. Devices can be pointed at it without having to manage TTEs. When passthru is disabled, all accesses must go through TTEs. Currently OpenBSD disables passthru. The benefit is devices can't blindly access sensitive memory unless it is explicitly shared. Note that this is how it is on most architectures anyway. However, the consequence of managing the TTEs is that it is expensive, and extremely so in some cases. dlg
Re: bypass support for iommu on sparc64
‐‐‐ Original Message ‐‐‐ On Friday, October 19, 2018 5:15 PM, Mark Kettenis wrote: > > Date: Fri, 19 Oct 2018 10:22:30 +1000 > > From: David Gwynne da...@gwynne.id.au > > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote: > > > > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote: > > > > > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes), > > > > setting up and tearing down the translation table entries (TTEs) > > > > is very expensive. so expensive that the cost of doing it for disk > > > > io has a noticable impact on compile times. > > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide > > > > to bypass the iommu for devices that set that flag, therefore > > > > avoiding the cost of handling the TTEs. Question for the unintroduced, what's the scope here, TTE is Sparc's page table and reconfiguring them at (process) context switch is expensive and this suggestion removes the need for TTE:s for hardware device access, but those don't change at context switch?
Re: lib/libfuse: Handle signals that get sent to any thread
On Fri, Oct 19 2018, at 4:02 PM, Klemens Nanni wrote: > On Fri, Oct 19, 2018 at 02:03:59PM -0700, Rian Hunter wrote: >> Gentle bump on this. Would be nice if this were fixed for the next release. >> >> On 2018-10-07 10:55, Rian Hunter wrote: >> > (re-posting because I realized the last patch didn't apply cleanly) > Your patch does not apply due to whitespace issues, your MUA is probably > the culprit. I suggest you mail diffs to yourself first and and try to > apply them as reviewers would have to do. > > Even after fixing it with `sed "s/^ / /"', your diff does not compile: > > /usr/src/lib/libfuse/fuse.c:496:20: error: use of undeclared identifier > 'SIG_SIGCHLD' > signal(SIGPIPE, SIG_SIGCHLD); > >> +if (sigaction(SIGCHLD, NULL, &old_sa) == 0) >> +if (old_sa.sa_handler == SIG_IGN) >> +signal(SIGPIPE, SIG_SIGCHLD); > It's SIGCHLD or SIG_DFL, see signal(3). Thanks for the feedback, in my haste to re-post my patch I posted an old version. Here's the current version, there should be no tab-mangling: Index: lib/libfuse/fuse.c === RCS file: /cvs/src/lib/libfuse/fuse.c,v retrieving revision 1.49 diff -u -r1.49 fuse.c --- lib/libfuse/fuse.c 5 Jul 2018 10:57:31 - 1.49 +++ lib/libfuse/fuse.c 20 Oct 2018 01:03:23 - @@ -32,8 +32,6 @@ #include "fuse_private.h" #include "debug.h" -static volatile sig_atomic_t sigraised = 0; -static volatile sig_atomic_t signum = 0; static struct fuse_context *ictx = NULL; enum { @@ -116,21 +114,10 @@ }; static void -ifuse_sighdlr(int num) -{ - if (!sigraised || (num == SIGCHLD)) { - sigraised = 1; - signum = num; - } -} - -static void ifuse_try_unmount(struct fuse *f) { pid_t child; - signal(SIGCHLD, ifuse_sighdlr); - /* unmount in another thread so fuse_loop() doesn't deadlock */ child = fork(); @@ -152,7 +139,6 @@ { int status; - signal(SIGCHLD, SIG_DFL); if (waitpid(WAIT_ANY, &status, WNOHANG) == -1) fprintf(stderr, "fuse: %s\n", strerror(errno)); @@ -160,7 +146,6 @@ fprintf(stderr, "fuse: %s: %s\n", f->fc->dir, strerror(WEXITSTATUS(status))); - sigraised = 0; return; } @@ -170,6 +155,7 @@ struct fusebuf fbuf; struct fuse_context ctx; struct fb_ioctl_xch ioexch; + struct kevent event[5]; struct kevent ev; ssize_t n; int ret; @@ -181,28 +167,39 @@ if (fuse->fc->kq == -1) return (-1); - EV_SET(&fuse->fc->event, fuse->fc->fd, EVFILT_READ, EV_ADD | + EV_SET(&event[0], fuse->fc->fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, 0); + /* signal events */ + EV_SET(&event[1], SIGCHLD, EVFILT_SIGNAL, EV_ADD | + EV_ENABLE, 0, 0, 0); + EV_SET(&event[2], SIGHUP, EVFILT_SIGNAL, EV_ADD | + EV_ENABLE, 0, 0, 0); + EV_SET(&event[3], SIGINT, EVFILT_SIGNAL, EV_ADD | + EV_ENABLE, 0, 0, 0); + EV_SET(&event[4], SIGTERM, EVFILT_SIGNAL, EV_ADD | + EV_ENABLE, 0, 0, 0); + while (!fuse->fc->dead) { - ret = kevent(fuse->fc->kq, &fuse->fc->event, 1, &ev, 1, NULL); + ret = kevent(fuse->fc->kq, &event[0], 5, &ev, 1, NULL); if (ret == -1) { - if (errno == EINTR) { - switch (signum) { - case SIGCHLD: - ifuse_child_exit(fuse); - break; - case SIGHUP: - case SIGINT: - case SIGTERM: - ifuse_try_unmount(fuse); - break; - default: - fprintf(stderr, "%s: %s\n", __func__, - strsignal(signum)); - } - } else + if (errno != EINTR) DPERROR(__func__); + } else if (ret > 0 && ev.filter == EVFILT_SIGNAL) { + int signum = ev.ident; + switch (signum) { + case SIGCHLD: + ifuse_child_exit(fuse); + break; + case SIGHUP: + case SIGINT: + case SIGTERM: + ifuse_try_unmount(fuse); + break; + default: + fprintf(stderr, "%s: %s\n", __func__, + strsignal(signum)); + }
Re: bypass support for iommu on sparc64
> On 19 Oct 2018, at 9:59 pm, Andrew Grillet wrote: > > Is the setup and teardown per transfer or when file is opened and closed? > Or is it set up once per context switch of task? > > I am partly interested cos I would like to improve mt one day (as user of > tape > and Sparc64 Txxx) if I get the time. > > Andrew The overhead is per transfer. You might not get better performance out of a tx000 because of the PCIe bridges involved, but you may also be lucky and not have that bridge in the way. > > > > On Fri, 19 Oct 2018 at 10:22, Mark Kettenis wrote: > >>> Date: Fri, 19 Oct 2018 10:22:30 +1000 >>> From: David Gwynne >>> >>> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote: On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote: > on modern sparc64s (think fire or sparc enterprise Mx000 boxes), > setting up and tearing down the translation table entries (TTEs) > is very expensive. so expensive that the cost of doing it for disk > io has a noticable impact on compile times. > > now that there's a BUS_DMA_64BIT flag, we can use that to decide > to bypass the iommu for devices that set that flag, therefore > avoiding the cost of handling the TTEs. > > the following diff adds support for bypass mappings to the iommu > code on sparc64. it's based on a diff from kettenis@ back in 2009. > the main changes are around coping with the differences between > schizo/psycho and fire/oberon. > > the differences between the chips are now represented by a iommu_hw > struct. these differences include how to enable the iommu (now via > a function pointer), and masks for bypass addresses. > > ive tested this on oberon (on an m4000) and schizo (on a v880). > however, the bypass code isnt working on fire (v245s). to cope with > that for now, the iommu_hw struct lets drivers mask flag bits that > are handled when creating a dmamap. this means fire boards will > ignore BUS_DMA_64BIT until i can figure out whats wrong with them. i figured it out. it turns out Fire was working fine. however, enabling 64bit dva on the onboard devices didnt work because the serverworks/broadcom pcie to pcix bridge can only handle dma addresses in the low 40 bits. because the fire bypass window is higher than this, the bridge would choke and things stopped working. the updated diff attempts to handle this. basically when probing the bridge, the platform creates a custom dma tag for it. this tag intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before handing it up to the parent bridge, which is pyro in my situation. it looks like early sun4v boxes could make use of this too. > i have not tested this on psycho yet. if anyone has such a machine > and is willing to work with me to figure it out, please talk to me. i still dont have psycho reports. >>> >>> Would anyone object if I committed this? I've been running it for the >>> last release or two without issues, but with significant improvements in >>> performance on the machines involved. >> >> At the price of giving all PCI devices unrestricted access to memory. >> >> So I'm not eager to this, especially since on sun4v hardware bypassing >> the iommu isn't possible as soon as multiple domains are enabled. And >> we lose a useful diagnostic when developing drivers. Are you sure the >> iommu overhead can't be reduced some other way? At some point we >> probably want to add iommu support on amd64 and arm64, but if that >> comes with a similar overhead as on sparc64 that's going to be a bit >> of an issue. >> Index: dev/iommu.c === RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v retrieving revision 1.74 diff -u -p -r1.74 iommu.c --- dev/iommu.c 30 Apr 2017 16:45:45 - 1.74 +++ dev/iommu.c 10 May 2017 12:00:09 - @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t, bus_addr_t, bus_size_t, int); +void iommu_hw_enable(struct iommu_state *); + +const struct iommu_hw iommu_hw_default = { + .ihw_enable = iommu_hw_enable, + + .ihw_dvma_pa= IOTTE_PAMASK, + + .ihw_bypass = 0x3fffUL << 50, + .ihw_bypass_nc = 0, + .ihw_bypass_ro = 0, +}; + +void +iommu_hw_enable(struct iommu_state *is) +{ + IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb); + IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16)); +} + /* * Initiate an STC entry flush. */ @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb * - create a private DVMA map. */ void -iommu_init(char *name, struct iommu_sta
Re: bypass support for iommu on sparc64
> On 19 Oct 2018, at 7:15 pm, Mark Kettenis wrote: > >> Date: Fri, 19 Oct 2018 10:22:30 +1000 >> From: David Gwynne >> >> On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote: >>> On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote: on modern sparc64s (think fire or sparc enterprise Mx000 boxes), setting up and tearing down the translation table entries (TTEs) is very expensive. so expensive that the cost of doing it for disk io has a noticable impact on compile times. now that there's a BUS_DMA_64BIT flag, we can use that to decide to bypass the iommu for devices that set that flag, therefore avoiding the cost of handling the TTEs. the following diff adds support for bypass mappings to the iommu code on sparc64. it's based on a diff from kettenis@ back in 2009. the main changes are around coping with the differences between schizo/psycho and fire/oberon. the differences between the chips are now represented by a iommu_hw struct. these differences include how to enable the iommu (now via a function pointer), and masks for bypass addresses. ive tested this on oberon (on an m4000) and schizo (on a v880). however, the bypass code isnt working on fire (v245s). to cope with that for now, the iommu_hw struct lets drivers mask flag bits that are handled when creating a dmamap. this means fire boards will ignore BUS_DMA_64BIT until i can figure out whats wrong with them. >>> >>> i figured it out. it turns out Fire was working fine. however, >>> enabling 64bit dva on the onboard devices didnt work because the >>> serverworks/broadcom pcie to pcix bridge can only handle dma addresses >>> in the low 40 bits. because the fire bypass window is higher than >>> this, the bridge would choke and things stopped working. >>> >>> the updated diff attempts to handle this. basically when probing >>> the bridge, the platform creates a custom dma tag for it. this tag >>> intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before >>> handing it up to the parent bridge, which is pyro in my situation. >>> it looks like early sun4v boxes could make use of this too. >>> i have not tested this on psycho yet. if anyone has such a machine and is willing to work with me to figure it out, please talk to me. >>> >>> i still dont have psycho reports. >> >> Would anyone object if I committed this? I've been running it for the >> last release or two without issues, but with significant improvements in >> performance on the machines involved. > > At the price of giving all PCI devices unrestricted access to memory. > > So I'm not eager to this, especially since on sun4v hardware bypassing > the iommu isn't possible as soon as multiple domains are enabled. And > we lose a useful diagnostic when developing drivers. Are you sure the > iommu overhead can't be reduced some other way? At some point we > probably want to add iommu support on amd64 and arm64, but if that > comes with a similar overhead as on sparc64 that's going to be a bit > of an issue. First, note that it doesn't turn the iommu off. By default drivers still go through it unless they opt out with BUS_DMA_64BIT. This is because the iommu is still in between the device and ram, and it provides the passthru window up at 0xfffc. As an aside, and as hinted at in my previous mails, it means that devices with ppb6 at pci6 dev 0 function 0 "ServerWorks PCIE-PCIX" rev 0xb5 in them cannot really use BUS_DMA_64BIT cos those bridges are buggy and don't handle DVAs above 48 or 56 bits or something. That bridge is used in v215s, v245s, v445s, t1000s, and so on. I have a theory that because of that bridge, there was a meme going around Sun at the time that it was cheaper to memcpy in and out of preallocated DMA memory than it was to do DMA for every packet or disk I/O or whatever. Which leads me to the conclusion that an alternative to using the passthru window would be to have bus_dma preallocate the dmaable memory and bounce in and out of it. The performance hit I'm trying to avoid is with setting up and tearing down the transaction table entries. If they already exist, you avoid that hit. Bouncing is complicated though, both in the bus_dma layer, and especially by pushing it into drivers. The amount of overhead varies between machines. It seems less of a difference with nvme(4) in a slot that is not behind the dodgy bridge on a v245. It was about 20 or 30 percent of a difference with gem(4) and tcpbench in a v880 (schizo). It is particularly bad on the M4000 I have, this is why I looked into this. There are orders of magnitude of difference between tcpbench results with a tweaked ix(4) and this diff on or off. We've not enabled mitigations before because of performance hits less than this. dlg > >>> Index: dev/iommu.c >>> ===
Re: lib/libfuse: Handle signals that get sent to any thread
On Fri, Oct 19, 2018 at 02:03:59PM -0700, Rian Hunter wrote: > Gentle bump on this. Would be nice if this were fixed for the next release. > > On 2018-10-07 10:55, Rian Hunter wrote: > > (re-posting because I realized the last patch didn't apply cleanly) Your patch does not apply due to whitespace issues, your MUA is probably the culprit. I suggest you mail diffs to yourself first and and try to apply them as reviewers would have to do. Even after fixing it with `sed "s/^ / /"', your diff does not compile: /usr/src/lib/libfuse/fuse.c:496:20: error: use of undeclared identifier 'SIG_SIGCHLD' signal(SIGPIPE, SIG_SIGCHLD); > + if (sigaction(SIGCHLD, NULL, &old_sa) == 0) > + if (old_sa.sa_handler == SIG_IGN) > + signal(SIGPIPE, SIG_SIGCHLD); It's SIGCHLD or SIG_DFL, see signal(3).
Re: lib/libfuse: Handle signals that get sent to any thread
Gentle bump on this. Would be nice if this were fixed for the next release. On 2018-10-07 10:55, Rian Hunter wrote: (re-posting because I realized the last patch didn't apply cleanly) lib/libfuse/fuse.c was using a EINTR return from kevent() to detect when a signal had occurred, but in a multi-threaded process the signal may not have been delivered to the thread blocking on kevent(). This changes makes it so the signals are caught using EVFILT_SIGNAL filters so they can be detected even when they happen on another thread. Index: lib/libfuse/fuse.c === RCS file: /cvs/src/lib/libfuse/fuse.c,v retrieving revision 1.49 diff -u -p -u -p -r1.49 fuse.c --- lib/libfuse/fuse.c 5 Jul 2018 10:57:31 - 1.49 +++ lib/libfuse/fuse.c 5 Oct 2018 22:13:41 - @@ -32,8 +32,6 @@ #include "fuse_private.h" #include "debug.h" -static volatile sig_atomic_t sigraised = 0; -static volatile sig_atomic_t signum = 0; static struct fuse_context *ictx = NULL; enum { @@ -116,21 +114,10 @@ static struct fuse_opt fuse_mount_opts[] }; static void -ifuse_sighdlr(int num) -{ - if (!sigraised || (num == SIGCHLD)) { - sigraised = 1; - signum = num; - } -} - -static void ifuse_try_unmount(struct fuse *f) { pid_t child; - signal(SIGCHLD, ifuse_sighdlr); - /* unmount in another thread so fuse_loop() doesn't deadlock */ child = fork(); @@ -152,7 +139,6 @@ ifuse_child_exit(const struct fuse *f) { int status; - signal(SIGCHLD, SIG_DFL); if (waitpid(WAIT_ANY, &status, WNOHANG) == -1) fprintf(stderr, "fuse: %s\n", strerror(errno)); @@ -160,7 +146,6 @@ ifuse_child_exit(const struct fuse *f) fprintf(stderr, "fuse: %s: %s\n", f->fc->dir, strerror(WEXITSTATUS(status))); - sigraised = 0; return; } @@ -170,6 +155,7 @@ fuse_loop(struct fuse *fuse) struct fusebuf fbuf; struct fuse_context ctx; struct fb_ioctl_xch ioexch; + struct kevent event[5]; struct kevent ev; ssize_t n; int ret; @@ -181,28 +167,39 @@ fuse_loop(struct fuse *fuse) if (fuse->fc->kq == -1) return (-1); - EV_SET(&fuse->fc->event, fuse->fc->fd, EVFILT_READ, EV_ADD | + EV_SET(&event[0], fuse->fc->fd, EVFILT_READ, EV_ADD | EV_ENABLE, 0, 0, 0); + /* signal events */ + EV_SET(&event[1], SIGCHLD, EVFILT_SIGNAL, EV_ADD | + EV_ENABLE, 0, 0, 0); + EV_SET(&event[2], SIGHUP, EVFILT_SIGNAL, EV_ADD | + EV_ENABLE, 0, 0, 0); + EV_SET(&event[3], SIGINT, EVFILT_SIGNAL, EV_ADD | + EV_ENABLE, 0, 0, 0); + EV_SET(&event[4], SIGTERM, EVFILT_SIGNAL, EV_ADD | + EV_ENABLE, 0, 0, 0); + while (!fuse->fc->dead) { - ret = kevent(fuse->fc->kq, &fuse->fc->event, 1, &ev, 1, NULL); + ret = kevent(fuse->fc->kq, &event[0], 5, &ev, 1, NULL); if (ret == -1) { - if (errno == EINTR) { - switch (signum) { - case SIGCHLD: - ifuse_child_exit(fuse); - break; - case SIGHUP: - case SIGINT: - case SIGTERM: - ifuse_try_unmount(fuse); - break; - default: - fprintf(stderr, "%s: %s\n", __func__, - strsignal(signum)); - } - } else + if (errno != EINTR) DPERROR(__func__); + } else if (ret > 0 && ev.filter == EVFILT_SIGNAL) { + int signum = ev.ident; + switch (signum) { + case SIGCHLD: + ifuse_child_exit(fuse); + break; + case SIGHUP: + case SIGINT: + case SIGTERM: + ifuse_try_unmount(fuse); + break; + default: + fprintf(stderr, "%s: %s\n", __func__, + strsignal(signum)); + } } else if (ret > 0) { n = read(fuse->fc->fd, &fbuf, sizeof(fbuf)); if (n != sizeof(fbuf)) { @@ -479,20 +476,24 @@ fuse_remove_signal_handlers(unused struc struct sigaction old_sa; if (sigaction(SIGHUP, NULL, &old_sa) == 0) - if (old_sa.sa_handler == ifuse_sighdlr) + if (old_sa.sa_handl
Re: [RFC PATCH] [vmm] Lightweight mechanism to kick virtio VQs
On Wed, Oct 17, 2018 at 02:22:23PM -0700, Mike Larkin wrote: > On Fri, Oct 05, 2018 at 12:07:13PM +0200, Sergio Lopez wrote: > > Hi, > > > > This patch implements a mechanism to allow users register an I/O port > > with a special file descriptor (kickfd) which can be monitored for > > events using kevent. The kernel will note an event each time the Guest > > writes to an I/O port registered with a kickfd. > > > > > This is mainly intended for kicking virtio virtqueues, and allows the > > VCPU to return immediately to the Guest, instead of having to exit all > > the way to vmd to process the request. With kickfd, the actual work > > (i.e. an I/O operation) requested by the Guest can be done in a > > parallel thread that eventually asserts the corresponding IRQ. > > The reason our virtio is so slow today is, as you point out, we process > the I/O sequentially with the queue notify back in vmd. To this end, there > are three "performance overhead components" involved: > > 1. the amount of time it takes to return from the vmm run ioctl back to >vmd > 2. the time it takes to process the events in the queue > 3. the time it takes to re-enter the vm back to the point after the OUT >instruction that did the queue notify. > > I've benchmarked this before; the biggest performance penalty here is in > #2. The exit and re-entry overheads are miniscule compared to the overhead > of the I/O itself. I think that, in addition to the absolute number of cycles spent on each approach, we should also consider where are being spent. The kickfd approach is probably more expensive (ignoring TLB flushes, which may end making it cheaper) than having the vCPU do the actual I/O, but it allows the latter to return much earlier to the Guest, resulting in a significantly lower submission latency and more CPU time available for the VM. > Optimizing #2 is therefore the most bang for the buck. We (dlg@ and I) > had a taskq-based diff to do this previously (I'm not sure what state it > was left in). In that diff, the I/O was processed in a set of worker threads, > effectively eliminating the overhead for #2. I had to partially rewrite the vioblk_notifyq function, as it didn't play nice with concurrent access (Guest + iothread), but I haven't implemented any optimization for #2. There's the option of merging requests and using working threads to avoid having the iothread actively waiting for the I/O, but this would significantly increase the complexity of vmd's virtio code, so must be done with care (it's the problem behind *many* qemu issues). > Your diff (if I read it correctly), optimizes all 3 cases, and that's good. > I'll admit it's not a way I thought of solving the performance issue, and > I'm not sure if this is the right way or if the taskq way is better. The > taskq way is a bit cleaner and easier to understand (there is no separate > backchannel from vmm->vmd, but it doesn't look too crazy from what I > read below). > > Do you have some performance measurements with this diff applied vs without? Yes, I ran some random read and random write tests with fio. The Guest is Linux, mainly because its virtio implementation is more mature. Next week I'll try with OpenBSD. In a general sense, the kickfd approach is similar or slightly better than the current one when generating I/O from just one process, and significantly better (up to 2x the throughput) with 2 and 4 processes. The test machine is an AMD Phenom II X4 (yeah, not the best thing for this kind of tests, but the only non-virtual environment I have here) with a Samsung SSD Evo 850. OpenBSD's kernel is running with HZ=1000 and Linux's with HZ=100 (otherwise I couldn't get reliable numbers due to missing ticks). The target device is backed by a raw image. I'm pasting the raw output for fio below. Sergio (slp). == | Linux + kickfd | == - randread [root@localhost ~]# fio --clocksource=clock_gettime --name=randread --rw=randread --bs=4k --filename=/dev/vdb --size=1g --numjobs=1 --ioengine=sync --iodepth=1 --direct=1 --group_reporting randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1 fio-3.3 Starting 1 process Jobs: 1 (f=1): [r(1)][95.7%][r=49.3MiB/s,w=0KiB/s][r=12.6k,w=0 IOPS][eta 00m:01s] randread: (groupid=0, jobs=1): err= 0: pid=318: Fri Oct 19 18:36:02 2018 read: IOPS=11.5k, BW=44.8MiB/s (46.0MB/s)(1024MiB/22850msec) clat (nsec): min=0, max=2k, avg=85372.92, stdev=924160.90 lat (nsec): min=0, max=2k, avg=85525.51, stdev=924971.98 clat percentiles (nsec): | 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], | 20.00th=[ 0], 30.00th=[ 0], 40.00th=[ 0], | 50.00th=[ 0], 60.00th=[ 0], 70.00th=[ 0], | 80.00th=[ 0], 90.00th=[ 0], 95.00th=[ 0], | 99.00th=[ 0], 99.50th=[10027008], 99.90th=[10027008], | 99.95th=[10027008], 99.99th=[10027008]
Re: bgpd throttling for peers
On Fri, Oct 19, 2018 at 05:51:20PM +0200, Claudio Jeker wrote: > On Wed, Oct 17, 2018 at 02:37:48PM +0200, Claudio Jeker wrote: > > I noticed that the throttling for peers which was added some time ago is > > incomplete. The following diff solved these issues. > > > > In rde_update_queue_runner() only process peers which are currently not > > throttled. Additionally only run the runners if not too many imsg are > > pending in the RDE. Both these changes should help imporve responsiveness. > > In the SE only set the throttled flag if the imsg sending was successful. > > > > Does work fine on my systems with little or no effect on convergance time. > > Please test on your systems and look for preformance changes. > > Updated version after recent commit. > OK denis@ > -- > :wq Claudio > > Index: rde.c > === > RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v > retrieving revision 1.437 > diff -u -p -r1.437 rde.c > --- rde.c 18 Oct 2018 12:19:09 - 1.437 > +++ rde.c 19 Oct 2018 15:49:43 - > @@ -334,15 +334,16 @@ rde_main(int debug, int verbose) > mctx = LIST_NEXT(mctx, entry); > } > > - rde_update_queue_runner(); > - for (aid = AID_INET6; aid < AID_MAX; aid++) > - rde_update6_queue_runner(aid); > + if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) { > + rde_update_queue_runner(); > + for (aid = AID_INET6; aid < AID_MAX; aid++) > + rde_update6_queue_runner(aid); > + } > if (rde_dump_pending() && > ibuf_se_ctl && ibuf_se_ctl->w.queued <= 10) > rde_dump_runner(); > - if (softreconfig) { > + if (softreconfig) > rde_reload_runner(); > - } > } > > /* do not clean up on shutdown on production, it takes ages. */ > @@ -2664,6 +2665,8 @@ rde_update_queue_runner(void) > continue; > if (peer->state != PEER_UP) > continue; > + if (peer->throttled) > + continue; > eor = 0; > /* first withdraws */ > wpos = 2; /* reserve space for the length field */ > @@ -2730,6 +2733,8 @@ rde_update6_queue_runner(u_int8_t aid) > continue; > if (peer->state != PEER_UP) > continue; > + if (peer->throttled) > + continue; > len = sizeof(queue_buf) - MSGSIZE_HEADER; > b = up_dump_mp_unreach(queue_buf, &len, peer, aid); > > @@ -2753,6 +2758,8 @@ rde_update6_queue_runner(u_int8_t aid) > if (peer->conf.id == 0) > continue; > if (peer->state != PEER_UP) > + continue; > + if (peer->throttled) > continue; > len = sizeof(queue_buf) - MSGSIZE_HEADER; > r = up_dump_mp_reach(queue_buf, &len, peer, aid); > Index: session.c > === > RCS file: /cvs/src/usr.sbin/bgpd/session.c,v > retrieving revision 1.369 > diff -u -p -r1.369 session.c > --- session.c 29 Sep 2018 07:58:06 - 1.369 > +++ session.c 17 Oct 2018 12:18:51 - > @@ -1382,7 +1382,8 @@ session_sendmsg(struct bgp_msg *msg, str > if (!p->throttled && p->wbuf.queued > SESS_MSG_HIGH_MARK) { > if (imsg_rde(IMSG_XOFF, p->conf.id, NULL, 0) == -1) > log_peer_warn(&p->conf, "imsg_compose XOFF"); > - p->throttled = 1; > + else > + p->throttled = 1; > } > > free(msg); > @@ -1773,7 +1774,8 @@ session_dispatch_msg(struct pollfd *pfd, > if (p->throttled && p->wbuf.queued < SESS_MSG_LOW_MARK) { > if (imsg_rde(IMSG_XON, p->conf.id, NULL, 0) == -1) > log_peer_warn(&p->conf, "imsg_compose XON"); > - p->throttled = 0; > + else > + p->throttled = 0; > } > if (!(pfd->revents & POLLIN)) > return (1); >
Re: vmd: servicing virtio devices from separate processes
On Wed, Oct 17, 2018 at 02:04:46PM -0700, Mike Larkin wrote: > On Fri, Oct 05, 2018 at 01:50:10PM +0200, Sergio Lopez wrote: > > Right now, vmd already features an excellent privsep model to ensure > > the process servicing the VM requests to the outside world is running > > with the lowest possible privileges. > > > > I was wondering if we could take a step further, servicing each virtio > > device from a different process. This design would simplify the > > implementation and maintenance of those devices, improve the privsep > > model and increase the resilience of VMs (the crash of a process > > servicing a device won't bring down the whole VM, and a mechanism to > > recover from this scenario could be explored). > > Our model is generally to not try to recover from crashes like that. Indeed, > you *want* to crash so the underlying bug can be found and fixed. With separate processes you'll still have a crash with core dump from the process servicing the device, with the additional advantage of being able to, optionally, try to recover device or debug it independently. > > - An in-kernel IRQ chip. This one is the most controversial, as it > > means emulating a device from a privileged domain, but I'm pretty sure > > a lapic implementation with enough functionality to serve *BSD/Linux > > Guests can be small and simple enough to be easily auditable. > > This needs to be done, for a number of reasons (device emulation being just > one). pd@ and I are working on how to implement this using features in > recent CPUs, since much of the LAPIC emulation can now be handled by the > CPU itself. We're thinking skylake and later will be the line in the sand > for this. Otherwise the software emulation is more complex and more prone > to bugs. I've resisted the urge to put this stuff in the kernel for exactly > that reason, but with later model CPUs we may be in better shape. We may > also decide to focus solely on x2APIC. If you're interested in helping in > this area, I'll keep you in the loop. Sure, I'd like to help. I'm quite familiar with KVM's in-kernel irqchip implementation. > > Do you think it's worth exploring this model? What are feelings > > regarding the in-kernel IRQ chip? > > > > Sergio (slp). > > > > All things considered, I'm less sold on the idea of splitting out devices > into their own processes. I don't see any compelling reason. But we do > need an IOAPIC and LAPIC implementation at some point, as you point out. Well, vmd is still small enough (thankfully) to be able debug it easily. But, eventually, it'll grow in both size and complexity (specially with SMP support, I/O optimizations, additional storage backends...) and having a high degree of modularity really helps here. In fact, the QEMU/KVM community is starting to consider going this route (but QEMU is *huge*). [1] Anyways, this is not a "now-or-never" thing. As you suggest, we can work now on kickfd and IOAPIC/LAPIC, which are useful by themselves. And when those are in place, I'll be able to write a PoC so we can evaluate its benefits and drawbacks. Sergio (slp). [1] https://www.linux-kvm.org/images/f/fc/KVM_FORUM_multi-process.pdf
igmp_slowtimo: drop NETLOCK
Hi, If we introduce a mutex for the igmp module we can drop the NETLOCK from igmp_slowtimo(). The router_info list, rti_head, and its entries are only ever accessed within the igmp module. The mutex also needs to protect igmp_timers_are_running. This is largely a monkey-see-monkey-do imitation of what visa@ did for frag6.c. Obviously any errors are mine. There are spots in igmp_input_if() where we could push the mutex into smaller sections of the switch statement. I'm under the impression that IGMP is a cooler path, though, so that additional complexity is probably pointless, but I figure it's worth mentioning. Maybe on a later pass. I must admit I only have a small test setup for multipath routing. We also don't have any regress tests for IGMP (no idea how tricky those would be to write). The introduction of the mutex was pretty straightforward, though, given the size of igmp.c and the fact that the mutex is module-local. Thoughts? ok? -Scott P.S. It doesn't look like we need KERNEL_LOCK in igmp_input_if() until rip_input(). Is that fair game, or is there a larger plan to coordinate moving KERNEL_LOCK into various *_input_if() routines later? Index: sys/netinet/igmp.c === RCS file: /cvs/src/sys/netinet/igmp.c,v retrieving revision 1.74 diff -u -p -r1.74 igmp.c --- sys/netinet/igmp.c 18 Oct 2018 15:46:28 - 1.74 +++ sys/netinet/igmp.c 19 Oct 2018 17:07:31 - @@ -1,4 +1,4 @@ -/* $OpenBSD: igmp.c,v 1.74 2018/10/18 15:46:28 cheloha Exp $ */ +/* $OpenBSD: igmp.c,v 1.73 2018/10/18 15:23:04 cheloha Exp $ */ /* $NetBSD: igmp.c,v 1.15 1996/02/13 23:41:25 christos Exp $ */ /* @@ -77,6 +77,8 @@ #include #include +#include +#include #include #include #include @@ -98,7 +100,13 @@ int *igmpctl_vars[IGMPCTL_MAXID] = IGMPCTL_VARS; -intigmp_timers_are_running; +/* + * Protects igmp_timers_are_running and rti_head. Note that rti_head needs + * to be protected when one of its entries is accessed via in_multi->rti. + */ +struct mutex igmp_mutex = MUTEX_INITIALIZER(IPL_SOFTNET); + +int igmp_timers_are_running; static LIST_HEAD(, router_info) rti_head; static struct mbuf *router_alert; struct cpumem *igmpcounters; @@ -150,6 +158,8 @@ rti_fill(struct in_multi *inm) { struct router_info *rti; + MUTEX_ASSERT_LOCKED(&igmp_mutex); + LIST_FOREACH(rti, &rti_head, rti_list) { if (rti->rti_ifidx == inm->inm_ifidx) { inm->inm_rti = rti; @@ -176,6 +186,8 @@ rti_find(struct ifnet *ifp) struct router_info *rti; KERNEL_ASSERT_LOCKED(); + MUTEX_ASSERT_LOCKED(&igmp_mutex); + LIST_FOREACH(rti, &rti_head, rti_list) { if (rti->rti_ifidx == ifp->if_index) return (rti); @@ -195,13 +207,18 @@ rti_delete(struct ifnet *ifp) { struct router_info *rti, *trti; + mtx_enter(&igmp_mutex); + LIST_FOREACH_SAFE(rti, &rti_head, rti_list, trti) { if (rti->rti_ifidx == ifp->if_index) { LIST_REMOVE(rti, rti_list); + mtx_leave(&igmp_mutex); free(rti, M_MRTABLE, sizeof(*rti)); - break; + return; } } + + mtx_leave(&igmp_mutex); } int @@ -271,6 +288,8 @@ igmp_input_if(struct ifnet *ifp, struct m->m_len += iphlen; ip = mtod(m, struct ip *); + mtx_enter(&igmp_mutex); + switch (igmp->igmp_type) { case IGMP_HOST_MEMBERSHIP_QUERY: @@ -282,6 +301,7 @@ igmp_input_if(struct ifnet *ifp, struct if (igmp->igmp_code == 0) { rti = rti_find(ifp); if (rti == NULL) { + mtx_leave(&igmp_mutex); m_freem(m); return IPPROTO_DONE; } @@ -289,6 +309,7 @@ igmp_input_if(struct ifnet *ifp, struct rti->rti_age = 0; if (ip->ip_dst.s_addr != INADDR_ALLHOSTS_GROUP) { + mtx_leave(&igmp_mutex); igmpstat_inc(igps_rcv_badqueries); m_freem(m); return IPPROTO_DONE; @@ -314,6 +335,7 @@ igmp_input_if(struct ifnet *ifp, struct } } else { if (!IN_MULTICAST(ip->ip_dst.s_addr)) { + mtx_leave(&igmp_mutex); igmpstat_inc(igps_rcv_badqueries); m_freem(m); return IPPROTO_DONE; @@ -371,6 +393,7 @@ igmp_input_if(struct ifnet *ifp, struct if (!IN_MULTICAST(igmp->igmp_group.s_addr) || igmp->igmp_group.s_addr != ip->i
Re: bgpd throttling for peers
On Wed, Oct 17, 2018 at 02:37:48PM +0200, Claudio Jeker wrote: > I noticed that the throttling for peers which was added some time ago is > incomplete. The following diff solved these issues. > > In rde_update_queue_runner() only process peers which are currently not > throttled. Additionally only run the runners if not too many imsg are > pending in the RDE. Both these changes should help imporve responsiveness. > In the SE only set the throttled flag if the imsg sending was successful. > > Does work fine on my systems with little or no effect on convergance time. > Please test on your systems and look for preformance changes. Updated version after recent commit. -- :wq Claudio Index: rde.c === RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v retrieving revision 1.437 diff -u -p -r1.437 rde.c --- rde.c 18 Oct 2018 12:19:09 - 1.437 +++ rde.c 19 Oct 2018 15:49:43 - @@ -334,15 +334,16 @@ rde_main(int debug, int verbose) mctx = LIST_NEXT(mctx, entry); } - rde_update_queue_runner(); - for (aid = AID_INET6; aid < AID_MAX; aid++) - rde_update6_queue_runner(aid); + if (ibuf_se && ibuf_se->w.queued < SESS_MSG_HIGH_MARK) { + rde_update_queue_runner(); + for (aid = AID_INET6; aid < AID_MAX; aid++) + rde_update6_queue_runner(aid); + } if (rde_dump_pending() && ibuf_se_ctl && ibuf_se_ctl->w.queued <= 10) rde_dump_runner(); - if (softreconfig) { + if (softreconfig) rde_reload_runner(); - } } /* do not clean up on shutdown on production, it takes ages. */ @@ -2664,6 +2665,8 @@ rde_update_queue_runner(void) continue; if (peer->state != PEER_UP) continue; + if (peer->throttled) + continue; eor = 0; /* first withdraws */ wpos = 2; /* reserve space for the length field */ @@ -2730,6 +2733,8 @@ rde_update6_queue_runner(u_int8_t aid) continue; if (peer->state != PEER_UP) continue; + if (peer->throttled) + continue; len = sizeof(queue_buf) - MSGSIZE_HEADER; b = up_dump_mp_unreach(queue_buf, &len, peer, aid); @@ -2753,6 +2758,8 @@ rde_update6_queue_runner(u_int8_t aid) if (peer->conf.id == 0) continue; if (peer->state != PEER_UP) + continue; + if (peer->throttled) continue; len = sizeof(queue_buf) - MSGSIZE_HEADER; r = up_dump_mp_reach(queue_buf, &len, peer, aid); Index: session.c === RCS file: /cvs/src/usr.sbin/bgpd/session.c,v retrieving revision 1.369 diff -u -p -r1.369 session.c --- session.c 29 Sep 2018 07:58:06 - 1.369 +++ session.c 17 Oct 2018 12:18:51 - @@ -1382,7 +1382,8 @@ session_sendmsg(struct bgp_msg *msg, str if (!p->throttled && p->wbuf.queued > SESS_MSG_HIGH_MARK) { if (imsg_rde(IMSG_XOFF, p->conf.id, NULL, 0) == -1) log_peer_warn(&p->conf, "imsg_compose XOFF"); - p->throttled = 1; + else + p->throttled = 1; } free(msg); @@ -1773,7 +1774,8 @@ session_dispatch_msg(struct pollfd *pfd, if (p->throttled && p->wbuf.queued < SESS_MSG_LOW_MARK) { if (imsg_rde(IMSG_XON, p->conf.id, NULL, 0) == -1) log_peer_warn(&p->conf, "imsg_compose XON"); - p->throttled = 0; + else + p->throttled = 0; } if (!(pfd->revents & POLLIN)) return (1);
Re: multicast.4: drop unneeded (void *) casts
On Thu, Oct 18, 2018 at 05:38:42PM -0900, Philip Guenther wrote: > On Thu, Oct 18, 2018 at 4:00 PM wrote: > > > getsockopt(2) and setsockopt(2) take a void pointer for the third > > parameter so there is no need for any of these casts. > > > > Yep. > > > > I can do other style(9)-type cleanup in a subsequent diff or I can > > cook this diff and do it all in one. > > > > > Thoughts, preferences? > > > > I would prefer line-wrapping adjustment that's enabled by the deletion of > the casts be done in the same commit. Those to change, IMHO, called out > below. > > > [...] Sure, looks cleaner. ok? Index: share/man/man4/multicast.4 === RCS file: /cvs/src/share/man/man4/multicast.4,v retrieving revision 1.12 diff -u -p -r1.12 multicast.4 --- share/man/man4/multicast.4 7 Mar 2018 09:54:23 - 1.12 +++ share/man/man4/multicast.4 19 Oct 2018 15:15:21 - @@ -138,17 +138,17 @@ or disable multicast forwarding in the k .Bd -literal -offset 5n /* IPv4 */ int v = 1;/* 1 to enable, or 0 to disable */ -setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, (void *)&v, sizeof(v)); +setsockopt(mrouter_s4, IPPROTO_IP, MRT_INIT, &v, sizeof(v)); .Ed .Bd -literal -offset 5n /* IPv6 */ int v = 1;/* 1 to enable, or 0 to disable */ -setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, (void *)&v, sizeof(v)); +setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_INIT, &v, sizeof(v)); \&... /* If necessary, filter all ICMPv6 messages */ struct icmp6_filter filter; ICMP6_FILTER_SETBLOCKALL(&filter); -setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, (void *)&filter, +setsockopt(mrouter_s6, IPPROTO_ICMPV6, ICMP6_FILTER, &filter, sizeof(filter)); .Ed .Pp @@ -168,8 +168,7 @@ memcpy(&vc.vifc_lcl_addr, &vif_local_add if (vc.vifc_flags & VIFF_TUNNEL) memcpy(&vc.vifc_rmt_addr, &vif_remote_address, sizeof(vc.vifc_rmt_addr)); -setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, (void *)&vc, - sizeof(vc)); +setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_VIF, &vc, sizeof(vc)); .Ed .Pp The @@ -205,8 +204,7 @@ memset(&mc, 0, sizeof(mc)); mc.mif6c_mifi = mif_index; mc.mif6c_flags = mif_flags; mc.mif6c_pifi = pif_index; -setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, (void *)&mc, - sizeof(mc)); +setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_ADD_MIF, &mc, sizeof(mc)); .Ed .Pp The @@ -226,13 +224,13 @@ A multicast interface is deleted by: .Bd -literal -offset indent /* IPv4 */ vifi_t vifi = vif_index; -setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, (void *)&vifi, +setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_VIF, &vifi, sizeof(vifi)); .Ed .Bd -literal -offset indent /* IPv6 */ mifi_t mifi = mif_index; -setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, (void *)&mifi, +setsockopt(mrouter_s6, IPPROTO_IPV6, MRT6_DEL_MIF, &mifi, sizeof(mifi)); .Ed .Pp @@ -301,8 +299,7 @@ memcpy(&mc.mfcc_mcastgrp, &group_addr, s mc.mfcc_parent = iif_index; for (i = 0; i < maxvifs; i++) mc.mfcc_ttls[i] = oifs_ttl[i]; -setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, - (void *)&mc, sizeof(mc)); +setsockopt(mrouter_s4, IPPROTO_IP, MRT_ADD_MFC, &mc, sizeof(mc)); .Ed .Bd -literal -offset indent /* IPv6 */ @@ -314,8 +311,7 @@ mc.mf6cc_parent = iif_index; for (i = 0; i < maxvifs; i++) if (oifs_ttl[i] > 0) IF_SET(i, &mc.mf6cc_ifset); -setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC, - (void *)&mc, sizeof(mc)); +setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_ADD_MFC, &mc, sizeof(mc)); .Ed .Pp The @@ -344,8 +340,7 @@ struct mfcctl mc; memset(&mc, 0, sizeof(mc)); memcpy(&mc.mfcc_origin, &source_addr, sizeof(mc.mfcc_origin)); memcpy(&mc.mfcc_mcastgrp, &group_addr, sizeof(mc.mfcc_mcastgrp)); -setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, - (void *)&mc, sizeof(mc)); +setsockopt(mrouter_s4, IPPROTO_IP, MRT_DEL_MFC, &mc, sizeof(mc)); .Ed .Bd -literal -offset indent /* IPv6 */ @@ -353,8 +348,7 @@ struct mf6cctl mc; memset(&mc, 0, sizeof(mc)); memcpy(&mc.mf6cc_origin, &source_addr, sizeof(mc.mf6cc_origin)); memcpy(&mc.mf6cc_mcastgrp, &group_addr, sizeof(mf6cc_mcastgrp)); -setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC, - (void *)&mc, sizeof(mc)); +setsockopt(mrouter_s4, IPPROTO_IPV6, MRT6_DEL_MFC, &mc, sizeof(mc)); .Ed .Pp The following method can be used to get various statistics per @@ -453,7 +447,7 @@ and An example: .Bd -literal -offset 3n uint32_t v; -getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, (void *)&v, sizeof(v)); +getsockopt(sock, IPPROTO_IP, MRT_API_SUPPORT, &v, sizeof(v)); .Ed .Pp This would set @@ -478,10 +472,8 @@ would fail. To modify the API, and to set some specific feature in the kernel, then: .Bd -literal -offset 3n uint32_t v = MRT_MFC_FLAGS_DISABLE_WRONGVIF; -if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, (void *)&v, sizeof(v)) -!= 0) { +if (setsockopt(sock, IPPROTO_IP, MRT_API_CONFIG, &v, sizeof(v)
Re: bypass support for iommu on sparc64
Is the setup and teardown per transfer or when file is opened and closed? Or is it set up once per context switch of task? I am partly interested cos I would like to improve mt one day (as user of tape and Sparc64 Txxx) if I get the time. Andrew On Fri, 19 Oct 2018 at 10:22, Mark Kettenis wrote: > > Date: Fri, 19 Oct 2018 10:22:30 +1000 > > From: David Gwynne > > > > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote: > > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote: > > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes), > > > > setting up and tearing down the translation table entries (TTEs) > > > > is very expensive. so expensive that the cost of doing it for disk > > > > io has a noticable impact on compile times. > > > > > > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide > > > > to bypass the iommu for devices that set that flag, therefore > > > > avoiding the cost of handling the TTEs. > > > > > > > > the following diff adds support for bypass mappings to the iommu > > > > code on sparc64. it's based on a diff from kettenis@ back in 2009. > > > > the main changes are around coping with the differences between > > > > schizo/psycho and fire/oberon. > > > > > > > > the differences between the chips are now represented by a iommu_hw > > > > struct. these differences include how to enable the iommu (now via > > > > a function pointer), and masks for bypass addresses. > > > > > > > > ive tested this on oberon (on an m4000) and schizo (on a v880). > > > > however, the bypass code isnt working on fire (v245s). to cope with > > > > that for now, the iommu_hw struct lets drivers mask flag bits that > > > > are handled when creating a dmamap. this means fire boards will > > > > ignore BUS_DMA_64BIT until i can figure out whats wrong with them. > > > > > > i figured it out. it turns out Fire was working fine. however, > > > enabling 64bit dva on the onboard devices didnt work because the > > > serverworks/broadcom pcie to pcix bridge can only handle dma addresses > > > in the low 40 bits. because the fire bypass window is higher than > > > this, the bridge would choke and things stopped working. > > > > > > the updated diff attempts to handle this. basically when probing > > > the bridge, the platform creates a custom dma tag for it. this tag > > > intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before > > > handing it up to the parent bridge, which is pyro in my situation. > > > it looks like early sun4v boxes could make use of this too. > > > > > > > i have not tested this on psycho yet. if anyone has such a machine > > > > and is willing to work with me to figure it out, please talk to me. > > > > > > i still dont have psycho reports. > > > > Would anyone object if I committed this? I've been running it for the > > last release or two without issues, but with significant improvements in > > performance on the machines involved. > > At the price of giving all PCI devices unrestricted access to memory. > > So I'm not eager to this, especially since on sun4v hardware bypassing > the iommu isn't possible as soon as multiple domains are enabled. And > we lose a useful diagnostic when developing drivers. Are you sure the > iommu overhead can't be reduced some other way? At some point we > probably want to add iommu support on amd64 and arm64, but if that > comes with a similar overhead as on sparc64 that's going to be a bit > of an issue. > > > > Index: dev/iommu.c > > > === > > > RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v > > > retrieving revision 1.74 > > > diff -u -p -r1.74 iommu.c > > > --- dev/iommu.c 30 Apr 2017 16:45:45 - 1.74 > > > +++ dev/iommu.c 10 May 2017 12:00:09 - > > > @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm > > > void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t, > > > bus_addr_t, bus_size_t, int); > > > > > > +void iommu_hw_enable(struct iommu_state *); > > > + > > > +const struct iommu_hw iommu_hw_default = { > > > + .ihw_enable = iommu_hw_enable, > > > + > > > + .ihw_dvma_pa= IOTTE_PAMASK, > > > + > > > + .ihw_bypass = 0x3fffUL << 50, > > > + .ihw_bypass_nc = 0, > > > + .ihw_bypass_ro = 0, > > > +}; > > > + > > > +void > > > +iommu_hw_enable(struct iommu_state *is) > > > +{ > > > + IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb); > > > + IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16)); > > > +} > > > + > > > /* > > > * Initiate an STC entry flush. > > > */ > > > @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb > > > * - create a private DVMA map. > > > */ > > > void > > > -iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t > iovabase) > > > +iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state > *is, > > > +int tsbsize, u_int32_t iovabase) > > > { > > > psize_t si
Re: bypass support for iommu on sparc64
> Date: Fri, 19 Oct 2018 10:22:30 +1000 > From: David Gwynne > > On Wed, May 10, 2017 at 10:09:59PM +1000, David Gwynne wrote: > > On Mon, May 08, 2017 at 11:03:58AM +1000, David Gwynne wrote: > > > on modern sparc64s (think fire or sparc enterprise Mx000 boxes), > > > setting up and tearing down the translation table entries (TTEs) > > > is very expensive. so expensive that the cost of doing it for disk > > > io has a noticable impact on compile times. > > > > > > now that there's a BUS_DMA_64BIT flag, we can use that to decide > > > to bypass the iommu for devices that set that flag, therefore > > > avoiding the cost of handling the TTEs. > > > > > > the following diff adds support for bypass mappings to the iommu > > > code on sparc64. it's based on a diff from kettenis@ back in 2009. > > > the main changes are around coping with the differences between > > > schizo/psycho and fire/oberon. > > > > > > the differences between the chips are now represented by a iommu_hw > > > struct. these differences include how to enable the iommu (now via > > > a function pointer), and masks for bypass addresses. > > > > > > ive tested this on oberon (on an m4000) and schizo (on a v880). > > > however, the bypass code isnt working on fire (v245s). to cope with > > > that for now, the iommu_hw struct lets drivers mask flag bits that > > > are handled when creating a dmamap. this means fire boards will > > > ignore BUS_DMA_64BIT until i can figure out whats wrong with them. > > > > i figured it out. it turns out Fire was working fine. however, > > enabling 64bit dva on the onboard devices didnt work because the > > serverworks/broadcom pcie to pcix bridge can only handle dma addresses > > in the low 40 bits. because the fire bypass window is higher than > > this, the bridge would choke and things stopped working. > > > > the updated diff attempts to handle this. basically when probing > > the bridge, the platform creates a custom dma tag for it. this tag > > intercets bus_dmamap_create and clears the BUS_DMA_64BIT flag before > > handing it up to the parent bridge, which is pyro in my situation. > > it looks like early sun4v boxes could make use of this too. > > > > > i have not tested this on psycho yet. if anyone has such a machine > > > and is willing to work with me to figure it out, please talk to me. > > > > i still dont have psycho reports. > > Would anyone object if I committed this? I've been running it for the > last release or two without issues, but with significant improvements in > performance on the machines involved. At the price of giving all PCI devices unrestricted access to memory. So I'm not eager to this, especially since on sun4v hardware bypassing the iommu isn't possible as soon as multiple domains are enabled. And we lose a useful diagnostic when developing drivers. Are you sure the iommu overhead can't be reduced some other way? At some point we probably want to add iommu support on amd64 and arm64, but if that comes with a similar overhead as on sparc64 that's going to be a bit of an issue. > > Index: dev/iommu.c > > === > > RCS file: /cvs/src/sys/arch/sparc64/dev/iommu.c,v > > retrieving revision 1.74 > > diff -u -p -r1.74 iommu.c > > --- dev/iommu.c 30 Apr 2017 16:45:45 - 1.74 > > +++ dev/iommu.c 10 May 2017 12:00:09 - > > @@ -100,6 +100,25 @@ void iommu_iomap_clear_pages(struct iomm > > void _iommu_dvmamap_sync(bus_dma_tag_t, bus_dma_tag_t, bus_dmamap_t, > > bus_addr_t, bus_size_t, int); > > > > +void iommu_hw_enable(struct iommu_state *); > > + > > +const struct iommu_hw iommu_hw_default = { > > + .ihw_enable = iommu_hw_enable, > > + > > + .ihw_dvma_pa= IOTTE_PAMASK, > > + > > + .ihw_bypass = 0x3fffUL << 50, > > + .ihw_bypass_nc = 0, > > + .ihw_bypass_ro = 0, > > +}; > > + > > +void > > +iommu_hw_enable(struct iommu_state *is) > > +{ > > + IOMMUREG_WRITE(is, iommu_tsb, is->is_ptsb); > > + IOMMUREG_WRITE(is, iommu_cr, IOMMUCR_EN | (is->is_tsbsize << 16)); > > +} > > + > > /* > > * Initiate an STC entry flush. > > */ > > @@ -125,7 +144,8 @@ iommu_strbuf_flush(struct strbuf_ctl *sb > > * - create a private DVMA map. > > */ > > void > > -iommu_init(char *name, struct iommu_state *is, int tsbsize, u_int32_t > > iovabase) > > +iommu_init(char *name, const struct iommu_hw *ihw, struct iommu_state *is, > > +int tsbsize, u_int32_t iovabase) > > { > > psize_t size; > > vaddr_t va; > > @@ -149,13 +169,9 @@ iommu_init(char *name, struct iommu_stat > > * be hard-wired, so we read the start and size from the PROM and > > * just use those values. > > */ > > - if (strncmp(name, "pyro", 4) == 0) { > > - is->is_cr = IOMMUREG_READ(is, iommu_cr); > > - is->is_cr &= ~IOMMUCR_FIRE_BE; > > - is->is_cr |= (IOMMUCR_FIRE_SE | IOMMUCR_FIRE_CM_EN | > > - IOMMUCR_FIRE_TE); > > - } else > > -
Re: multicast.4: drop unneeded (void *) casts
OK kn with guenther's suggestions in.