Re: Forcing a USB device to "ugen"
> On Mar 26, 2024, at 2:41 PM, Thor Lancelot Simon wrote: > > I don't think this can be safely allowed at security level > 0, unless, > perhaps, it's restricted from working on devices that would match disk > drivers. The driver being asked to detach could certainly refuse to do so based on why it’s being asked. -- thorpej
Re: Forcing a USB device to "ugen"
> On Mar 26, 2024, at 10:18 AM, Jason Thorpe wrote: > > Like I said in a previous message on this thread, I think the better model is > for ugen to always exist for every USB device / interface, and for the kernel > driver to attach there. Then it’s easy to just say “detach whatever kernel > driver is here [assuming it’s not busy providing some other service] so that > user-space can take over”, and “re-attach any kernel driver that might match > this device now that I’m done”. I guess I should explicitly state the other part of this, which is “and anything other than totally benign control pipe transfers without having CLAIM’d the device/interface must return EBUSY”. It’s the act of CLAIM’ing the device/interface that detaches the kernel driver and the act of RELEASE’ing it that causes the kernel to attempt to re-attach one. -- thorpej
Re: Forcing a USB device to "ugen"
> On Mar 26, 2024, at 9:31 AM, Brian Buhrow wrote: > > Isn't it possible to do most of what Jason proposes by using the drvctl > interface to > detach a driver from a specific USB device? Then, some glue could be added > to the ugen driver > to allow it to be attached to arbitrary devices using the same drvctl > interface? That seems a > lot easier to me than building a registry of devices and device IDs, which > will be woefully > out of date before it gets published. It also has the advantage of allowing > the user to do > creative stuff that the developers didn't think of. Am I missing something > obvious? It’s absolutely not easier, because there’s not really an easy way, once the kernel driver is detached, to say “attach ugen where that other thing was, AND ONLY THERE”. Like I said in a previous message on this thread, I think the better model is for ugen to always exist for every USB device / interface, and for the kernel driver to attach there. Then it’s easy to just say “detach whatever kernel driver is here [assuming it’s not busy providing some other service] so that user-space can take over”, and “re-attach any kernel driver that might match this device now that I’m done”. -- thorpej
Re: Forcing a USB device to "ugen"
> On Mar 26, 2024, at 5:37 AM, Robert Swindells wrote: > > I thought that all FTDI devices provided JTAG etc. functionality, just > that the pins are not connected to anything in some devices. I guess it depends on how your individual FTDI board is wired up. Also, for chips without the MPSSE, you need something else. OpenOCD supports JTAG over the FT232R, for example, by bit-banging with the various RS232 signals (RXD, TXD, RTS, CTS, DTR, DCD). Anyway, as I noted in my original email, my particular board has one of the 2232 interfaces intended for use as a UART (and connected to a header that labels it for that purpose), and the second interface brought out two 4 different headers labeled for JTAG, SWD, SPI, and I2C. > What would be wrong with attaching an ugen to interface 1 instead of > an ucom in the ftdi driver itself? ugen can’t currently attach to things other than uhub. I think attaching ugen as the leaf is the wrong model, though; ugen should be what the kernel drivers themselves attach to, IMO. -- thorpej
Re: Forcing a USB device to "ugen"
> On Mar 26, 2024, at 2:49 AM, Martin Husemann wrote: > It is also not *that* intrusive as it may sound at first look: > basically we need a central registry that collects all the > identification data (vid,pid,strings and what have you) plus the parent > and the device pointer, and a flag if this is a device claimed by some > driver or one of the currently visible ugen* things. I was thinking about this while sipping my first cup of coffee this morning.. I think the right model is for a “ugen” or what-have-you attach to uhub, and then for the kernel drivers to attach to ugen (either claim the whole device or attach to individual interfaces). -- thorpej
Forcing a USB device to "ugen"
I have a device based on the FTDI FT2232C: [ 3285.311079] uftdi1 at uhub1 port 2 configuration 1 interface 0 [ 3285.311079] uftdi1: SecuringHardware.com (0x0403) Tigard V1.1 (0x6010), rev 2.00/7.00, addr 3 [ 3285.311079] ucom1 at uftdi1 portno 1 [ 3285.311079] uftdi2 at uhub1 port 2 configuration 1 interface 1 [ 3285.311079] uftdi2: SecuringHardware.com (0x0403) Tigard V1.1 (0x6010), rev 2.00/7.00, addr 3 [ 3285.311079] ucom2 at uftdi2 portno 2 It's a combo device that, in addition to a standard TTL-level UART, has a bunch of break-out headers for doing things like SPI, SWD, and JTAG (in my case, I need to use JTAG for programming some Atmel CPLDs). I should be able to do this with OpenOCD (pkgsrc/devel/openocd), but libfdti1 fails to find the device because libusb1 only deals in "ugen". "ugenif" might have been a possible solution here, except for the fact that 0x0403,0x6010 is the standard VID,PID for the FTDI FT2232C, and I don't want "interface 1" of ALL FT2232C devices to get the "ugen" treatment. The desire to use "ugen" on "interface 1" is not a property of 0x0403,0x6010, it's really a property of "SecuringHardware.com","Tigard V1.1". Unfortunately, there's isn't a way to express that in the kernel config syntax. I think my only short-term option here is to, in uftdi_match(), specifically reject based on this criteria: - VID == 0x0403 - PID == 0x6010 - interface number == 1 - vendor string == "SecuringHardware.com" - product string == "Tigard V1.1" (It's never useful, on this particular board, to use the second port as a UART.) -- thorpej
Re: netlink protocol
> On Feb 16, 2024, at 3:16 PM, Rui-Xiang Guo wrote: > > Hi, > There are some go modules use this syscall and marked for Linux only. > This protocol also appeared in FreeBSD 13.2. > Is there any plan to port it? Or any other method to implement it? Netlink by itself isn't terribly useful (and it's pretty gross, honestly). Kernel subsystems also have to interface with it. What do these Go modules implement? Do you need to run an application that uses one of them? -- thorpej
Re: Fixing, reestoring DEC FDDI (DEFPA/DEFEA/DEFTA/DEFQA) in 8.2, 9.2; restore to -current?
> On Feb 11, 2024, at 1:29 PM, Jonathan Stone wrote: > Turns out that fddi_ifattach() is broken in 8.2 and 9.2. It never > initialises sc_ec.ec_lock, which causes a panic the fisrt time the kernel > tries to add a multicast address to the interface. If you're using IPv6 (I > don't; I comment out "options INET6"), the panic occurs soon after boot when > ipv6 discovery starts. ^^^ Important context. > I would like to restore "pdq" (fpa, fea, fta, whatever a qbus attachment is; > or write one if none) to -current. > However, 10_RC4. doesn't even have if_fddisubr.c. Right, it was removed from -current before netbsd-10 branched after some discussion. Same with Token Ring, for the same reason ... a bunch of apparently unused code that had no work done to make MP-safe improvements like the Ethernet code received, and the work hadn't been done because, well, no one was apparently using it. Looks like I was right, because (a) no one screamed when it disappeared, and (b) when someone tried an older version that was still around, it blew up in their face. :-) Anyway, having that unmaintained code lying about introduces a practical barrier to making further improvements to the networking code, especially when those improvements introduce changes to the contracts between the layers. In the case of the FDDI code, there's the additional complication that the PDQ driver is ... well, it's something! Namely, a maze of twisty #ifdefs, all alike, where you stand a very good chance of being eaten by a grue. > I don;t want to re-create the hack of having two different initialisers for > the IEE 802 (sic) [*] portions of "struct ethercom'. > A cleaner solution is to declare a new struct with all the members of 'struct > ethercom', except the 'struct ifnet ec_if; > 'struct ethercom' then becomes a struct with two members: a struct ifnet, and > the new struct (struct iee802_common?). > That allows clean separation of code which manipulates the additions in > today's "struct ethercom', from code which also manipulates struct ifnet. > > Thoughts? Anyone actively against PR'ing and (hopefully) minimal patches > NetBSD-8 and NetBSD-9? > Or against restoring FDDI to -current. (and perhaps backporting to NetBSD-10)? > If I have to, I can probably ship a pair of DEFTAs to an interested > contributor, if support from me is too tenuous. > > [*] FDDI is not IEEE 802. But it's derived from Token Ring,, 802.5, which is. > And I suppose the refactoring I'm proposing here could add supporing Token > Ring, if anyone actively wanted to... See also about Token Ring above :-) I am not at all opposed to resurrecting this stuff, doing a re-factor to make it easier to maintain going forward, etc. If someone wants to volunteer to do that work (and then actually DO it), who am I to say no? After all, I love obsolete technology as much as (and quite possibly more than) the next guy! *Stares in 6800.* But I would prefer we not return to the previous state where the code went completely unmaintained and unused. -- thorpej
Re: MNT Reform2 USB LCP flash
> On Jan 26, 2024, at 2:41 AM, Robert Elz wrote: > >Date:Fri, 26 Jan 2024 09:26:38 - (UTC) >From:mlel...@serpens.de (Michael van Elst) >Message-ID: > > | Fortunately the drive geometry isn't really used anywhere. All > | accesses just use the logical block addresses. > > I have been meaning to suggest for ages that we remove all the > geometry nonsense from everywhere in the kernel, except those > drivers that actually need it - those should be responsible for > converting block numbers to CHS in a way that works for thej, > if they really need it (ancient ide drives before LBA addressing, > vax massbus drives, sun xd drives ... anything like that which > almost no-one has seen in decades). > > It is just bizarre to see ssd and even nvme 'drives' claiming > to have cylinders and heads! 10% agree. Alas, FFS's whole schtick is caring about drive geometry, so we kind of need to fake up something for newfs on such drives (and we should do it in a generic way so the code isn’t replicated in a million different places), and have some way of getting the real info for drives where it actually does exist. -- thorpej
Re: _KERNEL_OPT and 0x6e074def
> On Dec 19, 2023, at 10:33 AM, Jason Thorpe wrote: > > >> On Dec 19, 2023, at 10:01 AM, Edgar Fuß wrote: >> >> 0x6e074def > > the-ripe-vessel:thorpej 99$ grep UNDEFINED * > mkheaders.c:#define UNDEFINED ('n' << 24 | 0 << 20 | 't' << 12 | 0xdefU) > mkheaders.c:/* Value for defined options with value UNDEFINED */ > mkheaders.c: return (unsigned int)(h != UNDEFINED ? h : DEFINED); > mkheaders.c: fprint_global(fp, dl->dl_name, UNDEFINED); Oops, forgot the important part: the-ripe-vessel:thorpej 98$ pwd /nbsd/src/usr.bin/config -- thorpej
Re: _KERNEL_OPT and 0x6e074def
> On Dec 19, 2023, at 10:01 AM, Edgar Fuß wrote: > > 0x6e074def the-ripe-vessel:thorpej 99$ grep UNDEFINED * mkheaders.c:#define UNDEFINED ('n' << 24 | 0 << 20 | 't' << 12 | 0xdefU) mkheaders.c:/* Value for defined options with value UNDEFINED */ mkheaders.c: return (unsigned int)(h != UNDEFINED ? h : DEFINED); mkheaders.c: fprint_global(fp, dl->dl_name, UNDEFINED); -- thorpej
Re: __futex(2): use outside Linux compat
> On Dec 11, 2023, at 5:00 AM, Robert Swindells wrote: > >> Is it OK to use for NetBSD "native" code since it is not "advertised" >> by a man page? > > No. Heavy asterisk here. Thierry, let’s chat off-list. -- thorpej
Re: kern/57691: NFS client regression with macOS 14 server
> On Dec 8, 2023, at 10:55 AM, Jason Thorpe wrote: > > Probably what’s going on is that the server is verifying the directory cookie > more strictly than before. Those two lines that pack the cookieverf should > be inserting 0s if uno_offset is 0. Just confirmed by code inspection that FreeBSD always sends a 0 cookie verifier for uio_offset 0. if (cookie.qval == 0) { *tl++ = 0; *tl++ = 0; } else { (From their nfsrpc_readdirplus().) -- thorpej
Re: kern/57691: NFS client regression with macOS 14 server
> On Nov 10, 2023, at 11:10 AM, schm...@netbsd.org wrote: > > tcpdump of "ls" after the problem has manifested: > https://netbsd.schmonz.com/tmp/nfs/ls-after-the-problem-tcpdump.txt The tcpdump shows: 16:16:35.683996 IP 10.0.2.2.shilp > 10.0.2.15.exp2: Flags [P.], seq 1573:1693, ack 1508, win 65535, length 120: NFS reply xid 834502658 reply ok 116 readdir ERROR: READDIR/READDIRPLUS cookie is stale Looking at the code that issues READDIRPLUS in the kernel in nfs_readdirplusrpc(): if (nmp->nm_iflag & NFSMNT_SWAPCOOKIE) { txdr_swapcookie3(uiop->uio_offset, tl); } else { txdr_cookie3(uiop->uio_offset, tl); } tl += 2; *tl++ = dnp->n_cookieverf.nfsuquad[0]; *tl++ = dnp->n_cookieverf.nfsuquad[1]; *tl++ = txdr_unsigned(nmp->nm_readdirsize); I think the cookie verifier is wrong. See: https://www.rfc-editor.org/rfc/rfc1813#page-81 cookieverf This should be set to 0 on the first request to read a directory. On subsequent requests, it should be a cookieverf as returned by the server. The cookieverf must match that returned by the READDIRPLUS call in which the cookie was acquired. Probably what’s going on is that the server is verifying the directory cookie more strictly than before. Those two lines that pack the cookieverf should be inserting 0s if uno_offset is 0. -- thorpej
Re: cv_fdrestart()
> On Oct 14, 2023, at 6:38 AM, Taylor R Campbell > wrote: > >> Date: Fri, 13 Oct 2023 13:41:50 -0700 >> From: Jason Thorpe >> >>> On Oct 13, 2023, at 11:48 AM, Andrew Doran wrote: >>> >>> Add cv_fdrestart() (better name suggestions welcome): >> >> Oof. > > Why do we need this at all? Indeed. > Condition variable semantics is standard, well-studied, and > well-understood. This is a foundational API that essentially every > kernel subsystem relies integrally on, and it strikes me as a mistake > to tie it in with file descriptors or the ERESTART mechanism. > > I would ask that a controversial change like this be reverted until we > have had substantive discussion giving a compelling reason to change > such a foundational API to add custom, nonstandard semantics wiring it > up to file descriptors and our ERESTART mechanism. > > We can just do > >foo->flags |= FOO_RESTART; >cv_broadcast(cv); > > and then the corresponding wait logic can do > >if (foo->flags & FOO_RESTART) >return ERESTART; >cv_wait(cv); >goto retry; See also eventfd_wait(). My main objection was “putting fd-specific restart logic in condvars”. Of course it would be better to not pollute condvars at all. -- thorpej
cv_fdrestart()
> On Oct 13, 2023, at 11:48 AM, Andrew Doran wrote: > > Module Name: src > Committed By: ad > Date: Fri Oct 13 18:48:56 UTC 2023 > > Modified Files: > src/sys/kern: kern_condvar.c kern_sleepq.c > src/sys/rump/librump/rumpkern: locks.c locks_up.c > src/sys/sys: condvar.h lwp.h > > Log Message: > Add cv_fdrestart() (better name suggestions welcome): Oof. I’d suggest doing something like: void cv_broadcast_cb(kcondvar_t *cv, void (*callback)(lwp_t *)) { . . . } . . . to make this a generic mechanism, rather that something so specialized for the few call sites that need this behavior. > > Like cv_broadcast(), but make any LWPs that share the same file descriptor > table as the caller return ERESTART when resuming. Used to dislodge LWPs > waiting for I/O that prevent a file descriptor from being closed, without > upsetting access to the file (not descriptor) made from another direction. > > > To generate a diff of this commit: > cvs rdiff -u -r1.59 -r1.60 src/sys/kern/kern_condvar.c > cvs rdiff -u -r1.83 -r1.84 src/sys/kern/kern_sleepq.c > cvs rdiff -u -r1.86 -r1.87 src/sys/rump/librump/rumpkern/locks.c > cvs rdiff -u -r1.12 -r1.13 src/sys/rump/librump/rumpkern/locks_up.c > cvs rdiff -u -r1.17 -r1.18 src/sys/sys/condvar.h > cvs rdiff -u -r1.227 -r1.228 src/sys/sys/lwp.h > > Please note that diffs are not public domain; they are subject to the > copyright notices on the relevant files. > -- thorpej
Re: Maxphys on -current?
> On Aug 3, 2023, at 2:19 PM, Brian Buhrow wrote: > > hello. I know that this has ben a very long term project, but I'm wondering > about the > status of this effort? I note that FreeBSD-13 has a Maxphys value of 1048576 > bytes. > Have we found other ways to get more throughput from ATA disks that obviate > the need for this > setting which I'm not aware of? > If not, is anyone working on this project? The wiki page says the project is > stalled. If someone does pick this up, I think it would be a good idea to start from scratch, because MAXPHYS, as it stands, is used for multiple things. Thankfully, I think it would be relatively straightforward to do the work that I am suggesting incrementally. Here goes... MAXPHYS is really supposed to be “maximum transfer via physio”, which is the code path you use when you open /dev/rsd0e and read/write to it. The user-space pages are wired and mapped into the kernel for the purpose of doing I/O. MAXPHYS is a per-architecture constant because some systems have different constraints as to how much KVA space can be used for that at any given time. Unfortunately, some of the adjacent physio machinery (e.g. minphys()) is also used for other purposes, specifically to clamp I/O sizes to constraints defined by the physical device and/or the controllers / busses they’re connected to. Logically, these are two totally separate things, and IMO they should be cleanly separated. What we *should* have is the notion of “I/O parameters” that are defined by the device… max I/O size, max queue depth, preferred I/O size, preferred I/O alignment, physical block size, logical block size, etc. The base values for these parameters should come from the leaf device (e.g.. the disk), and then be clamped as needed by it’s connective tissue (the controller, the system bus the controller is connected to, and ultimately the platform-specific e.g. DMA constraints). The the interface layers (the page cache / UBC, the traditional block I/O buffer cache, and the physio interface for user-space) can further impose their own constraints, as necessary per their API contract. There is zero reason that MAXPHYS should impact the maximum I/O that a file system can do via the UBC, for example. (In a perfect world, we wouldn’t even have to consume virtual address space to bring data and and out of the page cache / UBC, because we already know the physical addresses of the pages that are being pulled in / cleaned.) > Any thoughts or news would be greatly appreciated. Anyway, there are mine :-) -- thorpej
Re: kcmp(2)
> On Jul 20, 2023, at 11:01 AM, Mouse wrote: > Don't cloner instances differ in minor number? [...] >>> Not that I?m aware of. [...] >> Well, as noted in this thread, traditionally you can tell when two >> files are the same by examining stat results. > > Maybe that should be a hint. See below. > >> And the cloner mechanism replaced an older scheme where you had to >> pick the number of instances you wanted, and unless I'm >> misremembering badly in that world each had to have its own minor >> number. > > That's my understanding as well. > >> That said, it almost certainly isn't important... > > Well, if it means that with a minor tweak NetBSD could have kcmp(3) > instead of kcmp(2), it could be. If you read the Linux kcmp(2) manual page, you’ll see that it’s a lot more than just file descriptors. -- thorpej
Re: kcmp(2)
> On Jul 18, 2023, at 8:51 PM, David Holland wrote: > > On Tue, Jul 18, 2023 at 03:25:02PM -0700, Jason Thorpe wrote: >> That *might* work in this particular case, but it would not work >> for e.g. a cloning device where you get additional descriptors via >> dup() or whatever. > > Don't cloner instances differ in minor number? If not, shouldn't they? Not that I’m aware of. They result in a new file object with a new private data pointer, but they don’t change the minor number and I don’t see why forcing them to do so would be such a good idea. What if you had a single driver (that consumes a major # slot) that wanted to provide two cloning interfaces? If each clone got its own minor #, then you’d be artificially limiting how many could be created. -- thorpej
Re: kcmp(2)
> On Jul 18, 2023, at 2:57 PM, Mouse wrote: > >>> What does Mesa use kcmp(2) for? >> Working out whether two device handles are to the same DRM device. > >>> Is there another way to accomplish what Mesa is trying to use it for? >> I don't know of one. > > Is fstat() plus checking st_rdev (and possibly other fields) > insufficient? That *might* work in this particular case, but it would not work for e.g. a cloning device where you get additional descriptors via dup() or whatever. -- thorpej
Re: [PATCH] style(5): No struct typedefs
> On Jul 14, 2023, at 2:56 AM, Martin Husemann wrote: > > The typedef itself buys you nothing but a few charactes less to type, No, that’s not correct. For a type that’s meant to be “opaque” (i.e. “users of the interface should not have to know or care about any of the implementation details, historical leakage be damned”), the typedef gives you true opacity… if you’re using “struct kmutex *” rather than “kmutex_t *” then you’re already eating the forbidden fruit. -- thorpej
Re: [PATCH] style(5): No struct typedefs
> On Jul 12, 2023, at 5:25 AM, Mouse wrote: > > [paragraph-length-line damage repaired manually] > >> What about something like and >> ? The type definitions go into the former >> header file, [...] > > Well, I don't like the "typedefs" name, Yah, I don't like it either, but.. :-) > because as I see it the divide > you're sketching here (which I support, in general) is the divide > between interface and implementation, and in some cases the interface > is more than just the typedefs. Sort of. // contains the "vnode_t" opaque type definition // contains the guts of "struct vnode" and the other implementation details // Contains some of the file system interfaces, some of which use vnode_t // Contains the vnode interfaces, which definitely use vnode_t The latter if the two would each include . Feel free to pick a better name, but that's the idea. -- thorpej
Re: [PATCH] style(5): No struct typedefs
> On Jul 11, 2023, at 9:51 PM, Robert Elz wrote: > > That is to rip the definition of __NetBSD_Version__ (with however > nany underscores it really has) out of and into a new > header file all of its own (perhaps - no bikeshedding > about names here please) with the rule that *no* other header file is > oermitted to include it. C code that needs __NetBSD_Version__ needs > to explcitly include that file itself ... there is surprisingly little > of it. This will make the side effects of doing a kernel version bump > be so small (really, so it should be) that doing one is an automatic > decision any time it might be needed, and we end the "even though this > is a kernel internal ABI change, I don't think any modules will be affected" > (and no no bump happens - which must be what happens sometimes, either that > of some of our developers don't understand what a kernal internal ABI > change means) and should also avoid the need for "ride the kernel > version bump" (perhaps several hours later) or "let me know when you're > going to bump the kernel version, I have changes to commit which would > need that, but don't need to be commmitted right away" - which often > turns into a "ride the bump" when the developer who did a change, and > a bump, did not remeber, or perhaps even know, the other developer was > waiting and would have likes to coordinate. I’ve been thinking about this a little more… What about something like and ? The type definitions go into the former header file, that can be included by whatever other header needs that type definition. The full-blown structure and internal details of it go into the latter, to be only included by things that access the guts, and now it’s VERY clear that “I am using implementation details” aspect of touching those guts. -- thorpej
Re: [PATCH] style(5): No struct typedefs
> On Jul 11, 2023, at 2:56 PM, Taylor R Campbell > wrote: > > I agree the keyword is ugly, and it's unfortunate that in order to > omit it we would have to use C++, but the ugliness gives us practical > benefits of better type-checking, reduced header file maintenance > burden, and reduced motivation for unnecessary header file > dependencies. No -- you just don't have to use "void *". Can you point to a practical problematic example? -- thorpej
Re: [PATCH] style(5): No struct typedefs
> On Jul 11, 2023, at 6:43 AM, Johnny Billquist wrote: > > typedefs in C don't really create new types. They are all just derivatives. > Sometimes I even wonder why typedef exists in C. Feels like I could > accomplish the same with a #define Because #define generates pollution in unrelated things. -- thorpej
Re: [PATCH] style(5): No struct typedefs
> On Jul 11, 2023, at 3:17 AM, Taylor R Campbell wrote: > > If we used `struct bus_dma_tag *' instead, the forward declaration > could be `struct bus_dma_tag;' instead of having to pull in all of > sys/bus.h, _and_ the C compiler would actually check types. In the original design, it’s not always a struct. That was the whole point of using a more abstract type. If you want to hide the struct'ness in a machdep header file, fine, but I completely disagree with the notion of requiring the use of the “struct” keyword all over the place. -- thorpej
Re: Anonymous vnodes?
> On Jun 26, 2023, at 3:13 PM, Theodore Preduta wrote: > > Is it possible to create a vnode for a regular file in a file system > without linking the vnode to any directory, so that it disappears when > all open file descriptors to it are closed? (As far as I can tell, this > isn't possible with any of the vn_* or VOP_* functions?) There isn't a general way to do this. > For context, I'm currently working on implementing memfd_create(2), and > thought this might be a shortcut. Otherwise, I'll have to implement it > in terms of uvm operations (which is fine, just more work). Seems like these objects should be implemented above the file system ... just create a new descriptor type and interface directly with UVM. -- thorpej
Re: malloc(9) vs kmem(9) interfaces
> On Jun 1, 2023, at 5:27 AM, Greg Troxel wrote: > > However, you are talking about maintaining a count of objects by user. > That is vastly cheaper, and likely 90%+ as useful. > > SO there is "object attribution" and "total usage attribution". The old malloc() attribution mechanism just counted bytes-allocated-per-tag. -- thorpej
Re: malloc(9) vs kmem(9) interfaces
> On May 31, 2023, at 1:54 PM, Andrew Doran wrote: >> - What would the cost of restoring attribution be, other than the >> obvious O(ntag*nsizebuckets) memory cost to record it and the effort >> to annotate allocations? > > Related to this, in my experiments it turns out that using dedicated pools > for objects for no other reason than to to get attribution or be space > efficient seems to exert a lot of presssure on the CPU cache to the point > that reverting to the general purpose allocator where possible yields a > small but measurable and repeatable reduction in system time during builds. > I plan to do this for some objects. My big beef with specific pools for random objects is that it reduces utilization of memory. Let’s say you have a pool for “foo” structures, and they’re sized correctly to fit into a 128 byte kmem pool. But they have their own. And there are 7 of them allocated. That’s 3200 bytes that are being wasted when those objects could be allocated from a partially-fragmented kmem128 pool. Back in the day, a lot of subsystems were converted to use pools directly because that was how access to a direct-mapped super-page was achieved. Now that kmem backs malloc(), there’s less pressure there. But for fixed-size allocations, malloc() is an inefficient API, because of the need to store the size and then round up to the minimum allocation alignment for the architecture (16 bytes on some systems). Many of the direct pool consumers should switch to kmem, IMO… and pool_cache should probably have a kmem_cache counterpart (pool_cache ctor/dtor support, but with a generic kmem bucket, rather than a specific pool). This would likely lead to a general reduction in memory fragmentation in the system. If we really want to have an “attribution tag” system, we can do that *outside* of the allocator… It wouldn’t be all that different from evcnt. -- thorpej
Re: sdmmc question.
> On May 24, 2023, at 2:39 AM, Taylor R Campbell > wrote: > > Possible complication -- not sure if the spibus interface provides the > hardware access needed by sd@spi, judging by this FreeBSD wiki page: > https://wiki.freebsd.org/riscv/HiFiveUnmatched (But this is outside > my area of expertise, so don't let my assessment get in your way!) Probably what that FreeBSD wiki is referring to is that you need to be able to strobe the CS line for the card explicitly (and repeatedly) outside of the context of a normal SPI I/O operation … this is required as part of the process of putting the card into SPI mode, which needs to be done any time a card is inserted. -- thorpej
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> On May 9, 2023, at 3:09 PM, Taylor R Campbell > wrote: > > - uiopeek leaves uio itself untouched (it may modify the source/target > buffers but it's idempotent). Hm… I’m having second thoughts about uiopeek(), as well. It implies a direction (“peek” feels like “read”, and “write” would feel more like a “poke”). I think uiocopy() is a better name, and I think it is sufficiently different from uiomove() (“move” implies a sort of destructive-ness that “copy” does not). -- thorpej
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> On May 9, 2023, at 2:21 PM, Taylor R Campbell > wrote: > > tl;dr > > I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which I’m not a fan of uioskip() as a name … I think uioadvance() would be better. Skip implies, to my brain, that the data is being thrown away (even if you’re already consumed it). -- thorpej
Re: Nixing __HAVE_ATOMIC_AS_MEMBAR
> On Feb 22, 2023, at 4:48 PM, Taylor R Campbell > wrote: > > (c) Just make the membars unconditional -- take off the #ifndefs, but >don't add any new functions or change the semantics of existing >ones. That’s my preference. Having ti at all is confusing. -- thorpej
Re: rw_downgrade/tryupgrade semantics
> On Feb 22, 2023, at 9:40 AM, Taylor R Campbell > wrote: > > I believe this is guaranteed as rwlock(9) is currently implemented. > (Whether it is _intended_ to be guaranteed, I'm not sure.) That guarantee certainly falls in line with the principle of least astonishment. We should probably document it, but it doesn’t seem correct any other way. > The reason these can fail is that rw_tryupgrade and rw_downgrade only > issue store-before-store barriers (membar_producer), putting no > restrictions on when the CPU can load foo->nreaders relative to > rw_tryupgrade or rw_downgrade. That’s a bug, full-stop. > Even though the reader is not technically _just_ a `reader', in that > it issues stores as well as loads, I think these possibilities are > extremely counterintuitive and possibly dangerous, especially for a > relatively high-level API like rwlock(9) that prioritizes ease of use > over maximal parallelism -- merely taking a read lock can lead to > shared memory contention, which is why we also have harder-to-use but > cheaper options like pserialize(9), psref(9), and localcount(9). I think it would be worth documenting that “reader” and “writer” are just colloquialisms … “shared” and “exclusive” is what is really meant, but nearly everyone uses the “reader” and “writer” names. > I'm open to other opinions; perhaps it is intended that loads in a > writer can bleed into an adjacent read section, and that readers > aren't supposed to issue stores anyway, and perhaps there's a good > reason for all this. > > But absent a good reason, I think rw_downgrade should be a release > operation, and rw_tryupgrade should be an acquire operation, just like > rw_exit and rw_enter. Which means they need to use membar_release and > membar_acquire inside, not membar_producer (and we need to issue > pullups). I agree 100% … try-upgrade and downgrade are really just optimizations … their effect should be equivalent to dropping the writer lock and re-acquiring as a reader. -- thorpej
Re: Add five new escape sequences to wscons
> On Jan 18, 2023, at 3:34 AM, David Brownlee wrote: > > Technically the wscons terminal type is wsvt25, an extended ANSI > compatible terminal, already supporting more sequences than vt100. > > Having it also support a useful subset of xterm, providing it doesn't > add an excessive amount of complexity, seems like a useful addition, > particularly if other systems also have a "wscons" with similar > additional handling. 100% agree. -- thorpej
Re: 9.99.100 fallout: bootloader and modules
> On Sep 21, 2022, at 5:00 AM, Johnny Billquist wrote: > > Searching for uses of netbsd_version, there is some more broken logic in a > few places, following similar patterns or assumptions. > > Like in /usr/src/sys/arch/i386/stand/lib/exec.c: > >if (netbsd_version / 100 % 100 == 99) { >/* -current */ >snprintf(buf, bufsize, >"/stand/%s/%d.%d.%d/modules", machine, >netbsd_version / 1, >netbsd_version / 100 % 100, >netbsd_version / 100 % 100); >} else if (netbsd_version != 0) { >/* release */ >snprintf(buf, bufsize, >"/stand/%s/%d.%d/modules", machine, >netbsd_version / 1, >netbsd_version / 100 % 100); >} > > > So just changing the modulo in the place you were suggesting isn't enough. :-P Obviously we need a function in libsa or libkern for this. -- thorpej
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 18, 2022, at 2:03 AM, Taylor R Campbell > wrote: > >> Date: Sun, 17 Jul 2022 12:54:56 -0700 >> From: Jason Thorpe >> >> And another new version. This: >> >> ==> Creates a knote_impl structure that's private to kern_event.c >> that has the new lock. I took the opportunity to move kn_influx to >> the knote_impl as well, since absolutely no one outside of >> kern_event.c should touch it. This change is ABI-compatible. >> >> ==> Improves the comments about the locking rules. >> >> There are no functional changes since the last patch. > > LGTM, thanks for working on this! > > If I understand correctly, the issue is that removing the kqueue > listener (i.e., filter_detach) and detaching the device notifier > (i.e., klist_fini, via seldestroy) can race. A process can close its > kqueue file descriptor, or EV_DELETE a specific event subscription, at > the same time as a device it had subscribed to is detaching and > calling seldestroy. That is, the lifetimes of: > > (a) a knote in a kqueue listener for any of its event subscriptions, and > (b) a knote in a driver's klist for any of its subscribers, > > are overlapping but controlled independently, one by the kqueue user > and the other by the driver, e.g. when the device is physically > yanked, so the destruction by one path might happen while the other is > still in use. filter_detach can't safely call into the driver if the > knote is being freed by klist_fini; klist_fini can't safely return to > the driver if filter_detach is still in progress. It’s actually simpler than that, and doesn’t really involve even much of a race… it’s that any knote associated with a device that has just been yanked may be referring to a device instance’s private data (via kn->kn_hook), and has a kn_fop pointer that points to functions provided by the now-detached device instance. In the case of a driver that is statically linked into the kernel, at least the function pointers are probably stable and the problem is a simple use-after-free… but in the case of a dynamically-loaded driver module, kn_fop might itself now point at garbage, if the driver is unloaded. So, if the driver detach happens at any time before the kqueue listener is detached, there is the potential to go BOOM (it’s not guaranteed, because UAF bugs are sometimes like that). > Serializing the call to .f_detach and the replacement of kn_fop in > klist_fini with a mutex ensures that one happens after the other for > each knote. Using a mutex attached to the knote itself avoids > problems with the identity of the knote changing (close, EV_DELETE) > and avoids problems with the identity of the device notifier changing > (device detach) while either one is trying to run. More or less… the mutex around the filter ops ensures that kn_fop will point to EITHER the driver-provided filter ops OR the stub do-nothing filter ops, and by holding the mutex across the call into the filter ops, ensures that klist_fini() will not complete while a filter op call is in-progress, thus ensuring that the freeing of the memory that backs the klist being torn down will not complete until there are no callers inside the about-to-be-gone filter ops. > The kn_fop member is supposed to be stable all use points, so it > doesn't require additional synchronization like atomic_load/store_*: > > - Before .f_attach returns, the kqueue logic won't touch it, so > .f_attach can safely set it without synchronization. > > - By the time the driver calls klist_fini (seldestroy), the driver > promises never to KNOTE again so it will never be touched by > filter_event. It also promises to not allow any further f_attach calls to succeed. (This is not new; the same promise is already made for select, and has been for many years). > - filter_touch is used only for EVFILT_TIMER and EVFILT_USER, which > don't put the knote on any klist so never use klist_fini on it. Right. > - filter_detach and klist_fini are serialized by the knote foplock. Right. -- thorpej
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 13, 2022, at 7:18 PM, Jason Thorpe wrote: > > Ok, new version. Main differences: And another new version. This: ==> Creates a knote_impl structure that’s private to kern_event.c that has the new lock. I took the opportunity to move kn_influx to the knote_impl as well, since absolutely no one outside of kern_event.c should touch it. This change is ABI-compatible. ==> Improves the comments about the locking rules. There are no functional changes since the last patch. Index: kern/kern_event.c === RCS file: /cvsroot/src/sys/kern/kern_event.c,v retrieving revision 1.143 diff -u -p -r1.143 kern_event.c --- kern/kern_event.c 13 Jul 2022 14:11:46 - 1.143 +++ kern/kern_event.c 17 Jul 2022 19:44:06 - @@ -120,6 +120,61 @@ static voidfilt_userdetach(struct knote static int filt_user(struct knote *, long hint); static int filt_usertouch(struct knote *, struct kevent *, long type); +/* + * Private knote state that should never be exposed outside + * of kern_event.c + * + * Field locking: + * + * q kn_kq->kq_lock + */ +struct knote_impl { + struct knoteki_knote; + unsigned intki_influx; /* q: in-flux counter */ + kmutex_tki_foplock; /* for kn_filterops */ +}; + +#defineKIMPL_TO_KNOTE(kip) (&(kip)->ki_knote) +#defineKNOTE_TO_KIMPL(knp) container_of((knp), struct knote_impl, ki_knote) + +static inline struct knote * +knote_alloc(bool sleepok) +{ + struct knote_impl *ki; + + ki = kmem_zalloc(sizeof(*ki), sleepok ? KM_SLEEP : KM_NOSLEEP); + mutex_init(>ki_foplock, MUTEX_DEFAULT, IPL_NONE); + + return KIMPL_TO_KNOTE(ki); +} + +static inline void +knote_free(struct knote *kn) +{ + struct knote_impl *ki = KNOTE_TO_KIMPL(kn); + + mutex_destroy(>ki_foplock); + kmem_free(ki, sizeof(*ki)); +} + +static inline void +knote_foplock_enter(struct knote *kn) +{ + mutex_enter(_TO_KIMPL(kn)->ki_foplock); +} + +static inline void +knote_foplock_exit(struct knote *kn) +{ + mutex_exit(_TO_KIMPL(kn)->ki_foplock); +} + +static inline bool +knote_foplock_owned(struct knote *kn) +{ + return mutex_owned(_TO_KIMPL(kn)->ki_foplock); +} + static const struct fileops kqueueops = { .fo_name = "kqueue", .fo_read = (void *)enxio, @@ -133,6 +188,31 @@ static const struct fileops kqueueops = .fo_restart = kqueue_restart, }; +static void +filt_nopdetach(struct knote *kn __unused) +{ +} + +static int +filt_nopevent(struct knote *kn __unused, long hint __unused) +{ + return 0; +} + +static const struct filterops nop_fd_filtops = { + .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE, + .f_attach = NULL, + .f_detach = filt_nopdetach, + .f_event = filt_nopevent, +}; + +static const struct filterops nop_filtops = { + .f_flags = FILTEROP_MPSAFE, + .f_attach = NULL, + .f_detach = filt_nopdetach, + .f_event = filt_nopevent, +}; + static const struct filterops kqread_filtops = { .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE, .f_attach = NULL, @@ -232,15 +312,41 @@ static size_t user_kfiltersz; /* size * -> object lock (e.g., device driver lock, ) * -> kn_kq->kq_lock * - * Locking rules: + * Locking rules. ==> indicates the lock is acquired by the backing + * object, locks prior are acquired before calling filter ops: + * + * f_attach: fdp->fd_lock -> knote foplock -> + * (maybe) KERNEL_LOCK ==> backing object lock + * + * f_detach: fdp->fd_lock -> knote foplock -> + *(maybe) KERNEL_LOCK ==> backing object lock + * + * f_event via kevent: fdp->fd_lock -> knote foplock -> + *(maybe) KERNEL_LOCK ==> backing object lock + *N.B. NOTE_SUBMIT will never be set in the "hint" argument + *in this case. * - * f_attach: fdp->fd_lock, KERNEL_LOCK - * f_detach: fdp->fd_lock, KERNEL_LOCK - * f_event(!NOTE_SUBMIT) via kevent: fdp->fd_lock, _no_ object lock - * f_event via knote: whatever caller guarantees - * Typically, f_event(NOTE_SUBMIT) via knote: object lock - * f_event(!NOTE_SUBMIT) via knote: nothing, - * acquires/releases object lock inside. + * f_event via knote (via backing object: Whatever caller guarantees. + * Typically: + * f_event(NOTE_SUBMIT): caller has already acquired backing + * object lock. + * f_event(!NOTE_SUBMIT): caller has not acquired backing object, + * lock or has possibly acquired KERNEL_LOCK. Backing object + * lock may or may not be acquired as-needed. + * N.B. the knote foplock will **not** be acquired in this case. The + *
Re: bus_space_barrier semantics
> On Jul 16, 2022, at 11:15 AM, Taylor R Campbell wrote: > > Thoughts? 110% on-board with all of this. > P.S. I also considered eliminating the offset/length argument, > because no implementations take advantage of them, so I started (with > maya@'s help wrangling coccinelle) to draft a patch to remove them. > However, the other BSDs have the same API, so changing this would make > it more of a pain to share drivers. Didn’t OpenBSD remove the offset / length arguments? Anyway, I’m not particularly concerned about this part, but it would be a nice wart to rid ourselves of. -- thorpej
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 13, 2022, at 12:02 PM, Jason Thorpe wrote: > >> On Jul 13, 2022, at 11:25 AM, Taylor R Campbell >> wrote: >> >> Sorry, haven't had time yet to do a full review, but I spot at least >> one problem that means this won't fly as is: kqueue_register and >> kqueue_scan both call filter_touch under a spin lock, but with your >> patch filter_touch now takes an rwlock -- which is forbidden under a >> spin lock (and it'll crash under LOCKDEBUG). > > I found another issue as well and will post a follow-up later. Ok, new version. Main differences: ==> each knote gets its own mutex, rather than using a global rwlock. ==> Moved the locking outside of the filter_*() functions. ==> “Dealt” with the locking issue around the “touch” op, by only allowing user_filtops and timer_filtops to use it, because they’re known to be safe to call “touch” without holding the knote foplock (they don’t put notes on any lists). The knote foplock got tossed onto the end of struct knote with the idea being that it doesn’t change the knote ABI for modules (they can’t allocate their own knotes). If this gets aligned with another change that bumps the kernel version, I’ll put the foplock up next to fop. This passes the kqueue tests on a LOCKDEBUG kernel. Index: kern/kern_event.c === RCS file: /cvsroot/src/sys/kern/kern_event.c,v retrieving revision 1.143 diff -u -p -r1.143 kern_event.c --- kern/kern_event.c 13 Jul 2022 14:11:46 - 1.143 +++ kern/kern_event.c 14 Jul 2022 02:01:44 - @@ -133,6 +133,31 @@ static const struct fileops kqueueops = .fo_restart = kqueue_restart, }; +static void +filt_nopdetach(struct knote *kn __unused) +{ +} + +static int +filt_nopevent(struct knote *kn __unused, long hint __unused) +{ + return 0; +} + +static const struct filterops nop_fd_filtops = { + .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE, + .f_attach = NULL, + .f_detach = filt_nopdetach, + .f_event = filt_nopevent, +}; + +static const struct filterops nop_filtops = { + .f_flags = FILTEROP_MPSAFE, + .f_attach = NULL, + .f_detach = filt_nopdetach, + .f_event = filt_nopevent, +}; + static const struct filterops kqread_filtops = { .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE, .f_attach = NULL, @@ -235,12 +260,14 @@ static size_t user_kfiltersz; /* size * Locking rules: * * f_attach: fdp->fd_lock, KERNEL_LOCK - * f_detach: fdp->fd_lock, KERNEL_LOCK - * f_event(!NOTE_SUBMIT) via kevent: fdp->fd_lock, _no_ object lock - * f_event via knote: whatever caller guarantees + * f_detach: fdp->fd_lock, knote foplock, KERNEL_LOCK + * f_event(!NOTE_SUBMIT) via kevent: fdp->fd_lock, knote foplock, + * acquires/releases object lock inside + * f_event via knote: whatever caller guarantees, + * knote foplock _not_ acquired * Typically, f_event(NOTE_SUBMIT) via knote: object lock - * f_event(!NOTE_SUBMIT) via knote: nothing, - * acquires/releases object lock inside. + * f_event(!NOTE_SUBMIT) via knote: + * acquires/releases object lock inside * * Locking rules when detaching knotes: * @@ -429,6 +456,7 @@ knote_alloc(bool sleepok) struct knote *kn; kn = kmem_zalloc(sizeof(*kn), sleepok ? KM_SLEEP : KM_NOSLEEP); + mutex_init(>kn_foplock, MUTEX_DEFAULT, IPL_NONE); return kn; } @@ -436,14 +464,28 @@ knote_alloc(bool sleepok) static inline void knote_free(struct knote *kn) { + mutex_destroy(>kn_foplock); kmem_free(kn, sizeof(*kn)); } +/* + * Calls into the filterops need to be resilient against things which + * destroy a klist, e.g. device detach, freeing a vnode, etc., to avoid + * chasing garbage pointers (to data, or even potentially code in a + * module about to be unloaded). To that end, we acquire the + * knote foplock before calling into the filter ops. When a driver + * (or anything else) is tearing down its klist, klist_fini() enumerates + * each knote, acquires its foplock, and replaces the filterops with a + * nop stub, allowing knote detach (when descriptors are closed) to safely + * proceed. + */ + static int filter_attach(struct knote *kn) { int rv; + KASSERT(mutex_owned(>kn_foplock)); KASSERT(kn->kn_fop != NULL); KASSERT(kn->kn_fop->f_attach != NULL); @@ -465,6 +507,8 @@ filter_attach(struct knote *kn) static void filter_detach(struct knote *kn) { + + KASSERT(mutex_owned(>kn_foplock)); KASSERT(kn->kn_fop != NULL); KASSERT(kn->kn_fop->f_detach != NULL); @@ -478,10 +522,12 @
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 13, 2022, at 11:25 AM, Taylor R Campbell > wrote: > > Sorry, haven't had time yet to do a full review, but I spot at least > one problem that means this won't fly as is: kqueue_register and > kqueue_scan both call filter_touch under a spin lock, but with your > patch filter_touch now takes an rwlock -- which is forbidden under a > spin lock (and it'll crash under LOCKDEBUG). I found another issue as well and will post a follow-up later. -- thorpej
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 12, 2022, at 7:54 AM, Jason Thorpe wrote: > > The following patch rectifies this situation by having klist_fini() traverse > a list of knotes and substitute their filterops with no-op stubs. This > requires synchronizing with any calls into the filterops themselves. We have > convenient funnel-points for this in kern_event.c already, so it’s not > particularly difficult… the main issue is that the filterops themselves are > allowed to block. My solution to this was to have a single global rwlock, > knote_filterops_lock, that’s acquired for reading when a call though a > knote's filterops is made, and in klist_fini() is acquired for writing for > the short duration needed to stub out filterops. This lock should **very > rarely** be contended, but it will be *hot* (lots of acquire-for-read), so it > gets its own cache line. > > If someone has ideas about another synchronization mechanism, I’m all ears… > again, the main issue is that the filterops calls themselves are allowed to > block, so I’m not sure that any of our passive serialization mechanisms would > work here. FWIW, another alternative is to put the rwlock into the knote itself, if we think that lock will be Too Hot (which is not an unreasonable concern on a machine with many cores). The trade-off is memory. I could see putting the lock in the knote itself on amd64 and aarch64 (where we would expect to see high core count machines) and a global lock on everything else. Then again, on amd64, struct knote is 136 bytes, so it already spills into the kmem-00192 bucket, and on i386 it’s 84 bytes, so kmem-00128 bucket. I guess each of those could easily accommodate an additional pointer-sized struct member to alleviate any scalability concern about a global rwlock. -- thorpej
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 12, 2022, at 10:27 AM, Jason Thorpe wrote: > > >> On Jul 12, 2022, at 7:54 AM, Jason Thorpe wrote: >> >> If someone has ideas about another synchronization mechanism, I’m all ears… >> again, the main issue is that the filterops calls themselves are allowed to >> block, so I’m not sure that any of our passive serialization mechanisms >> would work here. >> >> > > Well, for some reason, I mistakenly posted an old version of this patch. > I’ll post a refresh later today. Oh, no, never mind, I’m wrong about this :-). NOTHING TO SEE HERE *whistles* -- thorpej
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 12, 2022, at 7:54 AM, Jason Thorpe wrote: > > If someone has ideas about another synchronization mechanism, I’m all ears… > again, the main issue is that the filterops calls themselves are allowed to > block, so I’m not sure that any of our passive serialization mechanisms would > work here. > > Well, for some reason, I mistakenly posted an old version of this patch. I’ll post a refresh later today. -- thorpej
Fix for PR kern/56713
When I changed vnode kqueue events to be posted from common code at the VFS layer, I introduced a regression in sending those events on stacked file systems like nullfs. The root cause of the problem is that when knotes are attached to the vnode, that operation (by way of the VOP bypass mechanism in layerfs) drills all the way down to the bottom of the vnode stack and attaches there (as intended), but with event delivery being handled now in vnode_if.c, it’s the upper vnode that’s being worked with, and no knotes are attached there. These stacked vnodes already share a bit of state with the bottom layer… (locks, etc.), so my solution to this problem is to allow these stacked vnodes to also refer to the bottom vnode’s klist. Each vnode always starts out referring to its own, but as a layerfs vnode is constructed, it is told to refer to the lower vnode's klist (just as it refers to the lower vnode’s interlock and uvm_obj lock). Various assertions are included to ensure that knotes are never attached to a vnode that refers to another vnode’s klist (they should always be attached to the bottom-most vnode). (After applying the patch, vnode_if.c must be rebuilt.) Index: sys/fs/union/union_subr.c === RCS file: /cvsroot/src/sys/fs/union/union_subr.c,v retrieving revision 1.81 diff -u -p -r1.81 union_subr.c --- sys/fs/union/union_subr.c 19 Mar 2022 13:53:32 - 1.81 +++ sys/fs/union/union_subr.c 12 Jul 2022 00:17:28 - @@ -232,10 +232,11 @@ union_newupper(struct union_node *un, st unlock_ap.a_desc = VDESC(vop_unlock); unlock_ap.a_vp = UNIONTOV(un); genfs_unlock(_ap); - /* Update union vnode interlock & vmobjlock. */ + /* Update union vnode interlock, vmobjlock, & klist. */ vshareilock(UNIONTOV(un), uppervp); rw_obj_hold(uppervp->v_uobj.vmobjlock); uvm_obj_setlock((un)->v_uobj, uppervp->v_uobj.vmobjlock); + vshareklist(UNIONTOV(un), uppervp); mutex_exit(>un_lock); if (ohash != nhash) { LIST_INSERT_HEAD([nhash], un, un_cache); @@ -577,6 +578,7 @@ union_loadvnode(struct mount *mp, struct vshareilock(vp, svp); rw_obj_hold(svp->v_uobj.vmobjlock); uvm_obj_setlock(>v_uobj, svp->v_uobj.vmobjlock); + vshareklist(vp, svp); /* detect the root vnode (and aliases) */ if ((un->un_uppervp == um->um_uppervp) && Index: sys/miscfs/genfs/layer_vfsops.c === RCS file: /cvsroot/src/sys/miscfs/genfs/layer_vfsops.c,v retrieving revision 1.54 diff -u -p -r1.54 layer_vfsops.c --- sys/miscfs/genfs/layer_vfsops.c 23 Feb 2020 15:46:41 - 1.54 +++ sys/miscfs/genfs/layer_vfsops.c 12 Jul 2022 00:17:28 - @@ -205,10 +205,11 @@ layerfs_loadvnode(struct mount *mp, stru xp = kmem_alloc(lmp->layerm_size, KM_SLEEP); - /* Share the interlock and vmobjlock with the lower node. */ + /* Share the interlock, vmobjlock, and klist with the lower node. */ vshareilock(vp, lowervp); rw_obj_hold(lowervp->v_uobj.vmobjlock); uvm_obj_setlock(>v_uobj, lowervp->v_uobj.vmobjlock); + vshareklist(vp, lowervp); vp->v_tag = lmp->layerm_tag; vp->v_type = lowervp->v_type; Index: sys/kern/vfs_vnode.c === RCS file: /cvsroot/src/sys/kern/vfs_vnode.c,v retrieving revision 1.143 diff -u -p -r1.143 vfs_vnode.c --- sys/kern/vfs_vnode.c9 Apr 2022 23:45:45 - 1.143 +++ sys/kern/vfs_vnode.c12 Jul 2022 00:17:28 - @@ -457,7 +457,8 @@ vnalloc_marker(struct mount *mp) vp->v_mount = mp; vp->v_type = VBAD; vp->v_interlock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE); - klist_init(>v_klist); + klist_init(>vi_klist.vk_klist); + vp->v_klist = >vi_klist; vip->vi_state = VS_MARKER; return vp; @@ -475,7 +476,7 @@ vnfree_marker(vnode_t *vp) KASSERT(vip->vi_state == VS_MARKER); mutex_obj_free(vp->v_interlock); uvm_obj_destroy(>v_uobj, true); - klist_fini(>v_klist); + klist_fini(>vi_klist.vk_klist); pool_cache_put(vcache_pool, vip); } @@ -1391,7 +1392,8 @@ vcache_alloc(void) vp->v_interlock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE); uvm_obj_init(>v_uobj, _vnodeops, true, 1); - klist_init(>v_klist); + klist_init(>vi_klist.vk_klist); + vp->v_klist = >vi_klist; cv_init(>v_cv, "vnode"); cache_vnode_init(vp); @@ -1453,7 +1455,9 @@ vcache_free(vnode_impl_t *vip) mutex_obj_free(vp->v_interlock); rw_destroy(>vi_lock); uvm_obj_destroy(>v_uobj, true); - klist_fini(>v_klist); + KASSERT(vp->v_klist == >vi_klist || + SLIST_EMPTY(>vi_klist.vk_klist)); + klist_fini(>vi_klist.vk_klist); cv_destroy(>v_cv);
Problem with outstanding knotes and device detach - and a fix
Background: kqueue events are represented by a “knote” that is kept on a list of knotes (a klist) by its respective driver (usually indirectly in the “selinfo” structure). A similar situation exists for vnodes. Then, when activated, they are queued onto the kqueue’s active note list. The driver gets called to attach the knote in a generic fashion, but then provides its own ops vector for subsequent operations, including “detach”. When a file descriptor for a given knote is closed, this driver-supplied “detach” function is called to remove it from whatever driver-specific klist the knote is on. When a driver detaches, it frees its softc structure, which includes the selinfo / klist. This list might still have notes on it… but that isn’t inherently problematic, because the driver owns the list and no one will ever traverse it again — the generic kqueue code doesn’t care about it. But it is a problem when an outstanding file descriptor with a registered kqueue event (e.g. EVFILT_READ) for that device that may have been open when the device was detached. In this case, when the file descriptor is closed, the knote, which still points to the now-detached device’s filterops, will call the “detach” operation. For a driver that is still loaded into the kernel, we call code that’s still valid, but a use-after-free condition exists where the now-freed softc structure will be accessed to unlink the knote from the klist. The situation is worse for modular drivers that may have been unloaded after the device detaches — jumping through function pointers to code that no longer exists is bad, m’kay? The following patch rectifies this situation by having klist_fini() traverse a list of knotes and substitute their filterops with no-op stubs. This requires synchronizing with any calls into the filterops themselves. We have convenient funnel-points for this in kern_event.c already, so it’s not particularly difficult… the main issue is that the filterops themselves are allowed to block. My solution to this was to have a single global rwlock, knote_filterops_lock, that’s acquired for reading when a call though a knote's filterops is made, and in klist_fini() is acquired for writing for the short duration needed to stub out filterops. This lock should **very rarely** be contended, but it will be *hot* (lots of acquire-for-read), so it gets its own cache line. If someone has ideas about another synchronization mechanism, I’m all ears… again, the main issue is that the filterops calls themselves are allowed to block, so I’m not sure that any of our passive serialization mechanisms would work here. Index: kern/kern_event.c === RCS file: /cvsroot/src/sys/kern/kern_event.c,v retrieving revision 1.141 diff -u -p -r1.141 kern_event.c --- kern/kern_event.c 24 May 2022 20:50:19 - 1.141 +++ kern/kern_event.c 11 Jul 2022 20:53:40 - @@ -133,6 +133,31 @@ static const struct fileops kqueueops = .fo_restart = kqueue_restart, }; +static void +filt_nopdetach(struct knote *kn __unused) +{ +} + +static int +filt_nopevent(struct knote *kn __unused, long hint __unused) +{ + return 0; +} + +static const struct filterops nop_fd_filtops = { + .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE, + .f_attach = NULL, + .f_detach = filt_nopdetach, + .f_event = filt_nopevent, +}; + +static const struct filterops nop_filtops = { + .f_flags = FILTEROP_MPSAFE, + .f_attach = NULL, + .f_detach = filt_nopdetach, + .f_event = filt_nopevent, +}; + static const struct filterops kqread_filtops = { .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE, .f_attach = NULL, @@ -232,6 +257,9 @@ static size_t user_kfiltersz; /* size * -> object lock (e.g., device driver lock, ) * -> kn_kq->kq_lock * + * knote_filterops_lock (rw) + * -> whatever will be acquired in filter_*() + * * Locking rules: * * f_attach: fdp->fd_lock, KERNEL_LOCK @@ -279,6 +307,24 @@ static size_t user_kfiltersz; /* size */ static krwlock_t kqueue_filter_lock; /* lock on filter lists */ +/* + * Calls into the filterops need to be resilient against things which + * destroy a klist, e.g. device detach, freeing a vnode, etc., to avoid + * chasing garbage pointers (to data, or even potentially code in a + * module about to be unloaded). To that end, we acquire the + * knote_filterops_lock as a reader whenver calling into the filter + * ops (in the filter_*() functions). When a driver (or anything else) + * is tearing down its klist, klist_fini() acquires knote_filterops_lock + * as a writer, and replaces all of the filterops in each knote with a + * stub, allowing knote detach (when descriptors are closed) to safely + * proceed. + * + * This lock should be very rarely *contended* (klist tear-down is fairly + * rare), but it
Re: membar_enter semantics
> On Feb 11, 2022, at 10:27 AM, Taylor R Campbell wrote: > > So x86, powerpc, and sparc64 all implement what I suggest membar_enter > _should_ be (and what all current users use it for!), and _fail_ to > implement what membar_enter is documented to be in the man page. > > And of all the membar_enter definitions, only the riscv one fails to > guarantee the load-before-load/store ordering I suggest it should be > documented to have. My beef with the membar_enter definitional change is with the word "acquire". I.e. you want to give it what is called today "acquire" semantics. My beef is with now "acquire" is defined, as load-before-load/store. I think it's a non-intuitive word to use for that kind of barrier, especially when used in context of the Solaris description of membar_enter(), as you quoted earlier: The membar_enter() function is a generic memory barrier used during lock entry. It is placed after the memory operation that acquires the lock to guarantee that the lock protects its data. No stores from after the memory barrier will reach visibility and no loads from after the barrier will be resolved before the lock acquisition reaches global visibility. So, let's again take a look at the rules for PSO from the SPARC v9 architecture manual: """ The rules for PSO are: * Loads are blocking and ordered with respect to earlier loads. * Atomic load-stores are ordered with respect to loads. Thus, PSO ensures that: * Each load and atomic load-store instruction behaves as if it were followed by a MEMBAR with a mask value of 05. * Explicit MEMBAR instructions are required to order store and atomic load-store instructions with respect to each other. """ My gripe hinges on what "lock acquisition" means. In SPARC-land, the above Solaris description can be broken down into three cases: v7 -- Locks are acquired with "ldstub", and memory is always strongly-ordered; membar_enter() is a nop. v8 -- Locks are acquired with "ldstub", and CPU Partial Store Order; membar_enter() is "stbar" (equivalent to "membar #StoreStore" in v9-speak). V9 -- Locks are acquired with either "ldstub" or "casx", and CPU can do PSO and RMO. Both "ldstub" and "casx" are the same with respect to how they interact with barriers. But PSO and RMO have two different requirements: v9-PSO -- Because Atomic load-stores ("ldstub" and "casx") are not ordered with respect to stores, you would need "membar #StoreStore" (in PSO mode, Atomic load-stores are already strongly ordered with respect to other loads). v9-RMO -- In **this case**, because an Atomic load-store is a load, you can use "membar #LoadLoad | #LoadStore" (your proposed semantics: load-before-load/store). But because an Atomic load-store is also a store, you can also use "membar #StoreLoad | #StoreStore" (the current definition in our man page: store-before-load/store). Now, because in PSO mode, Atomic load-stores are not strongly ordered with respect to stores, in order for the following code to work: mutex_enter(); *foo = 0; result = *bar; mutex_exit(); ...then you need to issue a "membar #StoreStore" because the ordering of the lock acquisition and the store through *foo is not guaranteed without it. But you can also issue a "membar #StoreLoad | #StoreStore", which also works in RMO mode. In other words, it's the **store into the lock cell** that actually performs the acquisition of the lock. In addition to being true on platforms that have Atomic load-store (like SPARC), it is also true on platforms that have LL/SC semantics (the load in that case doesn't mean jack-squat, and the ordering guarantees that the LL has are specifically with respect to the paired SC). Until that store is globally visible, the lock does not protect its data, on all platforms that NetBSD supports. And this is why I described membar_enter() the way I did. Now, if we want to loosen the definition to be "acquisition of lock", which would allow a load-before-load/store after an Atomic load-store, then that's fine, but I would object to defining "membar_enter()" as "load-before-load/store" ... rather, I would word it in a more fuzzy way such that load-before-load/store is allowed if such a barrier works for that architecture, but would also allow store-before-load/store if some other architecture needs those semantics for whatever reason. And that's why I'd also prefer to have a new name for an additional operation if you want to have one that is explicitly "load-before-load/store". > (sparc64 membar_sync also fails to order store-before-load, and that's > clearly just an implementation bug because membar_sync has always been > unequivocally intended and used as a full load/store-before-load/store > sequential consistency barrier. Easy fix: change membar_sync to do > `membar #StoreLoad' for kernel running in TSO. If userland is really > running in RMO, then all the membars need to be
Re: membar_enter semantics
> On Feb 11, 2022, at 9:03 AM, Jason Thorpe wrote: > > >> On Feb 11, 2022, at 8:45 AM, Martin Husemann wrote: >> >> so Jason is correct. For userland we default to RMO for 64bit binaries >> (but honor their elf flags), and TSO for 32bit. > > Right, v7 didn't have any barrier instructions, and v8 only had STBAR (a > stores-before-stores barrier), IIRC. I did not realize we are using RMO for > 64-bit userland, tho. So, given the existing definition of membar_enter() in the man page, sparc64's userland membar_sync() really ought to be "membar #StoreStore | #StoreLoad", and with Taylor's newly-proposed semantics it would need to be "membar #LoadLoad | #LoadStore". The current "membar #LoadLoad" is *only* maybe-OK if you're in TSO mode, and even then I think it's dicey. -- thorpej
Re: membar_enter semantics
> On Feb 11, 2022, at 8:45 AM, Martin Husemann wrote: > > so Jason is correct. For userland we default to RMO for 64bit binaries > (but honor their elf flags), and TSO for 32bit. Right, v7 didn't have any barrier instructions, and v8 only had STBAR (a stores-before-stores barrier), IIRC. I did not realize we are using RMO for 64-bit userland, tho. -- thorpej
Re: membar_enter semantics
> On Feb 11, 2022, at 5:33 AM, Taylor R Campbell wrote: > > sparc64: alias for membar_sync (not sure membar #LoadLoad is right, though!) Almost forgot to mention ... I wouldn't read too much into how sparc64's barriers are implemented ... I'm pretty sure our sparc64 port configures the system for Total Store Order and not Partial Store Order. -- thorpej
Re: membar_enter semantics
> On Feb 11, 2022, at 7:07 AM, Taylor R Campbell wrote: > >> Date: Fri, 11 Feb 2022 06:52:48 -0800 >> From: Jason Thorpe >> >> I would prefer we adopt the Solaris description about a generic >> barrier that provides "lock-is-visible-before-load/store" without >> explicitly stating "load-before-load/store", and provide a new >> membar_acquire() that means "load-before-load/store". > > What does `lock-is-visible' mean? How would you ever use it > correctly, or audit correct use? What operations do we have that have > `lock-is-visible' semantics, which don't also imply an acquire > operation and thus don't need an explicit barrier anyway? Actually, you know what, I’m just going to defer on all of this. I have a huge beef with the modern naming of various memory barrier operations (they are not intuitive at all, IMO) … so whatever. > It turns out that _even on x86_, our membar_enter fails to implement > the semantics we documented for it Sure, change the documentation rather than fix the bug, then. -- thorpej
Re: membar_enter semantics
> On Feb 11, 2022, at 5:33 AM, Taylor R Campbell wrote: > > So I propose to change the membar_enter documentation to match the > definitions and usage (and change the riscv definition), making it > instead: > > membar_enter() > Any load preceding membar_enter() will happen before all memory > operations following it. > > This will also let us delete the obnoxious text I added to > atomic_loadstore(9) warning about membar_enter semantics. I would prefer we adopt the Solaris description about a generic barrier that provides “lock-is-visible-before-load/store” without explicitly stating “load-before-load/store”, and provide a new membar_acquire() that means “load-before-load/store”. -- thorpej
Re: membar_enter semantics
> On Feb 11, 2022, at 5:33 AM, Taylor R Campbell wrote: > > The following definitions provide load-before-load/store: > > aarch64: alias for membar_sync > alpha: alias for membar_sync > arm: alias for membar_sync Actually, on Alpha[*], an alias to membar_sync is really providing “load/store-before-load/store”. That’s supposed to be what membar_sync() is defined as everywhere. [*] MB is “load/store-before-load/store” and WMB is “store-before-store”. -- thorpej
Re: Regulator
> On Jan 12, 2022, at 9:08 AM, Jason Thorpe wrote: > >> On Jan 12, 2022, at 8:02 AM, Taylor R Campbell >> wrote: >> >>> Date: Wed, 12 Jan 2022 15:42:17 + >>> From: Robert Swindells >>> >>> I'm guessing this is a platform that doesn't use FDT. >>> >>> Maybe we need a more general API for regulators. >> >> What other instances would a more general API cover? Is there some >> ACPI interface for regulators? > > The ACPI interface for regulators is usually "put device into power state X", > but of course that depends on the right ASL being present in the node for the > device. I guess I forgot to mention, it's not necessarily about setting the device into a specific power state (like "soft-suspend" vs "deep-sleep" vs "off"). In addition to "on" and "off", the regulator API can be used to step voltages. This is useful when you want to consume as little power as possible for a given performance requirement... like if you're running an interface at a slower clock rate (for whatever reason), you can sometimes step down the voltage because you don't need as much drive strength to defeat the capacitances of the physical device, for example. I don't know that ACPI really encapsulates that (unless there are DSMs or whatever defined for the device type that know how to do such stepping). In general, the philosophy between ACPI and Device Tree are different in this regard. ACPI aims to encapsulate the knowledge in methods that the firmware provides, and Device Tree aims to describe the connections between the elements and then provides generic interfaces to interact with those elements. -- thorpej
Re: Regulator
> On Jan 12, 2022, at 9:17 AM, Jason Thorpe wrote: > >> Both lima and panfrost need to make calls to the regulator API, as well >> as clock, reset and interrupt ones. > > Lima will probably be fine because the platforms that need it will use FDT. > Except for if you're using that board with ACPI, I guess. Same situation > with resets. Interrupts are another story and a bit more complicated. "Resets" is also a little more complicated, because there are two kinds on the Device Tree bindings universe: - Reset Controllers (which use the "Reset" bindings) - Reset GPIOs (which use the "GPIO" bindings) Our FDT GPIO and generic GPIO interfaces are in need of harmonizing, and I've been thinking about that lately, so hopefully will have some bandwidth to make progress on that front in the coming months. -- thorpej
Re: Regulator
> On Jan 12, 2022, at 9:12 AM, Robert Swindells wrote: > > > Taylor R Campbell wrote: >> Date: Wed, 12 Jan 2022 15:42:17 + >>> From: Robert Swindells >>> >>> I'm guessing this is a platform that doesn't use FDT. >>> >>> Maybe we need a more general API for regulators. >> >> What other instances would a more general API cover? Is there some >> ACPI interface for regulators? > > Don't know, somebody could look at what the Linux API covers. > > Could also look at doing clocks, resets and interrupts. We have one for clocks, but the only way to acquire a clock handle is via FDT at the moment. > Both lima and panfrost need to make calls to the regulator API, as well > as clock, reset and interrupt ones. Lima will probably be fine because the platforms that need it will use FDT. Except for if you're using that board with ACPI, I guess. Same situation with resets. Interrupts are another story and a bit more complicated. -- thorpej
Re: Regulator
> On Jan 12, 2022, at 9:08 AM, Jason Thorpe wrote: > > > >> On Jan 12, 2022, at 8:02 AM, Taylor R Campbell >> wrote: >> >>> Date: Wed, 12 Jan 2022 15:42:17 + >>> From: Robert Swindells >>> >>> I'm guessing this is a platform that doesn't use FDT. >>> >>> Maybe we need a more general API for regulators. >> >> What other instances would a more general API cover? Is there some >> ACPI interface for regulators? > > The ACPI interface for regulators is usually "put device into power state X", > but of course that depends on the right ASL being present in the node for the > device. (Of course I mean AML here, but of course it's all initially written in ASL, so...) > >>> The current situation causes a fair bit of "#ifdef FDT" in the drm >>> compat code. >> >> I count 1: >> >> https://nxr.netbsd.org/xref/src/sys/external/bsd/drm2/include/linux/regulator/consumer.h?r=1.5#39 >> >> Am I missing some? >> >> There's also one in linux/clk.h, for clocks, and two in nouveau_pci.c >> and radeon_pci.c, for kicking out the FBT framebuffer; these don't >> appear to be relevant to regulators. > > Any place where there's an "#ifdef FDT" is a failure to create a proper > abstraction. > > -- thorpej -- thorpej
Re: Regulator
> On Jan 12, 2022, at 8:02 AM, Taylor R Campbell > wrote: > >> Date: Wed, 12 Jan 2022 15:42:17 + >> From: Robert Swindells >> >> I'm guessing this is a platform that doesn't use FDT. >> >> Maybe we need a more general API for regulators. > > What other instances would a more general API cover? Is there some > ACPI interface for regulators? The ACPI interface for regulators is usually "put device into power state X", but of course that depends on the right ASL being present in the node for the device. >> The current situation causes a fair bit of "#ifdef FDT" in the drm >> compat code. > > I count 1: > > https://nxr.netbsd.org/xref/src/sys/external/bsd/drm2/include/linux/regulator/consumer.h?r=1.5#39 > > Am I missing some? > > There's also one in linux/clk.h, for clocks, and two in nouveau_pci.c > and radeon_pci.c, for kicking out the FBT framebuffer; these don't > appear to be relevant to regulators. Any place where there's an "#ifdef FDT" is a failure to create a proper abstraction. -- thorpej
Re: Regulator
> On Jan 12, 2022, at 6:34 AM, Emmanuel Dreyfus wrote: > > Hello > > I am still working on Goodix touchpanel as told in [1] and [2]. I > wrote a driver for the Intel GPIO chip that interfaces with Goodix > chipreset and interrupt pins. I have not committed yet, because it > remains untested. > > There is another readblock. The Linux driver for Goodix touchpanel > uses a beast called regulator, which seems to control the chip > power supply: >ts->avdd28 = devm_regulator_get(dev, "AVDD28"); > (...) >ts->vddio = devm_regulator_get(dev, "VDDIO"); > (...) >/* power up the controller */ >error = regulator_enable(ts->avdd28); > (...) >error = regulator_enable(ts->vddio); > > Is it another chip that neds another driver? Or a feature of the > Goodix chip? What support do we have for thins kind of thing? > grep -r regulator returns manu hits in src/sys/dev/i2c and > src/sys/dev/fdt but we do not have any generic API for that, right? We support regulators using FDT. -- thorpej
Passive serialization support in the pool allocator
An #ifdef that Taylor added in the new DRM code was bugging me, because a similar situation already existed in the NetBSD kernel, whereby the type stability of a object’s backing memory (the iam existentium case being LWPs) was required for a passively-serialized weak reference to work. The LWP pool cache had a hack to handle it, and the hack was propagated to the new DRM code under an “#ifdef __NetBSD__”. I initially fixed this by adding a “pre-destruct” callback that could be optionally set for a pool cache, but after further discussion, it seemed like having direct knowledge of passive serialization synchronization points in the allocator was a better solution, so here is a diff that implements it for you review. https://www.netbsd.org/~thorpej/pool-pser-diff.txt Please comment ASAP, because I want to ride a kernel version bump. -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
If strip removes it, then you’re doomed anyway and trampoline assist via a function won’t help you, because you won’t be able to get to the trampoline or past it anyway. -- thorpej Sent from my iPhone. > On Nov 21, 2021, at 7:54 AM, John Marino (NetBSD) wrote: > > > I'm not very familiar with CFA information. I've been worried that strip(1) > removes those symbols. Is that worry meritless? > >> On Sun, Nov 21, 2021 at 9:32 AM Jason Thorpe wrote: >> >> > On Nov 21, 2021, at 6:35 AM, John Marino (NetBSD) wrote: >> > >> > Sorry, I meant to answer and got overwhelmed. >> > Actually, I had a typo in the previous response. I meant to say "GCC" >> > unwinder, not gdb unwinder. >> > I don't see how this helps support the GCC unwinder. The context >> > information to support the unwind is already provided, we just need the >> > boolean (is this a sigtramp frame?) to decide whether or not to use it. I >> > thought Uwe's suggestion to return the context was either an expansion for >> > more general use or a second function. All we have is the pc pointer. >> >> Well, getting back to a previous part of the conversation, there can be >> multiple kinds of contexts, either a “sigcontext” or a “ucontext_t”, and so >> telling you “it’s in a trampoline” isn’t necessarily useful — you need to >> know what kind it is. The idea behind __sigtramp_unwind_np() is that you >> would need less architecture-dependent logic (and that the API would be >> future-proof, in case the way the handlers work changes for some reason). >> >> However, Joerg correctly pointed out that the real correct solution already >> exists, which is to say “make sure DWARF CFA information exists for signal >> trampolines”, which is what I’ve been focusing on over the last couple of >> weeks. Only if that proves to be insufficient for some reason should we add >> a new non-portable API to assist unwind. DWARF unwind information doesn’t >> really work for FreeBSD trampolines, because the handler is supplied by the >> kernel, and so of course there’s not really a good way to look up the CIE / >> FDE for it. >> >> -- thorpej >>
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Nov 21, 2021, at 8:28 AM, Jason Thorpe wrote: > > If strip removes it, then you’re doomed anyway and trampoline assist via a > function won’t help you, because you won’t be able to get to the trampoline > or past it anyway. Here’s a before/after of a “strip -s” of a binary on the DWARF unwind information: 12 .eh_frame_hdr 01a4 0001200077f0 0001200077f0 77f0 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA 13 .eh_frame 07a4 000120007998 000120007998 7998 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA 12 .eh_frame_hdr 01a4 0001200077f0 0001200077f0 77f0 2**2 CONTENTS, ALLOC, LOAD, READONLY, DATA 13 .eh_frame 07a4 000120007998 000120007998 7998 2**3 CONTENTS, ALLOC, LOAD, READONLY, DATA I.e. strip does not effect the unwind information, because unwind information is not debugging information nor is it part of the symbol information; it is, in fact, required for correct operation of the program in the face of exceptions. And my test program still works: alpha-vm:thorpej 22$ ./test1 ^Cx 2 Backtrace 4 stack frames. 0x1200014e4 <_init+0x304> at ./test1 0x3fffdddbfd4 <__sigtramp_siginfo_2> at /usr/lib/libc.so.12 0x1200015e4 <_init+0x404> at ./test1 0x120001590 <_init+0x3b0> at ./test1 As you can see, because I stripped the symbols out of the program binary, the symbol names are wrong, but the unwind works and the program counter values are the same as an un-stripped copy of the program: alpha-vm:thorpej 23$ ./sigbttest ^Cx 2 Backtrace 4 stack frames. 0x1200014e4 at ./sigbttest 0x3fffdddbfd4 <__sigtramp_siginfo_2> at /usr/lib/libc.so.12 0x1200015e4 at ./sigbttest 0x120001590 at ./sigbttest So, I think your worry about it is unwarranted. >> On Nov 21, 2021, at 7:54 AM, John Marino (NetBSD) wrote: >> >> >> I'm not very familiar with CFA information. I've been worried that strip(1) >> removes those symbols. Is that worry meritless? -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Nov 21, 2021, at 6:35 AM, John Marino (NetBSD) wrote: > > Sorry, I meant to answer and got overwhelmed. > Actually, I had a typo in the previous response. I meant to say "GCC" > unwinder, not gdb unwinder. > I don't see how this helps support the GCC unwinder. The context information > to support the unwind is already provided, we just need the boolean (is this > a sigtramp frame?) to decide whether or not to use it. I thought Uwe's > suggestion to return the context was either an expansion for more general use > or a second function. All we have is the pc pointer. Well, getting back to a previous part of the conversation, there can be multiple kinds of contexts, either a “sigcontext” or a “ucontext_t”, and so telling you “it’s in a trampoline” isn’t necessarily useful — you need to know what kind it is. The idea behind __sigtramp_unwind_np() is that you would need less architecture-dependent logic (and that the API would be future-proof, in case the way the handlers work changes for some reason). However, Joerg correctly pointed out that the real correct solution already exists, which is to say “make sure DWARF CFA information exists for signal trampolines”, which is what I’ve been focusing on over the last couple of weeks. Only if that proves to be insufficient for some reason should we add a new non-portable API to assist unwind. DWARF unwind information doesn’t really work for FreeBSD trampolines, because the handler is supplied by the kernel, and so of course there’s not really a good way to look up the CIE / FDE for it. -- thorpej
Re: timecounters
> On Nov 15, 2021, at 8:32 AM, Michael wrote: > > IIRC on macppc SMP startup we stop the timebase, then zero the counters > on all CPUs while they hatch, then start the timebase again. But that's for the clock interrupt, right? The time counter thing isn't about clock interrupts. -- thorpej
Re: timecounters
> On Nov 14, 2021, at 4:49 PM, Joerg Sonnenberger wrote: > >> x86 TSC: cycle count from CPU register. Very quick to read, but unreliable >> if >> CPU frequency changes because of power saving. Also each CPU has its own >> value (how do we cope with that?) > > It's even more complicated. For older x86 CPUs, different counts could > go at different frequencies, but this was fixed around the Netburst era > or so. That said, they frequently have issues when used in SMP systems. > It's a high resolution timer. This is handled correctly on Alpha now (fixed roughly 1 year ago with kern_cctr.c 1.11). If the cycle counters are running at the same nominal frequency, the offset of the counter on the secondary CPUs is calibrated periodically (~once per second) relative to the primary CPU’s counter. Alpha is the only port using that code at the moment, but there is nothing machine-dependent about it. However, that code cannot deal with counters whose frequency changes, or counters that run at a different nominal frequency than the primary’s. -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Nov 14, 2021, at 3:12 PM, John Marino (NetBSD) wrote: > > So for the purpose of the gdb unwinder, I would pass NULL for the sp argument? > The unwinder would only be checking the result to see __sigtramp_unwind_np > returns null or not. This would not work for the gdb unwinder — it only works for unwinding from within the same process, so for exception handling, etc. I.e. if the unwinder in the compiler runtime wasn’t able to process the DWARF call frame info for the signal trampoline, for example. The gdb unwinder may have to operate on a process that’s no longer running (and, at the very least, is not the current process), so it needs to rely only on DWARF call frame info and/or heuristics based on the symbol name (which I added to gdb quite some years ago now). The requirements for using this are: ==> In the frame that represents the handler itself, you are able to get the return address the handler will return to. ==> In that same frame, you can compute what the stack pointer should be when the handler returns (by either inspecting the instructions in the function prologue that setup the handler’s stack frame, or by parsing the DWARF call frame info). It’s that return address and stack pointer that you would pass to __sigtramp_unwind_np(). Then the context returned by __sigtramp_unwind_np() would allow you to then get the stack pointer, PC, etc. for the context that would be resumed after the handler returns (because remember, signals can run on their own stack). So, this is a little more than just “is this PC in the signal trampoline?”. This is “If this PC is in the signal trampoline, then return a pointer to the context that will be restored when the signal returns, given this SP.” This was uwe’s suggestion. I can still expose just __sigtramp_check_np(), but the assumption was that you would use a “YES!” result from that to to off an find the context-to-restore… so we decided to encapsulate some of that work for you as well. Does that make sense? If that assumption about how you would use __sigtramp_check_np() is invalid, please let me know so I can adjust the proposal. -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 28, 2021, at 8:49 AM, Valery Ushakov wrote: > > It is ucontext for the siginfo trampoline and sigcontext for the older > one, isn't it? Ok, I’ve settled on the following: void *__sigtramp_unwind_np(void *pc, void *sp, int *versp); Given a program counter and stack pointer, return a pointer to the context that will be restored when the signal handler returns. The signal trampoline version is returned in *versp; versp must not be NULL, and passing NULL will result in undefined behavior. Returns NULL if the provided pc is not within the signal trampoline, or if the location of the context to restore cannot be determined. If the returned version is within the range __SIGTRAMP_SIGINFO_VERSION_MIN … __SIGTRAMP_SIGINFO_VERSION_MAX, then the returned context points to a ucontext_t. If the returned version is within the range __SIGTRAMP_SIGCONTEXT_VERSION_MIN … __SIGTRAMP_SIGCONTEXT_VERSION_MAX, then the returned context points to a struct sigcontext. Note that the layout of the context structures is architecture-dependent and may also be version-dependent. Does this sound good to everyone? -- thorpej
Re: if_ethersubr.c: if_ierrors -> if_iqdrops?
> On Nov 7, 2021, at 11:07 PM, RVP wrote: > > So, I hacked up a small patch to put most of these into the > "if_iqdrops" bucket. The rest (following FreeBSD) remain as errors. LGTM! -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 29, 2021, at 7:54 AM, Jason Thorpe wrote: > > > >> On Oct 29, 2021, at 2:45 AM, Martin Husemann wrote: >> >> On Thu, Oct 28, 2021 at 11:14:39AM -0700, Jason Thorpe wrote: >>> We really just need to kill the sigcontext stuff completely. More to the >>> point, we should version sigaction() before (or concurrently with) adding >>> whatever call we add here so as to ensure that it will only ever be a >>> ucontext. >> >> Something there is broken in -current, see PR 56471 (freshly compiled macppc >> ntpd tries to use compat_16___sigreturn14). > > Yes, and from your description it should have failed to register a > “sigcontext” handler in the first place. Martin, is kern.module.autoload enabled on your machine? (Sorry, my Mac mini is in a state of “has been turned off for a while”, so I need to resurrect it today, so asking for info here to get the investigation going…) The PowerPC signal trampoline also has a case where it can return EINVAL because it thinks the saved SRR1 has been tampered with. I’m wondering if somehow the old sigcontext trampoline is getting invoked for a siginfo-style handler. -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 29, 2021, at 2:45 AM, Martin Husemann wrote: > > On Thu, Oct 28, 2021 at 11:14:39AM -0700, Jason Thorpe wrote: >> We really just need to kill the sigcontext stuff completely. More to the >> point, we should version sigaction() before (or concurrently with) adding >> whatever call we add here so as to ensure that it will only ever be a >> ucontext. > > Something there is broken in -current, see PR 56471 (freshly compiled macppc > ntpd tries to use compat_16___sigreturn14). Yes, and from your description it should have failed to register a “sigcontext” handler in the first place. I’m looking into it. -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 28, 2021, at 8:49 AM, Valery Ushakov wrote: > > On Wed, Oct 27, 2021 at 20:59:12 -0700, Jason Thorpe wrote: > >>> On Oct 27, 2021, at 4:01 PM, Jason Thorpe wrote: >>> >>> >>>> On Oct 27, 2021, at 3:44 PM, Valery Ushakov wrote: >>>> >>>> On Wed, Oct 27, 2021 at 07:50:55 -0700, Jason Thorpe wrote: >>>> >>>> I was wondering if it might be easier to not put the onus onto the >>>> caller and instead have a function that returns the interrupted >>>> ucontext (or NULL, if the pc is not in a trampoline). >>>> >>>> ucontext_t *__unwind_sigtramp(return_pc, return_sp) >>> >>> That would certainly be a nicer API. >> >> Thought about it a little more. >> >> To make this really work, we'd definitely have to version >> sigaction() so that it fully de-supported sigcontext handlers. >> Otherwise, it's a toss-up whether you have a sigcontext or a >> ucontext on the stack. > > It is ucontext for the siginfo trampoline and sigcontext for the older > one, isn't it? Sure, unless you're a platform that has never supported sigcontext, if you have COMPAT_16 disabled, or, ... We really just need to kill the sigcontext stuff completely. More to the point, we should version sigaction() before (or concurrently with) adding whatever call we add here so as to ensure that it will only ever be a ucontext. -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 27, 2021, at 4:01 PM, Jason Thorpe wrote: > > >> On Oct 27, 2021, at 3:44 PM, Valery Ushakov wrote: >> >> On Wed, Oct 27, 2021 at 07:50:55 -0700, Jason Thorpe wrote: >> >> I was wondering if it might be easier to not put the onus onto the >> caller and instead have a function that returns the interrupted >> ucontext (or NULL, if the pc is not in a trampoline). >> >> ucontext_t *__unwind_sigtramp(return_pc, return_sp) > > That would certainly be a nicer API. Thought about it a little more. To make this really work, we’d definitely have to version sigaction() so that it fully de-supported sigcontext handlers. Otherwise, it’s a toss-up whether you have a sigcontext or a ucontext on the stack. I do think a __sigtramp_unwind() would also be a good addition. But I also see some value in the basic check (which you need to have to __sigtramp_unwind() anyway…). -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 27, 2021, at 3:44 PM, Valery Ushakov wrote: > > On Wed, Oct 27, 2021 at 07:50:55 -0700, Jason Thorpe wrote: > > I was wondering if it might be easier to not put the onus onto the > caller and instead have a function that returns the interrupted > ucontext (or NULL, if the pc is not in a trampoline). > > ucontext_t *__unwind_sigtramp(return_pc, return_sp) That would certainly be a nicer API. -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 18, 2021, at 9:41 AM, John Marino (NetBSD) wrote: > > yes, it sounds like a __in_signal_trampoline function would work for > the GCC unwind, and I would think it would work for GDB as well. Ok, I have implemented a new function with this signature: /* * __sigtramp_check_np -- * * Non-portable function that checks if the specified program * counter value is within the signal return trampoline. Returns * the trampoline version numnber corresponding to what style of * trampoline it matches, or -1 if the program value is not within * the signal return trampoline. */ int __sigtramp_check_np(void *pc); Usage would be like: /* * If you need access to “struct sigcontext” to perform the unwind, you must * define _LIBC before including . */ #define _LIBC #include . . . int rv = __sigtramp_check_np((void *)context->return_address); if (rv == -1) { /* address was not within a signal return trampoline. */ } if (rv == __SIGTRAMP_SIGCODE_VERSION) { /* * ultra-legacy — this will never be returned for modern binaries. * in fact, the current implementation of __sigtramp_check_np() * will never return it and there is no reason to add support for it. */ } if (rv >= __SIGTRAMP_SIGCONTEXT_VERSION_MIN && rv <= __SIGTRAMP_SIGCONTEXT_VERSION_MAX) { #ifdef __HAVE_STRUCT_SIGCONTEXT /* do sigcontext stuff, distinguishing between versions, as needed */ #else /* * no sigcontext structure is available here, so none of these values * should have been returned. */ #endif } if (rv >= __SIGTRAMP_SIGINFO_VERSION_MIN && rv <= __SIGTRAMP_SIGINFO_VERSION_MAX) { /* do siginfo stuff, distinguishing between versions, as needed */ } /* * Address was within a signal return trampoline, but it’s not a version * that this program knows how to decode so *shrug*. */ For the most part, the MIN and MAX values will be the same (currently only the VAX has multiple versions of either kind of signal trampoline, and it’s for the “sigcontext” type). Again, to be clear, this can query the current process ONLY. Is this suitable for your needs? Looking at your GCC unwinder for FreeBSD (link in your original email), it seems it would work fine (that unwinder uses getpid() to pass to the sysctl). You would be able to test for the availability of __sigtramp_check_np() by testing for __SIGTRAMP_SIGCODE_VERSION being defined: #ifdef __SIGTRAMP_SIGCODE_VERSION /* * Code to check if we’re in a NetBSD signal trampoline. */ ... #endif /* __SIGTRAMP_SIGCODE_VERSION */ It’s not the prettiest API in the world, but you’re by definition diving into implementation details here, and the API does provide all the info needed to do the work. Any other comments? Christos? …and while there was a flurry of activity lately around this area, the minimal changes to support __sigtramp_check_np() are easily isolated and could be back ported to the netbsd-9 branch without much trouble. -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 18, 2021, at 9:41 AM, John Marino (NetBSD) wrote: > > yes, it sounds like a __in_signal_trampoline function would work for > the GCC unwind, and I would think it would work for GDB as well. Maybe I missed the GDB use case ... what is it that GDB needs to be able to do here? This hypothetical function would only be able to query "self", not another process, so if GDB needs to be able to do the check for another process (the debug target, presumably), then it will need to use something else (can't it use the CFI notations to do the unwind?) (Not all of the trampolines are correctly notated, but that's not a particularly tough problem to solve, either...) -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 18, 2021, at 8:41 AM, John Marino (NetBSD) wrote: > > For GCC, we've got the return address (context->ra in the unwind > programs in the original post). The "problem" is that we want to > know if that address falls on a signal trampoline frame. If NO, > return "end of stack", otherwise unwind the frame. Seems like a non-portable libc function that could check among the various candidates that libc knows about would not be terribly difficult. > A sysctl that returns an array of address pairs for all signal > trampolines in the process is what I'm requesting. Would a function like (hypothetically-named) "bool __in_signal_trampoline(uintptr_t addr);" provided by libc be workable for your use case? > If there's another way to determine if an address falls within a > signal trampoline, I'd like to see actual code to see if I can adapt > it. > Of course, the kernel team could just deny the request, but I won't be > able to fix the regression caused when the per-signal trampolines were > introduced. > Thanks, > John -- thorpej
Re: Request for implementation of KERN_PROC_SIGTRAMP sysctl
> On Oct 15, 2021, at 1:19 PM, Valery Ushakov wrote: > > On Fri, Oct 15, 2021 at 23:14:39 +0300, Valery Ushakov wrote: > >> On Fri, Oct 15, 2021 at 14:44:16 -0500, John Marino (NetBSD) wrote: >> >>> Is it possible for NetBSD to implement KERN_PROC_SIGTRAMP sysctl? >> >> It's been ages since I touched this area, but don't we have >> per-sigaction trampolines? I mean, in practice they all use the same >> __sigtramp_siginfo_$version trampoline, that sigaction passes to the >> actual syscall, but in principle the process can have different >> trampolines for different signals, can't it? >> >> struct sys___sigaction_sigtramp_args { >> syscallarg(int) signum; >> syscallarg(const struct sigaction *) nsa; >> syscallarg(struct sigaction *) osa; >> syscallarg(const void *) tramp; // <- >> syscallarg(int) vers; >> }; > > PS: We used to have a trampoline that the kernel copied out into the > process address space (bottom of the stack, iirc) - and that would be > something for KERN_PROC_SIGTRAMP to return indeed. But that was like > before netbsd 2.0, iirc. Right, that was The Very Old Way. There is an API contract between libc and the kernel vis a vis the signal trampoline, but the trampoline itself is in libc, which could get mapped anywhere. So there is not a single static address that could be returned, even per-process, because, as uwe points out, the trampoline can be specified on a per-signal basis (there is one for "siginfo" signal delivery and another for the old-style "sigcontext" signal delivery). -- thorpej
Re: Representing a rotary encoder input device
> On Sep 22, 2021, at 5:23 PM, Chris Hanson wrote: > > On Sep 22, 2021, at 7:10 AM, Jason Thorpe wrote: > >> Well, ultimately, we translate “HID report” -> “ws* input event”. Or are >> you suggesting that we should have a new interface to user-space that just >> sends HID reports? > > That sounds like a good long-term plan, IMO, given the prevalence of HID in > the modern era and the ability to map virtually everything to it. That wouldn't really bother me and seems like a really good idea, it's just work that I don't want to do :-) (A set of helper functions to generate HID reports for devices that need to translate to HID from their native protocol would of course be super helpful.) > I've though occasionally about putting together drivers for some of the > HP-HIL devices (single-dial-and-button rotary encoder, dial box with nine > rotary encoders, and button box with 32 buttons with LEDs) but haven't had > much clue how to wire them into the higher layers of the system to actually > use. Some people have all the luck getting cool HP-HIL devices! -- thorpej
Locking protocol for dev/scsipi
Is the locking protocol for the SCSI subsystem documented anywhere? Specifically, the contract between the generic SCSI layer and controller drivers? Thx. -- thorpej
Re: Representing a rotary encoder input device
> On Sep 22, 2021, at 8:43 AM, Greg Troxel wrote: > > > What do other systems do? It strikes me what wsmouse feels like it is > for things connected with the kbd/mouse/display world. To be > cantankerous, using it seems a little bit like representing a GPIO input > as a 1-button mouse that doesn't move. I haven't looked at what Linux does with rotary encoders yet, specifically. For GPIO buttons, I do know that those are translated into Linux key press events in their general input stream. We do that for buttons, too... that's what the "gpiokeys" driver is for, and these are the Devicetree bindings for it: https://github.com/devicetree-org/devicetree-source/blob/master/Bindings/input/gpio-keys.txt > I would imagine that a rotary encoder is more likely to be a volume or > level control, but perhaps not for the machine, perhaps just reported > over MQTT so Home Assistant on some other machine can deal. In my use case where the encoder is a volume control, it really is for the device it's attached to. > If you are really talking about encoders hooked to gpio, then perhaps > gpio should grow a facility to take N pins and say they are some kind of > encoder and then have a gpio encoder abstraction. Yes, that's what the Devicetree binding facilitates ... writing a driver that gathers up multiple GPIO inputs to present a higher-level device. > But maybe you are trying to use an encoder to add scroll to a 3-button > mouse? No, that's not what I'm doing. -- thorpej
Re: Representing a rotary encoder input device
> On Sep 22, 2021, at 7:52 AM, Mouse wrote: > >> Well, ultimately, we translate â??HID reportâ?? -> â??ws* input eventâ??. O$ > > Do you really want input devices to be inaccessible except through > ws* interfaces? Unless it's possible to get ws* events without needing > the whole wscons infrastructure set up (eg, on a machine with serial > console), that strikes me as a bad idea. (Keyboards and mice also > really should be accessible without all of wscons too, but that's less > likely to be a practical issue.) Well, I'm not particularly interested in having to decode each device type separately in user-space. ws* presents an abstraction from which you can read the events. There's nothing that says you have to have those things connected to a display. -- thorpej
Re: Representing a rotary encoder input device
> On Sep 21, 2021, at 10:47 PM, Michael van Elst wrote: > > thor...@me.com (Jason Thorpe) writes: > >> Trying to think about the best way to represent such a device, I guess = >> within wscons (they almost seem sort of like a 1-axis mouse, but I could = >> be convinced otherwise). > > You can make it a HID, because that's what it is. > > Currently we only expose USB hid devices and sys/dev/hid is not much > more than a framework to support these. This would need to be extended > to a real abstraction (and uhidev might then be just a subclass). Well, ultimately, we translate “HID report” -> “ws* input event”. Or are you suggesting that we should have a new interface to user-space that just sends HID reports? -- thorpej
Representing a rotary encoder input device
Hey folks... As part of a long-running hardware project I have been playing with, I'm experimenting with using a rotary encoder for input. For reference, here is the Devicetree binding for rotary encoders: https://github.com/devicetree-org/devicetree-source/blob/master/Bindings/input/rotary-encoder.txt Rotary encoders come in a bunch of different forms, of course... ranging from simple knobs with detents to devices similar to the original iPod click wheel. Some include built-in push buttons (that signal a separate GPIO from the two GPIOs used to signal the rotary control itself). Trying to think about the best way to represent such a device, I guess within wscons (they almost seem sort of like a 1-axis mouse, but I could be convinced otherwise). Personally I'm seeing my use cases teetering towards using relative events (value selection that from 0-9 that wraps around back to 0, and also volume control that simply saturates at 0 or 100). But I want to make sure I cover the absolute cases, as well. Thoughts? -- thorpej
Re: eventfd(2) and timerfd(2) APIs
> On Sep 18, 2021, at 4:59 PM, John Nemeth wrote: > > Nice. timerfd(2) is Asterisk's preferred timing source. This > should improve our support for Asterisk. As you might guess, I'm > all for the addition of these. Yes, I remember you mentioning that before, and I just double-checked to make sure that Asterisk isn’t using anything that isn’t supported by this implementation (Linux has some additional non-standard clock IDs that we don’t support). -- thorpej
Re: eventfd(2) and timerfd(2) APIs
> On Sep 18, 2021, at 3:54 PM, Jason Thorpe wrote: > >> >> | else plumbs it through either, but I'll go ahead and do so. >> >> Please do. What other things don't permit fcntl() to work? We >> should fix any of those. > > Well, I’ll fix these 2 anyway. …and I just double-checked this as well… F_GETFL and F_SETFL don’t need any specific hooks in the implementation; the flags are all handled in common code. -- thorpej
Re: eventfd(2) and timerfd(2) APIs
> On Sep 18, 2021, at 6:26 PM, Jason Thorpe wrote: > >> On Sep 18, 2021, at 5:51 PM, Robert Elz wrote: >> >> | You'll get EBADF if you try it on anything else >> >> and that's definitely wrong, on a pipe the system is allowed to return >> EINVAL, >> and that's what ought to be returned whenever the fd is not something that >> supports chmod (etc). EBADF should only be used when the process doesn't >> have the fd open. > > Oops, I may be wrong about that specific detail. I’ll double-check later. Indeed, I was mistaken, you do get EINVAL in the “good fd that’s not a vnode” case. -- thorpej
Re: eventfd(2) and timerfd(2) APIs
> On Sep 18, 2021, at 5:51 PM, Robert Elz wrote: > > | You'll get EBADF if you try it on anything else > > and that's definitely wrong, on a pipe the system is allowed to return EINVAL, > and that's what ought to be returned whenever the fd is not something that > supports chmod (etc). EBADF should only be used when the process doesn't > have the fd open. Oops, I may be wrong about that specific detail. I’ll double-check later. -- thorpej
Re: eventfd(2) and timerfd(2) APIs
> On Sep 18, 2021, at 3:35 PM, Robert Elz wrote: > >Date:Sat, 18 Sep 2021 13:21:27 -0700 >From: Jason Thorpe >Message-ID: <5e7b8a22-14c2-4dce-ace2-31552f412...@me.com> > > > | > unless the > | > .Nm > | > object was created with > | > .Dv TFD_NONBLOCK . > | > | I'm using those names, because those are the names used in the Linux API. > > It wasn't the names I was concerned about. > > | If you look at the code (it's on the thorpej-futex branch), > | you will see that they are aliases for O_NONBLOCK and O_CLOEXEC. > > That was kind of obvious anyway from the man page: > > The following flags define the behavior of the resulting object: > .Bl -tag -width "EFD_SEMAPHORE" > .It Dv EFD_CLOEXEC > Sets the > .Dv O_CLOEXEC > flag; see > .Xr open 2 > for more information. > > So: > | I will clarify this in the man page. > > probably isn't really necessary. I was more concerned with the > "unless the object was created with" - implying that if those flags > are changed later, that would be irrelevant, as it is the state at > create time that matters. That would be unfortunate indeed, but: I’ve changed the man pages to state “set for non-blocking I/O”. > | Actually, I didn't plumb fcntl through because just about nothing > > might explain part of that (though you can't avoid the ability to > alter O_CLOEXEC that way, as that's a much higher level operation). > | else plumbs it through either, but I'll go ahead and do so. > > Please do. What other things don't permit fcntl() to work? We > should fix any of those. Well, I’ll fix these 2 anyway. > | The behavior of timerfd with respect to read is documented in my man page: > > Yes, I saw that. > > | Writes to a timerfd return an error. I will clarify this in the man page. > > That would be useful. You might want to also indicate how these > descriptors are destroyed (I assume just close(2) but who knows). Yes, they’re file descriptors, so close(2) gets rid of them. Does this really need to be stated explicitly? > | > Finally, what does fstat() return about these fds? > > The one I should have asked about, but forgot, was (st_mode & _S_FMT) > Ie: what kind of object are these things pretending to be? static int timerfd_fop_stat(file_t * const fp, struct stat * const st) { struct timerfd * const tfd = fp->f_timerfd; memset(st, 0, sizeof(*st)); itimer_lock(); st->st_size = (off_t)timerfd_fire_count(tfd); st->st_atimespec = tfd->tfd_atime; st->st_mtimespec = tfd->tfd_mtime; itimer_unlock(); st->st_blksize = sizeof(uint64_t); st->st_mode = S_IFIFO | S_IRUSR | S_IWUSR; st->st_blocks = 1; st->st_birthtimespec = st->st_ctimespec = tfd->tfd_btime; st->st_uid = kauth_cred_geteuid(fp->f_cred); st->st_gid = kauth_cred_getegid(fp->f_cred); return 0; } eventfd is similar. > Since they're fd's, they can be inherited, open, by other processes > (and since the man page hints at it, probably sent through a AF_UNIX > socket), but particularly in the former case, the receiving process > needs to know (or at least be able to find out) what it is that is on > this fd it has received. > > | Of course, we don't document what these are for other kinds of descriptors, > > for many there's no need, as everything is exactly what stat(2) claims > it will be. For any where that is not true, or is insufficient, we > should be documenting it. There are, of course, not enough _S_FMT bits to describe the possible combinations. > If this was just a linux compat hack, so linux binaries could run, > then most of this wouldn't matter - the application would do whatever > linux allows it to do, and nothing actually built on NetBSD would > ever care. > > But if these are to be full NetBSD interfaces, they need to be > both complete (and sane) and properly documented. That means > which of the f*() interfaces (fstat, fchmod, fchown, ...) work, Actually, fchmod(), fchown(), etc. only work on DTYPE_VNODE descriptors. You’ll get EBADF if you try it on anything else (look for any place that calls fd_getvnode()). -- thorpej
Re: eventfd(2) and timerfd(2) APIs
> On Sep 18, 2021, at 12:17 PM, Robert Elz wrote: > >Date:Sat, 18 Sep 2021 10:26:29 -0700 >From: Jason Thorpe >Message-ID: <986563ad-88c2-41b9-bf69-51b26240b...@me.com> > > > |https://www.netbsd.org/~thorpej/timerfd.2 > > This one contains duplicated text... > > Because they are associated with a file descriptor, they may be passed > to other processes, inherited across a fork, and multiplexed using > .Xr kevent , > .Xr poll , > or > .Xr select they are associated with a file descriptor, they may be passed > to other processes, inherited across a fork, and multiplexed using > .Xr kevent 2 , > .Xr poll 2 , > or > .Xr select 2 . > > That should be fixed before anything is committed. Thanks, fixed. > Apart from that both man pages contain text like > > unless the > .Nm > object was created with > .Dv TFD_NONBLOCK . I’m using those names, because those are the names used in the Linux API. If you look at the code (it’s on the thorpej-futex branch), you will see that they are aliases for O_NONBLOCK and O_CLOEXEC. I will clarify this in the man page. > Since these things are working with file descriptors, I assume that > fcntl(2) can be used to manipulate flags like O_NONBLOCK and O_CLOEXEC > in which case I would guess (and hope) that the state of those flags when the > object was created isn't what is releant, but the state of the flags at > the time of the operation concerned. Actually, I didn’t plumb fcntl through because just about nothing else plumbs it through either, but I’ll go ahead and do so. > The man pages should probably be reworded with that in mind. > > The exact relationships of the {event,timer}fd_*() functions > and read()/write() is also not clear to me - are those just wrappers > around read/write or are they distinct sys calls of their own? In the case of eventfd_read() and eventfd_write(), those are in fact just wrappers around read() and write(), they’re implemented in libc, and they’re provided only because Linux also provides them and I was aiming for API compatibility. In the case of timerfd, Linux does not provide a timerfd_read() wrapper, so I also did not. timerfd_gettime() and timerfd_settime() are not wrappers around anything. They are themselves system calls, just as they are on Linux. > I initially assumed the former, but then I see that timerfd_settimer() > has an extra flags arg, which write() (I presume) has no easy way to > pass in, so now I am not sure. > > If these are distinct operations how to actual read()/write() interact? The behavior of timerfd with respect to read is documented in my man page: Each time a timerfd timer expires, an internal counter is incremented. Reads return the value of this counter as an unsigned 64-bit integer and reset the counter to 0. If the value of the counter is 0, then reads will block, unless the timerfd object was created with TFD_NONBLOCK. Writes to a timerfd return an error. I will clarify this in the man page. > Finally, what does fstat() return about these fds? What is the dev_t ? > What is the inode number, is the link count meaningfil, how about the > uid and permissions?And what affects the time fields? For timerfd: struct timespec tfd_btime; /* time created */ struct timespec tfd_mtime; /* last timerfd_settime() */ struct timespec tfd_atime; /* last read */ For eventfd: struct timespec efd_btime; /* time created */ struct timespec efd_mtime; /* last write */ struct timespec efd_atime; /* last read */ Of course, we don’t document what these are for other kinds of descriptors, so I didn’t spend a lot of time documenting it for these. It certainly might be a nice idea to fully document the stat info for every descriptor type in the system, but I don’t think the lack of that information (for which there is no standardized format, it seems, since no other descriptor types seem to document it) should be considered a blocker for adding these calls. -- thorpej
Re: eventfd(2) and timerfd(2) APIs
(Yes, this is a re-run … but I’m ready to merge it if there are no objections, so…) > On Sep 18, 2021, at 10:26 AM, Jason Thorpe wrote: > > Last year, I wrote implementations of the Linux eventfd(2) and timerfd(2) > interfaces for NetBSD, with the goal of improving our Linux emulation. In > order to be able to test them with ATF tests, I went ahead and made them > native calls as well. > > Here are the man pages describing the interfaces: > > https://www.netbsd.org/~thorpej/eventfd.2 > https://www.netbsd.org/~thorpej/timerfd.2 > > Any objections to adding these? > > -- thorpej > -- thorpej
eventfd(2) and timerfd(2) APIs
Last year, I wrote implementations of the Linux eventfd(2) and timerfd(2) interfaces for NetBSD, with the goal of improving our Linux emulation. In order to be able to test them with ATF tests, I went ahead and made them native calls as well. Here are the man pages describing the interfaces: https://www.netbsd.org/~thorpej/eventfd.2 https://www.netbsd.org/~thorpej/timerfd.2 Any objections to adding these? -- thorpej
Re: General device properties API
> On Sep 12, 2021, at 9:14 AM, Taylor R Campbell > wrote: > >> Date: Sun, 12 Sep 2021 08:57:07 -0700 >> From: Jason Thorpe >> >>> On Sep 12, 2021, at 8:17 AM, Jason Thorpe wrote: >>> >>> Doing this with symbols is a mess. >> >> Here's a way to basically get most of what you want without >> referencing symbols: > > Now the linker doesn't detect namespace collisions. So if two > different subsystems in different modules take the same name you might > silently get runtime memory corruption. I think we’re at a point here where we disagree on whether or not this will be an actual problem. I don’t believe it will be an actual problem and is therefore not worth the extra module management complexity (which isn’t free from a “space” perspective because each module has to have code to handle loading and unloading). We’re in engineering trade-off land here, and I think you’re making the wrong one. In any case, here is a diff that adds the argument structure type checking. I also added an awk program to generate the argument structures and all of the call binding stuff automatically from an interface description file. So, if someone wants to deal with the additional module management complexity, they can change gendevcalls.awk, re-gen the header files, and all of the call sites will be updated automatically. https://www.netbsd.org/~thorpej/device-call-typing-diffs.txt -- thorpej
Re: Overhaul of I2C and SPI device autoconfiguration
> On Sep 12, 2021, at 6:16 PM, David Holland wrote: > > On Sat, Sep 11, 2021 at 03:43:24PM -0700, Jason Thorpe wrote: >> The basic idea involves the device call mechanism I introduced a while ago. > > You still haven't explained why that needs to be untyped and based on > uncheckable string constants. I proposed something for type checking. However, there is an engineering trade-off disagreement about the requirement to reference symbols. In any case, I consider that to be an implementation detail that is unrelated to the meat of this discussion. -- thorpej
Re: Overhaul of I2C and SPI device autoconfiguration
> On Sep 12, 2021, at 2:49 PM, Tobias Nygren wrote: > > On Sun, 12 Sep 2021 12:30:14 -0700 > Jason Thorpe wrote: > >> Ok, sparc64 should be good do go now. I made a bunch of fake fixup entries >> for the Qemu machine and, after a couple of additional fixes, verified that >> the machinery is working as expected. > > Gets further but hits another problem. > bus mutex uninitialized or points at wrong data? > > iic0 at pcfiic0: I2C bus > cpu0: data fault: pc=105f440 rpc=143e594 addr=1afc000 > kernel trap 6c: +fast data access protection > Stopped in pid 0.0 (system) at netbsd:_lock_cas: casxa 0x80, > %o1, %o2 > db{0}> bt > iic_acquire_bus(1afd088, 8, 2003dac, 1af0400, 1afd098, 10) at > netbsd:iic_acquire_bus+0x94 > spdmem_i2c_match(1075dfa80, 1c66f00, 2004710, 1aecc00, 73, 1c60b98) at > netbsd:spdmem_i2c_match+0xa4 > config_match(1075dfa80, 1c66f00, 2004710, 2004710, 20, 0) at > netbsd:config_match+0x38 > [rest of trace omitted] Ok, I'm a little confused by this one. pcfiic_attach() allocates a single channel structure if the ebus front-end did not already do so. And then for either case, it then enumerates the channels and calls iic_tag_init(>ch_i2c) for the tag on each one, which initializes the bus mutex. Hm, but I spotted another bug ... I didn't update pcfiic_i2c_exec() for the channel split. That could be clobbering something. I just pushed an update to dev/ic/pcf8584.c -- can you give it a spin? -- thorpej
Re: Overhaul of I2C and SPI device autoconfiguration
> On Sep 12, 2021, at 10:26 AM, Jason Thorpe wrote: > > > >> On Sep 12, 2021, at 7:37 AM, Tobias Nygren wrote: >> >> iic3 at rkiic3: I2C bus >> typec-portc (fcs,fusb302) at iic3 addr 0x22 not configured >> panic: kernel diagnostic assertion "rv" failed: file >> "/usr/src/sys/dev/i2c/i2c.c", line 630 >> cpu0: Begin traceback... >> trace fp c0f784e0 >> fp c0f78510 vpanic() at c053bd5c netbsd:vpanic+0x14c >> fp c0f78570 kern_assert() at c079f928 netbsd:kern_assert+0x58 >> fp c0f78600 iic_enumerate_devices_callback() at c03448f4 >> netbsd:iic_enumerate_devices_callback+0xd4 >> fp c0f78670 of_i2c_enumerate_devices() at c06483cc >> netbsd:of_i2c_enumerate_devices+0x10c >> fp c0f78730 device_call() at c0523e60 netbsd:device_call+0x90 > > Huh, ok, I’ll take a look. I’ll craft a scenario where my DT has a “not > configured” device to reproduce the problem. > > Thanks for giving it a spin. I’ll follow up shortly. Ok, I reproduced it and found the problem pretty quickly (an errant "return false;" that was a paste-o). [ 1.000] bsciic2 at simplebus1: Broadcom Serial Controller [ 1.000] bsciic2: interrupting on icu irq 53 [ 1.000] iic2 at bsciic2: I2C bus [ 1.000] power-management (netbsd,uugear-wittypi3) at iic2 addr 0x69 not configured [ 1.000] /soc/vec@7e806000 at simplebus1 not configured [ 1.000] dwctwo0 at simplebus1: USB controller [ 1.000] dwctwo0: interrupting on icu irq 9 -- thorpej
Re: Overhaul of I2C and SPI device autoconfiguration
> On Sep 12, 2021, at 11:32 AM, Jason Thorpe wrote: > > >> On Sep 12, 2021, at 10:27 AM, Jason Thorpe wrote: >> >> Huh. Ok, this didn’t happen in qemu, but I’ll take a look today. > > Oh, yes it does. This is almost certainly due to a small tweak I made > locally after I tested it the first time. I am a dum-dum. Fixing now. Ok, sparc64 should be good do go now. I made a bunch of fake fixup entries for the Qemu machine and, after a couple of additional fixes, verified that the machinery is working as expected. -- thorpej
Re: Overhaul of I2C and SPI device autoconfiguration
> On Sep 12, 2021, at 10:27 AM, Jason Thorpe wrote: > > Huh. Ok, this didn’t happen in qemu, but I’ll take a look today. Oh, yes it does. This is almost certainly due to a small tweak I made locally after I tested it the first time. I am a dum-dum. Fixing now. -- thorpej
Re: Overhaul of I2C and SPI device autoconfiguration
> On Sep 12, 2021, at 5:31 AM, Tobias Nygren wrote: > > sparc64 kernel built from thorpej-i2c-spi-conf2 crashed early: > > r...@netra240.rymdfartsverket.se:/usr/obj/sys/arch/sparc64/compile/GENERIC.netra240-debug > total memory = 7168 MB > avail memory = 7023 MB > mainbus0 (root)cpu0: data fault: pc=17b7754 rpc=165c3ec addr=0 > kernel trap 30: data access exception > Stopped in pid 0.0 (system) at netbsd:strcmp+0x14: ldub[%o0 > + %g1], %g3 > db{0}> bt > device_compatible_lookup(1c7df08, 1, 17deb08, 0, 1060c43cc, 17deb08) at > netbsd:device_compatible_lookup+0x4c > sparc64_device_tree_fixup(1060c4380, 0, 2005b08, 1ae7800, 1c82400, 1c82400) > at netbsd:sparc64_device_tree_fixup+0x8c Huh. Ok, this didn’t happen in qemu, but I’ll take a look today. -- thorpej
Re: Overhaul of I2C and SPI device autoconfiguration
> On Sep 12, 2021, at 7:37 AM, Tobias Nygren wrote: > > iic3 at rkiic3: I2C bus > typec-portc (fcs,fusb302) at iic3 addr 0x22 not configured > panic: kernel diagnostic assertion "rv" failed: file > "/usr/src/sys/dev/i2c/i2c.c", line 630 > cpu0: Begin traceback... > trace fp c0f784e0 > fp c0f78510 vpanic() at c053bd5c netbsd:vpanic+0x14c > fp c0f78570 kern_assert() at c079f928 netbsd:kern_assert+0x58 > fp c0f78600 iic_enumerate_devices_callback() at c03448f4 > netbsd:iic_enumerate_devices_callback+0xd4 > fp c0f78670 of_i2c_enumerate_devices() at c06483cc > netbsd:of_i2c_enumerate_devices+0x10c > fp c0f78730 device_call() at c0523e60 netbsd:device_call+0x90 Huh, ok, I’ll take a look. I’ll craft a scenario where my DT has a “not configured” device to reproduce the problem. Thanks for giving it a spin. I’ll follow up shortly. -- thorpej
Re: General device properties API
> On Sep 12, 2021, at 8:17 AM, Jason Thorpe wrote: > > Doing this with symbols is a mess. Here’s a way to basically get most of what you want without referencing symbols: struct device_call_generic { const char *name; void *args; }; int device_call_generic(device_t, const struct device_call_generic *); struct i2c_enumerate_devices_args { ... }; union i2c_enumerate_devices_binding { struct device_call_generic generic; struct { const char *name; struct i2c_enumerate_devices_args *args; } binding; }; #define I2C_ENUMERATE_DEVICES(_args_) \ &((const union i2c_enumerate_devices_binding){ \ .binding.name = "i2c-enumerate-devices",\ .binding.args = (_args_), \ }) #define device_call(dev, call) \ device_call_generic((dev), (call)->generic) struct i2c_attach_args ia = { .ia_tag = ic, }; struct i2c_enumerate_devices_args enumargs = { .ia = , .callback = iic_enumerate_devices_callback, }; rv = device_call(dev, I2C_ENUMERATE_DEVICES()); -- thorpej