Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 12, 2022, at 7:54 AM, Jason Thorpe wrote: > > The following patch rectifies this situation by having klist_fini() traverse > a list of knotes and substitute their filterops with no-op stubs. This > requires synchronizing with any calls into the filterops themselves. We have > convenient funnel-points for this in kern_event.c already, so it’s not > particularly difficult… the main issue is that the filterops themselves are > allowed to block. My solution to this was to have a single global rwlock, > knote_filterops_lock, that’s acquired for reading when a call though a > knote's filterops is made, and in klist_fini() is acquired for writing for > the short duration needed to stub out filterops. This lock should **very > rarely** be contended, but it will be *hot* (lots of acquire-for-read), so it > gets its own cache line. > > If someone has ideas about another synchronization mechanism, I’m all ears… > again, the main issue is that the filterops calls themselves are allowed to > block, so I’m not sure that any of our passive serialization mechanisms would > work here. FWIW, another alternative is to put the rwlock into the knote itself, if we think that lock will be Too Hot (which is not an unreasonable concern on a machine with many cores). The trade-off is memory. I could see putting the lock in the knote itself on amd64 and aarch64 (where we would expect to see high core count machines) and a global lock on everything else. Then again, on amd64, struct knote is 136 bytes, so it already spills into the kmem-00192 bucket, and on i386 it’s 84 bytes, so kmem-00128 bucket. I guess each of those could easily accommodate an additional pointer-sized struct member to alleviate any scalability concern about a global rwlock. -- thorpej
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 12, 2022, at 10:27 AM, Jason Thorpe wrote: > > >> On Jul 12, 2022, at 7:54 AM, Jason Thorpe wrote: >> >> If someone has ideas about another synchronization mechanism, I’m all ears… >> again, the main issue is that the filterops calls themselves are allowed to >> block, so I’m not sure that any of our passive serialization mechanisms >> would work here. >> >> > > Well, for some reason, I mistakenly posted an old version of this patch. > I’ll post a refresh later today. Oh, no, never mind, I’m wrong about this :-). NOTHING TO SEE HERE *whistles* -- thorpej
Re: Problem with outstanding knotes and device detach - and a fix
> On Jul 12, 2022, at 7:54 AM, Jason Thorpe wrote: > > If someone has ideas about another synchronization mechanism, I’m all ears… > again, the main issue is that the filterops calls themselves are allowed to > block, so I’m not sure that any of our passive serialization mechanisms would > work here. > > Well, for some reason, I mistakenly posted an old version of this patch. I’ll post a refresh later today. -- thorpej
Fix for PR kern/56713
When I changed vnode kqueue events to be posted from common code at the VFS layer, I introduced a regression in sending those events on stacked file systems like nullfs. The root cause of the problem is that when knotes are attached to the vnode, that operation (by way of the VOP bypass mechanism in layerfs) drills all the way down to the bottom of the vnode stack and attaches there (as intended), but with event delivery being handled now in vnode_if.c, it’s the upper vnode that’s being worked with, and no knotes are attached there. These stacked vnodes already share a bit of state with the bottom layer… (locks, etc.), so my solution to this problem is to allow these stacked vnodes to also refer to the bottom vnode’s klist. Each vnode always starts out referring to its own, but as a layerfs vnode is constructed, it is told to refer to the lower vnode's klist (just as it refers to the lower vnode’s interlock and uvm_obj lock). Various assertions are included to ensure that knotes are never attached to a vnode that refers to another vnode’s klist (they should always be attached to the bottom-most vnode). (After applying the patch, vnode_if.c must be rebuilt.) Index: sys/fs/union/union_subr.c === RCS file: /cvsroot/src/sys/fs/union/union_subr.c,v retrieving revision 1.81 diff -u -p -r1.81 union_subr.c --- sys/fs/union/union_subr.c 19 Mar 2022 13:53:32 - 1.81 +++ sys/fs/union/union_subr.c 12 Jul 2022 00:17:28 - @@ -232,10 +232,11 @@ union_newupper(struct union_node *un, st unlock_ap.a_desc = VDESC(vop_unlock); unlock_ap.a_vp = UNIONTOV(un); genfs_unlock(&unlock_ap); - /* Update union vnode interlock & vmobjlock. */ + /* Update union vnode interlock, vmobjlock, & klist. */ vshareilock(UNIONTOV(un), uppervp); rw_obj_hold(uppervp->v_uobj.vmobjlock); uvm_obj_setlock(&UNIONTOV(un)->v_uobj, uppervp->v_uobj.vmobjlock); + vshareklist(UNIONTOV(un), uppervp); mutex_exit(&un->un_lock); if (ohash != nhash) { LIST_INSERT_HEAD(&uhashtbl[nhash], un, un_cache); @@ -577,6 +578,7 @@ union_loadvnode(struct mount *mp, struct vshareilock(vp, svp); rw_obj_hold(svp->v_uobj.vmobjlock); uvm_obj_setlock(&vp->v_uobj, svp->v_uobj.vmobjlock); + vshareklist(vp, svp); /* detect the root vnode (and aliases) */ if ((un->un_uppervp == um->um_uppervp) && Index: sys/miscfs/genfs/layer_vfsops.c === RCS file: /cvsroot/src/sys/miscfs/genfs/layer_vfsops.c,v retrieving revision 1.54 diff -u -p -r1.54 layer_vfsops.c --- sys/miscfs/genfs/layer_vfsops.c 23 Feb 2020 15:46:41 - 1.54 +++ sys/miscfs/genfs/layer_vfsops.c 12 Jul 2022 00:17:28 - @@ -205,10 +205,11 @@ layerfs_loadvnode(struct mount *mp, stru xp = kmem_alloc(lmp->layerm_size, KM_SLEEP); - /* Share the interlock and vmobjlock with the lower node. */ + /* Share the interlock, vmobjlock, and klist with the lower node. */ vshareilock(vp, lowervp); rw_obj_hold(lowervp->v_uobj.vmobjlock); uvm_obj_setlock(&vp->v_uobj, lowervp->v_uobj.vmobjlock); + vshareklist(vp, lowervp); vp->v_tag = lmp->layerm_tag; vp->v_type = lowervp->v_type; Index: sys/kern/vfs_vnode.c === RCS file: /cvsroot/src/sys/kern/vfs_vnode.c,v retrieving revision 1.143 diff -u -p -r1.143 vfs_vnode.c --- sys/kern/vfs_vnode.c9 Apr 2022 23:45:45 - 1.143 +++ sys/kern/vfs_vnode.c12 Jul 2022 00:17:28 - @@ -457,7 +457,8 @@ vnalloc_marker(struct mount *mp) vp->v_mount = mp; vp->v_type = VBAD; vp->v_interlock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE); - klist_init(&vp->v_klist); + klist_init(&vip->vi_klist.vk_klist); + vp->v_klist = &vip->vi_klist; vip->vi_state = VS_MARKER; return vp; @@ -475,7 +476,7 @@ vnfree_marker(vnode_t *vp) KASSERT(vip->vi_state == VS_MARKER); mutex_obj_free(vp->v_interlock); uvm_obj_destroy(&vp->v_uobj, true); - klist_fini(&vp->v_klist); + klist_fini(&vip->vi_klist.vk_klist); pool_cache_put(vcache_pool, vip); } @@ -1391,7 +1392,8 @@ vcache_alloc(void) vp->v_interlock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE); uvm_obj_init(&vp->v_uobj, &uvm_vnodeops, true, 1); - klist_init(&vp->v_klist); + klist_init(&vip->vi_klist.vk_klist); + vp->v_klist = &vip->vi_klist; cv_init(&vp->v_cv, "vnode"); cache_vnode_init(vp); @@ -1453,7 +1455,9 @@ vcache_free(vnode_impl_t *vip) mutex_obj_free(vp->v_interlock); rw_destroy(&vip->vi_lock); uvm_obj_destroy(&vp->v_uobj, true); - klist_fini(&vp->v_klist); + KASSERT(vp->v_klist == &vip->vi_klist || + S
Problem with outstanding knotes and device detach - and a fix
Background: kqueue events are represented by a “knote” that is kept on a list of knotes (a klist) by its respective driver (usually indirectly in the “selinfo” structure). A similar situation exists for vnodes. Then, when activated, they are queued onto the kqueue’s active note list. The driver gets called to attach the knote in a generic fashion, but then provides its own ops vector for subsequent operations, including “detach”. When a file descriptor for a given knote is closed, this driver-supplied “detach” function is called to remove it from whatever driver-specific klist the knote is on. When a driver detaches, it frees its softc structure, which includes the selinfo / klist. This list might still have notes on it… but that isn’t inherently problematic, because the driver owns the list and no one will ever traverse it again — the generic kqueue code doesn’t care about it. But it is a problem when an outstanding file descriptor with a registered kqueue event (e.g. EVFILT_READ) for that device that may have been open when the device was detached. In this case, when the file descriptor is closed, the knote, which still points to the now-detached device’s filterops, will call the “detach” operation. For a driver that is still loaded into the kernel, we call code that’s still valid, but a use-after-free condition exists where the now-freed softc structure will be accessed to unlink the knote from the klist. The situation is worse for modular drivers that may have been unloaded after the device detaches — jumping through function pointers to code that no longer exists is bad, m’kay? The following patch rectifies this situation by having klist_fini() traverse a list of knotes and substitute their filterops with no-op stubs. This requires synchronizing with any calls into the filterops themselves. We have convenient funnel-points for this in kern_event.c already, so it’s not particularly difficult… the main issue is that the filterops themselves are allowed to block. My solution to this was to have a single global rwlock, knote_filterops_lock, that’s acquired for reading when a call though a knote's filterops is made, and in klist_fini() is acquired for writing for the short duration needed to stub out filterops. This lock should **very rarely** be contended, but it will be *hot* (lots of acquire-for-read), so it gets its own cache line. If someone has ideas about another synchronization mechanism, I’m all ears… again, the main issue is that the filterops calls themselves are allowed to block, so I’m not sure that any of our passive serialization mechanisms would work here. Index: kern/kern_event.c === RCS file: /cvsroot/src/sys/kern/kern_event.c,v retrieving revision 1.141 diff -u -p -r1.141 kern_event.c --- kern/kern_event.c 24 May 2022 20:50:19 - 1.141 +++ kern/kern_event.c 11 Jul 2022 20:53:40 - @@ -133,6 +133,31 @@ static const struct fileops kqueueops = .fo_restart = kqueue_restart, }; +static void +filt_nopdetach(struct knote *kn __unused) +{ +} + +static int +filt_nopevent(struct knote *kn __unused, long hint __unused) +{ + return 0; +} + +static const struct filterops nop_fd_filtops = { + .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE, + .f_attach = NULL, + .f_detach = filt_nopdetach, + .f_event = filt_nopevent, +}; + +static const struct filterops nop_filtops = { + .f_flags = FILTEROP_MPSAFE, + .f_attach = NULL, + .f_detach = filt_nopdetach, + .f_event = filt_nopevent, +}; + static const struct filterops kqread_filtops = { .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE, .f_attach = NULL, @@ -232,6 +257,9 @@ static size_t user_kfiltersz; /* size * -> object lock (e.g., device driver lock, &c.) * -> kn_kq->kq_lock * + * knote_filterops_lock (rw) + * -> whatever will be acquired in filter_*() + * * Locking rules: * * f_attach: fdp->fd_lock, KERNEL_LOCK @@ -279,6 +307,24 @@ static size_t user_kfiltersz; /* size */ static krwlock_t kqueue_filter_lock; /* lock on filter lists */ +/* + * Calls into the filterops need to be resilient against things which + * destroy a klist, e.g. device detach, freeing a vnode, etc., to avoid + * chasing garbage pointers (to data, or even potentially code in a + * module about to be unloaded). To that end, we acquire the + * knote_filterops_lock as a reader whenver calling into the filter + * ops (in the filter_*() functions). When a driver (or anything else) + * is tearing down its klist, klist_fini() acquires knote_filterops_lock + * as a writer, and replaces all of the filterops in each knote with a + * stub, allowing knote detach (when descriptors are closed) to safely + * proceed. + * + * This lock should be very rarely *contended* (klist tear-down is fairly + * rare), but i
Re: Language-neutral interface specifications (research)
> I am building an Ada compiler as a hobby project. [...] native code > for various platforms and architectures, but to help with > bootstrapping, I also want it to be able to generate portable > C++11/POSIX code. This means I have to think about the difference > between API and ABI on POSIX platforms. Well, if you want to generate C++ code and machine code for the same platform, then, yes, you will need to know how the C++ API maps into the underlying ABI. > As I see it, there are three levels of interface definition to consider: > 1. The API level, [...] > 2. The Pure-C level. [...] I don't see why this one is relevant. It exists, to the extent that it _does_ exist, only as an intermediate step during the conversion of API level 1 code to machine code. As you point out, it may not even be C. > 3. The ABI level. [...] Documents like the System V ABI describe how > Pure-C is translated to machine code calls. I thought they worked more at the level-1 API level: "a struct timeval looks like this in memory..." without any specification of what it looks like at the (as you say, largely fictional) pure-C level. > I am interested in an Interface Description Language that can be used > to: > a. Define the source-level API in a way that is detached from the >C language. This can be done, but only by attaching it to some other language instead. It doesn't even make sense to talk about an API without it being an API for some specific language; an API is an Application _Programming_ Interface, and you can't program without programming in some language. > c. Define data flow better than C allows: The struct stat* >argument to stat() is only meant to move data out of the >function. The pathname pointer isn't captured by the function >call. Doesn't the restrict qualifier get you at least part of this? > d. Describe how the API level gets translated to the Pure-C level >or similar. Why bother with this? The Pure-C level is often fictional and doesn't really matter anyway; for example, I can't see that it matters whether (for example) time_t is int, long, or __builtin_time_t; what matters is (a) what the API code looks like and (b) what it turns into in terms of bits in memory, registers, and syscall traps, and that can be described without needing to refer to any pure-C level anything. It sounds to me as though you want an ABI description language, not an API description language. A C - or C++, or any other language except Ada - API is of minimal value for you, except (a) to the extent that an ABI can be inferred from it or (b) to the extent that you are generating code in that "other language". Indeed, one could argue that an ABI is, essentially, an API for the machine-code level. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Language-neutral interface specifications (research)
I saw this project on the wiki, and it reminds me of a problem I have been trying to understand better: https://wiki.netbsd.org/projects/project/language-neutral-interfaces/ I am a retired compiler engineer. I used to work on LLVM and other compilers, including LLVM's code generator and register allocators. I brought up the sparc64 support and implemented the System V ABI for that architecture in Clang and LLVM. I haven't contributed directly to NetBSD before, but my name appears once or twice in src/external. I am building an Ada compiler as a hobby project. I realize this is a huge undertaking that I will probably never finish. It's fun anyway. The compiler will generate native code for various platforms and architectures, but to help with bootstrapping, I also want it to be able to generate portable C++11/POSIX code. This means I have to think about the difference between API and ABI on POSIX platforms. As I see it, there are three levels of interface definition to consider: 1. The API level, or source code compatibility level. Standards like POSIX tend to describe interfaces in terms of what your C source code should look like: #include #include int check_dir(const char *pathname) { struct stat buffer; if (stat(pathname, &buffer) != 0) return errno; if (S_ISDIR(buffer.st_mode)) return 0; else return ENOTDIR; } POSIX doesn't specify the exact contents of struct stat nor the value of ENOTDIR. It only defines that you can use those symbols and struct members in C code after including the right headers. 2. The Pure-C level. The C compiler lowers the C code to something self-contained. It will: - Run the preprocessor, - Expand typedefs, and - Expand inline functions. (I'm not saying compilers work this way, but they work as-if this happened). This leaves code consisting of only C primitives: struct stat { int st_mode; ... }; extern __tls int errno; int check_dir(const char *pathname) { struct stat buffer; if (stat(pathname, &buffer) != 0) return errno; if ((buffer.st_mode & 017) == 004) return 0; else return 20; } This Pure-C code is not portable. There are differences between platforms, architectures, and even some compiler flags can affect this code. Note that Pure-C also doesn't have to be standard C. It is common to use vendor-specific extensions like __attr__ to get the right alignments and calling conventions. I see NetBSD sources are using a __RENAME macro to change linkage names. 3. The ABI level. Documents like the System V ABI describe how Pure-C is translated to machine code calls. This includes size and alignment of primitive types, layout of structs, and how arguments and return values are passed in function calls. The translation from Pure-C to ABI is a pretty well understood problem. There are ABI documents describing the standard stuff, and you may have to deal with a few compiler extensions for alignment, SIMD types, and alternate calling conventions. This is all stuff that compilers need to deal with anyway. The translation from API level to Pure-C level is more difficult (for me, anyway). It seems like you basically have to run C code through the system C compiler to make sure you covered all the corner cases. It is not very satisfying for an Ada compiler to have to depend on the system C compiler in order to generate binary code that interacts with the system. Ada actually has a standardized foreign function interface that can be used to interface with C and Fortran. The problem is that it interfaces to the Pure-C level, not the API level. I can't use it to call stat() in a portable way. I am interested in an Interface Description Language that can be used to: a. Define the source-level API in a way that is detached from the C language. b. Define stronger types than C allows: S_ISDIR() is only supposed to work on a mode_t returned from stat(), not any old integer. c. Define data flow better than C allows: The struct stat* argument to stat() is only meant to move data out of the function. The pathname pointer isn't captured by the function call. d. Describe how the API level gets translated to the Pure-C level or similar. This is different for different platforms and architectures. This IDL would make it possible for me to generate portable Ada bindings for POSIX and other APIs without having to rely on the system C compiler. I think it could also be useful for a project like NetBSD to be able to track source compatibility and binary compatibility individually. I haven't done anything concrete yet, and I agree that it is a good idea to research prior art. There is a lot of it: - Wikipedia as a long list of IDLs: https://en.wikipedia.org/wiki/Interface_description_language - CORB