Re: Problem with outstanding knotes and device detach - and a fix

2022-07-12 Thread Jason Thorpe


> On Jul 12, 2022, at 7:54 AM, Jason Thorpe  wrote:
> 
> The following patch rectifies this situation by having klist_fini() traverse 
> a list of knotes and substitute their filterops with no-op stubs.  This 
> requires synchronizing with any calls into the filterops themselves.  We have 
> convenient funnel-points for this in kern_event.c already, so it’s not 
> particularly difficult… the main issue is that the filterops themselves are 
> allowed to block.  My solution to this was to have a single global rwlock, 
> knote_filterops_lock, that’s acquired for reading when a call though a 
> knote's filterops is made, and in klist_fini() is acquired for writing for 
> the short duration needed to stub out filterops.  This lock should **very 
> rarely** be contended, but it will be *hot* (lots of acquire-for-read), so it 
> gets its own cache line.
> 
> If someone has ideas about another synchronization mechanism, I’m all ears… 
> again, the main issue is that the filterops calls themselves are allowed to 
> block, so I’m not sure that any of our passive serialization mechanisms would 
> work here.

FWIW, another alternative is to put the rwlock into the knote itself, if we 
think that lock will be Too Hot (which is not an unreasonable concern on a 
machine with many cores).  The trade-off is memory.  I could see putting the 
lock in the knote itself on amd64 and aarch64 (where we would expect to see 
high core count machines) and a global lock on everything else.  Then again, on 
amd64, struct knote is 136 bytes, so it already spills into the kmem-00192 
bucket, and on i386 it’s 84 bytes, so kmem-00128 bucket.  I guess each of those 
could easily accommodate an additional pointer-sized struct member to alleviate 
any scalability concern about a global rwlock.

-- thorpej




Re: Problem with outstanding knotes and device detach - and a fix

2022-07-12 Thread Jason Thorpe


> On Jul 12, 2022, at 10:27 AM, Jason Thorpe  wrote:
> 
> 
>> On Jul 12, 2022, at 7:54 AM, Jason Thorpe  wrote:
>> 
>> If someone has ideas about another synchronization mechanism, I’m all ears… 
>> again, the main issue is that the filterops calls themselves are allowed to 
>> block, so I’m not sure that any of our passive serialization mechanisms 
>> would work here.
>> 
>> 
> 
> Well, for some reason, I mistakenly posted an old version of this patch.  
> I’ll post a refresh later today.

Oh, no, never mind, I’m wrong about this :-). NOTHING TO SEE HERE *whistles*

-- thorpej




Re: Problem with outstanding knotes and device detach - and a fix

2022-07-12 Thread Jason Thorpe


> On Jul 12, 2022, at 7:54 AM, Jason Thorpe  wrote:
> 
> If someone has ideas about another synchronization mechanism, I’m all ears… 
> again, the main issue is that the filterops calls themselves are allowed to 
> block, so I’m not sure that any of our passive serialization mechanisms would 
> work here.
> 
> 

Well, for some reason, I mistakenly posted an old version of this patch.  I’ll 
post a refresh later today.

-- thorpej




Fix for PR kern/56713

2022-07-12 Thread Jason Thorpe
When I changed vnode kqueue events to be posted from common code at the VFS 
layer, I introduced a regression in sending those events on stacked file 
systems like nullfs.  The root cause of the problem is that when knotes are 
attached to the vnode, that operation (by way of the VOP bypass mechanism in 
layerfs) drills all the way down to the bottom of the vnode stack and attaches 
there (as intended), but with event delivery being handled now in vnode_if.c, 
it’s the upper vnode that’s being worked with, and no knotes are attached there.

These stacked vnodes already share a bit of state with the bottom layer… 
(locks, etc.), so my solution to this problem is to allow these stacked vnodes 
to also refer to the bottom vnode’s klist.  Each vnode always starts out 
referring to its own, but as a layerfs vnode is constructed, it is told to 
refer to the lower vnode's klist (just as it refers to the lower vnode’s 
interlock and uvm_obj lock).  Various assertions are included to ensure that 
knotes are never attached to a vnode that refers to another vnode’s klist (they 
should always be attached to the bottom-most vnode).

(After applying the patch, vnode_if.c must be rebuilt.)

Index: sys/fs/union/union_subr.c
===
RCS file: /cvsroot/src/sys/fs/union/union_subr.c,v
retrieving revision 1.81
diff -u -p -r1.81 union_subr.c
--- sys/fs/union/union_subr.c   19 Mar 2022 13:53:32 -  1.81
+++ sys/fs/union/union_subr.c   12 Jul 2022 00:17:28 -
@@ -232,10 +232,11 @@ union_newupper(struct union_node *un, st
unlock_ap.a_desc = VDESC(vop_unlock);
unlock_ap.a_vp = UNIONTOV(un);
genfs_unlock(&unlock_ap);
-   /* Update union vnode interlock & vmobjlock. */
+   /* Update union vnode interlock, vmobjlock, & klist. */
vshareilock(UNIONTOV(un), uppervp);
rw_obj_hold(uppervp->v_uobj.vmobjlock);
uvm_obj_setlock(&UNIONTOV(un)->v_uobj, uppervp->v_uobj.vmobjlock);
+   vshareklist(UNIONTOV(un), uppervp);
mutex_exit(&un->un_lock);
if (ohash != nhash) {
LIST_INSERT_HEAD(&uhashtbl[nhash], un, un_cache);
@@ -577,6 +578,7 @@ union_loadvnode(struct mount *mp, struct
vshareilock(vp, svp);
rw_obj_hold(svp->v_uobj.vmobjlock);
uvm_obj_setlock(&vp->v_uobj, svp->v_uobj.vmobjlock);
+   vshareklist(vp, svp);
 
/* detect the root vnode (and aliases) */
if ((un->un_uppervp == um->um_uppervp) &&
Index: sys/miscfs/genfs/layer_vfsops.c
===
RCS file: /cvsroot/src/sys/miscfs/genfs/layer_vfsops.c,v
retrieving revision 1.54
diff -u -p -r1.54 layer_vfsops.c
--- sys/miscfs/genfs/layer_vfsops.c 23 Feb 2020 15:46:41 -  1.54
+++ sys/miscfs/genfs/layer_vfsops.c 12 Jul 2022 00:17:28 -
@@ -205,10 +205,11 @@ layerfs_loadvnode(struct mount *mp, stru
 
xp = kmem_alloc(lmp->layerm_size, KM_SLEEP);
 
-   /* Share the interlock and vmobjlock with the lower node. */
+   /* Share the interlock, vmobjlock, and klist with the lower node. */
vshareilock(vp, lowervp);
rw_obj_hold(lowervp->v_uobj.vmobjlock);
uvm_obj_setlock(&vp->v_uobj, lowervp->v_uobj.vmobjlock);
+   vshareklist(vp, lowervp);
 
vp->v_tag = lmp->layerm_tag;
vp->v_type = lowervp->v_type;
Index: sys/kern/vfs_vnode.c
===
RCS file: /cvsroot/src/sys/kern/vfs_vnode.c,v
retrieving revision 1.143
diff -u -p -r1.143 vfs_vnode.c
--- sys/kern/vfs_vnode.c9 Apr 2022 23:45:45 -   1.143
+++ sys/kern/vfs_vnode.c12 Jul 2022 00:17:28 -
@@ -457,7 +457,8 @@ vnalloc_marker(struct mount *mp)
vp->v_mount = mp;
vp->v_type = VBAD;
vp->v_interlock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE);
-   klist_init(&vp->v_klist);
+   klist_init(&vip->vi_klist.vk_klist);
+   vp->v_klist = &vip->vi_klist;
vip->vi_state = VS_MARKER;
 
return vp;
@@ -475,7 +476,7 @@ vnfree_marker(vnode_t *vp)
KASSERT(vip->vi_state == VS_MARKER);
mutex_obj_free(vp->v_interlock);
uvm_obj_destroy(&vp->v_uobj, true);
-   klist_fini(&vp->v_klist);
+   klist_fini(&vip->vi_klist.vk_klist);
pool_cache_put(vcache_pool, vip);
 }
 
@@ -1391,7 +1392,8 @@ vcache_alloc(void)
vp->v_interlock = mutex_obj_alloc(MUTEX_DEFAULT, IPL_NONE);
 
uvm_obj_init(&vp->v_uobj, &uvm_vnodeops, true, 1);
-   klist_init(&vp->v_klist);
+   klist_init(&vip->vi_klist.vk_klist);
+   vp->v_klist = &vip->vi_klist;
cv_init(&vp->v_cv, "vnode");
cache_vnode_init(vp);
 
@@ -1453,7 +1455,9 @@ vcache_free(vnode_impl_t *vip)
mutex_obj_free(vp->v_interlock);
rw_destroy(&vip->vi_lock);
uvm_obj_destroy(&vp->v_uobj, true);
-   klist_fini(&vp->v_klist);
+   KASSERT(vp->v_klist == &vip->vi_klist ||
+   S

Problem with outstanding knotes and device detach - and a fix

2022-07-12 Thread Jason Thorpe
Background: kqueue events are represented by a “knote” that is kept on a list 
of knotes (a klist) by its respective driver (usually indirectly in the 
“selinfo” structure).  A similar situation exists for vnodes.  Then, when 
activated, they are queued onto the kqueue’s active note list.  The driver gets 
called to attach the knote in a generic fashion, but then provides its own ops 
vector for subsequent operations, including “detach”.  When a file descriptor 
for a given knote is closed, this driver-supplied “detach” function is called 
to remove it from whatever driver-specific klist the knote is on.

When a driver detaches, it frees its softc structure, which includes the 
selinfo / klist.  This list might still have notes on it… but that isn’t 
inherently problematic, because the driver owns the list and no one will ever 
traverse it again — the generic kqueue code doesn’t care about it.

But it is a problem when an outstanding file descriptor with a registered 
kqueue event (e.g. EVFILT_READ) for that device that may have been open when 
the device was detached.  In this case, when the file descriptor is closed, the 
knote, which still points to the now-detached device’s filterops, will call the 
“detach” operation.  For a driver that is still loaded into the kernel, we call 
code that’s still valid, but a use-after-free condition exists where the 
now-freed softc structure will be accessed to unlink the knote from the klist.  
The situation is worse for modular drivers that may have been unloaded after 
the device detaches — jumping through function pointers to code that no longer 
exists is bad, m’kay?

The following patch rectifies this situation by having klist_fini() traverse a 
list of knotes and substitute their filterops with no-op stubs.  This requires 
synchronizing with any calls into the filterops themselves.  We have convenient 
funnel-points for this in kern_event.c already, so it’s not particularly 
difficult… the main issue is that the filterops themselves are allowed to 
block.  My solution to this was to have a single global rwlock, 
knote_filterops_lock, that’s acquired for reading when a call though a knote's 
filterops is made, and in klist_fini() is acquired for writing for the short 
duration needed to stub out filterops.  This lock should **very rarely** be 
contended, but it will be *hot* (lots of acquire-for-read), so it gets its own 
cache line.

If someone has ideas about another synchronization mechanism, I’m all ears… 
again, the main issue is that the filterops calls themselves are allowed to 
block, so I’m not sure that any of our passive serialization mechanisms would 
work here.

Index: kern/kern_event.c
===
RCS file: /cvsroot/src/sys/kern/kern_event.c,v
retrieving revision 1.141
diff -u -p -r1.141 kern_event.c
--- kern/kern_event.c   24 May 2022 20:50:19 -  1.141
+++ kern/kern_event.c   11 Jul 2022 20:53:40 -
@@ -133,6 +133,31 @@ static const struct fileops kqueueops = 
.fo_restart = kqueue_restart,
 };
 
+static void
+filt_nopdetach(struct knote *kn __unused)
+{
+}
+
+static int
+filt_nopevent(struct knote *kn __unused, long hint __unused)
+{
+   return 0;
+}
+
+static const struct filterops nop_fd_filtops = {
+   .f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE,
+   .f_attach = NULL,
+   .f_detach = filt_nopdetach,
+   .f_event = filt_nopevent,
+};
+
+static const struct filterops nop_filtops = {
+   .f_flags = FILTEROP_MPSAFE,
+   .f_attach = NULL,
+   .f_detach = filt_nopdetach,
+   .f_event = filt_nopevent,
+};
+
 static const struct filterops kqread_filtops = {
.f_flags = FILTEROP_ISFD | FILTEROP_MPSAFE,
.f_attach = NULL,
@@ -232,6 +257,9 @@ static size_t   user_kfiltersz; /* size 
  * -> object lock (e.g., device driver lock, &c.)
  * -> kn_kq->kq_lock
  *
+ * knote_filterops_lock (rw)
+ * -> whatever will be acquired in filter_*()
+ *
  * Locking rules:
  *
  * f_attach: fdp->fd_lock, KERNEL_LOCK
@@ -279,6 +307,24 @@ static size_t  user_kfiltersz; /* size 
  */
 static krwlock_t   kqueue_filter_lock; /* lock on filter lists */
 
+/*
+ * Calls into the filterops need to be resilient against things which
+ * destroy a klist, e.g. device detach, freeing a vnode, etc., to avoid
+ * chasing garbage pointers (to data, or even potentially code in a
+ * module about to be unloaded).  To that end, we acquire the
+ * knote_filterops_lock as a reader whenver calling into the filter
+ * ops (in the filter_*() functions).  When a driver (or anything else)
+ * is tearing down its klist, klist_fini() acquires knote_filterops_lock
+ * as a writer, and replaces all of the filterops in each knote with a
+ * stub, allowing knote detach (when descriptors are closed) to safely
+ * proceed.
+ *
+ * This lock should be very rarely *contended* (klist tear-down is fairly
+ * rare), but i

Re: Language-neutral interface specifications (research)

2022-07-12 Thread Mouse
> I am building an Ada compiler as a hobby project.  [...] native code
> for various platforms and architectures, but to help with
> bootstrapping, I also want it to be able to generate portable
> C++11/POSIX code.  This means I have to think about the difference
> between API and ABI on POSIX platforms.

Well, if you want to generate C++ code and machine code for the same
platform, then, yes, you will need to know how the C++ API maps into
the underlying ABI.

> As I see it, there are three levels of interface definition to consider:

> 1. The API level, [...]

> 2. The Pure-C level.  [...]

I don't see why this one is relevant.  It exists, to the extent that it
_does_ exist, only as an intermediate step during the conversion of API
level 1 code to machine code.  As you point out, it may not even be C.

> 3. The ABI level.  [...] Documents like the System V ABI describe how
> Pure-C is translated to machine code calls.

I thought they worked more at the level-1 API level: "a struct timeval
looks like this in memory..." without any specification of what it
looks like at the (as you say, largely fictional) pure-C level.

> I am interested in an Interface Description Language that can be used
> to:

> a. Define the source-level API in a way that is detached from the
>C language.

This can be done, but only by attaching it to some other language
instead.  It doesn't even make sense to talk about an API without it
being an API for some specific language; an API is an Application
_Programming_ Interface, and you can't program without programming in
some language.

> c. Define data flow better than C allows: The struct stat*
>argument to stat() is only meant to move data out of the
>function.  The pathname pointer isn't captured by the function
>call.

Doesn't the restrict qualifier get you at least part of this?

> d. Describe how the API level gets translated to the Pure-C level
>or similar.

Why bother with this?  The Pure-C level is often fictional and doesn't
really matter anyway; for example, I can't see that it matters whether
(for example) time_t is int, long, or __builtin_time_t; what matters is
(a) what the API code looks like and (b) what it turns into in terms of
bits in memory, registers, and syscall traps, and that can be described
without needing to refer to any pure-C level anything.

It sounds to me as though you want an ABI description language, not an
API description language.  A C - or C++, or any other language except
Ada - API is of minimal value for you, except (a) to the extent that an
ABI can be inferred from it or (b) to the extent that you are
generating code in that "other language".

Indeed, one could argue that an ABI is, essentially, an API for the
machine-code level.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Language-neutral interface specifications (research)

2022-07-12 Thread Jakob Stoklund Olesen
I saw this project on the wiki, and it reminds me of a problem I have
been trying to understand better:
https://wiki.netbsd.org/projects/project/language-neutral-interfaces/

I am a retired compiler engineer. I used to work on LLVM and other
compilers, including LLVM's code generator and register allocators. I
brought up the sparc64 support and implemented the System V ABI for that
architecture in Clang and LLVM. I haven't contributed directly to NetBSD
before, but my name appears once or twice in src/external.

I am building an Ada compiler as a hobby project. I realize this is a
huge undertaking that I will probably never finish. It's fun anyway. The
compiler will generate native code for various platforms and
architectures, but to help with bootstrapping, I also want it to be able
to generate portable C++11/POSIX code. This means I have to think about
the difference between API and ABI on POSIX platforms.


As I see it, there are three levels of interface definition to consider:

1. The API level, or source code compatibility level. Standards like
POSIX tend to describe interfaces in terms of what your C source code
should look like:

#include 
#include 

int check_dir(const char *pathname)
{
struct stat buffer;

if (stat(pathname, &buffer) != 0)
return errno;

if (S_ISDIR(buffer.st_mode))
return 0;
else
return ENOTDIR;
}

POSIX doesn't specify the exact contents of struct stat nor the value of
ENOTDIR. It only defines that you can use those symbols and struct
members in C code after including the right headers.


2. The Pure-C level. The C compiler lowers the C code to something
self-contained. It will:

- Run the preprocessor,
- Expand typedefs, and
- Expand inline functions.

(I'm not saying compilers work this way, but they work as-if this
happened).

This leaves code consisting of only C primitives:

struct stat {
int st_mode;
...
};
extern __tls int errno;

int check_dir(const char *pathname)
{
struct stat buffer;

if (stat(pathname, &buffer) != 0)
return errno;

if ((buffer.st_mode & 017) == 004)
return 0;
else
return 20;
}

This Pure-C code is not portable. There are differences between
platforms, architectures, and even some compiler flags can affect this
code. Note that Pure-C also doesn't have to be standard C. It is common
to use vendor-specific extensions like __attr__ to get the right
alignments and calling conventions. I see NetBSD sources are using a
__RENAME macro to change linkage names.


3. The ABI level. Documents like the System V ABI describe how Pure-C is
translated to machine code calls. This includes size and alignment of
primitive types, layout of structs, and how arguments and return values
are passed in function calls.


The translation from Pure-C to ABI is a pretty well understood problem.
There are ABI documents describing the standard stuff, and you may have
to deal with a few compiler extensions for alignment, SIMD types, and
alternate calling conventions. This is all stuff that compilers need to
deal with anyway.

The translation from API level to Pure-C level is more difficult (for
me, anyway). It seems like you basically have to run C code through the
system C compiler to make sure you covered all the corner cases. It is
not very satisfying for an Ada compiler to have to depend on the system
C compiler in order to generate binary code that interacts with the
system.

Ada actually has a standardized foreign function interface that can be
used to interface with C and Fortran. The problem is that it interfaces
to the Pure-C level, not the API level. I can't use it to call stat() in
a portable way.

I am interested in an Interface Description Language that can be used
to:

a. Define the source-level API in a way that is detached from the C
   language.
b. Define stronger types than C allows: S_ISDIR() is only supposed
   to work on a mode_t returned from stat(), not any old integer.
c. Define data flow better than C allows: The struct stat* argument
   to stat() is only meant to move data out of the function. The
   pathname pointer isn't captured by the function call.
d. Describe how the API level gets translated to the Pure-C level or
   similar. This is different for different platforms and
   architectures.

This IDL would make it possible for me to generate portable Ada bindings
for POSIX and other APIs without having to rely on the system C
compiler. I think it could also be useful for a project like NetBSD to
be able to track source compatibility and binary compatibility
individually.


I haven't done anything concrete yet, and I agree that it is a good idea
to research prior art. There is a lot of it:

- Wikipedia as a long list of IDLs:
https://en.wikipedia.org/wiki/Interface_description_language
- CORB