Module Name:    src
Committed By:   thorpej
Date:           Sun Oct 10 18:07:52 UTC 2021

Modified Files:
        src/sys/kern: kern_event.c kern_exec.c kern_exit.c kern_fork.c
        src/sys/sys: event.h eventvar.h proc.h

Log Message:
Changes to make EVFILT_PROC MP-safe:

Because the locking protocol around processes is somewhat complex
compared to other events that can be posted on kqueues, introduce
new functions for posting NOTE_EXEC, NOTE_EXIT, and NOTE_FORK,
rather than just using the generic knote() function.  These functions
KASSERT() their locking expectations, and deal with other complexities
for each situation.

knote_proc_fork(), in particiular, needs to handle NOTE_TRACK, which
requires allocation of a new knote to attach to the child process.  We
don't want to be allocating memory while holding the parent's p_lock.
Furthermore, we also have to attach the tracking note to the child
process, which means we have to acquire the child's p_lock.

So, to handle all this, we introduce some additional synchronization
infrastructure around the 'knote' structure:

- Add the ability to mark a knote as being in a state of flux.  Knotes
  in this state are guaranteed not to be detached/deleted, thus allowing
  a code path drop other locks after putting a knote in this state.

- Code paths that wish to detach/delete a knote must first check if the
  knote is in-flux.  If so, they must wait for it to quiesce.  Because
  multiple threads of execution may attempt this concurrently, a mechanism
  exists for a single LWP to claim the detach responsibility; all other
  threads simply wait for the knote to disappear before they can make
  further progress.

- When kqueue_scan() encounters an in-flux knote, it simply treats the
  situation just like encountering another thread's queue marker -- wait
  for the flux to settle and continue on.

(The "in-flux knote" idea was inspired by FreeBSD, but this works differently
from their implementation, as the two kqueue implementations have diverged
quite a bit.)

knote_proc_fork() uses this infrastructure to implement NOTE_TRACK like so:

- Attempt to put the original tracking knote into a state of flux; if this
  fails (because the note has a detach pending), we skip all processing
  (the original process has lost interest, and we simply won the race).

- Once the note is in-flux, drop the kq and forking process's locks, and
  allocate 2 knotes: one to post the NOTE_CHILD event, and one to attach
  a new NOTE_TRACK to the child process.  Notably, we do NOT go through
  kqueue_register() to do this, but rather do all of the work directly
  and KASSERT() our assumptions; this allows us to directly control our
  interaction with locks.  All memory allocations here are performed with
  KM_NOSLEEP, in order to prevent holding the original knote in-flux
  indefinitely.

- Because the NOTE_TRACK use case adds knotes to kqueues through a
  sort of back-door mechanism, we must serialize with the closing of
  the destination kqueue's file descriptor, so steal another bit from
  the kq_count field to notify other threads that a kqueue is on its
  way out to prevent new knotes from being enqueued while the close
  path detaches them.

In addition to fixing EVFILT_PROC's reliance on KERNEL_LOCK, this also
fixes a long-standing bug whereby a NOTE_CHILD event could be dropped
if the child process exited before the interested process received the
NOTE_CHILD event (the same knote would be used to deliver the NOTE_EXIT
event, and would clobber the NOTE_CHILD's 'data' field).

Add a bunch of comments to explain what's going on in various critical
sections, and sprinkle additional KASSERT()s to validate assumptions
in several more locations.


To generate a diff of this commit:
cvs rdiff -u -r1.128 -r1.129 src/sys/kern/kern_event.c
cvs rdiff -u -r1.509 -r1.510 src/sys/kern/kern_exec.c
cvs rdiff -u -r1.291 -r1.292 src/sys/kern/kern_exit.c
cvs rdiff -u -r1.226 -r1.227 src/sys/kern/kern_fork.c
cvs rdiff -u -r1.43 -r1.44 src/sys/sys/event.h
cvs rdiff -u -r1.9 -r1.10 src/sys/sys/eventvar.h
cvs rdiff -u -r1.368 -r1.369 src/sys/sys/proc.h

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.

Modified files:

Index: src/sys/kern/kern_event.c
diff -u src/sys/kern/kern_event.c:1.128 src/sys/kern/kern_event.c:1.129
--- src/sys/kern/kern_event.c:1.128	Thu Sep 30 01:20:53 2021
+++ src/sys/kern/kern_event.c	Sun Oct 10 18:07:51 2021
@@ -1,7 +1,7 @@
-/*	$NetBSD: kern_event.c,v 1.128 2021/09/30 01:20:53 thorpej Exp $	*/
+/*	$NetBSD: kern_event.c,v 1.129 2021/10/10 18:07:51 thorpej Exp $	*/
 
 /*-
- * Copyright (c) 2008, 2009 The NetBSD Foundation, Inc.
+ * Copyright (c) 2008, 2009, 2021 The NetBSD Foundation, Inc.
  * All rights reserved.
  *
  * This code is derived from software contributed to The NetBSD Foundation
@@ -58,8 +58,10 @@
  * FreeBSD: src/sys/kern/kern_event.c,v 1.27 2001/07/05 17:10:44 rwatson Exp
  */
 
+#include "opt_ddb.h"
+
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: kern_event.c,v 1.128 2021/09/30 01:20:53 thorpej Exp $");
+__KERNEL_RCSID(0, "$NetBSD: kern_event.c,v 1.129 2021/10/10 18:07:51 thorpej Exp $");
 
 #include <sys/param.h>
 #include <sys/systm.h>
@@ -134,7 +136,7 @@ static const struct filterops kqread_fil
 };
 
 static const struct filterops proc_filtops = {
-	.f_flags = 0,
+	.f_flags = FILTEROP_MPSAFE,
 	.f_attach = filt_procattach,
 	.f_detach = filt_procdetach,
 	.f_event = filt_proc,
@@ -177,8 +179,6 @@ static int	kq_calloutmax = (4 * 1024);
 extern const struct filterops fs_filtops;	/* vfs_syscalls.c */
 extern const struct filterops sig_filtops;	/* kern_sig.c */
 
-#define KQ_FLUX_WAKEUP(kq)	cv_broadcast(&kq->kq_cv)
-
 /*
  * Table for for all system-defined filters.
  * These should be listed in the numeric order of the EVFILT_* defines.
@@ -234,10 +234,189 @@ static size_t		user_kfiltersz;		/* size 
  *		Typically,	f_event(NOTE_SUBMIT) via knote: object lock
  *				f_event(!NOTE_SUBMIT) via knote: nothing,
  *					acquires/releases object lock inside.
+ *
+ * Locking rules when detaching knotes:
+ *
+ * There are some situations where knote submission may require dropping
+ * locks (see knote_proc_fork()).  In order to support this, it's possible
+ * to mark a knote as being 'in-flux'.  Such a knote is guaranteed not to
+ * be detached while it remains in-flux.  Because it will not be detached,
+ * locks can be dropped so e.g. memory can be allocated, locks on other
+ * data structures can be acquired, etc.  During this time, any attempt to
+ * detach an in-flux knote must wait until the knote is no longer in-flux.
+ * When this happens, the knote is marked for death (KN_WILLDETACH) and the
+ * LWP who gets to finish the detach operation is recorded in the knote's
+ * 'udata' field (which is no longer required for its original purpose once
+ * a knote is so marked).  Code paths that lead to knote_detach() must ensure
+ * that their LWP is the one tasked with its final demise after waiting for
+ * the in-flux status of the knote to clear.  Note that once a knote is
+ * marked KN_WILLDETACH, no code paths may put it into an in-flux state.
+ *
+ * Once the special circumstances have been handled, the locks are re-
+ * acquired in the proper order (object lock -> kq_lock), the knote taken
+ * out of flux, and any waiters are notified.  Because waiters must have
+ * also dropped *their* locks in order to safely block, they must re-
+ * validate all of their assumptions; see knote_detach_quiesce().  See also
+ * the kqueue_register() (EV_ADD, EV_DELETE) and kqueue_scan() (EV_ONESHOT)
+ * cases.
+ *
+ * When kqueue_scan() encounters an in-flux knote, the situation is
+ * treated like another LWP's list marker.
+ *
+ * LISTEN WELL: It is important to not hold knotes in flux for an
+ * extended period of time! In-flux knotes effectively block any
+ * progress of the kqueue_scan() operation.  Any code paths that place
+ * knotes in-flux should be careful to not block for indefinite periods
+ * of time, such as for memory allocation (i.e. KM_NOSLEEP is OK, but
+ * KM_SLEEP is not).
  */
 static krwlock_t	kqueue_filter_lock;	/* lock on filter lists */
 static kmutex_t		kqueue_timer_lock;	/* for EVFILT_TIMER */
 
+#define	KQ_FLUX_WAIT(kq)	(void)cv_wait(&kq->kq_cv, &kq->kq_lock)
+#define	KQ_FLUX_WAKEUP(kq)	cv_broadcast(&kq->kq_cv)
+
+static inline bool
+kn_in_flux(struct knote *kn)
+{
+	KASSERT(mutex_owned(&kn->kn_kq->kq_lock));
+	return kn->kn_influx != 0;
+}
+
+static inline bool
+kn_enter_flux(struct knote *kn)
+{
+	KASSERT(mutex_owned(&kn->kn_kq->kq_lock));
+
+	if (kn->kn_status & KN_WILLDETACH) {
+		return false;
+	}
+
+	KASSERT(kn->kn_influx < UINT_MAX);
+	kn->kn_influx++;
+
+	return true;
+}
+
+static inline bool
+kn_leave_flux(struct knote *kn)
+{
+	KASSERT(mutex_owned(&kn->kn_kq->kq_lock));
+	KASSERT(kn->kn_influx > 0);
+	kn->kn_influx--;
+	return kn->kn_influx == 0;
+}
+
+static void
+kn_wait_flux(struct knote *kn, bool can_loop)
+{
+	bool loop;
+
+	KASSERT(mutex_owned(&kn->kn_kq->kq_lock));
+
+	/*
+	 * It may not be safe for us to touch the knote again after
+	 * dropping the kq_lock.  The caller has let us know in
+	 * 'can_loop'.
+	 */
+	for (loop = true; loop && kn->kn_influx != 0; loop = can_loop) {
+		KQ_FLUX_WAIT(kn->kn_kq);
+	}
+}
+
+#define	KNOTE_WILLDETACH(kn)						\
+do {									\
+	(kn)->kn_status |= KN_WILLDETACH;				\
+	(kn)->kn_kevent.udata = curlwp;					\
+} while (/*CONSTCOND*/0)
+
+/*
+ * Wait until the specified knote is in a quiescent state and
+ * safe to detach.  Returns true if we potentially blocked (and
+ * thus dropped our locks).
+ */
+static bool
+knote_detach_quiesce(struct knote *kn)
+{
+	struct kqueue *kq = kn->kn_kq;
+	filedesc_t *fdp = kq->kq_fdp;
+
+	KASSERT(mutex_owned(&fdp->fd_lock));
+
+	mutex_spin_enter(&kq->kq_lock);
+	/*
+	 * There are two cases where we might see KN_WILLDETACH here:
+	 *
+	 * 1. Someone else has already started detaching the knote but
+	 *    had to wait for it to settle first.
+	 *
+	 * 2. We had to wait for it to settle, and had to come back
+	 *    around after re-acquiring the locks.
+	 *
+	 * When KN_WILLDETACH is set, we also set the LWP that claimed
+	 * the prize of finishing the detach in the 'udata' field of the
+	 * knote (which will never be used again for its usual purpose
+	 * once the note is in this state).  If it doesn't point to us,
+	 * we must drop the locks and let them in to finish the job.
+	 *
+	 * Otherwise, once we have claimed the knote for ourselves, we
+	 * can finish waiting for it to settle.  The is the only scenario
+	 * where touching a detaching knote is safe after dropping the
+	 * locks.
+	 */
+	if ((kn->kn_status & KN_WILLDETACH) != 0 &&
+	    kn->kn_kevent.udata != curlwp) {
+		/*
+		 * N.B. it is NOT safe for us to touch the knote again
+		 * after dropping the locks here.  The caller must go
+		 * back around and re-validate everything.  However, if
+		 * the knote is in-flux, we want to block to minimize
+		 * busy-looping.
+		 */
+		mutex_exit(&fdp->fd_lock);
+		if (kn_in_flux(kn)) {
+			kn_wait_flux(kn, false);
+			mutex_spin_exit(&kq->kq_lock);
+			return true;
+		}
+		mutex_spin_exit(&kq->kq_lock);
+		preempt_point();
+		return true;
+	}
+	/*
+	 * If we get here, we know that we will be claiming the
+	 * detach responsibilies, or that we already have and
+	 * this is the second attempt after re-validation.
+	 */
+	KASSERT((kn->kn_status & KN_WILLDETACH) == 0 ||
+		kn->kn_kevent.udata == curlwp);
+	/*
+	 * Similarly, if we get here, either we are just claiming it
+	 * and may have to wait for it to settle, or if this is the
+	 * second attempt after re-validation that no other code paths
+	 * have put it in-flux.
+	 */
+	KASSERT((kn->kn_status & KN_WILLDETACH) == 0 ||
+		kn_in_flux(kn) == false);
+	KNOTE_WILLDETACH(kn);
+	if (kn_in_flux(kn)) {
+		mutex_exit(&fdp->fd_lock);
+		kn_wait_flux(kn, true);
+		/*
+		 * It is safe for us to touch the knote again after
+		 * dropping the locks, but the caller must still
+		 * re-validate everything because other aspects of
+		 * the environment may have changed while we blocked.
+		 */
+		KASSERT(kn_in_flux(kn) == false);
+		mutex_spin_exit(&kq->kq_lock);
+		return true;
+	}
+	mutex_spin_exit(&kq->kq_lock);
+
+	return false;
+}
+
 static int
 filter_attach(struct knote *kn)
 {
@@ -577,24 +756,9 @@ static int
 filt_procattach(struct knote *kn)
 {
 	struct proc *p;
-	struct lwp *curl;
-
-	curl = curlwp;
 
 	mutex_enter(&proc_lock);
-	if (kn->kn_flags & EV_FLAG1) {
-		/*
-		 * NOTE_TRACK attaches to the child process too early
-		 * for proc_find, so do a raw look up and check the state
-		 * explicitly.
-		 */
-		p = proc_find_raw(kn->kn_id);
-		if (p != NULL && p->p_stat != SIDL)
-			p = NULL;
-	} else {
-		p = proc_find(kn->kn_id);
-	}
-
+	p = proc_find(kn->kn_id);
 	if (p == NULL) {
 		mutex_exit(&proc_lock);
 		return ESRCH;
@@ -606,7 +770,7 @@ filt_procattach(struct knote *kn)
 	 */
 	mutex_enter(p->p_lock);
 	mutex_exit(&proc_lock);
-	if (kauth_authorize_process(curl->l_cred,
+	if (kauth_authorize_process(curlwp->l_cred,
 	    KAUTH_PROCESS_KEVENT_FILTER, p, NULL, NULL, NULL) != 0) {
 	    	mutex_exit(p->p_lock);
 		return EACCES;
@@ -616,13 +780,11 @@ filt_procattach(struct knote *kn)
 	kn->kn_flags |= EV_CLEAR;	/* automatically set */
 
 	/*
-	 * internal flag indicating registration done by kernel
+	 * NOTE_CHILD is only ever generated internally; don't let it
+	 * leak in from user-space.  See knote_proc_fork_track().
 	 */
-	if (kn->kn_flags & EV_FLAG1) {
-		kn->kn_data = kn->kn_sdata;	/* ppid */
-		kn->kn_fflags = NOTE_CHILD;
-		kn->kn_flags &= ~EV_FLAG1;
-	}
+	kn->kn_sfflags &= ~NOTE_CHILD;
+
 	SLIST_INSERT_HEAD(&p->p_klist, kn, kn_selnext);
     	mutex_exit(p->p_lock);
 
@@ -642,91 +804,350 @@ filt_procattach(struct knote *kn)
 static void
 filt_procdetach(struct knote *kn)
 {
+	struct kqueue *kq = kn->kn_kq;
 	struct proc *p;
 
-	if (kn->kn_status & KN_DETACHED)
-		return;
-
-	p = kn->kn_obj;
-
-	mutex_enter(p->p_lock);
-	SLIST_REMOVE(&p->p_klist, kn, knote, kn_selnext);
-	mutex_exit(p->p_lock);
+	/*
+	 * We have to synchronize with knote_proc_exit(), but we
+	 * are forced to acquire the locks in the wrong order here
+	 * because we can't be sure kn->kn_obj is valid unless
+	 * KN_DETACHED is not set.
+	 */
+ again:
+	mutex_spin_enter(&kq->kq_lock);
+	if ((kn->kn_status & KN_DETACHED) == 0) {
+		p = kn->kn_obj;
+		if (!mutex_tryenter(p->p_lock)) {
+			mutex_spin_exit(&kq->kq_lock);
+			preempt_point();
+			goto again;
+		}
+		kn->kn_status |= KN_DETACHED;
+		SLIST_REMOVE(&p->p_klist, kn, knote, kn_selnext);
+		mutex_exit(p->p_lock);
+	}
+	mutex_spin_exit(&kq->kq_lock);
 }
 
 /*
  * Filter event method for EVFILT_PROC.
+ *
+ * Due to some of the complexities of process locking, we have special
+ * entry points for delivering knote submissions.  filt_proc() is used
+ * only to check for activation from kqueue_register() and kqueue_scan().
  */
 static int
 filt_proc(struct knote *kn, long hint)
 {
-	u_int event, fflag;
-	struct kevent kev;
-	struct kqueue *kq;
-	int error;
+	struct kqueue *kq = kn->kn_kq;
+	uint32_t fflags;
 
-	event = (u_int)hint & NOTE_PCTRLMASK;
-	kq = kn->kn_kq;
-	fflag = 0;
+	/*
+	 * Because we share the same klist with signal knotes, just
+	 * ensure that we're not being invoked for the proc-related
+	 * submissions.
+	 */
+	KASSERT((hint & (NOTE_EXEC | NOTE_EXIT | NOTE_FORK)) == 0);
 
-	/* If the user is interested in this event, record it. */
-	if (kn->kn_sfflags & event)
-		fflag |= event;
+	mutex_spin_enter(&kq->kq_lock);
+	fflags = kn->kn_fflags;
+	mutex_spin_exit(&kq->kq_lock);
 
-	if (event == NOTE_EXIT) {
-		struct proc *p = kn->kn_obj;
+	return fflags != 0;
+}
 
-		if (p != NULL)
-			kn->kn_data = P_WAITSTATUS(p);
-		/*
-		 * Process is gone, so flag the event as finished.
-		 *
-		 * Detach the knote from watched process and mark
-		 * it as such. We can't leave this to kqueue_scan(),
-		 * since the process might not exist by then. And we
-		 * have to do this now, since psignal KNOTE() is called
-		 * also for zombies and we might end up reading freed
-		 * memory if the kevent would already be picked up
-		 * and knote g/c'ed.
-		 */
-		filt_procdetach(kn);
+void
+knote_proc_exec(struct proc *p)
+{
+	struct knote *kn, *tmpkn;
+	struct kqueue *kq;
+	uint32_t fflags;
+
+	mutex_enter(p->p_lock);
 
+	SLIST_FOREACH_SAFE(kn, &p->p_klist, kn_selnext, tmpkn) {
+		/* N.B. EVFILT_SIGNAL knotes are on this same list. */
+		if (kn->kn_fop == &sig_filtops) {
+			continue;
+		}
+		KASSERT(kn->kn_fop == &proc_filtops);
+
+		kq = kn->kn_kq;
 		mutex_spin_enter(&kq->kq_lock);
-		kn->kn_status |= KN_DETACHED;
-		/* Mark as ONESHOT, so that the knote it g/c'ed when read */
-		kn->kn_flags |= (EV_EOF | EV_ONESHOT);
-		kn->kn_fflags |= fflag;
+		fflags = (kn->kn_fflags |= (kn->kn_sfflags & NOTE_EXEC));
 		mutex_spin_exit(&kq->kq_lock);
+		if (fflags) {
+			knote_activate(kn);
+		}
+	}
+
+	mutex_exit(p->p_lock);
+}
+
+static int __noinline
+knote_proc_fork_track(struct proc *p1, struct proc *p2, struct knote *okn)
+{
+	struct kqueue *kq = okn->kn_kq;
+
+	KASSERT(mutex_owned(&kq->kq_lock));
+	KASSERT(mutex_owned(p1->p_lock));
+
+	/*
+	 * We're going to put this knote into flux while we drop
+	 * the locks and create and attach a new knote to track the
+	 * child.  If we are not able to enter flux, then this knote
+	 * is about to go away, so skip the notification.
+	 */
+	if (!kn_enter_flux(okn)) {
+		return 0;
+	}
+
+	mutex_spin_exit(&kq->kq_lock);
+	mutex_exit(p1->p_lock);
 
-		return 1;
+	/*
+	 * We actually have to register *two* new knotes:
+	 *
+	 * ==> One for the NOTE_CHILD notification.  This is a forced
+	 *     ONESHOT note.
+	 *
+	 * ==> One to actually track the child process as it subsequently
+	 *     forks, execs, and, ultimately, exits.
+	 *
+	 * If we only register a single knote, then it's possible for
+	 * for the NOTE_CHILD and NOTE_EXIT to be collapsed into a single
+	 * notification if the child exits before the tracking process
+	 * has received the NOTE_CHILD notification, which applications
+	 * aren't expecting (the event's 'data' field would be clobbered,
+	 * for exmaple).
+	 *
+	 * To do this, what we have here is an **extremely** stripped-down
+	 * version of kqueue_register() that has the following properties:
+	 *
+	 * ==> Does not block to allocate memory.  If we are unable
+	 *     to allocate memory, we return ENOMEM.
+	 *
+	 * ==> Does not search for existing knotes; we know there
+	 *     are not any because this is a new process that isn't
+	 *     even visible to other processes yet.
+	 *
+	 * ==> Assumes that the knhash for our kq's descriptor table
+	 *     already exists (after all, we're already tracking
+	 *     processes with knotes if we got here).
+	 *
+	 * ==> Directly attaches the new tracking knote to the child
+	 *     process.
+	 *
+	 * The whole point is to do the minimum amount of work while the
+	 * knote is held in-flux, and to avoid doing extra work in general
+	 * (we already have the new child process; why bother looking it
+	 * up again?).
+	 */
+	filedesc_t *fdp = kq->kq_fdp;
+	struct knote *knchild, *kntrack;
+	int error = 0;
+
+	knchild = kmem_zalloc(sizeof(*knchild), KM_NOSLEEP);
+	kntrack = kmem_zalloc(sizeof(*knchild), KM_NOSLEEP);
+	if (__predict_false(knchild == NULL || kntrack == NULL)) {
+		error = ENOMEM;
+		goto out;
+	}
+
+	kntrack->kn_obj = p2;
+	kntrack->kn_id = p2->p_pid;
+	kntrack->kn_kq = kq;
+	kntrack->kn_fop = okn->kn_fop;
+	kntrack->kn_kfilter = okn->kn_kfilter;
+	kntrack->kn_sfflags = okn->kn_sfflags;
+	kntrack->kn_sdata = p1->p_pid;
+
+	kntrack->kn_kevent.ident = p2->p_pid;
+	kntrack->kn_kevent.filter = okn->kn_filter;
+	kntrack->kn_kevent.flags =
+	    okn->kn_flags | EV_ADD | EV_ENABLE | EV_CLEAR;
+	kntrack->kn_kevent.fflags = 0;
+	kntrack->kn_kevent.data = 0;
+	kntrack->kn_kevent.udata = okn->kn_kevent.udata; /* preserve udata */
+
+	/*
+	 * The child note does not need to be attached to the
+	 * new proc's klist at all.
+	 */
+	*knchild = *kntrack;
+	knchild->kn_status = KN_DETACHED;
+	knchild->kn_sfflags = 0;
+	knchild->kn_kevent.flags |= EV_ONESHOT;
+	knchild->kn_kevent.fflags = NOTE_CHILD;
+	knchild->kn_kevent.data = p1->p_pid;		 /* parent */
+
+	mutex_enter(&fdp->fd_lock);
+
+	/*
+	 * We need to check to see if the kq is closing, and skip
+	 * attaching the knote if so.  Normally, this isn't necessary
+	 * when coming in the front door because the file descriptor
+	 * layer will synchronize this.
+	 *
+	 * It's safe to test KQ_CLOSING without taking the kq_lock
+	 * here because that flag is only ever set when the fd_lock
+	 * is also held.
+	 */
+	if (__predict_false(kq->kq_count & KQ_CLOSING)) {
+		mutex_exit(&fdp->fd_lock);
+		goto out;
 	}
 
+	/*
+	 * We do the "insert into FD table" and "attach to klist" steps
+	 * in the opposite order of kqueue_register() here to avoid
+	 * having to take p2->p_lock twice.  But this is OK because we
+	 * hold fd_lock across the entire operation.
+	 */
+
+	mutex_enter(p2->p_lock);
+	error = kauth_authorize_process(curlwp->l_cred,
+	    KAUTH_PROCESS_KEVENT_FILTER, p2, NULL, NULL, NULL);
+	if (__predict_false(error != 0)) {
+		mutex_exit(p2->p_lock);
+		mutex_exit(&fdp->fd_lock);
+		error = EACCES;
+		goto out;
+	}
+	SLIST_INSERT_HEAD(&p2->p_klist, kntrack, kn_selnext);
+	mutex_exit(p2->p_lock);
+
+	KASSERT(fdp->fd_knhashmask != 0);
+	KASSERT(fdp->fd_knhash != NULL);
+	struct klist *list = &fdp->fd_knhash[KN_HASH(kntrack->kn_id,
+	    fdp->fd_knhashmask)];
+	SLIST_INSERT_HEAD(list, kntrack, kn_link);
+	SLIST_INSERT_HEAD(list, knchild, kn_link);
+
+	/* This adds references for knchild *and* kntrack. */
+	atomic_add_int(&kntrack->kn_kfilter->refcnt, 2);
+
+	knote_activate(knchild);
+
+	kntrack = NULL;
+	knchild = NULL;
+
+	mutex_exit(&fdp->fd_lock);
+
+ out:
+	if (__predict_false(knchild != NULL)) {
+		kmem_free(knchild, sizeof(*knchild));
+	}
+	if (__predict_false(kntrack != NULL)) {
+		kmem_free(kntrack, sizeof(*kntrack));
+	}
+	mutex_enter(p1->p_lock);
 	mutex_spin_enter(&kq->kq_lock);
-	if ((event == NOTE_FORK) && (kn->kn_sfflags & NOTE_TRACK)) {
+
+	if (kn_leave_flux(okn)) {
+		KQ_FLUX_WAKEUP(kq);
+	}
+
+	return error;
+}
+
+void
+knote_proc_fork(struct proc *p1, struct proc *p2)
+{
+	struct knote *kn;
+	struct kqueue *kq;
+	uint32_t fflags;
+
+	mutex_enter(p1->p_lock);
+
+	/*
+	 * N.B. We DO NOT use SLIST_FOREACH_SAFE() here because we
+	 * don't want to pre-fetch the next knote; in the event we
+	 * have to drop p_lock, we will have put the knote in-flux,
+	 * meaning that no one will be able to detach it until we
+	 * have taken the knote out of flux.  However, that does
+	 * NOT stop someone else from detaching the next note in the
+	 * list while we have it unlocked.  Thus, we want to fetch
+	 * the next note in the list only after we have re-acquired
+	 * the lock, and using SLIST_FOREACH() will satisfy that.
+	 */
+	SLIST_FOREACH(kn, &p1->p_klist, kn_selnext) {
+		/* N.B. EVFILT_SIGNAL knotes are on this same list. */
+		if (kn->kn_fop == &sig_filtops) {
+			continue;
+		}
+		KASSERT(kn->kn_fop == &proc_filtops);
+
+		kq = kn->kn_kq;
+		mutex_spin_enter(&kq->kq_lock);
+		kn->kn_fflags |= (kn->kn_sfflags & NOTE_FORK);
+		if (__predict_false(kn->kn_sfflags & NOTE_TRACK)) {
+			/*
+			 * This will drop kq_lock and p_lock and
+			 * re-acquire them before it returns.
+			 */
+			if (knote_proc_fork_track(p1, p2, kn)) {
+				kn->kn_fflags |= NOTE_TRACKERR;
+			}
+			KASSERT(mutex_owned(p1->p_lock));
+			KASSERT(mutex_owned(&kq->kq_lock));
+		}
+		fflags = kn->kn_fflags;
+		mutex_spin_exit(&kq->kq_lock);
+		if (fflags) {
+			knote_activate(kn);
+		}
+	}
+
+	mutex_exit(p1->p_lock);
+}
+
+void
+knote_proc_exit(struct proc *p)
+{
+	struct knote *kn;
+	struct kqueue *kq;
+
+	KASSERT(mutex_owned(p->p_lock));
+
+	while (!SLIST_EMPTY(&p->p_klist)) {
+		kn = SLIST_FIRST(&p->p_klist);
+		kq = kn->kn_kq;
+
+		KASSERT(kn->kn_obj == p);
+
+		mutex_spin_enter(&kq->kq_lock);
+		kn->kn_data = P_WAITSTATUS(p);
+		/*
+		 * Mark as ONESHOT, so that the knote is g/c'ed
+		 * when read.
+		 */
+		kn->kn_flags |= (EV_EOF | EV_ONESHOT);
+		kn->kn_fflags |= kn->kn_sfflags & NOTE_EXIT;
+
 		/*
-		 * Process forked, and user wants to track the new process,
-		 * so attach a new knote to it, and immediately report an
-		 * event with the parent's pid.  Register knote with new
-		 * process.
+		 * Detach the knote from the process and mark it as such.
+		 * N.B. EVFILT_SIGNAL are also on p_klist, but by the
+		 * time we get here, all open file descriptors for this
+		 * process have been released, meaning that signal knotes
+		 * will have already been detached.
+		 *
+		 * We need to synchronize this with filt_procdetach().
 		 */
-		memset(&kev, 0, sizeof(kev));
-		kev.ident = hint & NOTE_PDATAMASK;	/* pid */
-		kev.filter = kn->kn_filter;
-		kev.flags = kn->kn_flags | EV_ADD | EV_ENABLE | EV_FLAG1;
-		kev.fflags = kn->kn_sfflags;
-		kev.data = kn->kn_id;			/* parent */
-		kev.udata = kn->kn_kevent.udata;	/* preserve udata */
+		KASSERT(kn->kn_fop == &proc_filtops);
+		if ((kn->kn_status & KN_DETACHED) == 0) {
+			kn->kn_status |= KN_DETACHED;
+			SLIST_REMOVE_HEAD(&p->p_klist, kn_selnext);
+		}
 		mutex_spin_exit(&kq->kq_lock);
-		error = kqueue_register(kq, &kev);
-		mutex_spin_enter(&kq->kq_lock);
-		if (error != 0)
-			kn->kn_fflags |= NOTE_TRACKERR;
-	}
-	kn->kn_fflags |= fflag;
-	fflag = kn->kn_fflags;
-	mutex_spin_exit(&kq->kq_lock);
 
-	return fflag != 0;
+		/*
+		 * Always activate the knote for NOTE_EXIT regardless
+		 * of whether or not the listener cares about it.
+		 * This matches historical behavior.
+		 */
+		knote_activate(kn);
+	}
 }
 
 static void
@@ -1220,6 +1641,10 @@ kqueue_register(struct kqueue *kq, struc
 		}
 	}
 
+	/* It's safe to test KQ_CLOSING while holding only the fd_lock. */
+	KASSERT(mutex_owned(&fdp->fd_lock));
+	KASSERT((kq->kq_count & KQ_CLOSING) == 0);
+
 	/*
 	 * kn now contains the matching knote, or NULL if no match
 	 */
@@ -1285,7 +1710,17 @@ kqueue_register(struct kqueue *kq, struc
 				    ft ? ft->f_ops->fo_name : "?", error);
 #endif
 
-				/* knote_detach() drops fdp->fd_lock */
+				/*
+				 * N.B. no need to check for this note to
+				 * be in-flux, since it was never visible
+				 * to the monitored object.
+				 *
+				 * knote_detach() drops fdp->fd_lock
+				 */
+				mutex_enter(&kq->kq_lock);
+				KNOTE_WILLDETACH(kn);
+				KASSERT(kn_in_flux(kn) == false);
+				mutex_exit(&kq->kq_lock);
 				knote_detach(kn, fdp, false);
 				goto done;
 			}
@@ -1299,6 +1734,36 @@ kqueue_register(struct kqueue *kq, struc
 	}
 
 	if (kev->flags & EV_DELETE) {
+		/*
+		 * Let the world know that this knote is about to go
+		 * away, and wait for it to settle if it's currently
+		 * in-flux.
+		 */
+		mutex_spin_enter(&kq->kq_lock);
+		if (kn->kn_status & KN_WILLDETACH) {
+			/*
+			 * This knote is already on its way out,
+			 * so just be done.
+			 */
+			mutex_spin_exit(&kq->kq_lock);
+			goto doneunlock;
+		}
+		KNOTE_WILLDETACH(kn);
+		if (kn_in_flux(kn)) {
+			mutex_exit(&fdp->fd_lock);
+			/*
+			 * It's safe for us to conclusively wait for
+			 * this knote to settle because we know we'll
+			 * be completing the detach.
+			 */
+			kn_wait_flux(kn, true);
+			KASSERT(kn_in_flux(kn) == false);
+			mutex_spin_exit(&kq->kq_lock);
+			mutex_enter(&fdp->fd_lock);
+		} else {
+			mutex_spin_exit(&kq->kq_lock);
+		}
+
 		/* knote_detach() drops fdp->fd_lock */
 		knote_detach(kn, fdp, true);
 		goto done;
@@ -1355,10 +1820,46 @@ doneunlock:
 	return (error);
 }
 
-#if defined(DEBUG)
 #define KN_FMT(buf, kn) \
     (snprintb((buf), sizeof(buf), __KN_FLAG_BITS, (kn)->kn_status), buf)
 
+#if defined(DDB)
+void
+kqueue_printit(struct kqueue *kq, bool full, void (*pr)(const char *, ...))
+{
+	const struct knote *kn;
+	u_int count;
+	int nmarker;
+	char buf[128];
+
+	count = 0;
+	nmarker = 0;
+
+	(*pr)("kqueue %p (restart=%d count=%u):\n", kq,
+	    !!(kq->kq_count & KQ_RESTART), KQ_COUNT(kq));
+	(*pr)("  Queued knotes:\n");
+	TAILQ_FOREACH(kn, &kq->kq_head, kn_tqe) {
+		if (kn->kn_status & KN_MARKER) {
+			nmarker++;
+		} else {
+			count++;
+		}
+		(*pr)("    knote %p: kq=%p status=%s\n",
+		    kn, kn->kn_kq, KN_FMT(buf, kn));
+		(*pr)("      id=0x%lx (%lu) filter=%d\n",
+		    (u_long)kn->kn_id, (u_long)kn->kn_id, kn->kn_filter);
+		if (kn->kn_kq != kq) {
+			(*pr)("      !!! kn->kn_kq != kq\n");
+		}
+	}
+	if (count != KQ_COUNT(kq)) {
+		(*pr)("  !!! count(%u) != KQ_COUNT(%u)\n",
+		    count, KQ_COUNT(kq));
+	}
+}
+#endif /* DDB */
+
+#if defined(DEBUG)
 static void
 kqueue_check(const char *func, size_t line, const struct kqueue *kq)
 {
@@ -1368,7 +1869,6 @@ kqueue_check(const char *func, size_t li
 	char buf[128];
 
 	KASSERT(mutex_owned(&kq->kq_lock));
-	KASSERT(KQ_COUNT(kq) < UINT_MAX / 2);
 
 	count = 0;
 	nmarker = 0;
@@ -1389,7 +1889,7 @@ kqueue_check(const char *func, size_t li
 			}
 			count++;
 			if (count > KQ_COUNT(kq)) {
-				panic("%s,%zu: kq=%p kq->kq_count(%d) != "
+				panic("%s,%zu: kq=%p kq->kq_count(%u) != "
 				    "count(%d), nmarker=%d",
 		    		    func, line, kq, KQ_COUNT(kq), count,
 				    nmarker);
@@ -1461,6 +1961,7 @@ kqueue_scan(file_t *fp, size_t maxevents
 
 	memset(&morker, 0, sizeof(morker));
 	marker = &morker;
+	marker->kn_kq = kq;
 	marker->kn_status = KN_MARKER;
 	mutex_spin_enter(&kq->kq_lock);
  retry:
@@ -1498,21 +1999,47 @@ kqueue_scan(file_t *fp, size_t maxevents
 	 * Acquire the fdp->fd_lock interlock to avoid races with
 	 * file creation/destruction from other threads.
 	 */
-relock:
 	mutex_spin_exit(&kq->kq_lock);
+relock:
 	mutex_enter(&fdp->fd_lock);
 	mutex_spin_enter(&kq->kq_lock);
 
 	while (count != 0) {
-		kn = TAILQ_FIRST(&kq->kq_head);	/* get next knote */
+		/*
+		 * Get next knote.  We are guaranteed this will never
+		 * be NULL because of the marker we inserted above.
+		 */
+		kn = TAILQ_FIRST(&kq->kq_head);
 
-		if ((kn->kn_status & KN_MARKER) != 0 && kn != marker) {
+		bool kn_is_other_marker =
+		    (kn->kn_status & KN_MARKER) != 0 && kn != marker;
+		bool kn_is_detaching = (kn->kn_status & KN_WILLDETACH) != 0;
+		bool kn_is_in_flux = kn_in_flux(kn);
+
+		/*
+		 * If we found a marker that's not ours, or this knote
+		 * is in a state of flux, then wait for everything to
+		 * settle down and go around again.
+		 */
+		if (kn_is_other_marker || kn_is_detaching || kn_is_in_flux) {
 			if (influx) {
 				influx = 0;
 				KQ_FLUX_WAKEUP(kq);
 			}
 			mutex_exit(&fdp->fd_lock);
-			(void)cv_wait(&kq->kq_cv, &kq->kq_lock);
+			if (kn_is_other_marker || kn_is_in_flux) {
+				KQ_FLUX_WAIT(kq);
+				mutex_spin_exit(&kq->kq_lock);
+			} else {
+				/*
+				 * Detaching but not in-flux?  Someone is
+				 * actively trying to finish the job; just
+				 * go around and try again.
+				 */
+				KASSERT(kn_is_detaching);
+				mutex_spin_exit(&kq->kq_lock);
+				preempt_point();
+			}
 			goto relock;
 		}
 
@@ -1553,14 +2080,22 @@ relock:
 			}
 			if (rv == 0) {
 				/*
-				 * non-ONESHOT event that hasn't
-				 * triggered again, so de-queue.
+				 * non-ONESHOT event that hasn't triggered
+				 * again, so it will remain de-queued.
 				 */
 				kn->kn_status &= ~(KN_ACTIVE|KN_BUSY);
 				kq->kq_count--;
 				influx = 1;
 				continue;
 			}
+		} else {
+			/*
+			 * This ONESHOT note is going to be detached
+			 * below.  Mark the knote as not long for this
+			 * world before we release the kq lock so that
+			 * no one else will put it in a state of flux.
+			 */
+			KNOTE_WILLDETACH(kn);
 		}
 		KASSERT(kn->kn_fop != NULL);
 		touch = (!(kn->kn_fop->f_flags & FILTEROP_ISFD) &&
@@ -1578,6 +2113,9 @@ relock:
 			/* delete ONESHOT events after retrieval */
 			kn->kn_status &= ~KN_BUSY;
 			kq->kq_count--;
+			KASSERT(kn_in_flux(kn) == false);
+			KASSERT((kn->kn_status & KN_WILLDETACH) != 0 &&
+				kn->kn_kevent.udata == curlwp);
 			mutex_spin_exit(&kq->kq_lock);
 			knote_detach(kn, fdp, true);
 			mutex_enter(&fdp->fd_lock);
@@ -1773,18 +2311,22 @@ kqueue_doclose(struct kqueue *kq, struct
 
 	KASSERT(mutex_owned(&fdp->fd_lock));
 
+ again:
 	for (kn = SLIST_FIRST(list); kn != NULL;) {
 		if (kq != kn->kn_kq) {
 			kn = SLIST_NEXT(kn, kn_link);
 			continue;
 		}
+		if (knote_detach_quiesce(kn)) {
+			mutex_enter(&fdp->fd_lock);
+			goto again;
+		}
 		knote_detach(kn, fdp, true);
 		mutex_enter(&fdp->fd_lock);
 		kn = SLIST_FIRST(list);
 	}
 }
 
-
 /*
  * fileops close method for a kqueue descriptor.
  */
@@ -1801,7 +2343,27 @@ kqueue_close(file_t *fp)
 	fp->f_type = 0;
 	fdp = curlwp->l_fd;
 
+	KASSERT(kq->kq_fdp == fdp);
+
 	mutex_enter(&fdp->fd_lock);
+
+	/*
+	 * We're doing to drop the fd_lock multiple times while
+	 * we detach knotes.  During this time, attempts to register
+	 * knotes via the back door (e.g. knote_proc_fork_track())
+	 * need to fail, lest they sneak in to attach a knote after
+	 * we've already drained the list it's destined for.
+	 *
+	 * We must aquire kq_lock here to set KQ_CLOSING (to serialize
+	 * with other code paths that modify kq_count without holding
+	 * the fd_lock), but once this bit is set, it's only safe to
+	 * test it while holding the fd_lock, and holding kq_lock while
+	 * doing so is not necessary.
+	 */
+	mutex_enter(&kq->kq_lock);
+	kq->kq_count |= KQ_CLOSING;
+	mutex_exit(&kq->kq_lock);
+
 	for (i = 0; i <= fdp->fd_lastkqfile; i++) {
 		if ((ff = fdp->fd_dt->dt_ff[i]) == NULL)
 			continue;
@@ -1812,8 +2374,15 @@ kqueue_close(file_t *fp)
 			kqueue_doclose(kq, &fdp->fd_knhash[i], -1);
 		}
 	}
+
 	mutex_exit(&fdp->fd_lock);
 
+#if defined(DEBUG)
+	mutex_enter(&kq->kq_lock);
+	kq_check(kq);
+	mutex_exit(&kq->kq_lock);
+#endif /* DEBUG */
+	KASSERT(TAILQ_EMPTY(&kq->kq_head));
 	KASSERT(KQ_COUNT(kq) == 0);
 	mutex_destroy(&kq->kq_lock);
 	cv_destroy(&kq->kq_cv);
@@ -1875,10 +2444,14 @@ knote_fdclose(int fd)
 	struct knote *kn;
 	filedesc_t *fdp;
 
+ again:
 	fdp = curlwp->l_fd;
 	mutex_enter(&fdp->fd_lock);
 	list = (struct klist *)&fdp->fd_dt->dt_ff[fd]->ff_knlist;
 	while ((kn = SLIST_FIRST(list)) != NULL) {
+		if (knote_detach_quiesce(kn)) {
+			goto again;
+		}
 		knote_detach(kn, fdp, true);
 		mutex_enter(&fdp->fd_lock);
 	}
@@ -1898,9 +2471,10 @@ knote_detach(struct knote *kn, filedesc_
 	kq = kn->kn_kq;
 
 	KASSERT((kn->kn_status & KN_MARKER) == 0);
+	KASSERT((kn->kn_status & KN_WILLDETACH) != 0);
+	KASSERT(kn->kn_fop != NULL);
 	KASSERT(mutex_owned(&fdp->fd_lock));
 
-	KASSERT(kn->kn_fop != NULL);
 	/* Remove from monitored object. */
 	if (dofop) {
 		filter_detach(kn);
@@ -1917,8 +2491,10 @@ knote_detach(struct knote *kn, filedesc_
 	/* Remove from kqueue. */
 again:
 	mutex_spin_enter(&kq->kq_lock);
+	KASSERT(kn_in_flux(kn) == false);
 	if ((kn->kn_status & KN_QUEUED) != 0) {
 		kq_check(kq);
+		KASSERT(KQ_COUNT(kq) != 0);
 		kq->kq_count--;
 		TAILQ_REMOVE(&kq->kq_head, kn, kn_tqe);
 		kn->kn_status &= ~KN_QUEUED;
@@ -1949,6 +2525,10 @@ knote_enqueue(struct knote *kn)
 	kq = kn->kn_kq;
 
 	mutex_spin_enter(&kq->kq_lock);
+	if (__predict_false(kn->kn_status & KN_WILLDETACH)) {
+		/* Don't bother enqueueing a dying knote. */
+		goto out;
+	}
 	if ((kn->kn_status & KN_DISABLED) != 0) {
 		kn->kn_status &= ~KN_DISABLED;
 	}
@@ -1956,11 +2536,13 @@ knote_enqueue(struct knote *kn)
 		kq_check(kq);
 		kn->kn_status |= KN_QUEUED;
 		TAILQ_INSERT_TAIL(&kq->kq_head, kn, kn_tqe);
+		KASSERT(KQ_COUNT(kq) < KQ_MAXCOUNT);
 		kq->kq_count++;
 		kq_check(kq);
 		cv_broadcast(&kq->kq_cv);
 		selnotify(&kq->kq_sel, 0, NOTE_SUBMIT);
 	}
+ out:
 	mutex_spin_exit(&kq->kq_lock);
 }
 /*
@@ -1976,15 +2558,21 @@ knote_activate(struct knote *kn)
 	kq = kn->kn_kq;
 
 	mutex_spin_enter(&kq->kq_lock);
+	if (__predict_false(kn->kn_status & KN_WILLDETACH)) {
+		/* Don't bother enqueueing a dying knote. */
+		goto out;
+	}
 	kn->kn_status |= KN_ACTIVE;
 	if ((kn->kn_status & (KN_QUEUED | KN_DISABLED)) == 0) {
 		kq_check(kq);
 		kn->kn_status |= KN_QUEUED;
 		TAILQ_INSERT_TAIL(&kq->kq_head, kn, kn_tqe);
+		KASSERT(KQ_COUNT(kq) < KQ_MAXCOUNT);
 		kq->kq_count++;
 		kq_check(kq);
 		cv_broadcast(&kq->kq_cv);
 		selnotify(&kq->kq_sel, 0, NOTE_SUBMIT);
 	}
+ out:
 	mutex_spin_exit(&kq->kq_lock);
 }

Index: src/sys/kern/kern_exec.c
diff -u src/sys/kern/kern_exec.c:1.509 src/sys/kern/kern_exec.c:1.510
--- src/sys/kern/kern_exec.c:1.509	Tue Sep 28 15:35:44 2021
+++ src/sys/kern/kern_exec.c	Sun Oct 10 18:07:51 2021
@@ -1,4 +1,4 @@
-/*	$NetBSD: kern_exec.c,v 1.509 2021/09/28 15:35:44 thorpej Exp $	*/
+/*	$NetBSD: kern_exec.c,v 1.510 2021/10/10 18:07:51 thorpej Exp $	*/
 
 /*-
  * Copyright (c) 2008, 2019, 2020 The NetBSD Foundation, Inc.
@@ -62,7 +62,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: kern_exec.c,v 1.509 2021/09/28 15:35:44 thorpej Exp $");
+__KERNEL_RCSID(0, "$NetBSD: kern_exec.c,v 1.510 2021/10/10 18:07:51 thorpej Exp $");
 
 #include "opt_exec.h"
 #include "opt_execfmt.h"
@@ -1367,8 +1367,17 @@ execve_runproc(struct lwp *l, struct exe
 
 	pool_put(&exec_pool, data->ed_argp);
 
-	/* notify others that we exec'd */
-	KNOTE(&p->p_klist, NOTE_EXEC);
+	/*
+	 * Notify anyone who might care that we've exec'd.
+	 *
+	 * This is slightly racy; someone could sneak in and
+	 * attach a knote after we've decided not to notify,
+	 * or vice-versa, but that's not particularly bothersome.
+	 * knote_proc_exec() will acquire p->p_lock as needed.
+	 */
+	if (!SLIST_EMPTY(&p->p_klist)) {
+		knote_proc_exec(p);
+	}
 
 	kmem_free(epp->ep_hdr, epp->ep_hdrlen);
 

Index: src/sys/kern/kern_exit.c
diff -u src/sys/kern/kern_exit.c:1.291 src/sys/kern/kern_exit.c:1.292
--- src/sys/kern/kern_exit.c:1.291	Sat Dec  5 18:17:01 2020
+++ src/sys/kern/kern_exit.c	Sun Oct 10 18:07:51 2021
@@ -1,4 +1,4 @@
-/*	$NetBSD: kern_exit.c,v 1.291 2020/12/05 18:17:01 thorpej Exp $	*/
+/*	$NetBSD: kern_exit.c,v 1.292 2021/10/10 18:07:51 thorpej Exp $	*/
 
 /*-
  * Copyright (c) 1998, 1999, 2006, 2007, 2008, 2020 The NetBSD Foundation, Inc.
@@ -67,7 +67,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: kern_exit.c,v 1.291 2020/12/05 18:17:01 thorpej Exp $");
+__KERNEL_RCSID(0, "$NetBSD: kern_exit.c,v 1.292 2021/10/10 18:07:51 thorpej Exp $");
 
 #include "opt_ktrace.h"
 #include "opt_dtrace.h"
@@ -435,16 +435,6 @@ exit1(struct lwp *l, int exitcode, int s
 	proc_finispecific(p);
 
 	/*
-	 * Notify interested parties of our demise.
-	 */
-	KNOTE(&p->p_klist, NOTE_EXIT);
-
-	SDT_PROBE(proc, kernel, , exit,
-		((p->p_sflag & PS_COREDUMP) ? CLD_DUMPED :
-		 (p->p_xsig ? CLD_KILLED : CLD_EXITED)),
-		0,0,0,0);
-
-	/*
 	 * Reset p_opptr pointer of all former children which got
 	 * traced by another process and were reparented. We reset
 	 * it to NULL here; the trace detach code then reparents
@@ -509,6 +499,15 @@ exit1(struct lwp *l, int exitcode, int s
 	 */
 	p->p_stat = SDEAD;
 
+	/*
+	 * Let anyone watching this DTrace probe know what we're
+	 * on our way out.
+	 */
+	SDT_PROBE(proc, kernel, , exit,
+		((p->p_sflag & PS_COREDUMP) ? CLD_DUMPED :
+		 (p->p_xsig ? CLD_KILLED : CLD_EXITED)),
+		0,0,0,0);
+
 	/* Put in front of parent's sibling list for parent to collect it */
 	old_parent = p->p_pptr;
 	old_parent->p_nstopchild++;
@@ -559,6 +558,19 @@ exit1(struct lwp *l, int exitcode, int s
 	pcu_discard_all(l);
 
 	mutex_enter(p->p_lock);
+	/*
+	 * Notify other processes tracking us with a knote that
+	 * we're exiting.
+	 *
+	 * N.B. we do this here because the process is now SDEAD,
+	 * and thus cannot have any more knotes attached.  Also,
+	 * knote_proc_exit() expects that p->p_lock is already
+	 * held (and will assert so).
+	 */
+	if (!SLIST_EMPTY(&p->p_klist)) {
+		knote_proc_exit(p);
+	}
+
 	/* Free the LWP ID */
 	proc_free_lwpid(p, l->l_lid);
 	lwp_drainrefs(l);

Index: src/sys/kern/kern_fork.c
diff -u src/sys/kern/kern_fork.c:1.226 src/sys/kern/kern_fork.c:1.227
--- src/sys/kern/kern_fork.c:1.226	Sat May 23 23:42:43 2020
+++ src/sys/kern/kern_fork.c	Sun Oct 10 18:07:51 2021
@@ -1,4 +1,4 @@
-/*	$NetBSD: kern_fork.c,v 1.226 2020/05/23 23:42:43 ad Exp $	*/
+/*	$NetBSD: kern_fork.c,v 1.227 2021/10/10 18:07:51 thorpej Exp $	*/
 
 /*-
  * Copyright (c) 1999, 2001, 2004, 2006, 2007, 2008, 2019
@@ -68,7 +68,7 @@
  */
 
 #include <sys/cdefs.h>
-__KERNEL_RCSID(0, "$NetBSD: kern_fork.c,v 1.226 2020/05/23 23:42:43 ad Exp $");
+__KERNEL_RCSID(0, "$NetBSD: kern_fork.c,v 1.227 2021/10/10 18:07:51 thorpej Exp $");
 
 #include "opt_ktrace.h"
 #include "opt_dtrace.h"
@@ -547,7 +547,7 @@ fork1(struct lwp *l1, int flags, int exi
 	 */
 	if (!SLIST_EMPTY(&p1->p_klist)) {
 		mutex_exit(&proc_lock);
-		KNOTE(&p1->p_klist, NOTE_FORK | p2->p_pid);
+		knote_proc_fork(p1, p2);
 		mutex_enter(&proc_lock);
 	}
 

Index: src/sys/sys/event.h
diff -u src/sys/sys/event.h:1.43 src/sys/sys/event.h:1.44
--- src/sys/sys/event.h:1.43	Sun Sep 26 21:29:39 2021
+++ src/sys/sys/event.h	Sun Oct 10 18:07:51 2021
@@ -1,4 +1,4 @@
-/*	$NetBSD: event.h,v 1.43 2021/09/26 21:29:39 thorpej Exp $	*/
+/*	$NetBSD: event.h,v 1.44 2021/10/10 18:07:51 thorpej Exp $	*/
 
 /*-
  * Copyright (c) 1999,2000,2001 Jonathan Lemon <jle...@freebsd.org>
@@ -246,6 +246,7 @@ struct knote {
 	struct kfilter		*kn_kfilter;
 	void 			*kn_hook;
 	int			kn_hookid;
+	unsigned int		kn_influx;	/* q: in-flux counter */
 
 #define	KN_ACTIVE	0x01U			/* event has been triggered */
 #define	KN_QUEUED	0x02U			/* event is on queue */
@@ -253,6 +254,7 @@ struct knote {
 #define	KN_DETACHED	0x08U			/* knote is detached */
 #define	KN_MARKER	0x10U			/* is a marker */
 #define	KN_BUSY		0x20U			/* is being scanned */
+#define	KN_WILLDETACH	0x40U			/* being detached imminently */
 /* Toggling KN_BUSY also requires kn_kq->kq_fdp->fd_lock. */
 #define __KN_FLAG_BITS \
     "\20" \
@@ -261,7 +263,8 @@ struct knote {
     "\3DISABLED" \
     "\4DETACHED" \
     "\5MARKER" \
-    "\6BUSY"
+    "\6BUSY" \
+    "\7WILLDETACH"
 
 
 #define	kn_id		kn_kevent.ident

Index: src/sys/sys/eventvar.h
diff -u src/sys/sys/eventvar.h:1.9 src/sys/sys/eventvar.h:1.10
--- src/sys/sys/eventvar.h:1.9	Sun May  2 19:13:43 2021
+++ src/sys/sys/eventvar.h	Sun Oct 10 18:07:51 2021
@@ -1,4 +1,4 @@
-/*	$NetBSD: eventvar.h,v 1.9 2021/05/02 19:13:43 jdolecek Exp $	*/
+/*	$NetBSD: eventvar.h,v 1.10 2021/10/10 18:07:51 thorpej Exp $	*/
 
 /*-
  * Copyright (c) 1999,2000 Jonathan Lemon <jle...@freebsd.org>
@@ -51,9 +51,21 @@ struct kqueue {
 	filedesc_t	*kq_fdp;
 	struct selinfo	kq_sel;
 	kcondvar_t	kq_cv;
-	u_int		kq_count;		/* number of pending events */
-#define	KQ_RESTART	0x80000000		/* force ERESTART */
-#define KQ_COUNT(kq)	((kq)->kq_count & ~KQ_RESTART)
+	uint32_t	kq_count;		/* number of pending events */
 };
 
+#define	KQ_RESTART	__BIT(31)		/* force ERESTART */
+#define	KQ_CLOSING	__BIT(30)		/* kqueue is closing for good */
+#define	KQ_MAXCOUNT	__BITS(0,29)
+#define	KQ_COUNT(kq)	((unsigned int)((kq)->kq_count & KQ_MAXCOUNT))
+
+#ifdef _KERNEL
+
+#if defined(DDB)
+void	kqueue_printit(struct kqueue *, bool,
+	    void (*)(const char *, ...));
+#endif /* DDB */
+
+#endif /* _KERNEL */
+
 #endif /* !_SYS_EVENTVAR_H_ */

Index: src/sys/sys/proc.h
diff -u src/sys/sys/proc.h:1.368 src/sys/sys/proc.h:1.369
--- src/sys/sys/proc.h:1.368	Sat Dec  5 18:17:01 2020
+++ src/sys/sys/proc.h	Sun Oct 10 18:07:51 2021
@@ -1,4 +1,4 @@
-/*	$NetBSD: proc.h,v 1.368 2020/12/05 18:17:01 thorpej Exp $	*/
+/*	$NetBSD: proc.h,v 1.369 2021/10/10 18:07:51 thorpej Exp $	*/
 
 /*-
  * Copyright (c) 2006, 2007, 2008, 2020 The NetBSD Foundation, Inc.
@@ -562,6 +562,15 @@ void	proc_setspecific(struct proc *, spe
 int	proc_compare(const struct proc *, const struct lwp *,
     const struct proc *, const struct lwp *);
 
+/*
+ * Special handlers for delivering EVFILT_PROC notifications.  These
+ * exist to handle some of the special locking considerations around
+ * proesses.
+ */
+void	knote_proc_exec(struct proc *);
+void	knote_proc_fork(struct proc *, struct proc *);
+void	knote_proc_exit(struct proc *);
+
 int	proclist_foreach_call(struct proclist *,
     int (*)(struct proc *, void *arg), void *);
 

Reply via email to