Re: Prevent race in single_thread_set()

2020-12-01 Thread Martin Pieuchot
On 01/12/20(Tue) 10:21, Claudio Jeker wrote:
> On Mon, Nov 30, 2020 at 07:19:28PM -0300, Martin Pieuchot wrote:
> > On 04/11/20(Wed) 11:19, Martin Pieuchot wrote:
> > > Here's a 3rd approach to solve the TOCTOU race in single_thread_set().
> > > The issue being that the lock serializing access to `ps_single' is not
> > > held when calling single_thread_check().
> > > 
> > > The approach below is controversial because it extends the scope of the
> > > SCHED_LOCK().  On the other hand, the two others approaches that both
> > > add a new lock to avoid this race ignore the fact that accesses to
> > > `ps_single' are currently not clearly serialized w/o KERNEL_LOCK().
> > > 
> > > So the diff below improves the situation in that regard and do not add
> > > more complexity due to the use of multiple locks.  After having looked
> > > for a way to split the SCHED_LOCK() I believe this is the simplest
> > > approach.
> > > 
> > > I deliberately used a *_locked() function to avoid grabbing the lock
> > > recursively as I'm trying to get rid of the recursion, see the other
> > > thread on tech@.
> > > 
> > > That said the uses of `ps_single' in ptrace_ctrl() are not covered by
> > > this diff and I'd be glad to hear some comments about them.  This is
> > > fine as long as all the code using `ps_single' runs under KERNEL_LOCK()
> > > but since we're trying to get the single_thread_* API out of it, this
> > > need to be addressed.
> > > 
> > > Note that this diff introduces a helper for initializing ps_single*
> > > values in order to keep all the accesses of those fields in the same
> > > file.
> > 
> > Anyone?  With this only the `ps_threads' iteration must receive some
> > love in order to take the single_thread_* API out of the KERNEL_LOCK().
> > For that I just sent a SMR_TAILQ conversion diff.
> > 
> > Combined with the diff to remove the recursive attribute of the
> > SCHED_LOCK() we're ready to split it into multiple mutexes.
> > 
> > Does that make any sense?  Comments?  Oks?
> > 
> > > Index: kern/kern_fork.c
> > > ===
> > > RCS file: /cvs/src/sys/kern/kern_fork.c,v
> > > retrieving revision 1.226
> > > diff -u -p -r1.226 kern_fork.c
> > > --- kern/kern_fork.c  25 Oct 2020 01:55:18 -  1.226
> > > +++ kern/kern_fork.c  4 Nov 2020 12:52:54 -
> > > @@ -563,10 +563,7 @@ thread_fork(struct proc *curp, void *sta
> > >* if somebody else wants to take us to single threaded mode,
> > >* count ourselves in.
> > >*/
> > > - if (pr->ps_single) {
> > > - atomic_inc_int(>ps_singlecount);
> > > - atomic_setbits_int(>p_flag, P_SUSPSINGLE);
> > > - }
> > > + single_thread_init(p);
> 
> This is not the right name for this function. It does not initalize
> anything. Why is this indirection needed? I would just put the
> SCHED_LOCK() around this bit. It makes more sense especially with the
> comment above.

Updated diff does that.  I introduced a function because it helps me
having all the locking in the same place.

> > > Index: kern/kern_sig.c
> > > ===
> > > RCS file: /cvs/src/sys/kern/kern_sig.c,v
> > > retrieving revision 1.263
> > > diff -u -p -r1.263 kern_sig.c
> > > --- kern/kern_sig.c   16 Sep 2020 13:50:42 -  1.263
> > > +++ kern/kern_sig.c   4 Nov 2020 12:38:35 -
> > > @@ -1932,11 +1932,27 @@ userret(struct proc *p)
> > >   p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
> > >  }
> > >  
> > > +void
> > > +single_thread_init(struct proc *p)
> > > +{
> > > + struct process *pr = p->p_p;
> > > + int s;
> > > +
> > > + SCHED_LOCK(s);
> > > + if (pr->ps_single) {
> > > + atomic_inc_int(>ps_singlecount);
> > > + atomic_setbits_int(>p_flag, P_SUSPSINGLE);
> > > + }
> > > + SCHED_UNLOCK(s);
> > > +}
> > > +
> > >  int
> > > -single_thread_check(struct proc *p, int deep)
> > > +_single_thread_check_locked(struct proc *p, int deep)
> 
> Please don't add the leading _ to this function. There is no need for it.

Done.

> > > @@ -2014,7 +2043,6 @@ single_thread_set(struct proc *p, enum s
> > >   panic("single_thread_mode = %d&q

Re: Prevent race in single_thread_set()

2020-11-30 Thread Martin Pieuchot
On 04/11/20(Wed) 11:19, Martin Pieuchot wrote:
> Here's a 3rd approach to solve the TOCTOU race in single_thread_set().
> The issue being that the lock serializing access to `ps_single' is not
> held when calling single_thread_check().
> 
> The approach below is controversial because it extends the scope of the
> SCHED_LOCK().  On the other hand, the two others approaches that both
> add a new lock to avoid this race ignore the fact that accesses to
> `ps_single' are currently not clearly serialized w/o KERNEL_LOCK().
> 
> So the diff below improves the situation in that regard and do not add
> more complexity due to the use of multiple locks.  After having looked
> for a way to split the SCHED_LOCK() I believe this is the simplest
> approach.
> 
> I deliberately used a *_locked() function to avoid grabbing the lock
> recursively as I'm trying to get rid of the recursion, see the other
> thread on tech@.
> 
> That said the uses of `ps_single' in ptrace_ctrl() are not covered by
> this diff and I'd be glad to hear some comments about them.  This is
> fine as long as all the code using `ps_single' runs under KERNEL_LOCK()
> but since we're trying to get the single_thread_* API out of it, this
> need to be addressed.
> 
> Note that this diff introduces a helper for initializing ps_single*
> values in order to keep all the accesses of those fields in the same
> file.

Anyone?  With this only the `ps_threads' iteration must receive some
love in order to take the single_thread_* API out of the KERNEL_LOCK().
For that I just sent a SMR_TAILQ conversion diff.

Combined with the diff to remove the recursive attribute of the
SCHED_LOCK() we're ready to split it into multiple mutexes.

Does that make any sense?  Comments?  Oks?

> Index: kern/kern_fork.c
> ===
> RCS file: /cvs/src/sys/kern/kern_fork.c,v
> retrieving revision 1.226
> diff -u -p -r1.226 kern_fork.c
> --- kern/kern_fork.c  25 Oct 2020 01:55:18 -  1.226
> +++ kern/kern_fork.c  4 Nov 2020 12:52:54 -
> @@ -563,10 +563,7 @@ thread_fork(struct proc *curp, void *sta
>* if somebody else wants to take us to single threaded mode,
>* count ourselves in.
>*/
> - if (pr->ps_single) {
> - atomic_inc_int(>ps_singlecount);
> - atomic_setbits_int(>p_flag, P_SUSPSINGLE);
> - }
> + single_thread_init(p);
>  
>   /*
>* Return tid to parent thread and copy it out to userspace
> Index: kern/kern_sig.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sig.c,v
> retrieving revision 1.263
> diff -u -p -r1.263 kern_sig.c
> --- kern/kern_sig.c   16 Sep 2020 13:50:42 -  1.263
> +++ kern/kern_sig.c   4 Nov 2020 12:38:35 -
> @@ -1932,11 +1932,27 @@ userret(struct proc *p)
>   p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
>  }
>  
> +void
> +single_thread_init(struct proc *p)
> +{
> + struct process *pr = p->p_p;
> + int s;
> +
> + SCHED_LOCK(s);
> + if (pr->ps_single) {
> + atomic_inc_int(>ps_singlecount);
> + atomic_setbits_int(>p_flag, P_SUSPSINGLE);
> + }
> + SCHED_UNLOCK(s);
> +}
> +
>  int
> -single_thread_check(struct proc *p, int deep)
> +_single_thread_check_locked(struct proc *p, int deep)
>  {
>   struct process *pr = p->p_p;
>  
> + SCHED_ASSERT_LOCKED();
> +
>   if (pr->ps_single != NULL && pr->ps_single != p) {
>   do {
>   int s;
> @@ -1949,14 +1965,12 @@ single_thread_check(struct proc *p, int 
>   return (EINTR);
>   }
>  
> - SCHED_LOCK(s);
> - if (pr->ps_single == NULL) {
> - SCHED_UNLOCK(s);
> + if (pr->ps_single == NULL)
>   continue;
> - }
>  
>   if (atomic_dec_int_nv(>ps_singlecount) == 0)
>   wakeup(>ps_singlecount);
> +
>   if (pr->ps_flags & PS_SINGLEEXIT) {
>   SCHED_UNLOCK(s);
>   KERNEL_LOCK();
> @@ -1967,13 +1981,24 @@ single_thread_check(struct proc *p, int 
>   /* not exiting and don't need to unwind, so suspend */
>   p->p_stat = SSTOP;
>   mi_switch();
> - SCHED_UNLOCK(s);
>   } while (pr->ps_single != NULL);
>   }
>  
>   return (

Use SMR_TAILQ for `ps_threads'

2020-11-30 Thread Martin Pieuchot
Every multi-threaded process keeps a list of threads in `ps_threads'.
This list is iterated in interrupt and process context which makes it
complicated to protect it with a rwlock.

One of the places where such iteration is done is inside the tsleep(9)
routines, directly in single_thread_check() or via CURSIG().  In order
to take this code path out of the KERNEL_LOCK(), claudio@ proposed to
use SMR_TAILQ.  This has the advantage of not introducing lock
dependencies and allow us to address every iteration one-by-one.

Diff below is a first step into this direction, it replaces the existing
TAILQ_* macros by the locked version of SMR_TAILQ*.  This is mostly lifted
from claudio@'s diff and should not introduce any side effect.

ok?

diff --git lib/libkvm/kvm_proc2.c lib/libkvm/kvm_proc2.c
index 96f7dc91b92..1f4f9b914bb 100644
--- lib/libkvm/kvm_proc2.c
+++ lib/libkvm/kvm_proc2.c
@@ -341,8 +341,9 @@ kvm_proclist(kvm_t *kd, int op, int arg, struct process *pr,
kp.p_pctcpu = 0;
kp.p_stat = (process.ps_flags & PS_ZOMBIE) ? SDEAD :
SIDL;
-   for (p = TAILQ_FIRST(_threads); p != NULL; 
-   p = TAILQ_NEXT(, p_thr_link)) {
+   for (p = SMR_TAILQ_FIRST_LOCKED(_threads);
+   p != NULL;
+   p = SMR_TAILQ_NEXT_LOCKED(, p_thr_link)) {
if (KREAD(kd, (u_long)p, )) {
_kvm_err(kd, kd->program,
"can't read proc at %lx",
@@ -376,8 +377,8 @@ kvm_proclist(kvm_t *kd, int op, int arg, struct process *pr,
if (!dothreads)
continue;
 
-   for (p = TAILQ_FIRST(_threads); p != NULL; 
-   p = TAILQ_NEXT(, p_thr_link)) {
+   for (p = SMR_TAILQ_FIRST_LOCKED(_threads); p != NULL;
+   p = SMR_TAILQ_NEXT_LOCKED(, p_thr_link)) {
if (KREAD(kd, (u_long)p, )) {
_kvm_err(kd, kd->program,
"can't read proc at %lx",
diff --git sys/kern/exec_elf.c sys/kern/exec_elf.c
index 5e455208663..575273b306c 100644
--- sys/kern/exec_elf.c
+++ sys/kern/exec_elf.c
@@ -85,6 +85,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1360,7 +1361,7 @@ coredump_notes_elf(struct proc *p, void *iocookie, size_t 
*sizep)
 * threads in the process have been stopped and the list can't
 * change.
 */
-   TAILQ_FOREACH(q, >ps_threads, p_thr_link) {
+   SMR_TAILQ_FOREACH_LOCKED(q, >ps_threads, p_thr_link) {
if (q == p) /* we've taken care of this thread */
continue;
error = coredump_note_elf(q, iocookie, );
diff --git sys/kern/init_main.c sys/kern/init_main.c
index fed6be19435..2b657ffe328 100644
--- sys/kern/init_main.c
+++ sys/kern/init_main.c
@@ -519,7 +519,7 @@ main(void *framep)
 */
LIST_FOREACH(pr, , ps_list) {
nanouptime(>ps_start);
-   TAILQ_FOREACH(p, >ps_threads, p_thr_link) {
+   SMR_TAILQ_FOREACH_LOCKED(p, >ps_threads, p_thr_link) {
nanouptime(>p_cpu->ci_schedstate.spc_runtime);
timespecclear(>p_rtime);
}
diff --git sys/kern/kern_exit.c sys/kern/kern_exit.c
index a20775419e3..ffc0158954c 100644
--- sys/kern/kern_exit.c
+++ sys/kern/kern_exit.c
@@ -63,6 +63,7 @@
 #ifdef SYSVSEM
 #include 
 #endif
+#include 
 #include 
 
 #include 
@@ -161,7 +162,7 @@ exit1(struct proc *p, int xexit, int xsig, int flags)
}
 
/* unlink ourselves from the active threads */
-   TAILQ_REMOVE(>ps_threads, p, p_thr_link);
+   SMR_TAILQ_REMOVE_LOCKED(>ps_threads, p, p_thr_link);
if ((p->p_flag & P_THREAD) == 0) {
/* main thread gotta wait because it has the pid, et al */
while (pr->ps_refcnt > 1)
@@ -724,7 +725,7 @@ process_zap(struct process *pr)
if (pr->ps_ptstat != NULL)
free(pr->ps_ptstat, M_SUBPROC, sizeof(*pr->ps_ptstat));
pool_put(_pool, pr->ps_ru);
-   KASSERT(TAILQ_EMPTY(>ps_threads));
+   KASSERT(SMR_TAILQ_EMPTY_LOCKED(>ps_threads));
lim_free(pr->ps_limit);
crfree(pr->ps_ucred);
pool_put(_pool, pr);
diff --git sys/kern/kern_fork.c sys/kern/kern_fork.c
index 9fb239bc8b4..e1cb587b2b8 100644
--- sys/kern/kern_fork.c
+++ sys/kern/kern_fork.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -179,8 +180,8 @@ process_initialize(struct process *pr, struct proc *p)
 {
/* initialize the thread links */
pr->ps_mainproc = p;
-   TAILQ_INIT(>ps_threads);
-   TAILQ_INSERT_TAIL(>ps_threads, p, p_thr_link);
+   SMR_TAILQ_INIT(>ps_threads);
+   

Re: Switch select(2) to kqueue-based implementation

2020-11-30 Thread Martin Pieuchot
On 30/11/20(Mon) 17:06, Visa Hankala wrote:
> On Mon, Nov 30, 2020 at 01:28:14PM -0300, Martin Pieuchot wrote:
> > I plan to commit this in 3 steps, to ease a possible revert:
> >  - kevent(2) refactoring
> >  - introduction of newer kq* APIs
> >  - dopselect rewrite
> 
> Please send a separate patch for the first step.

Here's it.  Diff below changes kevent(2) to possibly call kqueue_scan()
multiple times.  The same pattern is/will be used by select(2) and
poll(2).

The copyout(2) and ktrace entry point have been moved to the syscall
function.

Comments?  Oks?

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.145
diff -u -p -r1.145 kern_event.c
--- kern/kern_event.c   25 Nov 2020 13:49:00 -  1.145
+++ kern/kern_event.c   30 Nov 2020 20:12:08 -
@@ -567,6 +567,7 @@ sys_kevent(struct proc *p, void *v, regi
struct timespec ts;
struct timespec *tsp = NULL;
int i, n, nerrors, error;
+   int ready, total;
struct kevent kev[KQ_NEVENTS];
 
if ((fp = fd_getfile(fdp, SCARG(uap, fd))) == NULL)
@@ -595,9 +596,9 @@ sys_kevent(struct proc *p, void *v, regi
kq = fp->f_data;
nerrors = 0;
 
-   while (SCARG(uap, nchanges) > 0) {
-   n = SCARG(uap, nchanges) > KQ_NEVENTS ?
-   KQ_NEVENTS : SCARG(uap, nchanges);
+   while ((n = SCARG(uap, nchanges)) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
error = copyin(SCARG(uap, changelist), kev,
n * sizeof(struct kevent));
if (error)
@@ -635,11 +636,36 @@ sys_kevent(struct proc *p, void *v, regi
 
kqueue_scan_setup(, kq);
FRELE(fp, p);
-   error = kqueue_scan(, SCARG(uap, nevents), SCARG(uap, eventlist),
-   tsp, kev, p, );
+   /*
+* Collect as many events as we can.  The timeout on successive
+* loops is disabled (kqueue_scan() becomes non-blocking).
+*/
+   total = 0;
+   error = 0;
+   while ((n = SCARG(uap, nevents) - total) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
+   ready = kqueue_scan(, n, kev, tsp, p, );
+   if (ready == 0)
+   break;
+   error = copyout(kev, SCARG(uap, eventlist) + total,
+   sizeof(struct kevent) * ready);
+#ifdef KTRACE
+   if (KTRPOINT(p, KTR_STRUCT))
+   ktrevent(p, kev, ready);
+#endif
+   total += ready;
+   if (error || ready < n)
+   break;
+   /*
+* Successive loops are only necessary if there are more
+* ready events to gather, so they don't need to block.
+*/
+   tsp = 
+   timespecclear(tsp);
+   }
kqueue_scan_finish();
-
-   *retval = n;
+   *retval = total;
return (error);
 
  done:
@@ -893,22 +919,22 @@ kqueue_sleep(struct kqueue *kq, struct t
return (error);
 }
 
+/*
+ * Scan the kqueue, blocking if necessary until the target time is reached.
+ * If tsp is NULL we block indefinitely.  If tsp->ts_secs/nsecs are both
+ * 0 we do not block at all.
+ */
 int
 kqueue_scan(struct kqueue_scan_state *scan, int maxevents,
-struct kevent *ulistp, struct timespec *tsp, struct kevent *kev,
-struct proc *p, int *retval)
+struct kevent *kevp, struct timespec *tsp, struct proc *p, int *errorp)
 {
struct kqueue *kq = scan->kqs_kq;
-   struct kevent *kevp;
struct knote *kn;
-   int s, count, nkev, error = 0;
+   int s, count, nkev = 0, error = 0;
 
-   nkev = 0;
-   kevp = kev;
count = maxevents;
if (count == 0)
goto done;
-
 retry:
KASSERT(count == maxevents);
KASSERT(nkev == 0);
@@ -958,14 +984,8 @@ retry:
while (count) {
kn = TAILQ_NEXT(>kqs_start, kn_tqe);
if (kn->kn_filter == EVFILT_MARKER) {
-   if (kn == >kqs_end) {
-   TAILQ_REMOVE(>kq_head, >kqs_start,
-   kn_tqe);
-   splx(s);
-   if (scan->kqs_nevent == 0)
-   goto retry;
-   goto done;
-   }
+   if (kn == >kqs_end)
+   break;
 
/* Move start marker past another thread's marker. */
TAILQ_REMOVE(>kq_head, >kqs_start, kn_tqe);
@@ -1001,6 +1021,9 @@ retry:
count--;
scan->kqs_nevent++;
 
+   /*
+* Post-event action on the note
+   

Switch select(2) to kqueue-based implementation

2020-11-30 Thread Martin Pieuchot
Now that the kqueue refactoring has been committed, here's once again
the diff to modify the internal implementation of {p,}select(2) to query
kqfilter handlers instead of poll ones.

{p,}poll(2) are left untouched to ease the transition.

I plan to commit this in 3 steps, to ease a possible revert:
 - kevent(2) refactoring
 - introduction of newer kq* APIs
 - dopselect rewrite

A mid-term goal of this change would be to get rid of the poll handlers
in order to have a single event system in the kernel to maintain and
turn mp-safe.

The logic is as follow:

- With this change every thread get a "private" kqueue, usable by the
  kernel only, to register events for select(2) and later poll(2).

- Events specified via FD_SET(2) are converted to their kqueue equivalent.

- kqueue_scan() has been modified to be restartable and work with a given
  kqueue.

- At the end of every {p,}select(2) syscall the private kqueue is purged.

This version should include your previous feedbacks.

Comments, tests and oks are welcome!

Thanks,
Martin

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.145
diff -u -p -r1.145 kern_event.c
--- kern/kern_event.c   25 Nov 2020 13:49:00 -  1.145
+++ kern/kern_event.c   30 Nov 2020 15:30:40 -
@@ -57,6 +57,7 @@
 #include 
 #include 
 
+struct kqueue *kqueue_alloc(struct filedesc *);
 void   kqueue_terminate(struct proc *p, struct kqueue *);
 void   kqueue_free(struct kqueue *);
 void   kqueue_init(void);
@@ -504,6 +505,27 @@ const struct filterops dead_filtops = {
.f_event= filt_dead,
 };
 
+void
+kqpoll_init(struct proc *p)
+{
+   if (p->p_kq != NULL)
+   return;
+
+   p->p_kq = kqueue_alloc(p->p_fd);
+   p->p_kq_serial = arc4random();
+}
+
+void
+kqpoll_exit(struct proc *p)
+{
+   if (p->p_kq == NULL)
+   return;
+
+   kqueue_terminate(p, p->p_kq);
+   kqueue_free(p->p_kq);
+   p->p_kq = NULL;
+}
+
 struct kqueue *
 kqueue_alloc(struct filedesc *fdp)
 {
@@ -567,6 +589,7 @@ sys_kevent(struct proc *p, void *v, regi
struct timespec ts;
struct timespec *tsp = NULL;
int i, n, nerrors, error;
+   int ready, total;
struct kevent kev[KQ_NEVENTS];
 
if ((fp = fd_getfile(fdp, SCARG(uap, fd))) == NULL)
@@ -595,9 +618,9 @@ sys_kevent(struct proc *p, void *v, regi
kq = fp->f_data;
nerrors = 0;
 
-   while (SCARG(uap, nchanges) > 0) {
-   n = SCARG(uap, nchanges) > KQ_NEVENTS ?
-   KQ_NEVENTS : SCARG(uap, nchanges);
+   while ((n = SCARG(uap, nchanges)) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
error = copyin(SCARG(uap, changelist), kev,
n * sizeof(struct kevent));
if (error)
@@ -635,11 +658,36 @@ sys_kevent(struct proc *p, void *v, regi
 
kqueue_scan_setup(, kq);
FRELE(fp, p);
-   error = kqueue_scan(, SCARG(uap, nevents), SCARG(uap, eventlist),
-   tsp, kev, p, );
+   /*
+* Collect as many events as we can.  The timeout on successive
+* loops is disabled (kqueue_scan() becomes non-blocking).
+*/
+   total = 0;
+   error = 0;
+   while ((n = SCARG(uap, nevents) - total) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
+   ready = kqueue_scan(, n, kev, tsp, p, );
+   if (ready == 0)
+   break;
+   error = copyout(kev, SCARG(uap, eventlist) + total,
+   sizeof(struct kevent) * ready);
+#ifdef KTRACE
+   if (KTRPOINT(p, KTR_STRUCT))
+   ktrevent(p, kev, ready);
+#endif
+   total += ready;
+   if (error || ready < n)
+   break;
+   /*
+* Successive loops are only necessary if there are more
+* ready events to gather, so they don't need to block.
+*/
+   tsp = 
+   timespecclear(tsp);
+   }
kqueue_scan_finish();
-
-   *retval = n;
+   *retval = total;
return (error);
 
  done:
@@ -893,22 +941,22 @@ kqueue_sleep(struct kqueue *kq, struct t
return (error);
 }
 
+/*
+ * Scan the kqueue, blocking if necessary until the target time is reached.
+ * If tsp is NULL we block indefinitely.  If tsp->ts_secs/nsecs are both
+ * 0 we do not block at all.
+ */
 int
 kqueue_scan(struct kqueue_scan_state *scan, int maxevents,
-struct kevent *ulistp, struct timespec *tsp, struct kevent *kev,
-struct proc *p, int *retval)
+struct kevent *kevp, struct timespec *tsp, struct proc *p, int *errorp)
 {
struct kqueue *kq = scan->kqs_kq;
-   struct kevent *kevp;
struct knote *kn;
-   int s, count, nkev, error = 0;
+   int s, count, nkev = 0, error = 0;
 

Correct IPL for UVM pageqlock

2020-11-26 Thread Martin Pieuchot
As reported by AIsha Tammy on bugs@, there's a current recursion
possible with the pageqlock:

   ddb> trace
   db_enter() at db_enter+0x10
   panic(81dcd47c) at panic+0x12a
   mtx_enter(B219ed00) at mtx_enter+0x81
   uvm_objfree(fd8015f2c9a0) at uvm_objfree+0x61
   buf_dealloc_mem(fd8015f2c8e0) at buf_dealloc_mem+0x7c
   buf_put(fd8O15f2c8e0) at buf _put+Oxcd
   brelse(fd8O15f2c8e0) at brelse+0x3df
   sd_buf_done(fd8014a27028) at sd_buf_done+0xf9
   vioblk_vq_done (800c 7050) at vioblk_vq_done+0x6f
   virtio_check_vqs (8009ce00) at virtio_check_vqs+0xfe
   intr_handler (8e6dea70, 80047700) at intr_handler+0x38
   Xintr_ioapic_edge19_untramp() at Xintr_ioapic_edge19_untramp+0x18f
   mtx_enter (fd8002699680) at mtx_enter+0xb
   uvm_pagedeactivate(fd800Z2699680) at uvm_pagedeactivate+0x105
   uyvmpd_scan() at uvmpd_scan+0x259
   uvm_pageout (80009718) at uvm_pageout +0x375
   end trace frame: 0x0, count: -16

Diff below should prevent that, ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.151
diff -u -p -r1.151 uvm_page.c
--- uvm/uvm_page.c  24 Nov 2020 13:49:09 -  1.151
+++ uvm/uvm_page.c  26 Nov 2020 17:17:55 -
@@ -180,7 +180,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
TAILQ_INIT(_active);
TAILQ_INIT(_inactive_swp);
TAILQ_INIT(_inactive_obj);
-   mtx_init(, IPL_NONE);
+   mtx_init(, IPL_VM);
mtx_init(, IPL_VM);
uvm_pmr_init();
 



Re: Fewer uvmexp

2020-11-19 Thread Martin Pieuchot
On 19/11/20(Thu) 01:02, Jeremie Courreges-Anglas wrote:
> On Wed, Nov 18 2020, Martin Pieuchot  wrote:
> > While auditing the various uses of the uvmexp fields I came across
> > those under #ifdet notyet.  May I delete them so I don't have to give
> > them some MP love?  Ok?
> 
> ok jca@, but while here shouldn't the rest of cpu_vm_init() go too?
> Unless I'm missing something it doesn't have side effects except
> computing ncolors, and ncolors is meant to be used by the code you're
> removing.

Like that?

Index: arch/amd64//amd64/cpu.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.150
diff -u -p -r1.150 cpu.c
--- arch/amd64//amd64/cpu.c 13 Sep 2020 11:53:16 -  1.150
+++ arch/amd64//amd64/cpu.c 19 Nov 2020 17:07:37 -
@@ -419,44 +419,6 @@ cpu_match(struct device *parent, void *m
return 1;
 }
 
-static void
-cpu_vm_init(struct cpu_info *ci)
-{
-   int ncolors = 2, i;
-
-   for (i = CAI_ICACHE; i <= CAI_L2CACHE; i++) {
-   struct x86_cache_info *cai;
-   int tcolors;
-
-   cai = >ci_cinfo[i];
-
-   tcolors = atop(cai->cai_totalsize);
-   switch(cai->cai_associativity) {
-   case 0xff:
-   tcolors = 1; /* fully associative */
-   break;
-   case 0:
-   case 1:
-   break;
-   default:
-   tcolors /= cai->cai_associativity;
-   }
-   ncolors = max(ncolors, tcolors);
-   }
-
-#ifdef notyet
-   /*
-* Knowing the size of the largest cache on this CPU, re-color
-* our pages.
-*/
-   if (ncolors <= uvmexp.ncolors)
-   return;
-   printf("%s: %d page colors\n", ci->ci_dev->dv_xname, ncolors);
-   uvm_page_recolor(ncolors);
-#endif
-}
-
-
 void   cpu_idle_mwait_cycle(void);
 void   cpu_init_mwait(struct cpu_softc *);
 
@@ -689,7 +651,6 @@ cpu_attach(struct device *parent, struct
default:
panic("unknown processor type??");
}
-   cpu_vm_init(ci);
 
 #if defined(MULTIPROCESSOR)
if (mp_verbose) {
Index: arch/luna88k/luna88k/isr.c
===
RCS file: /cvs/src/sys/arch/luna88k/luna88k/isr.c,v
retrieving revision 1.11
diff -u -p -r1.11 isr.c
--- arch/luna88k/luna88k/isr.c  28 Jun 2017 10:31:48 -  1.11
+++ arch/luna88k/luna88k/isr.c  18 Nov 2020 13:11:27 -
@@ -151,10 +151,6 @@ isrdispatch_autovec(int ipl)
panic("isrdispatch_autovec: bad ipl %d", ipl);
 #endif
 
-#if 0  /* XXX: already counted in machdep.c */
-   uvmexp.intrs++;
-#endif
-
list = _autovec[ipl];
if (LIST_EMPTY(list)) {
printf("isrdispatch_autovec: ipl %d unexpected\n", ipl);



Re: dt: add kernel function boundary tracing provider

2020-11-19 Thread Martin Pieuchot
Hello Tom,

Thanks for sharing your work, that's awesome!

On 14/11/20(Sat) 13:13, Tom Rollet wrote:
> Here is a diff for dynamic tracing of kernel's functions boundaries.
> It's implemented as one of the dt's provider on i386 and amd64.
> To activate it, DDBPROF and pseudo device dt must be activated
> on GENERIC.

Could we replace the DDBPROF define by NDT > 0?  Is it still possible to
build a profiling kernel if we do so?

Being able to use the existing kernel profiling code is a nice way to
test prologue patching but if performances are really bad we might want
to keep a way to use the old profiling method.  If it's too complicated
to have both coexist I'd suggest we get rid of DBPROF and use the
intr-based patching code for dt(4) only.

> For now it's working like DTRACE fbt(function boundaries tracing).
> 
> We replace the prologue and epilogue of each function with a breakpoint.
> The replaced instruction on amd64 and 386 are respectively
> "push %rbp", "push %ebp" for prologue and "ret", "pop %ebp" for epilogue.
> These instructions are emulated at the end of the breakpoint handler,
> just after sending an event to userland (to be read by btrace).
> 
> For now the lookup to find the instruction to replace is hardcoded
> and needs to find if there is a retguard before the prologue of the
> function or if there are multiples int3 after the last ret.
> It allows to find 31896 entry points and 19513 return points on amd64.

That's a great start.

> ddb has also  a similar way of tracing all prologue of function. It now
> uses this new dt provider for doing tracing.
> 
> Perf wise, when all entry probes are enabled, the slow down is
> quite massive but expected since breakpoints are slow.
> On the kernel compilation with 10 threads on a linux qemu VM we go from:
> real   2m38.280s
> user   10m2.050s
> sys    14m10.360s
> 
> to:
> real   24m19.280s
> user   9m44.110s
> sys    220m26.980s

Did you try on i386 as well?  How is it?

> Any comments on the diff?

See inline.

> diff --git a/sys/arch/amd64/amd64/vector.S b/sys/arch/amd64/amd64/vector.S
> index dd2dfde3e3b..3a0a58ba3ac 100644
> --- a/sys/arch/amd64/amd64/vector.S
> +++ b/sys/arch/amd64/amd64/vector.S
> @@ -188,10 +188,11 @@ INTRENTRY_LABEL(trap03):
>  sti
>  cld
>  SMAP_CLAC
> -    movq    %rsp, %rdi
> -    call    _C_LABEL(db_prof_hook)
> -    cmpl    $1, %eax
> -    jne    .Lreal_kern_trap
> +    leaq    _C_LABEL(dt_prov_kprobe), %rdi
> +    movq    %rsp, %rsi
> +    call    _C_LABEL(dt_prov_kprobe_hook)

Why don't we call a function with no argument?  Maybe the current
interface is over designed and we should not use it.

> +    cmpl    $0, %eax
> +    je .Lreal_kern_trap
> 
>  cli
>  movq    TF_RDI(%rsp),%rdi
> @@ -210,6 +211,11 @@ INTRENTRY_LABEL(trap03):
>  movq    TF_R11(%rsp),%r11
>  /* %rax restored below, after being used to shift the stack */
> 
> +    cmpl    $2, %eax
> +    je  .Lemulate_ret
> +
> +.Lemulate_push_rbp:
> +
>  /*
>   * We are returning from a probe trap so we need to fix the
>   * stack layout and emulate the patched instruction.
> @@ -217,6 +223,9 @@ INTRENTRY_LABEL(trap03):
>   */
>  subq    $16, %rsp
> 
> +    movq    (TF_RAX + 16)(%rsp), %rax
> +    movq    %rax, TF_RAX(%rsp)
> +
>  /* Shift hardware-saved registers. */
>  movq    (TF_RIP + 16)(%rsp), %rax
>  movq    %rax, TF_RIP(%rsp)
> @@ -237,7 +246,20 @@ INTRENTRY_LABEL(trap03):
> 
>  /* Finally restore %rax */
>  movq    (TF_RAX + 16)(%rsp),%rax
> +    jmp .ret_int3

Shouldn't we copy the two instructions here instead of jumping?  If
you're after reducing code duplication you can look at macros like
the ones used in frameasm.h.

> +.Lemulate_ret:
> +
> +    /* Store a new return address in %rip */
> +    movq    TF_RSP(%rsp), %rax
> +    movq    (%rax), %rax
> +    movq    %rax, TF_RIP(%rsp)
> +    addq    $8, TF_RSP(%rsp)
> +
> +    /* Finally restore %rax */
> +    movq    (TF_RAX)(%rsp),%rax
> 
> +.ret_int3:
>  addq    $TF_RIP,%rsp
>  iretq
>  #endif /* !defined(GPROF) && defined(DDBPROF) */
> diff --git a/sys/arch/i386/i386/locore.s b/sys/arch/i386/i386/locore.s
> index 0c5607fe38a..e2d43b47eb3 100644
> --- a/sys/arch/i386/i386/locore.s
> +++ b/sys/arch/i386/i386/locore.s
> @@ -205,7 +205,8 @@ INTRENTRY_LABEL(label):    /* from kernel */    ; \
>  #define    INTRFASTEXIT \
>  jmp    intr_fast_exit
> 
> -#define    INTR_FAKE_TRAP    0xbadabada
> +#define    INTR_FAKE_TRAP_PUSH_RPB    0xbadabada
> +#define    INTR_FAKE_TRAP_POP_RBP    0xbcbcbcbc
> 
>  /*
>   * PTmap is recursive pagemap at top of virtual address space.
> @@ -1259,17 +1260,32 @@ calltrap:
>  jne    .Lreal_trap
> 
>  pushl    %esp
> -    call    _C_LABEL(db_prof_hook)
> -    addl    $4,%esp
> -    cmpl    $1,%eax
> -    jne    .Lreal_trap
> +    subl    $4, %esp
> +    pushl   %eax
> +    leal    _C_LABEL(dt_prov_kprobe), %eax
> +    movl    %eax, 4(%esp)
> +    popl    %eax
> +    call    

Locking of uvm_pageclean()

2020-11-18 Thread Martin Pieuchot
I found another race related to some missing locking, this time around
uvm_pageclean().

Diff below fixes the two places in /sys/uvm where the page queue lock
should be taken.  To prevent further corruption I added some assertions 
and documented some global data structures that are currently protected
by this lock.

Note that uvm_pagefree() is called by many pmaps most of the time
without the lock held.  The diff below doesn't fix them and that's why
some assertions are commented out.

ok?

Index: uvm/uvm.h
===
RCS file: /cvs/src/sys/uvm/uvm.h,v
retrieving revision 1.67
diff -u -p -r1.67 uvm.h
--- uvm/uvm.h   6 Dec 2019 08:33:25 -   1.67
+++ uvm/uvm.h   18 Nov 2020 23:22:15 -
@@ -44,18 +44,20 @@
 /*
  * uvm structure (vm global state: collected in one structure for ease
  * of reference...)
+ *
+ *  Locks used to protect struct members in this file:
+ * Q   uvm.pageqlock
  */
-
 struct uvm {
/* vm_page related parameters */
 
/* vm_page queues */
-   struct pglist page_active;  /* allocated pages, in use */
-   struct pglist page_inactive_swp;/* pages inactive (reclaim or free) */
-   struct pglist page_inactive_obj;/* pages inactive (reclaim or free) */
+   struct pglist page_active;  /* [Q] allocated pages, in use */
+   struct pglist page_inactive_swp;/* [Q] pages inactive (reclaim/free) */
+   struct pglist page_inactive_obj;/* [Q] pages inactive (reclaim/free) */
/* Lock order: pageqlock, then fpageqlock. */
-   struct mutex pageqlock; /* lock for active/inactive page q */
-   struct mutex fpageqlock;/* lock for free page q  + pdaemon */
+   struct mutex pageqlock; /* [] lock for active/inactive page q */
+   struct mutex fpageqlock;/* [] lock for free page q  + pdaemon */
boolean_t page_init_done;   /* TRUE if uvm_page_init() finished */
struct uvm_pmr_control pmr_control; /* pmemrange data */
 
Index: uvm/uvm_anon.c
===
RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
retrieving revision 1.49
diff -u -p -r1.49 uvm_anon.c
--- uvm/uvm_anon.c  4 Jan 2020 16:17:29 -   1.49
+++ uvm/uvm_anon.c  18 Nov 2020 23:22:15 -
@@ -106,7 +106,9 @@ uvm_anfree_list(struct vm_anon *anon, st
 * clean page, and put on on pglist
 * for later freeing.
 */
+   uvm_lock_pageq();
uvm_pageclean(pg);
+   uvm_unlock_pageq();
TAILQ_INSERT_HEAD(pgl, pg, pageq);
} else {
uvm_lock_pageq();   /* lock out pagedaemon */
Index: uvm/uvm_object.c
===
RCS file: /cvs/src/sys/uvm/uvm_object.c,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_object.c
--- uvm/uvm_object.c21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_object.c18 Nov 2020 23:22:15 -
@@ -172,7 +172,9 @@ uvm_objfree(struct uvm_object *uobj)
 * this pg from the uobj we are throwing away
 */
atomic_clearbits_int(>pg_flags, PG_TABLED);
+   uvm_lock_pageq();
uvm_pageclean(pg);
+   uvm_unlock_pageq();
TAILQ_INSERT_TAIL(, pg, pageq);
}
uvm_pmr_freepageq();
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_page.c
--- uvm/uvm_page.c  22 Sep 2020 14:31:08 -  1.150
+++ uvm/uvm_page.c  18 Nov 2020 23:22:15 -
@@ -973,6 +973,10 @@ uvm_pageclean(struct vm_page *pg)
 {
u_int flags_to_clear = 0;
 
+#if all_pmap_are_fixed
+   MUTEX_ASSERT_LOCKED();
+#endif
+
 #ifdef DEBUG
if (pg->uobject == (void *)0xdeadbeef &&
pg->uanon == (void *)0xdeadbeef) {
@@ -1037,6 +1041,10 @@ uvm_pageclean(struct vm_page *pg)
 void
 uvm_pagefree(struct vm_page *pg)
 {
+#if all_pmap_are_fixed
+   MUTEX_ASSERT_LOCKED();
+#endif
+
uvm_pageclean(pg);
uvm_pmr_freepages(pg, 1);
 }
@@ -1229,6 +1237,8 @@ uvm_pagelookup(struct uvm_object *obj, v
 void
 uvm_pagewire(struct vm_page *pg)
 {
+   MUTEX_ASSERT_LOCKED();
+
if (pg->wire_count == 0) {
if (pg->pg_flags & PQ_ACTIVE) {
TAILQ_REMOVE(_active, pg, pageq);
@@ -1257,6 +1267,8 @@ uvm_pagewire(struct vm_page *pg)
 void
 uvm_pageunwire(struct vm_page *pg)
 {
+   MUTEX_ASSERT_LOCKED();
+
pg->wire_count--;
if (pg->wire_count == 0) {
TAILQ_INSERT_TAIL(_active, pg, pageq);
@@ -1276,6 +1288,8 @@ uvm_pageunwire(struct vm_page *pg)
 void
 uvm_pagedeactivate(struct vm_page *pg)
 {
+   MUTEX_ASSERT_LOCKED();
+
if 

Fewer uvmexp

2020-11-18 Thread Martin Pieuchot
While auditing the various uses of the uvmexp fields I came across
those under #ifdet notyet.  May I delete them so I don't have to give
them some MP love?  Ok?

Index: arch/amd64//amd64/cpu.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.150
diff -u -p -r1.150 cpu.c
--- arch/amd64//amd64/cpu.c 13 Sep 2020 11:53:16 -  1.150
+++ arch/amd64//amd64/cpu.c 18 Nov 2020 13:11:17 -
@@ -443,17 +443,6 @@ cpu_vm_init(struct cpu_info *ci)
}
ncolors = max(ncolors, tcolors);
}
-
-#ifdef notyet
-   /*
-* Knowing the size of the largest cache on this CPU, re-color
-* our pages.
-*/
-   if (ncolors <= uvmexp.ncolors)
-   return;
-   printf("%s: %d page colors\n", ci->ci_dev->dv_xname, ncolors);
-   uvm_page_recolor(ncolors);
-#endif
 }
 
 
Index: arch/luna88k/luna88k/isr.c
===
RCS file: /cvs/src/sys/arch/luna88k/luna88k/isr.c,v
retrieving revision 1.11
diff -u -p -r1.11 isr.c
--- arch/luna88k/luna88k/isr.c  28 Jun 2017 10:31:48 -  1.11
+++ arch/luna88k/luna88k/isr.c  18 Nov 2020 13:11:27 -
@@ -151,10 +151,6 @@ isrdispatch_autovec(int ipl)
panic("isrdispatch_autovec: bad ipl %d", ipl);
 #endif
 
-#if 0  /* XXX: already counted in machdep.c */
-   uvmexp.intrs++;
-#endif
-
list = _autovec[ipl];
if (LIST_EMPTY(list)) {
printf("isrdispatch_autovec: ipl %d unexpected\n", ipl);



Re: uvm_pagealloc() & uvm.free accounting

2020-11-17 Thread Martin Pieuchot
On 17/11/20(Tue) 13:52, Mark Kettenis wrote:
> > Date: Tue, 17 Nov 2020 09:32:28 -0300
> > From: Martin Pieuchot 
> > 
> > On 17/11/20(Tue) 13:23, Mark Kettenis wrote:
> > > > Date: Mon, 16 Nov 2020 10:11:50 -0300
> > > > From: Martin Pieuchot 
> > > > 
> > > > On 13/11/20(Fri) 21:05, Mark Kettenis wrote:
> > > > > [...] 
> > > > > > Careful reviewers will spot an off-by-one change in the check for 
> > > > > > pagedaemon's reserved memory.  My understanding is that it's a 
> > > > > > bugfix,
> > > > > > is it correct?
> > > > > 
> > > > > You mean for uvm_pagealloc().  I'd say yes.  But this does mean that
> > > > > in some sense the pagedaemon reserve grows by a page at the cost of
> > > > > the kernel reserve.  So I think it would be a good idea to bump the
> > > > > kernel reserve a bit.  Maybe to 8 pages?
> > > > 
> > > > Fine with me.  Could you do it?
> > > 
> > > I think it should be done as part of this diff.
> > 
> > Fine with me, updated diff below.
> 
> Not quite.  Need to bump the kernel reserve such that we still have
> enough on top of the pagedaemon reserve because your fix means the
> page daemon may consume an additional page.

Here we go:

Index: uvm/uvm_extern.h
===
RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.154
diff -u -p -r1.154 uvm_extern.h
--- uvm/uvm_extern.h19 Oct 2020 08:19:46 -  1.154
+++ uvm/uvm_extern.h16 Nov 2020 13:09:39 -
@@ -142,14 +142,15 @@ typedef int   vm_prot_t;
 #defineUVM_PGA_ZERO0x0002  /* returned page must be zeroed 
*/
 
 /*
- * flags for uvm_pglistalloc()
+ * flags for uvm_pglistalloc() also used by uvm_pmr_getpages()
  */
 #define UVM_PLA_WAITOK 0x0001  /* may sleep */
 #define UVM_PLA_NOWAIT 0x0002  /* can't sleep (need one of the two) */
 #define UVM_PLA_ZERO   0x0004  /* zero all pages before returning */
 #define UVM_PLA_TRYCONTIG  0x0008  /* try to allocate contig physmem */
 #define UVM_PLA_FAILOK 0x0010  /* caller can handle failure */
-#define UVM_PLA_NOWAKE 0x0020  /* don't wake the page daemon on 
failure */
+#define UVM_PLA_NOWAKE 0x0020  /* don't wake page daemon on failure */
+#define UVM_PLA_USERESERVE 0x0040  /* can allocate from kernel reserve */
 
 /*
  * lockflags that control the locking behavior of various functions.
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_page.c
--- uvm/uvm_page.c  22 Sep 2020 14:31:08 -  1.150
+++ uvm/uvm_page.c  17 Nov 2020 13:13:03 -
@@ -277,7 +277,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 * XXXCDC - values may need adjusting
 */
uvmexp.reserve_pagedaemon = 4;
-   uvmexp.reserve_kernel = 6;
+   uvmexp.reserve_kernel = 8;
uvmexp.anonminpct = 10;
uvmexp.vnodeminpct = 10;
uvmexp.vtextminpct = 5;
@@ -733,32 +733,11 @@ uvm_pglistalloc(psize_t size, paddr_t lo
size = atop(round_page(size));
 
/*
-* check to see if we need to generate some free pages waking
-* the pagedaemon.
-*/
-   if ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freemin ||
-   ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg &&
-   (uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg))
-   wakeup();
-
-   /*
 * XXX uvm_pglistalloc is currently only used for kernel
 * objects. Unlike the checks in uvm_pagealloc, below, here
-* we are always allowed to use the kernel reserve. However, we
-* have to enforce the pagedaemon reserve here or allocations
-* via this path could consume everything and we can't
-* recover in the page daemon.
+* we are always allowed to use the kernel reserve.
 */
- again:
-   if ((uvmexp.free <= uvmexp.reserve_pagedaemon + size &&
-   !((curproc == uvm.pagedaemon_proc) ||
-   (curproc == syncerproc {
-   if (flags & UVM_PLA_WAITOK) {
-   uvm_wait("uvm_pglistalloc");
-   goto again;
-   }
-   return (ENOMEM);
-   }
+   flags |= UVM_PLA_USERESERVE;
 
if ((high & PAGE_MASK) != PAGE_MASK) {
printf("uvm_pglistalloc: Upper boundary 0x%lx "
@@ -871,7 +850,7 @@ uvm_pagerealloc_multi(struct uvm_object 
 }
 
 /*
- * uvm_pagealloc_strat: allocate vm_page from a particular free list.
+ * uvm_pagealloc: allocate vm_page

Re: uvm_pagealloc() & uvm.free accounting

2020-11-17 Thread Martin Pieuchot
On 17/11/20(Tue) 13:23, Mark Kettenis wrote:
> > Date: Mon, 16 Nov 2020 10:11:50 -0300
> > From: Martin Pieuchot 
> > 
> > On 13/11/20(Fri) 21:05, Mark Kettenis wrote:
> > > [...] 
> > > > Careful reviewers will spot an off-by-one change in the check for 
> > > > pagedaemon's reserved memory.  My understanding is that it's a bugfix,
> > > > is it correct?
> > > 
> > > You mean for uvm_pagealloc().  I'd say yes.  But this does mean that
> > > in some sense the pagedaemon reserve grows by a page at the cost of
> > > the kernel reserve.  So I think it would be a good idea to bump the
> > > kernel reserve a bit.  Maybe to 8 pages?
> > 
> > Fine with me.  Could you do it?
> 
> I think it should be done as part of this diff.

Fine with me, updated diff below.

> > > > I also started to document the locking for some of the `uvmexp' fields.
> > > > I'm well aware that reading `uvmexp.free' is done without the correct
> > > > lock in the pagedaemon, this will be addressed soon.  The other accesses
> > > > shouldn't matter much as they are not used to make decisions.
> > > > 
> > > > This should hopefully be good enough to make uvm_pagealloc() completely
> > > > mp-safe, something that matters much because it's called from 
> > > > pmap_enter(9)
> > > > on some architectures.
> > > > 
> > > > ok?
> > > 
> > > Some additional comments below...
> > 
> > Updated diff addressing your comments below, ok?
> 
> Yes, looks good, but I'd like you to include the kernel reserve bump.
> 
> And maybe ping beck@ and ask him whether he is ok as well as this is
> related to the buffer cache and page daemon.

Bob are you ok with the diff below?

Index: uvm/uvm_extern.h
===
RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.154
diff -u -p -r1.154 uvm_extern.h
--- uvm/uvm_extern.h19 Oct 2020 08:19:46 -  1.154
+++ uvm/uvm_extern.h16 Nov 2020 13:09:39 -
@@ -142,14 +142,15 @@ typedef int   vm_prot_t;
 #defineUVM_PGA_ZERO0x0002  /* returned page must be zeroed 
*/
 
 /*
- * flags for uvm_pglistalloc()
+ * flags for uvm_pglistalloc() also used by uvm_pmr_getpages()
  */
 #define UVM_PLA_WAITOK 0x0001  /* may sleep */
 #define UVM_PLA_NOWAIT 0x0002  /* can't sleep (need one of the two) */
 #define UVM_PLA_ZERO   0x0004  /* zero all pages before returning */
 #define UVM_PLA_TRYCONTIG  0x0008  /* try to allocate contig physmem */
 #define UVM_PLA_FAILOK 0x0010  /* caller can handle failure */
-#define UVM_PLA_NOWAKE 0x0020  /* don't wake the page daemon on 
failure */
+#define UVM_PLA_NOWAKE 0x0020  /* don't wake page daemon on failure */
+#define UVM_PLA_USERESERVE 0x0040  /* can allocate from kernel reserve */
 
 /*
  * lockflags that control the locking behavior of various functions.
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_page.c
--- uvm/uvm_page.c  22 Sep 2020 14:31:08 -  1.150
+++ uvm/uvm_page.c  17 Nov 2020 12:30:11 -
@@ -276,7 +276,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 * init reserve thresholds
 * XXXCDC - values may need adjusting
 */
-   uvmexp.reserve_pagedaemon = 4;
+   uvmexp.reserve_pagedaemon = 8;
uvmexp.reserve_kernel = 6;
uvmexp.anonminpct = 10;
uvmexp.vnodeminpct = 10;
@@ -733,32 +733,11 @@ uvm_pglistalloc(psize_t size, paddr_t lo
size = atop(round_page(size));
 
/*
-* check to see if we need to generate some free pages waking
-* the pagedaemon.
-*/
-   if ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freemin ||
-   ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg &&
-   (uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg))
-   wakeup();
-
-   /*
 * XXX uvm_pglistalloc is currently only used for kernel
 * objects. Unlike the checks in uvm_pagealloc, below, here
-* we are always allowed to use the kernel reserve. However, we
-* have to enforce the pagedaemon reserve here or allocations
-* via this path could consume everything and we can't
-* recover in the page daemon.
+* we are always allowed to use the kernel reserve.
 */
- again:
-   if ((uvmexp.free <= uvmexp.reserve_pagedaemon + size &&
-   !((curproc == uvm.pagedaemon_proc) ||
-   (curproc == syncerproc {
-   if (flags & UVM_PLA_WAITOK) {
-  

uvm_fault: refactoring for case 2 faults

2020-11-17 Thread Martin Pieuchot
Here's another refactoring that moves the remaining logic of uvm_fault()
handling lower faults, case 2, to its own function.  This logic shouldn't
be modified in the first step of unlocking amap & anon and will still be
executed under KERNEL_LOCK().  Having a separate function will however
help to turn the 'ReFault' goto into a more readable loop.  This will be
the next step.

ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.107
diff -u -p -r1.107 uvm_fault.c
--- uvm/uvm_fault.c 16 Nov 2020 12:30:16 -  1.107
+++ uvm/uvm_fault.c 16 Nov 2020 13:27:32 -
@@ -484,6 +484,9 @@ struct uvm_faultctx {
paddr_t pa_flags;
 };
 
+intuvm_fault_lower(struct uvm_faultinfo *, struct uvm_faultctx *,
+   struct vm_page **, vm_fault_t, vm_prot_t);
+
 /*
  * uvm_fault_check: check prot, handle needs-copy, etc.
  *
@@ -901,19 +904,11 @@ uvm_fault(vm_map_t orig_map, vaddr_t vad
 {
struct uvm_faultinfo ufi;
struct uvm_faultctx flt;
-   boolean_t promote, locked, shadowed;
-   int result, lcv, gotpages;
-   vaddr_t currva;
-   voff_t uoff;
-   struct vm_amap *amap;
-   struct uvm_object *uobj;
-   struct vm_anon *anons_store[UVM_MAXRANGE], **anons, *anon;
-   struct vm_page *pages[UVM_MAXRANGE], *pg, *uobjpage;
+   boolean_t shadowed;
+   struct vm_anon *anons_store[UVM_MAXRANGE], **anons;
+   struct vm_page *pages[UVM_MAXRANGE];
int error;
 
-   anon = NULL;
-   pg = NULL;
-
uvmexp.faults++;/* XXX: locking? */
TRACEPOINT(uvm, fault, vaddr, fault_type, access_type, NULL);
 
@@ -957,8 +952,28 @@ ReFault:
}
}
 
-   amap = ufi.entry->aref.ar_amap;
-   uobj = ufi.entry->object.uvm_obj;
+   /* handle case 2: faulting on backing object or zero fill */
+   error = uvm_fault_lower(, , pages, fault_type, access_type);
+   switch (error) {
+   case ERESTART:
+   goto ReFault;
+   default:
+   return error;
+   }
+}
+
+int
+uvm_fault_lower(struct uvm_faultinfo *ufi, struct uvm_faultctx *flt,
+   struct vm_page **pages, vm_fault_t fault_type, vm_prot_t access_type)
+{
+   struct vm_amap *amap = ufi->entry->aref.ar_amap;
+   struct uvm_object *uobj = ufi->entry->object.uvm_obj;
+   boolean_t promote, locked;
+   int result, lcv, gotpages;
+   struct vm_page *uobjpage, *pg = NULL;
+   struct vm_anon *anon = NULL;
+   vaddr_t currva;
+   voff_t uoff;
 
/*
 * if the desired page is not shadowed by the amap and we have a
@@ -967,15 +982,15 @@ ReFault:
 * with the usual pgo_get hook).  the backing object signals this by
 * providing a pgo_fault routine.
 */
-   if (uobj && shadowed == FALSE && uobj->pgops->pgo_fault != NULL) {
-   result = uobj->pgops->pgo_fault(, flt.startva, pages,
-   flt.npages, flt.centeridx, fault_type, access_type,
+   if (uobj != NULL && uobj->pgops->pgo_fault != NULL) {
+   result = uobj->pgops->pgo_fault(ufi, flt->startva, pages,
+   flt->npages, flt->centeridx, fault_type, access_type,
PGO_LOCKED);
 
if (result == VM_PAGER_OK)
return (0); /* pgo_fault did pmap enter */
else if (result == VM_PAGER_REFAULT)
-   goto ReFault;   /* try again! */
+   return ERESTART;/* try again! */
else
return (EACCES);
}
@@ -989,20 +1004,20 @@ ReFault:
 *
 * ("get" has the option of doing a pmap_enter for us)
 */
-   if (uobj && shadowed == FALSE) {
+   if (uobj != NULL) {
uvmexp.fltlget++;
-   gotpages = flt.npages;
-   (void) uobj->pgops->pgo_get(uobj, ufi.entry->offset +
-   (flt.startva - ufi.entry->start),
-   pages, , flt.centeridx,
-   access_type & MASK(ufi.entry),
-   ufi.entry->advice, PGO_LOCKED);
+   gotpages = flt->npages;
+   (void) uobj->pgops->pgo_get(uobj, ufi->entry->offset +
+   (flt->startva - ufi->entry->start),
+   pages, , flt->centeridx,
+   access_type & MASK(ufi->entry),
+   ufi->entry->advice, PGO_LOCKED);
 
/* check for pages to map, if we got any */
uobjpage = NULL;
if (gotpages) {
-   currva = flt.startva;
-   for (lcv = 0 ; lcv < flt.npages ;
+   currva = flt->startva;
+   for (lcv = 0 ; lcv < flt->npages ;
   

Re: uvm_pagealloc() & uvm.free accounting

2020-11-16 Thread Martin Pieuchot
On 13/11/20(Fri) 21:05, Mark Kettenis wrote:
> [...] 
> > Careful reviewers will spot an off-by-one change in the check for 
> > pagedaemon's reserved memory.  My understanding is that it's a bugfix,
> > is it correct?
> 
> You mean for uvm_pagealloc().  I'd say yes.  But this does mean that
> in some sense the pagedaemon reserve grows by a page at the cost of
> the kernel reserve.  So I think it would be a good idea to bump the
> kernel reserve a bit.  Maybe to 8 pages?

Fine with me.  Could you do it?

> > I also started to document the locking for some of the `uvmexp' fields.
> > I'm well aware that reading `uvmexp.free' is done without the correct
> > lock in the pagedaemon, this will be addressed soon.  The other accesses
> > shouldn't matter much as they are not used to make decisions.
> > 
> > This should hopefully be good enough to make uvm_pagealloc() completely
> > mp-safe, something that matters much because it's called from pmap_enter(9)
> > on some architectures.
> > 
> > ok?
> 
> Some additional comments below...

Updated diff addressing your comments below, ok?

Index: uvm/uvm_extern.h
===
RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.154
diff -u -p -r1.154 uvm_extern.h
--- uvm/uvm_extern.h19 Oct 2020 08:19:46 -  1.154
+++ uvm/uvm_extern.h16 Nov 2020 13:09:39 -
@@ -142,14 +142,15 @@ typedef int   vm_prot_t;
 #defineUVM_PGA_ZERO0x0002  /* returned page must be zeroed 
*/
 
 /*
- * flags for uvm_pglistalloc()
+ * flags for uvm_pglistalloc() also used by uvm_pmr_getpages()
  */
 #define UVM_PLA_WAITOK 0x0001  /* may sleep */
 #define UVM_PLA_NOWAIT 0x0002  /* can't sleep (need one of the two) */
 #define UVM_PLA_ZERO   0x0004  /* zero all pages before returning */
 #define UVM_PLA_TRYCONTIG  0x0008  /* try to allocate contig physmem */
 #define UVM_PLA_FAILOK 0x0010  /* caller can handle failure */
-#define UVM_PLA_NOWAKE 0x0020  /* don't wake the page daemon on 
failure */
+#define UVM_PLA_NOWAKE 0x0020  /* don't wake page daemon on failure */
+#define UVM_PLA_USERESERVE 0x0040  /* can allocate from kernel reserve */
 
 /*
  * lockflags that control the locking behavior of various functions.
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_page.c
--- uvm/uvm_page.c  22 Sep 2020 14:31:08 -  1.150
+++ uvm/uvm_page.c  16 Nov 2020 13:11:19 -
@@ -733,32 +733,11 @@ uvm_pglistalloc(psize_t size, paddr_t lo
size = atop(round_page(size));
 
/*
-* check to see if we need to generate some free pages waking
-* the pagedaemon.
-*/
-   if ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freemin ||
-   ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg &&
-   (uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg))
-   wakeup();
-
-   /*
 * XXX uvm_pglistalloc is currently only used for kernel
 * objects. Unlike the checks in uvm_pagealloc, below, here
-* we are always allowed to use the kernel reserve. However, we
-* have to enforce the pagedaemon reserve here or allocations
-* via this path could consume everything and we can't
-* recover in the page daemon.
+* we are always allowed to use the kernel reserve.
 */
- again:
-   if ((uvmexp.free <= uvmexp.reserve_pagedaemon + size &&
-   !((curproc == uvm.pagedaemon_proc) ||
-   (curproc == syncerproc {
-   if (flags & UVM_PLA_WAITOK) {
-   uvm_wait("uvm_pglistalloc");
-   goto again;
-   }
-   return (ENOMEM);
-   }
+   flags |= UVM_PLA_USERESERVE;
 
if ((high & PAGE_MASK) != PAGE_MASK) {
printf("uvm_pglistalloc: Upper boundary 0x%lx "
@@ -871,7 +850,7 @@ uvm_pagerealloc_multi(struct uvm_object 
 }
 
 /*
- * uvm_pagealloc_strat: allocate vm_page from a particular free list.
+ * uvm_pagealloc: allocate vm_page from a particular free list.
  *
  * => return null if no pages free
  * => wake up pagedaemon if number of free pages drops below low water mark
@@ -886,37 +865,21 @@ uvm_pagealloc(struct uvm_object *obj, vo
struct vm_page *pg;
struct pglist pgl;
int pmr_flags;
-   boolean_t use_reserve;
 
KASSERT(obj == NULL || anon == NULL);
+   KASSERT(anon == NULL || off == 0);
KASSERT(off == trunc_page(off));
 
-   /*
-* check to see if we need to generate some free pages waking
-* the pagedaemon.
-*/
-   if ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freemin ||
-   ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg &&
-   (uvmexp.inactive + BUFPAGES_INACT) < 

uvm_fault: Kill goto Case2

2020-11-13 Thread Martin Pieuchot
Another simple refactoring of uvm_fault() removing a goto, ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.106
diff -u -p -r1.106 uvm_fault.c
--- uvm/uvm_fault.c 13 Nov 2020 14:18:25 -  1.106
+++ uvm/uvm_fault.c 13 Nov 2020 15:01:41 -
@@ -942,12 +942,24 @@ ReFault:
return error;
}
 
-   amap = ufi.entry->aref.ar_amap;
-   uobj = ufi.entry->object.uvm_obj;
-
/* (shadowed == TRUE) if there is an anon at the faulting address */
shadowed = uvm_fault_upper_lookup(, , anons, pages);
 
+   /* handle case 1: fault on an anon in our amap */
+   if (shadowed == TRUE) {
+   error = uvm_fault_upper(, , anons, fault_type,
+   access_type);
+   switch (error) {
+   case ERESTART:
+   goto ReFault;
+   default:
+   return error;
+   }
+   }
+
+   amap = ufi.entry->aref.ar_amap;
+   uobj = ufi.entry->object.uvm_obj;
+
/*
 * if the desired page is not shadowed by the amap and we have a
 * backing object, then we check to see if the backing object would
@@ -1055,30 +1067,12 @@ ReFault:
/*
 * note that at this point we are done with any front or back pages.
 * we are now going to focus on the center page (i.e. the one we've
-* faulted on).  if we have faulted on the top (anon) layer
-* [i.e. case 1], then the anon we want is anons[centeridx] (we have
-* not touched it yet).  if we have faulted on the bottom (uobj)
+* faulted on).  if we have faulted on the bottom (uobj)
 * layer [i.e. case 2] and the page was both present and available,
 * then we've got a pointer to it as "uobjpage" and we've already
 * made it BUSY.
 */
-   /*
-* there are four possible cases we must address: 1A, 1B, 2A, and 2B
-*/
-   /* redirect case 2: if we are not shadowed, go to case 2. */
-   if (shadowed == FALSE)
-   goto Case2;
-
-   /* handle case 1: fault on an anon in our amap */
-   error = uvm_fault_upper(, , anons, fault_type, access_type);
-   switch (error) {
-   case ERESTART:
-   goto ReFault;
-   default:
-   return error;
-   }
 
-Case2:
/* handle case 2: faulting on backing object or zero fill */
/*
 * note that uobjpage can not be PGO_DONTCARE at this point.  we now



uvm_pagealloc() & uvm.free accounting

2020-11-13 Thread Martin Pieuchot
The uvmexp.free* variables are read in uvm_pagealloc() & uvm_pglistalloc()
before grabbing the `fpageqlock'.

`uvmexp.free' is always modified by the pmemrange allocator under the
above motioned lock.  To avoid races and the chances of missing a wakeup,
the diff below move the checks inside the critical section.

Note that this doesn't solve the race with reading `freemin', `freetarg',
`inactive' and `inactarg'.  These are currently updated under another
lock and will be addressed in an upcoming diff.

To fix this race I introduced a new UVM_PLA_USERESERVE.  Note that those
flags are now for uvm_pmr_getpages() so reflect that in comment.

Careful reviewers will spot an off-by-one change in the check for 
pagedaemon's reserved memory.  My understanding is that it's a bugfix,
is it correct?

I also started to document the locking for some of the `uvmexp' fields.
I'm well aware that reading `uvmexp.free' is done without the correct
lock in the pagedaemon, this will be addressed soon.  The other accesses
shouldn't matter much as they are not used to make decisions.

This should hopefully be good enough to make uvm_pagealloc() completely
mp-safe, something that matters much because it's called from pmap_enter(9)
on some architectures.

ok?

Index: uvm/uvm_extern.h
===
RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.154
diff -u -p -r1.154 uvm_extern.h
--- uvm/uvm_extern.h19 Oct 2020 08:19:46 -  1.154
+++ uvm/uvm_extern.h13 Nov 2020 14:05:08 -
@@ -142,14 +142,15 @@ typedef int   vm_prot_t;
 #defineUVM_PGA_ZERO0x0002  /* returned page must be zeroed 
*/
 
 /*
- * flags for uvm_pglistalloc()
+ * flags for uvm_pmr_getpages()
  */
 #define UVM_PLA_WAITOK 0x0001  /* may sleep */
 #define UVM_PLA_NOWAIT 0x0002  /* can't sleep (need one of the two) */
 #define UVM_PLA_ZERO   0x0004  /* zero all pages before returning */
 #define UVM_PLA_TRYCONTIG  0x0008  /* try to allocate contig physmem */
 #define UVM_PLA_FAILOK 0x0010  /* caller can handle failure */
-#define UVM_PLA_NOWAKE 0x0020  /* don't wake the page daemon on 
failure */
+#define UVM_PLA_NOWAKE 0x0020  /* don't wake page daemon on failure */
+#define UVM_PLA_USERESERVE 0x0040  /* can allocate from kernel reserve */
 
 /*
  * lockflags that control the locking behavior of various functions.
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_page.c
--- uvm/uvm_page.c  22 Sep 2020 14:31:08 -  1.150
+++ uvm/uvm_page.c  13 Nov 2020 13:52:25 -
@@ -733,32 +733,11 @@ uvm_pglistalloc(psize_t size, paddr_t lo
size = atop(round_page(size));
 
/*
-* check to see if we need to generate some free pages waking
-* the pagedaemon.
-*/
-   if ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freemin ||
-   ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg &&
-   (uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg))
-   wakeup();
-
-   /*
 * XXX uvm_pglistalloc is currently only used for kernel
 * objects. Unlike the checks in uvm_pagealloc, below, here
-* we are always allowed to use the kernel reserve. However, we
-* have to enforce the pagedaemon reserve here or allocations
-* via this path could consume everything and we can't
-* recover in the page daemon.
+* we are always allowed to use the kernel reserve.
 */
- again:
-   if ((uvmexp.free <= uvmexp.reserve_pagedaemon + size &&
-   !((curproc == uvm.pagedaemon_proc) ||
-   (curproc == syncerproc {
-   if (flags & UVM_PLA_WAITOK) {
-   uvm_wait("uvm_pglistalloc");
-   goto again;
-   }
-   return (ENOMEM);
-   }
+   flags |= UVM_PLA_USERESERVE;
 
if ((high & PAGE_MASK) != PAGE_MASK) {
printf("uvm_pglistalloc: Upper boundary 0x%lx "
@@ -871,7 +850,7 @@ uvm_pagerealloc_multi(struct uvm_object 
 }
 
 /*
- * uvm_pagealloc_strat: allocate vm_page from a particular free list.
+ * uvm_pagealloc: allocate vm_page from a particular free list.
  *
  * => return null if no pages free
  * => wake up pagedaemon if number of free pages drops below low water mark
@@ -886,37 +865,16 @@ uvm_pagealloc(struct uvm_object *obj, vo
struct vm_page *pg;
struct pglist pgl;
int pmr_flags;
-   boolean_t use_reserve;
 
KASSERT(obj == NULL || anon == NULL);
+   KASSERT(anon == NULL || off == 0);
KASSERT(off == trunc_page(off));
 
-   /*
-* check to see if we need to generate some free pages waking
-* the pagedaemon.
-*/
-   if ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freemin 

Re: uvm_fault: is there an anon?

2020-11-13 Thread Martin Pieuchot
On 04/11/20(Wed) 11:04, Martin Pieuchot wrote:
> Diff below introduces a helper that looks for existing mapping.  The
> value returned by this lookup function determines if there's an anon
> at the faulting address which tells us if we're dealign with a fault
> of type 1 or 2.
> 
> This small refactoring is part of the current work to separate the code
> handling faults of type 1 and 2.  The end goal being to move the type 1
> faults handling out of the KERNEL_LOCK().
> 
> The function name is taken from NetBSD to not introduce more difference
> than there's already.

New diff now that the other one has been committed, ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.105
diff -u -p -r1.105 uvm_fault.c
--- uvm/uvm_fault.c 13 Nov 2020 11:16:08 -  1.105
+++ uvm/uvm_fault.c 13 Nov 2020 11:17:52 -
@@ -801,6 +801,84 @@ uvm_fault_upper(struct uvm_faultinfo *uf
return 0;
 }
 
+
+/*
+ * uvm_fault_upper_lookup: look up existing h/w mapping and amap.
+ *
+ * iterate range of interest:
+ *  1. check if h/w mapping exists.  if yes, we don't care
+ *  2. check if anon exists.  if not, page is lower.
+ *  3. if anon exists, enter h/w mapping for neighbors.
+ */
+boolean_t
+uvm_fault_upper_lookup(struct uvm_faultinfo *ufi,
+const struct uvm_faultctx *flt, struct vm_anon **anons,
+struct vm_page **pages)
+{
+   struct vm_amap *amap = ufi->entry->aref.ar_amap;
+   struct vm_anon *anon;
+   boolean_t shadowed;
+   vaddr_t currva;
+   paddr_t pa;
+   int lcv;
+
+   /*
+* map in the backpages and frontpages we found in the amap in hopes
+* of preventing future faults.we also init the pages[] array as
+* we go.
+*/
+   currva = flt->startva;
+   shadowed = FALSE;
+   for (lcv = 0 ; lcv < flt->npages ; lcv++, currva += PAGE_SIZE) {
+   /*
+* dont play with VAs that are already mapped
+* except for center)
+*/
+   if (lcv != flt->centeridx &&
+   pmap_extract(ufi->orig_map->pmap, currva, )) {
+   pages[lcv] = PGO_DONTCARE;
+   continue;
+   }
+
+   /* unmapped or center page. check if any anon at this level. */
+   if (amap == NULL || anons[lcv] == NULL) {
+   pages[lcv] = NULL;
+   continue;
+   }
+
+   /* check for present page and map if possible. re-activate it */
+   pages[lcv] = PGO_DONTCARE;
+   if (lcv == flt->centeridx) {/* save center for later! */
+   shadowed = TRUE;
+   continue;
+   }
+   anon = anons[lcv];
+   if (anon->an_page &&
+   (anon->an_page->pg_flags & (PG_RELEASED|PG_BUSY)) == 0) {
+   uvm_lock_pageq();
+   uvm_pageactivate(anon->an_page);/* reactivate */
+   uvm_unlock_pageq();
+   uvmexp.fltnamap++;
+
+   /*
+* Since this isn't the page that's actually faulting,
+* ignore pmap_enter() failures; it's not critical
+* that we enter these right now.
+*/
+   (void) pmap_enter(ufi->orig_map->pmap, currva,
+   VM_PAGE_TO_PHYS(anon->an_page) | flt->pa_flags,
+   (anon->an_ref > 1) ?
+   (flt->enter_prot & ~PROT_WRITE) : flt->enter_prot,
+   PMAP_CANFAIL |
+(VM_MAPENT_ISWIRED(ufi->entry) ? PMAP_WIRED : 0));
+   }
+   }
+   if (flt->npages > 1)
+   pmap_update(ufi->orig_map->pmap);
+
+   return shadowed;
+}
+
 /*
  *   F A U L T   -   m a i n   e n t r y   p o i n t
  */
@@ -827,7 +905,6 @@ uvm_fault(vm_map_t orig_map, vaddr_t vad
int result, lcv, gotpages;
vaddr_t currva;
voff_t uoff;
-   paddr_t pa;
struct vm_amap *amap;
struct uvm_object *uobj;
struct vm_anon *anons_store[UVM_MAXRANGE], **anons, *anon;
@@ -868,61 +945,9 @@ ReFault:
amap = ufi.entry->aref.ar_amap;
uobj = ufi.entry->object.uvm_obj;
 
-   /*
-* map in the backpages and frontpages we found in the amap in hopes
-* of preventing future faults.we also init the pages[] array as
-* we go.
-*/
-   currva = flt.startva;
-   shadowed = FALSE;
-   for (lcv = 0 ; lcv < flt.npages ; lcv++, currva += PAGE_SIZE) {
-   /*
-   

Document art locking fields

2020-11-11 Thread Martin Pieuchot
While discussing the new source address mechanism with denis@, I figured
those ought to be documented.

Note that `ar_rtableid' is unused and can die.  The ART code is actually
free from any network knowledge.

ok?

Index: net/art.c
===
RCS file: /cvs/src/sys/net/art.c,v
retrieving revision 1.28
diff -u -p -r1.28 art.c
--- net/art.c   31 Mar 2019 19:29:27 -  1.28
+++ net/art.c   9 Nov 2020 19:52:48 -
@@ -115,7 +115,6 @@ art_alloc(unsigned int rtableid, unsigne
}
 
ar->ar_off = off;
-   ar->ar_rtableid = rtableid;
rw_init(>ar_lock, "art");
 
return (ar);
Index: net/art.h
===
RCS file: /cvs/src/sys/net/art.h,v
retrieving revision 1.19
diff -u -p -r1.19 art.h
--- net/art.h   29 Oct 2020 21:15:27 -  1.19
+++ net/art.h   9 Nov 2020 19:52:42 -
@@ -27,16 +27,22 @@
 
 /*
  * Root of the ART tables, equivalent to the radix head.
+ *
+ *  Locks used to protect struct members in this file:
+ * I   immutable after creation
+ * l   root's `ar_lock'
+ * K   kernel lock
+ *  For SRP related structures that allow lock-free reads, the write lock
+ *  is indicated below.
  */
 struct art_root {
-   struct srp   ar_root;   /* First table */
-   struct rwlockar_lock;   /* Serialise modifications */
-   uint8_t  ar_bits[ART_MAXLVL];   /* Per level stride */
-   uint8_t  ar_nlvl;   /* Number of levels */
-   uint8_t  ar_alen;   /* Address length in bits */
-   uint8_t  ar_off;/* Offset of the key in bytes */
-   unsigned int ar_rtableid;   /* ID of this routing table */
-   struct sockaddr *source;/* optional src addr to use */
+   struct srp   ar_root;   /* [l] First table */
+   struct rwlockar_lock;   /* [] Serialise modifications */
+   uint8_t  ar_bits[ART_MAXLVL]; /* [I] Per level stride */
+   uint8_t  ar_nlvl;   /* [I] Number of levels */
+   uint8_t  ar_alen;   /* [I] Address length in bits */
+   uint8_t  ar_off;/* [I] Offset of key in bytes */
+   struct sockaddr *source;/* [K] optional src addr to use 
*/
 };
 
 #define ISLEAF(e)  (((unsigned long)(e) & 1) == 0)



Re: Use selected source IP when replying to reflecting ICMP

2020-11-08 Thread Martin Pieuchot
On 08/11/20(Sun) 18:05, Denis Fondras wrote:
> ICMP error replies are sent from the IP of the interface the packet came in 
> even
> when the source IP was forced with route(8).

icmp_reflect() is called without the KERNEL_LOCK().  rtable_getsource()
and ifa_ifwithaddr() are not safe to do so.

So it would be wise to turn this interface mp-safe first.

> Index: netinet/ip_icmp.c
> ===
> RCS file: /cvs/src/sys/netinet/ip_icmp.c,v
> retrieving revision 1.183
> diff -u -p -r1.183 ip_icmp.c
> --- netinet/ip_icmp.c 22 Aug 2020 17:55:54 -  1.183
> +++ netinet/ip_icmp.c 8 Nov 2020 16:48:15 -
> @@ -689,6 +689,8 @@ icmp_reflect(struct mbuf *m, struct mbuf
>   struct mbuf *opts = NULL;
>   struct sockaddr_in sin;
>   struct rtentry *rt = NULL;
> + struct sockaddr *ip4_source = NULL;
> + struct in_addr src;
>   int optlen = (ip->ip_hl << 2) - sizeof(struct ip);
>   u_int rtableid;
>  
> @@ -707,6 +709,7 @@ icmp_reflect(struct mbuf *m, struct mbuf
>   m_resethdr(m);
>   m->m_pkthdr.ph_rtableid = rtableid;
>  
> + memset(, 0, sizeof(struct in_addr));
>   /*
>* If the incoming packet was addressed directly to us,
>* use dst as the src for the reply.  For broadcast, use
> @@ -721,7 +724,7 @@ icmp_reflect(struct mbuf *m, struct mbuf
>   rt = rtalloc(sintosa(), 0, rtableid);
>   if (rtisvalid(rt) &&
>   ISSET(rt->rt_flags, RTF_LOCAL|RTF_BROADCAST))
> - ia = ifatoia(rt->rt_ifa);
> + src = ifatoia(rt->rt_ifa)->ia_addr.sin_addr;
>   }
>  
>   /*
> @@ -729,7 +732,7 @@ icmp_reflect(struct mbuf *m, struct mbuf
>* Use the new source address and do a route lookup. If it fails
>* drop the packet as there is no path to the host.
>*/
> - if (ia == NULL) {
> + if (src.s_addr == 0) {
>   rtfree(rt);
>  
>   memset(, 0, sizeof(sin));
> @@ -745,14 +748,23 @@ icmp_reflect(struct mbuf *m, struct mbuf
>   return (EHOSTUNREACH);
>   }
>  
> - ia = ifatoia(rt->rt_ifa);
> + ip4_source = rtable_getsource(rtableid, AF_INET);
> + if (ip4_source != NULL) {
> + struct ifaddr *ifa;
> + if ((ifa = ifa_ifwithaddr(ip4_source, rtableid)) !=
> + NULL && ISSET(ifa->ifa_ifp->if_flags, IFF_UP)) {
> + src = satosin(ip4_source)->sin_addr;
> + }
> + }
> + if (src.s_addr == 0)
> + src = ifatoia(rt->rt_ifa)->ia_addr.sin_addr;
>   }
>  
>   ip->ip_dst = ip->ip_src;
>   ip->ip_ttl = MAXTTL;
>  
>   /* It is safe to dereference ``ia'' iff ``rt'' is valid. */
> - ip->ip_src = ia->ia_addr.sin_addr;
> + ip->ip_src = src;
>   rtfree(rt);
>  
>   if (optlen > 0) {
> Index: netinet6/icmp6.c
> ===
> RCS file: /cvs/src/sys/netinet6/icmp6.c,v
> retrieving revision 1.233
> diff -u -p -r1.233 icmp6.c
> --- netinet6/icmp6.c  28 Oct 2020 17:27:35 -  1.233
> +++ netinet6/icmp6.c  8 Nov 2020 16:48:15 -
> @@ -1146,6 +1146,7 @@ icmp6_reflect(struct mbuf **mp, size_t o
>  
>   if (src == NULL) {
>   struct in6_ifaddr *ia6;
> + struct sockaddr *ip6_source = NULL;
>  
>   /*
>* This case matches to multicasts, our anycast, or unicasts
> @@ -1164,7 +1165,15 @@ icmp6_reflect(struct mbuf **mp, size_t o
>   goto bad;
>   }
>   ia6 = in6_ifawithscope(rt->rt_ifa->ifa_ifp, , rtableid);
> - if (ia6 != NULL)
> + ip6_source = rtable_getsource(rtableid, AF_INET6);
> + if (ip6_source != NULL) {
> + struct ifaddr *ifa;
> + if ((ifa = ifa_ifwithaddr(ip6_source, rtableid)) !=
> + NULL && ISSET(ifa->ifa_ifp->if_flags, IFF_UP)) {
> + src = (ip6_source)->sin6_addr;
> + }
> + }
> + if (src == NULL && ia6 != NULL)
>   src = >ia_addr.sin6_addr;
>   if (src == NULL)
>   src = (rt->rt_ifa)->ia_addr.sin6_addr;
> 



sendsig() & sigexit()

2020-11-06 Thread Martin Pieuchot
Diff below moves the various sigexit() from all MD sendsig() to the MI
trapsignal().  Apart from the obvious code simplification, this will
help with locking as sigexit() does not return.

ok?

Index: arch/alpha/alpha/machdep.c
===
RCS file: /cvs/src/sys/arch/alpha/alpha/machdep.c,v
retrieving revision 1.193
diff -u -p -r1.193 machdep.c
--- arch/alpha/alpha/machdep.c  26 Aug 2020 03:29:05 -  1.193
+++ arch/alpha/alpha/machdep.c  15 Sep 2020 08:34:45 -
@@ -1381,7 +1381,7 @@ regdump(framep)
 /*
  * Send an interrupt to process.
  */
-void
+int
 sendsig(sig_t catcher, int sig, sigset_t mask, const siginfo_t *ksip)
 {
struct proc *p = curproc;
@@ -1443,20 +1443,13 @@ sendsig(sig_t catcher, int sig, sigset_t
if (psp->ps_siginfo & sigmask(sig)) {
sip = (void *)scp + kscsize;
if (copyout(ksip, (caddr_t)sip, fsize - kscsize) != 0)
-   goto trash;
+   return 1;
} else
sip = NULL;
 
ksc.sc_cookie = (long)scp ^ p->p_p->ps_sigcookie;
-   if (copyout((caddr_t), (caddr_t)scp, kscsize) != 0) {
-trash:
-   /*
-* Process has trashed its stack; give it an illegal
-* instruction to halt it in its tracks.
-*/
-   sigexit(p, SIGILL);
-   /* NOTREACHED */
-   }
+   if (copyout((caddr_t), (caddr_t)scp, kscsize) != 0)
+   return 1;
 
/*
 * Set up the registers to return to sigcode.
@@ -1467,6 +1460,8 @@ trash:
frame->tf_regs[FRAME_A2] = (u_int64_t)scp;
frame->tf_regs[FRAME_T12] = (u_int64_t)catcher; /* t12 is pv */
alpha_pal_wrusp((unsigned long)scp);
+
+   return 0;
 }
 
 /*
Index: arch/amd64/amd64/machdep.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/machdep.c,v
retrieving revision 1.269
diff -u -p -r1.269 machdep.c
--- arch/amd64/amd64/machdep.c  20 Aug 2020 15:12:35 -  1.269
+++ arch/amd64/amd64/machdep.c  15 Sep 2020 08:35:30 -
@@ -566,7 +566,7 @@ cpu_sysctl(int *name, u_int namelen, voi
  * signal mask, the stack, and the frame pointer, it returns to the
  * user specified pc.
  */
-void
+int
 sendsig(sig_t catcher, int sig, sigset_t mask, const siginfo_t *ksip)
 {
struct proc *p = curproc;
@@ -618,7 +618,7 @@ sendsig(sig_t catcher, int sig, sigset_t
sp -= fpu_save_len;
ksc.sc_fpstate = (struct fxsave64 *)sp;
if (copyout(sfp, (void *)sp, fpu_save_len))
-   sigexit(p, SIGILL);
+   return 1;
 
/* Now reset the FPU state in PCB */
memcpy(>p_addr->u_pcb.pcb_savefpu,
@@ -630,13 +630,13 @@ sendsig(sig_t catcher, int sig, sigset_t
sss += (sizeof(*ksip) + 15) & ~15;
 
if (copyout(ksip, (void *)sip, sizeof(*ksip)))
-   sigexit(p, SIGILL);
+   return 1;
}
scp = sp - sss;
 
ksc.sc_cookie = (long)scp ^ p->p_p->ps_sigcookie;
if (copyout(, (void *)scp, sizeof(ksc)))
-   sigexit(p, SIGILL);
+   return 1;
 
/*
 * Build context to run handler in.
@@ -654,6 +654,8 @@ sendsig(sig_t catcher, int sig, sigset_t
 
/* The reset state _is_ the userspace state for this thread now */
curcpu()->ci_flags |= CPUF_USERXSTATE;
+
+   return 0;
 }
 
 /*
Index: arch/arm/arm/sig_machdep.c
===
RCS file: /cvs/src/sys/arch/arm/arm/sig_machdep.c,v
retrieving revision 1.18
diff -u -p -r1.18 sig_machdep.c
--- arch/arm/arm/sig_machdep.c  10 Jul 2018 04:19:59 -  1.18
+++ arch/arm/arm/sig_machdep.c  15 Sep 2020 08:36:11 -
@@ -74,7 +74,7 @@ process_frame(struct proc *p)
  * signal mask, the stack, and the frame pointer, it returns to the
  * user specified pc.
  */
-void
+int
 sendsig(sig_t catcher, int sig, sigset_t mask, const siginfo_t *ksip)
 {
struct proc *p = curproc;
@@ -145,14 +145,8 @@ sendsig(sig_t catcher, int sig, sigset_t
}
 
frame.sf_sc.sc_cookie = (long)>sf_sc ^ p->p_p->ps_sigcookie;
-   if (copyout(, fp, sizeof(frame)) != 0) {
-   /*
-* Process has trashed its stack; give it an illegal
-* instruction to halt it in its tracks.
-*/
-   sigexit(p, SIGILL);
-   /* NOTREACHED */
-   }
+   if (copyout(, fp, sizeof(frame)) != 0)
+   return 1;
 
/*
 * Build context to run handler in.  We invoke the handler
@@ -163,8 +157,10 @@ sendsig(sig_t catcher, int sig, sigset_t
tf->tf_r2 = (register_t)frame.sf_scp;
tf->tf_pc = (register_t)frame.sf_handler;
tf->tf_usr_sp = (register_t)fp;
-   
+
tf->tf_usr_lr = p->p_p->ps_sigcode;
+
+   return 0;
 }
 
 /*

uvm_fault: split out handling of case 1

2020-11-06 Thread Martin Pieuchot
Diff below moves the logic dealing with faults of case 1A & 1B to its
own function.

With this, the logic in uvm_fault() now only deals with case 2 and the
various if/else/goto dances can be simplified.

As for the previous refactoring diffs, the name is taken from NetBSD
but the implementation is left mostly untouched to ease reviews. 

Upcoming locking will assert that the given amap and anon are sharing
the same lock and that it is held in this function.

ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.104
diff -u -p -r1.104 uvm_fault.c
--- uvm/uvm_fault.c 6 Nov 2020 11:52:39 -   1.104
+++ uvm/uvm_fault.c 6 Nov 2020 12:30:01 -
@@ -636,6 +636,173 @@ uvm_fault_check(struct uvm_faultinfo *uf
 }
 
 /*
+ * uvm_fault_upper: handle upper fault (case 1A & 1B)
+ *
+ * 1. acquire anon lock.
+ * 2. get anon.  let uvmfault_anonget do the dirty work.
+ * 3. if COW, promote data to new anon
+ * 4. enter h/w mapping
+ */
+int
+uvm_fault_upper(struct uvm_faultinfo *ufi, struct uvm_faultctx *flt,
+   struct vm_anon **anons, vm_fault_t fault_type, vm_prot_t access_type)
+{
+   struct vm_amap *amap = ufi->entry->aref.ar_amap;
+   struct vm_anon *oanon, *anon = anons[flt->centeridx];
+   struct vm_page *pg = NULL;
+   int error, ret;
+
+   /*
+* no matter if we have case 1A or case 1B we are going to need to
+* have the anon's memory resident.   ensure that now.
+*/
+   /*
+* let uvmfault_anonget do the dirty work.
+* also, if it is OK, then the anon's page is on the queues.
+*/
+   error = uvmfault_anonget(ufi, amap, anon);
+   switch (error) {
+   case VM_PAGER_OK:
+   break;
+
+   case VM_PAGER_REFAULT:
+   return ERESTART;
+
+   case VM_PAGER_ERROR:
+   /*
+* An error occured while trying to bring in the
+* page -- this is the only error we return right
+* now.
+*/
+   return EACCES;  /* XXX */
+   default:
+#ifdef DIAGNOSTIC
+   panic("uvm_fault: uvmfault_anonget -> %d", error);
+#else
+   return EACCES;
+#endif
+   }
+
+   /*
+* if we are case 1B then we will need to allocate a new blank
+* anon to transfer the data into.   note that we have a lock
+* on anon, so no one can busy or release the page until we are done.
+* also note that the ref count can't drop to zero here because
+* it is > 1 and we are only dropping one ref.
+*
+* in the (hopefully very rare) case that we are out of RAM we
+* will wait for more RAM, and refault.
+*
+* if we are out of anon VM we wait for RAM to become available.
+*/
+
+   if ((access_type & PROT_WRITE) != 0 && anon->an_ref > 1) {
+   uvmexp.flt_acow++;
+   oanon = anon;   /* oanon = old */
+   anon = uvm_analloc();
+   if (anon) {
+   pg = uvm_pagealloc(NULL, 0, anon, 0);
+   }
+
+   /* check for out of RAM */
+   if (anon == NULL || pg == NULL) {
+   uvmfault_unlockall(ufi, amap, NULL);
+   if (anon == NULL)
+   uvmexp.fltnoanon++;
+   else {
+   uvm_anfree(anon);
+   uvmexp.fltnoram++;
+   }
+
+   if (uvm_swapisfull())
+   return ENOMEM;
+
+   /* out of RAM, wait for more */
+   if (anon == NULL)
+   uvm_anwait();
+   else
+   uvm_wait("flt_noram3");
+   return ERESTART;
+   }
+
+   /* got all resources, replace anon with nanon */
+   uvm_pagecopy(oanon->an_page, pg);   /* pg now !PG_CLEAN */
+   /* un-busy! new page */
+   atomic_clearbits_int(>pg_flags, PG_BUSY|PG_FAKE);
+   UVM_PAGE_OWN(pg, NULL);
+   ret = amap_add(>entry->aref,
+   ufi->orig_rvaddr - ufi->entry->start, anon, 1);
+   KASSERT(ret == 0);
+
+   /* deref: can not drop to zero here by defn! */
+   oanon->an_ref--;
+
+   /*
+* note: anon is _not_ locked, but we have the sole references
+* to in from amap.
+* thus, no one can get at it until we are done with it.
+*/
+   } else {
+   uvmexp.flt_anon++;
+   oanon = anon;
+   pg = anon->an_page;
+   if (anon->an_ref > 1) /* disallow writes to ref > 1 anons */
+  

uvmfault_unlockall: kill unused anon argument

2020-11-05 Thread Martin Pieuchot
One of the functions call in uvm_fault() passes a non-initialized
`oanon' argument.  This bug is harmless as long as there is no locking
associated to amap & anons.  But more importantly an `amap' is passed
to the function any given anon should share its lock, so this parameter
is redundant.

ok to kill it?

Index: dev/pci/drm/drm_gem.c
===
RCS file: /cvs/src/sys/dev/pci/drm/drm_gem.c,v
retrieving revision 1.13
diff -u -p -r1.13 drm_gem.c
--- dev/pci/drm/drm_gem.c   21 Oct 2020 09:08:14 -  1.13
+++ dev/pci/drm/drm_gem.c   5 Nov 2020 12:13:32 -
@@ -101,7 +101,7 @@ drm_fault(struct uvm_faultinfo *ufi, vad
 */

if (UVM_ET_ISCOPYONWRITE(entry)) {
-   uvmfault_unlockall(ufi, ufi->entry->aref.ar_amap, uobj, NULL);
+   uvmfault_unlockall(ufi, ufi->entry->aref.ar_amap, uobj);
return(VM_PAGER_ERROR);
}
 
@@ -115,7 +115,7 @@ drm_fault(struct uvm_faultinfo *ufi, vad
mtx_enter(>quiesce_mtx);
if (dev->quiesce && dev->quiesce_count == 0) {
mtx_leave(>quiesce_mtx);
-   uvmfault_unlockall(ufi, ufi->entry->aref.ar_amap, uobj, NULL);
+   uvmfault_unlockall(ufi, ufi->entry->aref.ar_amap, uobj);
mtx_enter(>quiesce_mtx);
while (dev->quiesce) {
msleep_nsec(>quiesce, >quiesce_mtx,
Index: dev/pci/drm/ttm/ttm_bo_vm.c
===
RCS file: /cvs/src/sys/dev/pci/drm/ttm/ttm_bo_vm.c,v
retrieving revision 1.23
diff -u -p -r1.23 ttm_bo_vm.c
--- dev/pci/drm/ttm/ttm_bo_vm.c 21 Oct 2020 09:08:14 -  1.23
+++ dev/pci/drm/ttm/ttm_bo_vm.c 5 Nov 2020 12:12:49 -
@@ -750,7 +750,7 @@ ttm_bo_vm_fault(struct uvm_faultinfo *uf
break;
}
 
-   uvmfault_unlockall(ufi, NULL, NULL, NULL);
+   uvmfault_unlockall(ufi, NULL, NULL);
return ret;
}
 
@@ -769,7 +769,7 @@ ttm_bo_vm_fault(struct uvm_faultinfo *uf
 
dma_resv_unlock(bo->base.resv);
 
-   uvmfault_unlockall(ufi, NULL, NULL, NULL);
+   uvmfault_unlockall(ufi, NULL, NULL);
return ret;
 }
 EXPORT_SYMBOL(ttm_bo_vm_fault);
Index: dev/pci/drm/i915/gem/i915_gem_mman.c
===
RCS file: /cvs/src/sys/dev/pci/drm/i915/gem/i915_gem_mman.c,v
retrieving revision 1.2
diff -u -p -r1.2 i915_gem_mman.c
--- dev/pci/drm/i915/gem/i915_gem_mman.c21 Oct 2020 02:16:53 -  
1.2
+++ dev/pci/drm/i915/gem/i915_gem_mman.c5 Nov 2020 12:04:58 -
@@ -473,7 +473,7 @@ vm_fault_cpu(struct i915_mmap_offset *mm
 
/* Sanity check that we allow writing into this object */
if (unlikely(i915_gem_object_is_readonly(obj) && write)) {
-   uvmfault_unlockall(ufi, NULL, >base.uobj, NULL);
+   uvmfault_unlockall(ufi, NULL, >base.uobj);
return VM_PAGER_BAD;
}
 
@@ -518,7 +518,7 @@ vm_fault_cpu(struct i915_mmap_offset *mm
i915_gem_object_unpin_pages(obj);
 
 out:
-   uvmfault_unlockall(ufi, NULL, >base.uobj, NULL);
+   uvmfault_unlockall(ufi, NULL, >base.uobj);
return i915_error_to_vmf_fault(err);
 }
 
@@ -559,7 +559,7 @@ vm_fault_gtt(struct i915_mmap_offset *mm
 
/* Sanity check that we allow writing into this object */
if (i915_gem_object_is_readonly(obj) && write) {
-   uvmfault_unlockall(ufi, NULL, >base.uobj, NULL);
+   uvmfault_unlockall(ufi, NULL, >base.uobj);
return VM_PAGER_BAD;
}
 
@@ -664,7 +664,7 @@ err_rpm:
intel_runtime_pm_put(rpm, wakeref);
i915_gem_object_unpin_pages(obj);
 err:
-   uvmfault_unlockall(ufi, NULL, >base.uobj, NULL);
+   uvmfault_unlockall(ufi, NULL, >base.uobj);
return i915_error_to_vmf_fault(ret);
 }
 
@@ -687,7 +687,7 @@ i915_gem_fault(struct drm_gem_object *ge
mmo = container_of(node, struct i915_mmap_offset, vma_node);
drm_vma_offset_unlock_lookup(dev->vma_offset_manager);
if (!mmo) {
-   uvmfault_unlockall(ufi, NULL, _obj->uobj, NULL);
+   uvmfault_unlockall(ufi, NULL, _obj->uobj);
return VM_PAGER_BAD;
}
 
Index: uvm/uvm_device.c
===
RCS file: /cvs/src/sys/uvm/uvm_device.c,v
retrieving revision 1.59
diff -u -p -r1.59 uvm_device.c
--- uvm/uvm_device.c24 Oct 2020 21:07:53 -  1.59
+++ uvm/uvm_device.c5 Nov 2020 12:14:24 -
@@ -306,7 +306,7 @@ udv_fault(struct uvm_faultinfo *ufi, vad
 * so we kill any attempt to do so here.
 */
if (UVM_ET_ISCOPYONWRITE(entry)) {
-   uvmfault_unlockall(ufi, ufi->entry->aref.ar_amap, uobj, NULL);
+   uvmfault_unlockall(ufi, ufi->entry->aref.ar_amap, uobj);

Prevent race in single_thread_set()

2020-11-04 Thread Martin Pieuchot
Here's a 3rd approach to solve the TOCTOU race in single_thread_set().
The issue being that the lock serializing access to `ps_single' is not
held when calling single_thread_check().

The approach below is controversial because it extends the scope of the
SCHED_LOCK().  On the other hand, the two others approaches that both
add a new lock to avoid this race ignore the fact that accesses to
`ps_single' are currently not clearly serialized w/o KERNEL_LOCK().

So the diff below improves the situation in that regard and do not add
more complexity due to the use of multiple locks.  After having looked
for a way to split the SCHED_LOCK() I believe this is the simplest
approach.

I deliberately used a *_locked() function to avoid grabbing the lock
recursively as I'm trying to get rid of the recursion, see the other
thread on tech@.

That said the uses of `ps_single' in ptrace_ctrl() are not covered by
this diff and I'd be glad to hear some comments about them.  This is
fine as long as all the code using `ps_single' runs under KERNEL_LOCK()
but since we're trying to get the single_thread_* API out of it, this
need to be addressed.

Note that this diff introduces a helper for initializing ps_single*
values in order to keep all the accesses of those fields in the same
file.

ok?

Index: kern/kern_fork.c
===
RCS file: /cvs/src/sys/kern/kern_fork.c,v
retrieving revision 1.226
diff -u -p -r1.226 kern_fork.c
--- kern/kern_fork.c25 Oct 2020 01:55:18 -  1.226
+++ kern/kern_fork.c4 Nov 2020 12:52:54 -
@@ -563,10 +563,7 @@ thread_fork(struct proc *curp, void *sta
 * if somebody else wants to take us to single threaded mode,
 * count ourselves in.
 */
-   if (pr->ps_single) {
-   atomic_inc_int(>ps_singlecount);
-   atomic_setbits_int(>p_flag, P_SUSPSINGLE);
-   }
+   single_thread_init(p);
 
/*
 * Return tid to parent thread and copy it out to userspace
Index: kern/kern_sig.c
===
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.263
diff -u -p -r1.263 kern_sig.c
--- kern/kern_sig.c 16 Sep 2020 13:50:42 -  1.263
+++ kern/kern_sig.c 4 Nov 2020 12:38:35 -
@@ -1932,11 +1932,27 @@ userret(struct proc *p)
p->p_cpu->ci_schedstate.spc_curpriority = p->p_usrpri;
 }
 
+void
+single_thread_init(struct proc *p)
+{
+   struct process *pr = p->p_p;
+   int s;
+
+   SCHED_LOCK(s);
+   if (pr->ps_single) {
+   atomic_inc_int(>ps_singlecount);
+   atomic_setbits_int(>p_flag, P_SUSPSINGLE);
+   }
+   SCHED_UNLOCK(s);
+}
+
 int
-single_thread_check(struct proc *p, int deep)
+_single_thread_check_locked(struct proc *p, int deep)
 {
struct process *pr = p->p_p;
 
+   SCHED_ASSERT_LOCKED();
+
if (pr->ps_single != NULL && pr->ps_single != p) {
do {
int s;
@@ -1949,14 +1965,12 @@ single_thread_check(struct proc *p, int 
return (EINTR);
}
 
-   SCHED_LOCK(s);
-   if (pr->ps_single == NULL) {
-   SCHED_UNLOCK(s);
+   if (pr->ps_single == NULL)
continue;
-   }
 
if (atomic_dec_int_nv(>ps_singlecount) == 0)
wakeup(>ps_singlecount);
+
if (pr->ps_flags & PS_SINGLEEXIT) {
SCHED_UNLOCK(s);
KERNEL_LOCK();
@@ -1967,13 +1981,24 @@ single_thread_check(struct proc *p, int 
/* not exiting and don't need to unwind, so suspend */
p->p_stat = SSTOP;
mi_switch();
-   SCHED_UNLOCK(s);
} while (pr->ps_single != NULL);
}
 
return (0);
 }
 
+int
+single_thread_check(struct proc *p, int deep)
+{
+   int s, error;
+
+   SCHED_LOCK(s);
+   error = _single_thread_check_locked(p, deep);
+   SCHED_UNLOCK(s);
+
+   return error;
+}
+
 /*
  * Stop other threads in the process.  The mode controls how and
  * where the other threads should stop:
@@ -1995,8 +2020,12 @@ single_thread_set(struct proc *p, enum s
KERNEL_ASSERT_LOCKED();
KASSERT(curproc == p);
 
-   if ((error = single_thread_check(p, deep)))
+   SCHED_LOCK(s);
+   error = _single_thread_check_locked(p, deep);
+   if (error) {
+   SCHED_UNLOCK(s);
return error;
+   }
 
switch (mode) {
case SINGLE_SUSPEND:
@@ -2014,7 +2043,6 @@ single_thread_set(struct proc *p, enum s
panic("single_thread_mode = %d", mode);
 #endif
}
-   SCHED_LOCK(s);
pr->ps_singlecount = 

uvm_fault: is there an anon?

2020-11-04 Thread Martin Pieuchot
Diff below introduces a helper that looks for existing mapping.  The
value returned by this lookup function determines if there's an anon
at the faulting address which tells us if we're dealign with a fault
of type 1 or 2.

This small refactoring is part of the current work to separate the code
handling faults of type 1 and 2.  The end goal being to move the type 1
faults handling out of the KERNEL_LOCK().

The function name is taken from NetBSD to not introduce more difference
than there's already.

ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_fault.c
--- uvm/uvm_fault.c 21 Oct 2020 08:55:40 -  1.103
+++ uvm/uvm_fault.c 4 Nov 2020 13:57:01 -
@@ -637,6 +637,84 @@ uvm_fault_check(struct uvm_faultinfo *uf
return 0;
 }
 
+
+/*
+ * uvm_fault_upper_lookup: look up existing h/w mapping and amap.
+ *
+ * iterate range of interest:
+ *  1. check if h/w mapping exists.  if yes, we don't care
+ *  2. check if anon exists.  if not, page is lower.
+ *  3. if anon exists, enter h/w mapping for neighbors.
+ */
+boolean_t
+uvm_fault_upper_lookup(struct uvm_faultinfo *ufi,
+const struct uvm_faultctx *flt, struct vm_anon **anons,
+struct vm_page **pages)
+{
+   struct vm_amap *amap = ufi->entry->aref.ar_amap;
+   struct vm_anon *anon;
+   boolean_t shadowed;
+   vaddr_t currva;
+   paddr_t pa;
+   int lcv;
+
+   /*
+* map in the backpages and frontpages we found in the amap in hopes
+* of preventing future faults.we also init the pages[] array as
+* we go.
+*/
+   currva = flt->startva;
+   shadowed = FALSE;
+   for (lcv = 0 ; lcv < flt->npages ; lcv++, currva += PAGE_SIZE) {
+   /*
+* dont play with VAs that are already mapped
+* except for center)
+*/
+   if (lcv != flt->centeridx &&
+   pmap_extract(ufi->orig_map->pmap, currva, )) {
+   pages[lcv] = PGO_DONTCARE;
+   continue;
+   }
+
+   /* unmapped or center page. check if any anon at this level. */
+   if (amap == NULL || anons[lcv] == NULL) {
+   pages[lcv] = NULL;
+   continue;
+   }
+
+   /* check for present page and map if possible. re-activate it */
+   pages[lcv] = PGO_DONTCARE;
+   if (lcv == flt->centeridx) {/* save center for later! */
+   shadowed = TRUE;
+   continue;
+   }
+   anon = anons[lcv];
+   if (anon->an_page &&
+   (anon->an_page->pg_flags & (PG_RELEASED|PG_BUSY)) == 0) {
+   uvm_lock_pageq();
+   uvm_pageactivate(anon->an_page);/* reactivate */
+   uvm_unlock_pageq();
+   uvmexp.fltnamap++;
+
+   /*
+* Since this isn't the page that's actually faulting,
+* ignore pmap_enter() failures; it's not critical
+* that we enter these right now.
+*/
+   (void) pmap_enter(ufi->orig_map->pmap, currva,
+   VM_PAGE_TO_PHYS(anon->an_page) | flt->pa_flags,
+   (anon->an_ref > 1) ?
+   (flt->enter_prot & ~PROT_WRITE) : flt->enter_prot,
+   PMAP_CANFAIL |
+(VM_MAPENT_ISWIRED(ufi->entry) ? PMAP_WIRED : 0));
+   }
+   }
+   if (flt->npages > 1)
+   pmap_update(ufi->orig_map->pmap);
+
+   return shadowed;
+}
+
 /*
  *   F A U L T   -   m a i n   e n t r y   p o i n t
  */
@@ -663,7 +741,6 @@ uvm_fault(vm_map_t orig_map, vaddr_t vad
int result, lcv, gotpages, ret;
vaddr_t currva;
voff_t uoff;
-   paddr_t pa;
struct vm_amap *amap;
struct uvm_object *uobj;
struct vm_anon *anons_store[UVM_MAXRANGE], **anons, *anon, *oanon;
@@ -704,61 +781,9 @@ ReFault:
amap = ufi.entry->aref.ar_amap;
uobj = ufi.entry->object.uvm_obj;
 
-   /*
-* map in the backpages and frontpages we found in the amap in hopes
-* of preventing future faults.we also init the pages[] array as
-* we go.
-*/
-   currva = flt.startva;
-   shadowed = FALSE;
-   for (lcv = 0 ; lcv < flt.npages ; lcv++, currva += PAGE_SIZE) {
-   /*
-* dont play with VAs that are already mapped
-* except for center)
-*/
-   if (lcv != flt.centeridx &&
-   pmap_extract(ufi.orig_map->pmap, currva, )) {
-   pages[lcv] = 

Turn SCHED_LOCK() into a mutex

2020-11-04 Thread Martin Pieuchot
Diff below removes the recursive attribute of the SCHED_LOCK() by
turning it into a IPL_NONE mutex.  I'm not intending to commit it
yet as it raises multiple questions, see below. 

This work has been started by art@ more than a decade ago and I'm
willing to finish it as I believe it's the easiest way to reduce
the scope of this lock.  Having a global mutex is the first step to
have a per runqueue and per sleepqueue mutex. 

This is also a way to avoid lock ordering problems exposed by the recent
races in single_thread_set().

About the diff:

 The diff below includes a (hugly) refactoring of rw_exit() to avoid a
 recursion on the SCHED_LOCK().  In this case the lock is used to protect
 the global sleepqueue and is grabbed in sleep_setup().

 The same pattern can be observed in single_thread_check().  However in
 this case the lock is used to protect different fields so there's no
 "recursive access" to the same data structure.

 assertwaitok() has been moved down in mi_switch() which isn't ideal.

 It becomes obvious that the per-CPU and per-thread accounting fields
 updated in mi_switch() won't need a separate mutex as proposed last
 year and that splitting this global mutex will be enough.

It's unclear to me if/how WITNESS should be modified to handle this lock
change.

This has been tested on sparc64 and amd64.  I'm not convinced it exposed
all the recursions.  So if you want to give it a go and can break it, it
is more than welcome.

Comments?  Questions?

Index: kern/kern_fork.c
===
RCS file: /cvs/src/sys/kern/kern_fork.c,v
retrieving revision 1.226
diff -u -p -r1.226 kern_fork.c
--- kern/kern_fork.c25 Oct 2020 01:55:18 -  1.226
+++ kern/kern_fork.c2 Nov 2020 10:50:24 -
@@ -665,7 +665,7 @@ void
 proc_trampoline_mp(void)
 {
SCHED_ASSERT_LOCKED();
-   __mp_unlock(_lock);
+   mtx_leave(_lock);
spl0();
SCHED_ASSERT_UNLOCKED();
KERNEL_ASSERT_UNLOCKED();
Index: kern/kern_lock.c
===
RCS file: /cvs/src/sys/kern/kern_lock.c,v
retrieving revision 1.71
diff -u -p -r1.71 kern_lock.c
--- kern/kern_lock.c5 Mar 2020 09:28:31 -   1.71
+++ kern/kern_lock.c2 Nov 2020 10:50:24 -
@@ -97,9 +97,6 @@ ___mp_lock_init(struct __mp_lock *mpl, c
if (mpl == _lock)
mpl->mpl_lock_obj.lo_flags = LO_WITNESS | LO_INITIALIZED |
LO_SLEEPABLE | (LO_CLASS_KERNEL_LOCK << LO_CLASSSHIFT);
-   else if (mpl == _lock)
-   mpl->mpl_lock_obj.lo_flags = LO_WITNESS | LO_INITIALIZED |
-   LO_RECURSABLE | (LO_CLASS_SCHED_LOCK << LO_CLASSSHIFT);
WITNESS_INIT(>mpl_lock_obj, type);
 #endif
 }
Index: kern/kern_rwlock.c
===
RCS file: /cvs/src/sys/kern/kern_rwlock.c,v
retrieving revision 1.45
diff -u -p -r1.45 kern_rwlock.c
--- kern/kern_rwlock.c  2 Mar 2020 17:07:49 -   1.45
+++ kern/kern_rwlock.c  2 Nov 2020 23:13:01 -
@@ -128,36 +128,6 @@ rw_enter_write(struct rwlock *rwl)
}
 }
 
-void
-rw_exit_read(struct rwlock *rwl)
-{
-   unsigned long owner;
-
-   rw_assert_rdlock(rwl);
-   WITNESS_UNLOCK(>rwl_lock_obj, 0);
-
-   membar_exit_before_atomic();
-   owner = rwl->rwl_owner;
-   if (__predict_false((owner & RWLOCK_WAIT) ||
-   rw_cas(>rwl_owner, owner, owner - RWLOCK_READ_INCR)))
-   rw_do_exit(rwl, 0);
-}
-
-void
-rw_exit_write(struct rwlock *rwl)
-{
-   unsigned long owner;
-
-   rw_assert_wrlock(rwl);
-   WITNESS_UNLOCK(>rwl_lock_obj, LOP_EXCLUSIVE);
-
-   membar_exit_before_atomic();
-   owner = rwl->rwl_owner;
-   if (__predict_false((owner & RWLOCK_WAIT) ||
-   rw_cas(>rwl_owner, owner, 0)))
-   rw_do_exit(rwl, RWLOCK_WRLOCK);
-}
-
 #ifdef DIAGNOSTIC
 /*
  * Put the diagnostic functions here to keep the main code free
@@ -314,9 +284,10 @@ retry:
 }
 
 void
-rw_exit(struct rwlock *rwl)
+_rw_exit(struct rwlock *rwl, int locked)
 {
unsigned long wrlock;
+   unsigned long owner, set;
 
/* Avoid deadlocks after panic or in DDB */
if (panicstr || db_active)
@@ -330,15 +301,6 @@ rw_exit(struct rwlock *rwl)
WITNESS_UNLOCK(>rwl_lock_obj, wrlock ? LOP_EXCLUSIVE : 0);
 
membar_exit_before_atomic();
-   rw_do_exit(rwl, wrlock);
-}
-
-/* membar_exit_before_atomic() has to precede call of this function. */
-void
-rw_do_exit(struct rwlock *rwl, unsigned long wrlock)
-{
-   unsigned long owner, set;
-
do {
owner = rwl->rwl_owner;
if (wrlock)
@@ -349,7 +311,13 @@ rw_do_exit(struct rwlock *rwl, unsigned 
} while (__predict_false(rw_cas(>rwl_owner, owner, set)));
 
if (owner & RWLOCK_WAIT)
-   wakeup(rwl);
+   wakeup_n(rwl, -1, locked);
+}
+
+void
+rw_exit(struct rwlock *rwl)
+{

Re: amap: introduce amap_adjref_anons()

2020-10-30 Thread Martin Pieuchot
On 23/10/20(Fri) 10:31, Martin Pieuchot wrote:
> More refactoring.  This time let's introduce a helper to manipulate
> references.  The goal is to reduce the upcoming diff adding locking.
> 
> This is extracted from a bigger diff from guenther@ as well as some
> bits from NetBSD.

Now with the correct diff, ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.85
diff -u -p -r1.85 uvm_amap.c
--- uvm/uvm_amap.c  12 Oct 2020 08:44:45 -  1.85
+++ uvm/uvm_amap.c  23 Oct 2020 08:23:59 -
@@ -68,7 +68,23 @@ static inline void amap_list_remove(stru
 
 struct vm_amap_chunk *amap_chunk_get(struct vm_amap *, int, int, int);
 void amap_chunk_free(struct vm_amap *, struct vm_amap_chunk *);
-void amap_wiperange_chunk(struct vm_amap *, struct vm_amap_chunk *, int, int);
+
+/*
+ * if we enable PPREF, then we have a couple of extra functions that
+ * we need to prototype here...
+ */
+
+#ifdef UVM_AMAP_PPREF
+
+#define PPREF_NONE ((int *) -1)/* not using ppref */
+
+void   amap_pp_adjref(struct vm_amap *, int, vsize_t, int);
+void   amap_pp_establish(struct vm_amap *);
+void   amap_wiperange_chunk(struct vm_amap *, struct vm_amap_chunk *, int,
+   int);
+void   amap_wiperange(struct vm_amap *, int, int);
+
+#endif /* UVM_AMAP_PPREF */
 
 static inline void
 amap_list_insert(struct vm_amap *amap)
@@ -1153,6 +1169,32 @@ amap_unadd(struct vm_aref *aref, vaddr_t
 }
 
 /*
+ * amap_adjref_anons: adjust the reference count(s) on amap and its anons.
+ */
+static void
+amap_adjref_anons(struct vm_amap *amap, vaddr_t offset, vsize_t len,
+int refv, boolean_t all)
+{
+#ifdef UVM_AMAP_PPREF
+   if (amap->am_ppref == NULL && !all && len != amap->am_nslot) {
+   amap_pp_establish(amap);
+   }
+#endif
+
+   amap->am_ref += refv;
+
+#ifdef UVM_AMAP_PPREF
+   if (amap->am_ppref && amap->am_ppref != PPREF_NONE) {
+   if (all) {
+   amap_pp_adjref(amap, 0, amap->am_nslot, refv);
+   } else {
+   amap_pp_adjref(amap, offset, len, refv);
+   }
+   }
+#endif
+}
+
+/*
  * amap_ref: gain a reference to an amap
  *
  * => "offset" and "len" are in units of pages
@@ -1162,51 +1204,36 @@ void
 amap_ref(struct vm_amap *amap, vaddr_t offset, vsize_t len, int flags)
 {
 
-   amap->am_ref++;
if (flags & AMAP_SHARED)
amap->am_flags |= AMAP_SHARED;
-#ifdef UVM_AMAP_PPREF
-   if (amap->am_ppref == NULL && (flags & AMAP_REFALL) == 0 &&
-   len != amap->am_nslot)
-   amap_pp_establish(amap);
-   if (amap->am_ppref && amap->am_ppref != PPREF_NONE) {
-   if (flags & AMAP_REFALL)
-   amap_pp_adjref(amap, 0, amap->am_nslot, 1);
-   else
-   amap_pp_adjref(amap, offset, len, 1);
-   }
-#endif
+   amap_adjref_anons(amap, offset, len, 1, (flags & AMAP_REFALL) != 0);
 }
 
 /*
  * amap_unref: remove a reference to an amap
  *
- * => caller must remove all pmap-level references to this amap before
- * dropping the reference
- * => called from uvm_unmap_detach [only]  ... note that entry is no
- * longer part of a map
+ * => All pmap-level references to this amap must be already removed.
+ * => Called from uvm_unmap_detach(); entry is already removed from the map.
  */
 void
 amap_unref(struct vm_amap *amap, vaddr_t offset, vsize_t len, boolean_t all)
 {
+   KASSERT(amap->am_ref > 0);
 
-   /* if we are the last reference, free the amap and return. */
-   if (amap->am_ref-- == 1) {
-   amap_wipeout(amap); /* drops final ref and frees */
+   if (amap->am_ref == 1) {
+   /*
+* If the last reference - wipeout and destroy the amap.
+*/
+   amap->am_ref--;
+   amap_wipeout(amap);
return;
}
 
-   /* otherwise just drop the reference count(s) */
-   if (amap->am_ref == 1 && (amap->am_flags & AMAP_SHARED) != 0)
-   amap->am_flags &= ~AMAP_SHARED; /* clear shared flag */
-#ifdef UVM_AMAP_PPREF
-   if (amap->am_ppref == NULL && all == 0 && len != amap->am_nslot)
-   amap_pp_establish(amap);
-   if (amap->am_ppref && amap->am_ppref != PPREF_NONE) {
-   if (all)
-   amap_pp_adjref(amap, 0, amap->am_nslot, -1);
-   else
-   amap_pp_adjref(amap, offset, len, -1);
+   /*
+* Otherwise, drop the reference count(s) on anons.
+*/
+   if (amap->am_ref == 2 && (amap->am_flags & AMAP_SHARED) != 0) {
+ 

Re: Please test: switch select(2) to kqfilters

2020-10-30 Thread Martin Pieuchot
On 26/10/20(Mon) 11:57, Scott Cheloha wrote:
> On Mon, Oct 12, 2020 at 11:11:36AM +0200, Martin Pieuchot wrote:
> [...]
> > +/*
> > + * Scan the kqueue, blocking if necessary until the target time is reached.
> > + * If tsp is NULL we block indefinitely.  If tsp->ts_secs/nsecs are both
> > + * 0 we do not block at all.
> > + */
> >  int
> >  kqueue_scan(struct kqueue_scan_state *scan, int maxevents,
> > -struct kevent *ulistp, struct timespec *tsp, struct kevent *kev,
> > -struct proc *p, int *retval)
> > +struct kevent *kevp, struct timespec *tsp, struct proc *p, int *errorp)
> 
> Is there any reason to change these argument names?

The array is no longer a user specified list because the interface is
now used by other kernel consumers.

> > @@ -618,6 +631,8 @@ dopselect(struct proc *p, int nd, fd_set
> > pobits[2] = (fd_set *)[5];
> > }
> >  
> > +   kqpoll_init(p);
> > +
> >  #definegetbits(name, x) \
> > if (name && (error = copyin(name, pibits[x], ni))) \
> > goto done;
> > @@ -636,43 +651,63 @@ dopselect(struct proc *p, int nd, fd_set
> > if (sigmask)
> > dosigsuspend(p, *sigmask &~ sigcantmask);
> >  
> > -retry:
> > -   ncoll = nselcoll;
> > -   atomic_setbits_int(>p_flag, P_SELECT);
> > -   error = selscan(p, pibits[0], pobits[0], nd, ni, retval);
> > -   if (error || *retval)
> > +   /* Register kqueue events */
> > +   if ((error = pselregister(p, pibits[0], nd, ni, ) != 0))
> > goto done;
> > -   if (timeout == NULL || timespecisset(timeout)) {
> > -   if (timeout != NULL) {
> > -   getnanouptime();
> > -   nsecs = MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP);
> > -   } else
> > -   nsecs = INFSLP;
> > -   s = splhigh();
> > -   if ((p->p_flag & P_SELECT) == 0 || nselcoll != ncoll) {
> > -   splx(s);
> > -   goto retry;
> > -   }
> > -   atomic_clearbits_int(>p_flag, P_SELECT);
> > -   error = tsleep_nsec(, PSOCK | PCATCH, "select", nsecs);
> 
> I need to clarify something.
> 
> My understanding of the current state of poll/select is that all
> threads wait on the same channel, selwait.  Sometimes, a thread will
> wakeup(9) *all* threads waiting on that channel.  When this happens,
> most of the sleeping threads will recheck their conditions, see that
> nothing has changed, and go back to sleep.
> 
> Right?
> 
> Is that spurious wakeup case going away with this diff?  That is, when
> a thread is sleeping in poll/select it will only be woken up by
> another thread if the condition for one of the descriptors of interest
> changes.

Your understanding matches mine, the removal of spurious wakeup might
be beneficial to some programs.  I see this as a pro of this
implementation.

> 
> > -   splx(s);
> > -   if (timeout != NULL) {
> > -   getnanouptime();
> > -   timespecsub(, , );
> > -   timespecsub(timeout, , timeout);
> > -   if (timeout->tv_sec < 0)
> > -   timespecclear(timeout);
> > -   }
> > -   if (error == 0 || error == EWOULDBLOCK)
> > -   goto retry;
> > +
> > +   /*
> > +* The poll/select family of syscalls has been designed to
> > +* block when file descriptors are not available, even if
> > +* there's nothing to wait for.
> > +*/
> > +   if (nevents == 0) {
> > +   uint64_t nsecs = INFSLP;
> > +
> > +   if (timeout != NULL)
> > +   nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
> > +
> > +   error = tsleep_nsec(>p_kq, PSOCK | PCATCH, "kqsel", nsecs);
> > +   /* select is not restarted after signals... */
> > +   if (error == ERESTART)
> > +   error = EINTR;
> > +   if (error == EWOULDBLOCK)
> > +   error = 0;
> > +   goto done;
> > +   }
> 
> This is still wrong.  If you want to isolate this case you can't block
> when the timeout is empty:

Thanks, fixed.

> > Index: sys/proc.h
> > ===
> > RCS file: /cvs/src/sys/sys/proc.h,v
> > retrieving revision 1.300
> > diff -u -p -r1.300 proc.h
> > --- sys/proc.h  16 Sep 2020 08:01:15 -  1.300
> > +++ sys

amap: introduce amap_adjref_anons()

2020-10-23 Thread Martin Pieuchot
More refactoring.  This time let's introduce a helper to manipulate
references.  The goal is to reduce the upcoming diff adding locking.

This is extracted from a bigger diff from guenther@ as well as some
bits from NetBSD.

ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.85
diff -u -p -r1.85 uvm_amap.c
--- uvm/uvm_amap.c  12 Oct 2020 08:44:45 -  1.85
+++ uvm/uvm_amap.c  23 Oct 2020 08:21:38 -
@@ -68,7 +68,23 @@ static inline void amap_list_remove(stru
 
 struct vm_amap_chunk *amap_chunk_get(struct vm_amap *, int, int, int);
 void amap_chunk_free(struct vm_amap *, struct vm_amap_chunk *);
-void amap_wiperange_chunk(struct vm_amap *, struct vm_amap_chunk *, int, int);
+
+/*
+ * if we enable PPREF, then we have a couple of extra functions that
+ * we need to prototype here...
+ */
+
+#ifdef UVM_AMAP_PPREF
+
+#define PPREF_NONE ((int *) -1)/* not using ppref */
+
+void   amap_pp_adjref(struct vm_amap *, int, vsize_t, int, struct vm_anon **);
+void   amap_pp_establish(struct vm_amap *);
+void   amap_wiperange_chunk(struct vm_amap *, struct vm_amap_chunk *, int,
+   int, struct vm_anon **);
+void   amap_wiperange(struct vm_amap *, int, int, struct vm_anon **);
+
+#endif /* UVM_AMAP_PPREF */
 
 static inline void
 amap_list_insert(struct vm_amap *amap)
@@ -1153,6 +1169,32 @@ amap_unadd(struct vm_aref *aref, vaddr_t
 }
 
 /*
+ * amap_adjref_anons: adjust the reference count(s) on amap and its anons.
+ */
+static void
+amap_adjref_anons(struct vm_amap *amap, vaddr_t offset, vsize_t len,
+int refv, boolean_t all)
+{
+#ifdef UVM_AMAP_PPREF
+   if (amap->am_ppref == NULL && !all && len != amap->am_nslot) {
+   amap_pp_establish(amap);
+   }
+#endif
+
+   amap->am_ref += refv;
+
+#ifdef UVM_AMAP_PPREF
+   if (amap->am_ppref && amap->am_ppref != PPREF_NONE) {
+   if (all) {
+   amap_pp_adjref(amap, 0, amap->am_nslot, refv);
+   } else {
+   amap_pp_adjref(amap, offset, len, refv);
+   }
+   }
+#endif
+}
+
+/*
  * amap_ref: gain a reference to an amap
  *
  * => "offset" and "len" are in units of pages
@@ -1162,51 +1204,36 @@ void
 amap_ref(struct vm_amap *amap, vaddr_t offset, vsize_t len, int flags)
 {
 
-   amap->am_ref++;
if (flags & AMAP_SHARED)
amap->am_flags |= AMAP_SHARED;
-#ifdef UVM_AMAP_PPREF
-   if (amap->am_ppref == NULL && (flags & AMAP_REFALL) == 0 &&
-   len != amap->am_nslot)
-   amap_pp_establish(amap);
-   if (amap->am_ppref && amap->am_ppref != PPREF_NONE) {
-   if (flags & AMAP_REFALL)
-   amap_pp_adjref(amap, 0, amap->am_nslot, 1);
-   else
-   amap_pp_adjref(amap, offset, len, 1);
-   }
-#endif
+   amap_adjref_anons(amap, offset, len, 1, (flags & AMAP_REFALL) != 0);
 }
 
 /*
  * amap_unref: remove a reference to an amap
  *
- * => caller must remove all pmap-level references to this amap before
- * dropping the reference
- * => called from uvm_unmap_detach [only]  ... note that entry is no
- * longer part of a map
+ * => All pmap-level references to this amap must be already removed.
+ * => Called from uvm_unmap_detach(); entry is already removed from the map.
  */
 void
 amap_unref(struct vm_amap *amap, vaddr_t offset, vsize_t len, boolean_t all)
 {
+   KASSERT(amap->am_ref > 0);
 
-   /* if we are the last reference, free the amap and return. */
-   if (amap->am_ref-- == 1) {
-   amap_wipeout(amap); /* drops final ref and frees */
+   if (amap->am_ref == 1) {
+   /*
+* If the last reference - wipeout and destroy the amap.
+*/
+   amap->am_ref--;
+   amap_wipeout(amap);
return;
}
 
-   /* otherwise just drop the reference count(s) */
-   if (amap->am_ref == 1 && (amap->am_flags & AMAP_SHARED) != 0)
-   amap->am_flags &= ~AMAP_SHARED; /* clear shared flag */
-#ifdef UVM_AMAP_PPREF
-   if (amap->am_ppref == NULL && all == 0 && len != amap->am_nslot)
-   amap_pp_establish(amap);
-   if (amap->am_ppref && amap->am_ppref != PPREF_NONE) {
-   if (all)
-   amap_pp_adjref(amap, 0, amap->am_nslot, -1);
-   else
-   amap_pp_adjref(amap, offset, len, -1);
+   /*
+* Otherwise, drop the reference count(s) on anons.
+*/
+   if (amap->am_ref == 2 && (amap->am_flags & AMAP_SHARED) != 0) {
+   amap->am_flags &= ~AMAP_SHARED;
}
-#endif
+   amap_adjref_anons(amap, offset, len, -1, all);
 }
Index: uvm/uvm_amap.h
===
RCS file: /cvs/src/sys/uvm/uvm_amap.h,v
retrieving revision 1.31
diff -u -p 

const/C99 & locks for uvm_pagerops

2020-10-20 Thread Martin Pieuchot
Diff below use C99 initializer and constify the various "struct uvm_pagerops"
in tree.

While here add some KERNEL_ASSERT_LOCKED() to places where the `uobj'
locking has been removed and that should be revisited.  This is to help
a future myself or another developer to look at what needs some love.

ok?

Index: dev/pci/drm/drm_gem.c
===
RCS file: /cvs/src/sys/dev/pci/drm/drm_gem.c,v
retrieving revision 1.11
diff -u -p -r1.11 drm_gem.c
--- dev/pci/drm/drm_gem.c   22 Aug 2020 04:53:50 -  1.11
+++ dev/pci/drm/drm_gem.c   20 Oct 2020 09:01:08 -
@@ -58,12 +58,11 @@ boolean_t drm_flush(struct uvm_object *,
 int drm_fault(struct uvm_faultinfo *, vaddr_t, vm_page_t *, int, int,
 vm_fault_t, vm_prot_t, int);
 
-struct uvm_pagerops drm_pgops = {
-   NULL,
-   drm_ref,
-   drm_unref,
-   drm_fault,
-   drm_flush,
+const struct uvm_pagerops drm_pgops = {
+   .pgo_reference = drm_ref,
+   .pgo_detach = drm_unref,
+   .pgo_fault = drm_fault,
+   .pgo_flush = drm_flush,
 };
 
 void
Index: dev/pci/drm/ttm/ttm_bo_vm.c
===
RCS file: /cvs/src/sys/dev/pci/drm/ttm/ttm_bo_vm.c,v
retrieving revision 1.22
diff -u -p -r1.22 ttm_bo_vm.c
--- dev/pci/drm/ttm/ttm_bo_vm.c 18 Oct 2020 09:22:32 -  1.22
+++ dev/pci/drm/ttm/ttm_bo_vm.c 20 Oct 2020 09:01:08 -
@@ -903,7 +903,7 @@ ttm_bo_vm_detach(struct uvm_object *uobj
ttm_bo_put(bo);
 }
 
-struct uvm_pagerops ttm_bo_vm_ops = {
+const struct uvm_pagerops ttm_bo_vm_ops = {
.pgo_fault = ttm_bo_vm_fault,
.pgo_reference = ttm_bo_vm_reference,
.pgo_detach = ttm_bo_vm_detach
Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.87
diff -u -p -r1.87 uvm_aobj.c
--- uvm/uvm_aobj.c  22 Sep 2020 14:31:08 -  1.87
+++ uvm/uvm_aobj.c  20 Oct 2020 09:01:08 -
@@ -181,16 +181,14 @@ int   uao_grow_convert(struct uvm_object *
 
 /*
  * aobj_pager
- * 
+ *
  * note that some functions (e.g. put) are handled elsewhere
  */
-struct uvm_pagerops aobj_pager = {
-   NULL,   /* init */
-   uao_reference,  /* reference */
-   uao_detach, /* detach */
-   NULL,   /* fault */
-   uao_flush,  /* flush */
-   uao_get,/* get */
+const struct uvm_pagerops aobj_pager = {
+   .pgo_reference = uao_reference,
+   .pgo_detach = uao_detach,
+   .pgo_flush = uao_flush,
+   .pgo_get = uao_get,
 };
 
 /*
@@ -810,6 +808,7 @@ uao_init(void)
 void
 uao_reference(struct uvm_object *uobj)
 {
+   KERNEL_ASSERT_LOCKED();
uao_reference_locked(uobj);
 }
 
@@ -834,6 +833,7 @@ uao_reference_locked(struct uvm_object *
 void
 uao_detach(struct uvm_object *uobj)
 {
+   KERNEL_ASSERT_LOCKED();
uao_detach_locked(uobj);
 }
 
@@ -908,6 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
struct vm_page *pp;
voff_t curoff;
 
+   KERNEL_ASSERT_LOCKED();
+
if (flags & PGO_ALLPAGES) {
start = 0;
stop = (voff_t)aobj->u_pages << PAGE_SHIFT;
@@ -1028,6 +1030,8 @@ uao_get(struct uvm_object *uobj, voff_t 
vm_page_t ptmp;
int lcv, gotpages, maxpages, swslot, rv, pageidx;
boolean_t done;
+
+   KERNEL_ASSERT_LOCKED();
 
/* get number of pages */
maxpages = *npagesp;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.16
diff -u -p -r1.16 uvm_aobj.h
--- uvm/uvm_aobj.h  11 Jul 2014 16:35:40 -  1.16
+++ uvm/uvm_aobj.h  20 Oct 2020 09:01:08 -
@@ -69,7 +69,7 @@ int uao_grow(struct uvm_object *, int);
  * globals
  */
 
-extern struct uvm_pagerops aobj_pager;
+extern const struct uvm_pagerops aobj_pager;
 
 #endif /* _KERNEL */
 
Index: uvm/uvm_device.c
===
RCS file: /cvs/src/sys/uvm/uvm_device.c,v
retrieving revision 1.57
diff -u -p -r1.57 uvm_device.c
--- uvm/uvm_device.c8 Dec 2019 12:37:45 -   1.57
+++ uvm/uvm_device.c20 Oct 2020 09:01:08 -
@@ -70,12 +70,11 @@ static boolean_tudv_flush(struct
 /*
  * master pager structure
  */
-struct uvm_pagerops uvm_deviceops = {
-   NULL,   /* inited statically */
-   udv_reference,
-   udv_detach,
-   udv_fault,
-   udv_flush,
+const struct uvm_pagerops uvm_deviceops = {
+   .pgo_reference = udv_reference,
+   .pgo_detach = udv_detach,
+   .pgo_fault = udv_fault,
+   .pgo_flush = udv_flush,
 };
 
 /*
@@ -213,7 +212,7 @@ udv_attach(dev_t device, vm_prot_t acces
 static void
 udv_reference(struct uvm_object *uobj)
 {
-
+   KERNEL_ASSERT_LOCKED();
uobj->uo_refs++;
 }
 
@@ 

kqueue_scan() refactoring

2020-10-19 Thread Martin Pieuchot
Diff below is the second part of the refactoring [0] that got committed
then reverted 6 months ago.

The idea of this refactoring is to move the accounting and markers used
when collecting events to a context (struct kqueue_scan_state) such that
the collecting routine (kqueue_scan()) can be called multiple times.

Please test this especially with X and report back.

[0] https://marc.info/?l=openbsd-tech=158739418817156=2

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.143
diff -u -p -r1.143 kern_event.c
--- kern/kern_event.c   11 Oct 2020 07:11:59 -  1.143
+++ kern/kern_event.c   19 Oct 2020 08:47:00 -
@@ -905,7 +905,6 @@ kqueue_scan(struct kqueue_scan_state *sc
 
nkev = 0;
kevp = kev;
-
count = maxevents;
if (count == 0)
goto done;
@@ -921,7 +920,8 @@ retry:
 
s = splhigh();
if (kq->kq_count == 0) {
-   if (tsp != NULL && !timespecisset(tsp)) {
+   if ((tsp != NULL && !timespecisset(tsp)) ||
+   scan->kqs_nevent != 0) {
splx(s);
error = 0;
goto done;
@@ -937,18 +937,32 @@ retry:
goto done;
}
 
-   TAILQ_INSERT_TAIL(>kq_head, >kqs_end, kn_tqe);
+   /*
+* Put the end marker in the queue to limit the scan to the events
+* that are currently active.  This prevents events from being
+* recollected if they reactivate during scan.
+*
+* If a partial scan has been performed already but no events have
+* been collected, reposition the end marker to make any new events
+* reachable.
+*/
+   if (!scan->kqs_queued) {
+   TAILQ_INSERT_TAIL(>kq_head, >kqs_end, kn_tqe);
+   scan->kqs_queued = 1;
+   } else if (scan->kqs_nevent == 0) {
+   TAILQ_REMOVE(>kq_head, >kqs_end, kn_tqe);
+   TAILQ_INSERT_TAIL(>kq_head, >kqs_end, kn_tqe);
+   }
+
TAILQ_INSERT_HEAD(>kq_head, >kqs_start, kn_tqe);
while (count) {
kn = TAILQ_NEXT(>kqs_start, kn_tqe);
if (kn->kn_filter == EVFILT_MARKER) {
if (kn == >kqs_end) {
-   TAILQ_REMOVE(>kq_head, >kqs_end,
-   kn_tqe);
TAILQ_REMOVE(>kq_head, >kqs_start,
kn_tqe);
splx(s);
-   if (count == maxevents)
+   if (scan->kqs_nevent == 0)
goto retry;
goto done;
}
@@ -984,6 +998,9 @@ retry:
*kevp = kn->kn_kevent;
kevp++;
nkev++;
+   count--;
+   scan->kqs_nevent++;
+
if (kn->kn_flags & EV_ONESHOT) {
splx(s);
kn->kn_fop->f_detach(kn);
@@ -1009,7 +1026,6 @@ retry:
knote_release(kn);
}
kqueue_check(kq);
-   count--;
if (nkev == KQ_NEVENTS) {
splx(s);
 #ifdef KTRACE
@@ -1026,7 +1042,6 @@ retry:
break;
}
}
-   TAILQ_REMOVE(>kq_head, >kqs_end, kn_tqe);
TAILQ_REMOVE(>kq_head, >kqs_start, kn_tqe);
splx(s);
 done:
@@ -1059,15 +1074,21 @@ void
 kqueue_scan_finish(struct kqueue_scan_state *scan)
 {
struct kqueue *kq = scan->kqs_kq;
+   int s;
 
KASSERT(scan->kqs_start.kn_filter == EVFILT_MARKER);
KASSERT(scan->kqs_start.kn_status == KN_PROCESSING);
KASSERT(scan->kqs_end.kn_filter == EVFILT_MARKER);
KASSERT(scan->kqs_end.kn_status == KN_PROCESSING);
 
+   if (scan->kqs_queued) {
+   scan->kqs_queued = 0;
+   s = splhigh();
+   TAILQ_REMOVE(>kq_head, >kqs_end, kn_tqe);
+   splx(s);
+   }
KQRELE(kq);
 }
-
 
 /*
  * XXX
Index: sys/event.h
===
RCS file: /cvs/src/sys/sys/event.h,v
retrieving revision 1.46
diff -u -p -r1.46 event.h
--- sys/event.h 11 Oct 2020 07:11:59 -  1.46
+++ sys/event.h 19 Oct 2020 08:39:51 -
@@ -204,6 +204,9 @@ struct kqueue_scan_state {
struct kqueue   *kqs_kq;/* kqueue of this scan */
struct knote kqs_start; /* start marker */
struct knote kqs_end;   /* end marker */
+   int  kqs_nevent;/* number of events collected */
+   int  kqs_queued;/* if set, end marker is
+* in queue */
 };
 
 struct proc;
@@ -223,6 

UVM fault check

2020-10-19 Thread Martin Pieuchot
uvm_fault() is one of the most contended "entry point" of the kernel.
To reduce this contention I'm carefully refactoring this code to be able
to push the KERNEL_LOCK() inside the fault handler.

The first aim of this project would be to get the upper layer faults
(cases 1A and 1B) out of ze big lock.  As these faults do not involve
`uobj' the scope of this project should be limited to serializing amap
changes without the KERNEL_LOCK().

The diff below moves the first part of uvm_fault() into its own
function: uvm_fault_check().  It is inspired by/imitates the current
code structure of NetBSD's fault handler. 

This diff should not have any functional change.

I hope it helps build better understanding of this area.

Comments?  Oks?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.102
diff -u -p -r1.102 uvm_fault.c
--- uvm/uvm_fault.c 29 Sep 2020 11:47:41 -  1.102
+++ uvm/uvm_fault.c 12 Oct 2020 09:01:05 -
@@ -472,114 +472,101 @@ uvmfault_update_stats(struct uvm_faultin
}
 }
 
-/*
- *   F A U L T   -   m a i n   e n t r y   p o i n t
- */
+struct uvm_faultctx {
+   /*
+* the following members are set up by uvm_fault_check() and
+* read-only after that.
+*/
+   vm_prot_t enter_prot;
+   vaddr_t startva;
+   int npages;
+   int centeridx;
+   boolean_t narrow;
+   boolean_t wired;
+   paddr_t pa_flags;
+};
 
 /*
- * uvm_fault: page fault handler
+ * uvm_fault_check: check prot, handle needs-copy, etc.
  *
- * => called from MD code to resolve a page fault
- * => VM data structures usually should be unlocked.   however, it is
- * possible to call here with the main map locked if the caller
- * gets a write lock, sets it recursive, and then calls us (c.f.
- * uvm_map_pageable).   this should be avoided because it keeps
- * the map locked off during I/O.
+ * 1. lookup entry.
+ * 2. check protection.
+ * 3. adjust fault condition (mainly for simulated fault).
+ * 4. handle needs-copy (lazy amap copy).
+ * 5. establish range of interest for neighbor fault (aka pre-fault).
+ * 6. look up anons (if amap exists).
+ * 7. flush pages (if MADV_SEQUENTIAL)
+ *
+ * => called with nothing locked.
+ * => if we fail (result != 0) we unlock everything.
+ * => initialize/adjust many members of flt.
  */
-#define MASK(entry) (UVM_ET_ISCOPYONWRITE(entry) ? \
-~PROT_WRITE : PROT_MASK)
 int
-uvm_fault(vm_map_t orig_map, vaddr_t vaddr, vm_fault_t fault_type,
-vm_prot_t access_type)
+uvm_fault_check(struct uvm_faultinfo *ufi, struct uvm_faultctx *flt,
+struct vm_anon ***ranons, vm_prot_t access_type)
 {
-   struct uvm_faultinfo ufi;
-   vm_prot_t enter_prot;
-   boolean_t wired, narrow, promote, locked, shadowed;
-   int npages, nback, nforw, centeridx, result, lcv, gotpages, ret;
-   vaddr_t startva, currva;
-   voff_t uoff;
-   paddr_t pa, pa_flags;
struct vm_amap *amap;
struct uvm_object *uobj;
-   struct vm_anon *anons_store[UVM_MAXRANGE], **anons, *anon, *oanon;
-   struct vm_page *pages[UVM_MAXRANGE], *pg, *uobjpage;
+   int nback, nforw;
 
-   anon = NULL;
-   pg = NULL;
-
-   uvmexp.faults++;/* XXX: locking? */
-   TRACEPOINT(uvm, fault, vaddr, fault_type, access_type, NULL);
-
-   /* init the IN parameters in the ufi */
-   ufi.orig_map = orig_map;
-   ufi.orig_rvaddr = trunc_page(vaddr);
-   ufi.orig_size = PAGE_SIZE;  /* can't get any smaller than this */
-   if (fault_type == VM_FAULT_WIRE)
-   narrow = TRUE;  /* don't look for neighborhood
-* pages on wire */
-   else
-   narrow = FALSE; /* normal fault */
-
-   /* "goto ReFault" means restart the page fault from ground zero. */
-ReFault:
/* lookup and lock the maps */
-   if (uvmfault_lookup(, FALSE) == FALSE) {
+   if (uvmfault_lookup(ufi, FALSE) == FALSE) {
return (EFAULT);
}
 
 #ifdef DIAGNOSTIC
-   if ((ufi.map->flags & VM_MAP_PAGEABLE) == 0)
+   if ((ufi->map->flags & VM_MAP_PAGEABLE) == 0)
panic("uvm_fault: fault on non-pageable map (%p, 0x%lx)",
-   ufi.map, vaddr);
+   ufi->map, ufi->orig_rvaddr);
 #endif
 
/* check protection */
-   if ((ufi.entry->protection & access_type) != access_type) {
-   uvmfault_unlockmaps(, FALSE);
+   if ((ufi->entry->protection & access_type) != access_type) {
+   uvmfault_unlockmaps(ufi, FALSE);
return (EACCES);
}
 
/*
 * "enter_prot" is the protection we want to enter the page in at.
 * for certain pages (e.g. copy-on-write pages) this protection can
-* be more strict than 

uao_init() cleanup

2020-10-19 Thread Martin Pieuchot
uao_init() is called from uvm_km_init() which itself is called by
uvm_init().  None of the *init() functions in UVM have a guard, so be
coherent and remove this one.

ok?

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.87
diff -u -p -r1.87 uvm_aobj.c
--- uvm/uvm_aobj.c  22 Sep 2020 14:31:08 -  1.87
+++ uvm/uvm_aobj.c  13 Oct 2020 09:25:20 -
@@ -788,12 +788,6 @@ uao_create(vsize_t size, int flags)
 void
 uao_init(void)
 {
-   static int uao_initialized;
-
-   if (uao_initialized)
-   return;
-   uao_initialized = TRUE;
-
/*
 * NOTE: Pages for this pool must not come from a pageable
 * kernel map!



uvm_grow(): serialize updates

2020-10-14 Thread Martin Pieuchot
Getting uvm_fault() out of the KERNEL_LOCK() alone is not enough to
reduce the contention due to page faults.  A single part of the handler
spinning on the lock is enough to hide bugs and increase latency.  One
recent example is the uvm_map_inentry() check.

uvm_grow() is another small function called in trap that currently needs
the KERNEL_LOCK().  Diff below changes this requirement without removing
the KERNEL_LOCK() yet. 

It uses the underlying vm_space lock to serialize writes to the fields
of "truct vmspace". 

While here I also documented that the reference counting is currently
protected by the KERNEL_LOCK() and introduced a wrapper to help with
future changes and reduce the differences with NetBSD.

Once uvm_grow() is safe to be called without the KERNEL_LOCK() MD trap
functions can be adapted on a case-per-case basis.

Comments, Oks?

Index: kern/kern_sysctl.c
===
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
retrieving revision 1.379
diff -u -p -r1.379 kern_sysctl.c
--- kern/kern_sysctl.c  1 Sep 2020 01:53:50 -   1.379
+++ kern/kern_sysctl.c  14 Oct 2020 09:35:00 -
@@ -1783,7 +1783,7 @@ sysctl_proc_args(int *name, u_int namele
/* Execing - danger. */
if ((vpr->ps_flags & PS_INEXEC))
return (EBUSY);
-   
+
/* Only owner or root can get env */
if ((op == KERN_PROC_NENV || op == KERN_PROC_ENV) &&
(vpr->ps_ucred->cr_uid != cp->p_ucred->cr_uid &&
@@ -1792,7 +1792,7 @@ sysctl_proc_args(int *name, u_int namele
 
ps_strings = vpr->ps_strings;
vm = vpr->ps_vmspace;
-   vm->vm_refcnt++;
+   uvmspace_addref(vm);
vpr = NULL;
 
buf = malloc(PAGE_SIZE, M_TEMP, M_WAITOK);
Index: kern/sys_process.c
===
RCS file: /cvs/src/sys/kern/sys_process.c,v
retrieving revision 1.83
diff -u -p -r1.83 sys_process.c
--- kern/sys_process.c  16 Mar 2020 11:58:46 -  1.83
+++ kern/sys_process.c  14 Oct 2020 09:35:00 -
@@ -850,13 +850,12 @@ process_domem(struct proc *curp, struct 
if ((error = process_checkioperm(curp, tr)) != 0)
return error;
 
-   /* XXXCDC: how should locking work here? */
vm = tr->ps_vmspace;
if ((tr->ps_flags & PS_EXITING) || (vm->vm_refcnt < 1))
return EFAULT;
addr = uio->uio_offset;
 
-   vm->vm_refcnt++;
+   uvmspace_addref(vm);
 
error = uvm_io(>vm_map, uio,
(uio->uio_rw == UIO_WRITE) ? UVM_IO_FIXPROT : 0);
@@ -892,7 +891,7 @@ process_auxv_offset(struct proc *curp, s
if ((tr->ps_flags & PS_EXITING) || (vm->vm_refcnt < 1))
return EFAULT;
 
-   vm->vm_refcnt++;
+   uvmspace_addref(vm);
error = uvm_io(>vm_map, , 0);
uvmspace_free(vm);
 
Index: uvm/uvm_extern.h
===
RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
retrieving revision 1.153
diff -u -p -r1.153 uvm_extern.h
--- uvm/uvm_extern.h13 Sep 2020 10:05:25 -  1.153
+++ uvm/uvm_extern.h14 Oct 2020 09:35:00 -
@@ -192,11 +192,13 @@ struct pmap;
  * Several fields are temporary (text, data stuff).
  *
  *  Locks used to protect struct members in this file:
+ * K   kernel lock
  * I   immutable after creation
+ * v   vm_map's lock
  */
 struct vmspace {
struct  vm_map vm_map;  /* VM address map */
-   int vm_refcnt;  /* number of references */
+   int vm_refcnt;  /* [K] number of references */
caddr_t vm_shm; /* SYS5 shared memory private data XXX */
 /* we copy from vm_startcopy to the end of the structure on fork */
 #define vm_startcopy vm_rssize
@@ -205,9 +207,9 @@ struct vmspace {
segsz_t vm_tsize;   /* text size (pages) XXX */
segsz_t vm_dsize;   /* data size (pages) XXX */
segsz_t vm_dused;   /* data segment length (pages) XXX */
-   segsz_t vm_ssize;   /* stack size (pages) */
-   caddr_t vm_taddr;   /* user virtual address of text XXX */
-   caddr_t vm_daddr;   /* user virtual address of data XXX */
+   segsz_t vm_ssize;   /* [v] stack size (pages) */
+   caddr_t vm_taddr;   /* [I] user virtual address of text */
+   caddr_t vm_daddr;   /* [I] user virtual address of data */
caddr_t vm_maxsaddr;/* [I] user VA at max stack growth */
caddr_t vm_minsaddr;/* [I] user VA at top of stack */
 };
@@ -413,6 +415,7 @@ voiduvmspace_init(struct vmspace *, 
s
vaddr_t, vaddr_t, boolean_t, boolean_t);
 void   uvmspace_exec(struct proc *, vaddr_t, vaddr_t);
 struct vmspace *uvmspace_fork(struct process *);
+void   uvmspace_addref(struct vmspace *);
 void   uvmspace_free(struct vmspace *);
 struct vmspace 

Re: Please test: switch select(2) to kqfilters

2020-10-12 Thread Martin Pieuchot
On 09/10/20(Fri) 10:38, Martin Pieuchot wrote:
> On 02/10/20(Fri) 12:19, Martin Pieuchot wrote:
> > Diff below modifies the internal implementation of {p,}select(2) to
> > query kqfilter handlers instead of poll ones.
> > 
> > I deliberately left {p,}poll(2) untouched to ease the transition.
> > 
> > This diff includes some kqueue refactoring from visa@.  It is built on
> > top of the changes that went in during the last release cycle notably
> > EVFILT_EXCEPT and NOTE_OOB.
> > 
> > A mid-term goal of this change would be to get rid of the poll handlers
> > in order to have a single event system in the kernel to maintain and
> > turn mp-safe.
> > 
> > The logic is as follow:
> > 
> > - With this change every thread get a "private" kqueue, usable by the
> >   kernel only, to register events for select(2) and later poll(2).
> > 
> > - Events specified via FD_SET(2) are converted to their kqueue equivalent.
> > 
> > - kqueue_scan() has been modified to be restartable and work with a given
> >   kqueue.
> > 
> > - At the end of every {p,}select(2) syscall the private kqueue is purged.
> > 
> > This version includes a fix for a previously reported regression triggered
> > by regress/usr.bin/ssh's keyscan test.
> > 
> > 
> > I'd like to get this in early in this release cycle, so please test and
> > report back :o)
> 
> Thanks for all the reports.  Here's an updated version including the
> following changes:
> 
> - Allocate the per-thread kqueue in the first {p,}select(2) syscall to
>   not waste resources as suggested by anton@
> 
> - Keep EWOULDBLOCK handling inside kqueue_scan(), pointed by cheloha@
> 
> - Add a comment to better explain why successive kqueue_scan() calls are
>   always non-blocking
> 
> I'm appreciate reviews/oks on the kqueue_scan() refactoring I sent to
> start shrinking this diff.
> 
> Tests are always welcome, especially on non-amd64 architectures.

Rebased diff on top of -current below:

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.143
diff -u -p -r1.143 kern_event.c
--- kern/kern_event.c   11 Oct 2020 07:11:59 -  1.143
+++ kern/kern_event.c   12 Oct 2020 08:56:21 -
@@ -57,6 +57,7 @@
 #include 
 #include 
 
+struct kqueue *kqueue_alloc(struct filedesc *);
 void   kqueue_terminate(struct proc *p, struct kqueue *);
 void   kqueue_free(struct kqueue *);
 void   kqueue_init(void);
@@ -504,6 +505,27 @@ const struct filterops dead_filtops = {
.f_event= filt_dead,
 };
 
+void
+kqpoll_init(struct proc *p)
+{
+   if (p->p_kq != NULL)
+   return;
+
+   p->p_kq = kqueue_alloc(p->p_fd);
+   p->p_kq_serial = arc4random();
+}
+
+void
+kqpoll_exit(struct proc *p)
+{
+   if (p->p_kq == NULL)
+   return;
+
+   kqueue_terminate(p, p->p_kq);
+   kqueue_free(p->p_kq);
+   p->p_kq = NULL;
+}
+
 struct kqueue *
 kqueue_alloc(struct filedesc *fdp)
 {
@@ -567,6 +589,7 @@ sys_kevent(struct proc *p, void *v, regi
struct timespec ts;
struct timespec *tsp = NULL;
int i, n, nerrors, error;
+   int ready, total;
struct kevent kev[KQ_NEVENTS];
 
if ((fp = fd_getfile(fdp, SCARG(uap, fd))) == NULL)
@@ -595,9 +618,9 @@ sys_kevent(struct proc *p, void *v, regi
kq = fp->f_data;
nerrors = 0;
 
-   while (SCARG(uap, nchanges) > 0) {
-   n = SCARG(uap, nchanges) > KQ_NEVENTS ?
-   KQ_NEVENTS : SCARG(uap, nchanges);
+   while ((n = SCARG(uap, nchanges)) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
error = copyin(SCARG(uap, changelist), kev,
n * sizeof(struct kevent));
if (error)
@@ -635,11 +658,36 @@ sys_kevent(struct proc *p, void *v, regi
 
kqueue_scan_setup(, kq);
FRELE(fp, p);
-   error = kqueue_scan(, SCARG(uap, nevents), SCARG(uap, eventlist),
-   tsp, kev, p, );
+   /*
+* Collect as many events as we can.  The timeout on successive
+* loops is disabled (kqueue_scan() becomes non-blocking).
+*/
+   total = 0;
+   error = 0;
+   while ((n = SCARG(uap, nevents) - total) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
+   ready = kqueue_scan(, n, kev, tsp, p, );
+   if (ready == 0)
+   break;
+   error = copyout(kev, SCARG(uap, eventlist) + total,
+   sizeof(struct kevent) * ready);
+#ifdef KTRACE
+   if (KTRPOINT(p, KTR_STRUCT))
+   ktreve

Re: xhci zero length transfers 'leak' one transfer buffer count

2020-10-11 Thread Martin Pieuchot
On 09/10/20(Fri) 12:37, Jonathon Fletcher wrote:
> In xhci_xfer_get_trb, the count of transfer buffers in the pipe 
> (xp->free_trbs) is always decremented but the count of transfer buffers used 
> in the transfer (xx->ntrb) is not incremented for zero-length transfers. The 
> result of this is that, at the end of a zero length transfer, xp->free_trbs 
> has 'lost' one.
> 
> Over time, this mismatch of unconditional decrement (xp->free_trbs) vs 
> conditional increment (xx->ntrb) results in xhci_device_*_start returning 
> USBD_NOMEM.
> 
> The patch below works around this by only decrementing xp->free_trbs in the 
> cases when xx->ntrb is incremented.

Did you consider incrementing xx->ntrb instead?

With the diff below the produced TRB isn't accounted which might lead to
and off-by-one.

> Index: xhci.c
> ===
> RCS file: /cvs/src/sys/dev/usb/xhci.c,v
> retrieving revision 1.119
> diff -u -p -u -r1.119 xhci.c
> --- xhci.c31 Jul 2020 19:27:57 -  1.119
> +++ xhci.c9 Oct 2020 19:11:45 -
> @@ -1836,7 +1836,6 @@ xhci_xfer_get_trb(struct xhci_softc *sc,
>   struct xhci_xfer *xx = (struct xhci_xfer *)xfer;
>  
>   KASSERT(xp->free_trbs >= 1);
> - xp->free_trbs--;
>   *togglep = xp->ring.toggle;
>  
>   switch (last) {
> @@ -1847,11 +1846,13 @@ xhci_xfer_get_trb(struct xhci_softc *sc,
>   xp->pending_xfers[xp->ring.index] = xfer;
>   xx->index = -2;
>   xx->ntrb += 1;
> + xp->free_trbs--;
>   break;
>   case 1: /* This will terminate a chain. */
>   xp->pending_xfers[xp->ring.index] = xfer;
>   xx->index = xp->ring.index;
>   xx->ntrb += 1;
> + xp->free_trbs--;
>   break;
>   }
>  
> 



Re: Please test: switch select(2) to kqfilters

2020-10-09 Thread Martin Pieuchot
On 02/10/20(Fri) 12:19, Martin Pieuchot wrote:
> Diff below modifies the internal implementation of {p,}select(2) to
> query kqfilter handlers instead of poll ones.
> 
> I deliberately left {p,}poll(2) untouched to ease the transition.
> 
> This diff includes some kqueue refactoring from visa@.  It is built on
> top of the changes that went in during the last release cycle notably
> EVFILT_EXCEPT and NOTE_OOB.
> 
> A mid-term goal of this change would be to get rid of the poll handlers
> in order to have a single event system in the kernel to maintain and
> turn mp-safe.
> 
> The logic is as follow:
> 
> - With this change every thread get a "private" kqueue, usable by the
>   kernel only, to register events for select(2) and later poll(2).
> 
> - Events specified via FD_SET(2) are converted to their kqueue equivalent.
> 
> - kqueue_scan() has been modified to be restartable and work with a given
>   kqueue.
> 
> - At the end of every {p,}select(2) syscall the private kqueue is purged.
> 
> This version includes a fix for a previously reported regression triggered
> by regress/usr.bin/ssh's keyscan test.
> 
> 
> I'd like to get this in early in this release cycle, so please test and
> report back :o)

Thanks for all the reports.  Here's an updated version including the
following changes:

- Allocate the per-thread kqueue in the first {p,}select(2) syscall to
  not waste resources as suggested by anton@

- Keep EWOULDBLOCK handling inside kqueue_scan(), pointed by cheloha@

- Add a comment to better explain why successive kqueue_scan() calls are
  always non-blocking

I'm appreciate reviews/oks on the kqueue_scan() refactoring I sent to
start shrinking this diff.

Tests are always welcome, especially on non-amd64 architectures.

diff --git sys/kern/kern_event.c sys/kern/kern_event.c
index 9bc469b1235..87f15831c8f 100644
--- sys/kern/kern_event.c
+++ sys/kern/kern_event.c
@@ -57,6 +57,7 @@
 #include 
 #include 
 
+struct kqueue *kqueue_alloc(struct filedesc *);
 void   kqueue_terminate(struct proc *p, struct kqueue *);
 void   kqueue_free(struct kqueue *);
 void   kqueue_init(void);
@@ -64,9 +65,6 @@ void  KQREF(struct kqueue *);
 void   KQRELE(struct kqueue *);
 
 intkqueue_sleep(struct kqueue *, struct timespec *);
-intkqueue_scan(struct kqueue *kq, int maxevents,
-   struct kevent *ulistp, struct timespec *timeout,
-   struct kevent *kev, struct proc *p, int *retval);
 
 intkqueue_read(struct file *, struct uio *, int);
 intkqueue_write(struct file *, struct uio *, int);
@@ -507,6 +505,27 @@ const struct filterops dead_filtops = {
.f_event= filt_dead,
 };
 
+void
+kqpoll_init(struct proc *p)
+{
+   if (p->p_kq != NULL)
+   return;
+
+   p->p_kq = kqueue_alloc(p->p_fd);
+   p->p_kq_serial = arc4random();
+}
+
+void
+kqpoll_exit(struct proc *p)
+{
+   if (p->p_kq == NULL)
+   return;
+
+   kqueue_terminate(p, p->p_kq);
+   kqueue_free(p->p_kq);
+   p->p_kq = NULL;
+}
+
 struct kqueue *
 kqueue_alloc(struct filedesc *fdp)
 {
@@ -554,6 +573,7 @@ out:
 int
 sys_kevent(struct proc *p, void *v, register_t *retval)
 {
+   struct kqueue_scan_state scan;
struct filedesc* fdp = p->p_fd;
struct sys_kevent_args /* {
syscallarg(int) fd;
@@ -569,6 +589,7 @@ sys_kevent(struct proc *p, void *v, register_t *retval)
struct timespec ts;
struct timespec *tsp = NULL;
int i, n, nerrors, error;
+   int ready, total;
struct kevent kev[KQ_NEVENTS];
 
if ((fp = fd_getfile(fdp, SCARG(uap, fd))) == NULL)
@@ -597,9 +618,9 @@ sys_kevent(struct proc *p, void *v, register_t *retval)
kq = fp->f_data;
nerrors = 0;
 
-   while (SCARG(uap, nchanges) > 0) {
-   n = SCARG(uap, nchanges) > KQ_NEVENTS ?
-   KQ_NEVENTS : SCARG(uap, nchanges);
+   while ((n = SCARG(uap, nchanges)) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
error = copyin(SCARG(uap, changelist), kev,
n * sizeof(struct kevent));
if (error)
@@ -635,12 +656,41 @@ sys_kevent(struct proc *p, void *v, register_t *retval)
goto done;
}
 
+
KQREF(kq);
FRELE(fp, p);
-   error = kqueue_scan(kq, SCARG(uap, nevents), SCARG(uap, eventlist),
-   tsp, kev, p, );
+   /*
+* Collect as many events as we can.  The timeout on successive
+* loops is disabled (kqueue_scan() becomes non-blocking).
+*/
+   total = 0;
+   error = 0;
+   kqueue_scan_setup(, kq);
+   while ((n = SCARG(uap, nevents) - total) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
+   ready = kqueue_scan

Re: amap: KASSERT()s and local variables

2020-10-07 Thread Martin Pieuchot
On 01/10/20(Thu) 14:18, Martin Pieuchot wrote:
> Use more KASSERT()s instead of the "if (x) panic()" idiom for sanity
> checks and add a couple of local variables to reduce the difference
> with NetBSD and help for upcoming locking.

deraadt@ mentioned that KASSERT()s are not effective in RAMDISK kernels.

So the revisited diff below only converts checks that are redundant with
NULL dereferences.

ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.84
diff -u -p -r1.84 uvm_amap.c
--- uvm/uvm_amap.c  25 Sep 2020 08:04:48 -  1.84
+++ uvm/uvm_amap.c  7 Oct 2020 14:40:53 -
@@ -669,9 +669,7 @@ ReStart:
pg = anon->an_page;
 
/* page must be resident since parent is wired */
-   if (pg == NULL)
-   panic("amap_cow_now: non-resident wired page"
-   " in anon %p", anon);
+   KASSERT(pg != NULL);
 
/*
 * if the anon ref count is one, we are safe (the child
@@ -740,6 +738,7 @@ ReStart:
 void
 amap_splitref(struct vm_aref *origref, struct vm_aref *splitref, vaddr_t 
offset)
 {
+   struct vm_amap *amap = origref->ar_amap;
int leftslots;
 
AMAP_B2SLOT(leftslots, offset);
@@ -747,17 +746,18 @@ amap_splitref(struct vm_aref *origref, s
panic("amap_splitref: split at zero offset");
 
/* now: we have a valid am_mapped array. */
-   if (origref->ar_amap->am_nslot - origref->ar_pageoff - leftslots <= 0)
+   if (amap->am_nslot - origref->ar_pageoff - leftslots <= 0)
panic("amap_splitref: map size check failed");
 
 #ifdef UVM_AMAP_PPREF
-/* establish ppref before we add a duplicate reference to the amap */
-   if (origref->ar_amap->am_ppref == NULL)
-   amap_pp_establish(origref->ar_amap);
+/* Establish ppref before we add a duplicate reference to the amap. */
+   if (amap->am_ppref == NULL)
+   amap_pp_establish(amap);
 #endif
 
-   splitref->ar_amap = origref->ar_amap;
-   splitref->ar_amap->am_ref++;/* not a share reference */
+   /* Note: not a share reference. */
+   amap->am_ref++;
+   splitref->ar_amap = amap;
splitref->ar_pageoff = origref->ar_pageoff + leftslots;
 }
 
@@ -1104,12 +1104,11 @@ amap_add(struct vm_aref *aref, vaddr_t o
 
slot = UVM_AMAP_SLOTIDX(slot);
if (replace) {
-   if (chunk->ac_anon[slot] == NULL)
-   panic("amap_add: replacing null anon");
-   if (chunk->ac_anon[slot]->an_page != NULL &&
-   (amap->am_flags & AMAP_SHARED) != 0) {
-   pmap_page_protect(chunk->ac_anon[slot]->an_page,
-   PROT_NONE);
+   struct vm_anon *oanon  = chunk->ac_anon[slot];
+
+   KASSERT(oanon != NULL);
+   if (oanon->an_page && (amap->am_flags & AMAP_SHARED) != 0) {
+   pmap_page_protect(oanon->an_page, PROT_NONE);
/*
 * XXX: suppose page is supposed to be wired somewhere?
 */
@@ -1138,14 +1137,13 @@ amap_unadd(struct vm_aref *aref, vaddr_t
 
AMAP_B2SLOT(slot, offset);
slot += aref->ar_pageoff;
-   KASSERT(slot < amap->am_nslot);
+   if (chunk->ac_anon[slot] == NULL)
+   panic("amap_unadd: nothing there");
chunk = amap_chunk_get(amap, slot, 0, PR_NOWAIT);
-   if (chunk == NULL)
-   panic("amap_unadd: chunk for slot %d not present", slot);
+   KASSERT(chunk != NULL);
 
slot = UVM_AMAP_SLOTIDX(slot);
-   if (chunk->ac_anon[slot] == NULL)
-   panic("amap_unadd: nothing there");
+   KASSERT(chunk->ac_anon[slot] != NULL);
 
chunk->ac_anon[slot] = NULL;
chunk->ac_usedmap &= ~(1 << slot);



kqueue_scan() refactoring

2020-10-07 Thread Martin Pieuchot
The diff below has already been presented in August [0].  It is the
first step at splitting the kqueue refactoring required for the poll
and select rewrite.

This first iteration introduces the new API with only minimal changes.
The only change in behavior is that the markers' `kn_filter' and
`kn_status' are now initialized only once.

Ok?

[0] https://marc.info/?l=openbsd-tech=159740120705383=2

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.142
diff -u -p -r1.142 kern_event.c
--- kern/kern_event.c   12 Aug 2020 13:49:24 -  1.142
+++ kern/kern_event.c   7 Oct 2020 13:42:36 -
@@ -64,9 +64,6 @@ void  KQREF(struct kqueue *);
 void   KQRELE(struct kqueue *);
 
 intkqueue_sleep(struct kqueue *, struct timespec *);
-intkqueue_scan(struct kqueue *kq, int maxevents,
-   struct kevent *ulistp, struct timespec *timeout,
-   struct kevent *kev, struct proc *p, int *retval);
 
 intkqueue_read(struct file *, struct uio *, int);
 intkqueue_write(struct file *, struct uio *, int);
@@ -554,6 +551,7 @@ out:
 int
 sys_kevent(struct proc *p, void *v, register_t *retval)
 {
+   struct kqueue_scan_state scan;
struct filedesc* fdp = p->p_fd;
struct sys_kevent_args /* {
syscallarg(int) fd;
@@ -635,11 +633,12 @@ sys_kevent(struct proc *p, void *v, regi
goto done;
}
 
-   KQREF(kq);
+   kqueue_scan_setup(, kq);
FRELE(fp, p);
-   error = kqueue_scan(kq, SCARG(uap, nevents), SCARG(uap, eventlist),
+   error = kqueue_scan(, SCARG(uap, nevents), SCARG(uap, eventlist),
tsp, kev, p, );
-   KQRELE(kq);
+   kqueue_scan_finish();
+
*retval = n;
return (error);
 
@@ -895,11 +894,13 @@ kqueue_sleep(struct kqueue *kq, struct t
 }
 
 int
-kqueue_scan(struct kqueue *kq, int maxevents, struct kevent *ulistp,
-struct timespec *tsp, struct kevent *kev, struct proc *p, int *retval)
+kqueue_scan(struct kqueue_scan_state *scan, int maxevents,
+struct kevent *ulistp, struct timespec *tsp, struct kevent *kev,
+struct proc *p, int *retval)
 {
+   struct kqueue *kq = scan->kqs_kq;
struct kevent *kevp;
-   struct knote mend, mstart, *kn;
+   struct knote *kn;
int s, count, nkev, error = 0;
 
nkev = 0;
@@ -909,9 +910,6 @@ kqueue_scan(struct kqueue *kq, int maxev
if (count == 0)
goto done;
 
-   memset(, 0, sizeof(mstart));
-   memset(, 0, sizeof(mend));
-
 retry:
KASSERT(count == maxevents);
KASSERT(nkev == 0);
@@ -939,18 +937,16 @@ retry:
goto done;
}
 
-   mstart.kn_filter = EVFILT_MARKER;
-   mstart.kn_status = KN_PROCESSING;
-   TAILQ_INSERT_HEAD(>kq_head, , kn_tqe);
-   mend.kn_filter = EVFILT_MARKER;
-   mend.kn_status = KN_PROCESSING;
-   TAILQ_INSERT_TAIL(>kq_head, , kn_tqe);
+   TAILQ_INSERT_TAIL(>kq_head, >kqs_end, kn_tqe);
+   TAILQ_INSERT_HEAD(>kq_head, >kqs_start, kn_tqe);
while (count) {
-   kn = TAILQ_NEXT(, kn_tqe);
+   kn = TAILQ_NEXT(>kqs_start, kn_tqe);
if (kn->kn_filter == EVFILT_MARKER) {
-   if (kn == ) {
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
+   if (kn == >kqs_end) {
+   TAILQ_REMOVE(>kq_head, >kqs_end,
+   kn_tqe);
+   TAILQ_REMOVE(>kq_head, >kqs_start,
+   kn_tqe);
splx(s);
if (count == maxevents)
goto retry;
@@ -958,8 +954,9 @@ retry:
}
 
/* Move start marker past another thread's marker. */
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
-   TAILQ_INSERT_AFTER(>kq_head, kn, , kn_tqe);
+   TAILQ_REMOVE(>kq_head, >kqs_start, kn_tqe);
+   TAILQ_INSERT_AFTER(>kq_head, kn, >kqs_start,
+   kn_tqe);
continue;
}
 
@@ -1029,8 +1026,8 @@ retry:
break;
}
}
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
+   TAILQ_REMOVE(>kq_head, >kqs_end, kn_tqe);
+   TAILQ_REMOVE(>kq_head, >kqs_start, kn_tqe);
splx(s);
 done:
if (nkev != 0) {
@@ -1044,6 +1041,33 @@ done:
*retval = maxevents - count;
return (error);
 }
+
+void
+kqueue_scan_setup(struct kqueue_scan_state *scan, struct kqueue *kq)
+{
+   memset(scan, 0, sizeof(*scan));
+
+   KQREF(kq);
+   scan->kqs_kq = kq;
+   

uvm/uvm_map.h header cleanup

2020-10-07 Thread Martin Pieuchot
Now that all architectures have been fixed, it's time to remove this...

ok?

Index: uvm/uvm_map.h
===
RCS file: /cvs/src/sys/uvm/uvm_map.h,v
retrieving revision 1.67
diff -u -p -r1.67 uvm_map.h
--- uvm/uvm_map.h   18 Dec 2019 13:33:29 -  1.67
+++ uvm/uvm_map.h   17 Sep 2020 12:25:12 -
@@ -409,13 +409,6 @@ intuvm_map_fill_vmmap(struct vm_map *,
  *
  */
 
-/*
- * XXX: clean up later
- * Half the kernel seems to depend on them being included here.
- */
-#include 
-#include   /* for panic() */
-
 boolean_t  vm_map_lock_try_ln(struct vm_map*, char*, int);
 void   vm_map_lock_ln(struct vm_map*, char*, int);
 void   vm_map_lock_read_ln(struct vm_map*, char*, int);



Re: mmap: Do not push KERNEL_LOCK() too far

2020-10-05 Thread Martin Pieuchot
On 03/10/20(Sat) 12:59, Mark Kettenis wrote:
> > Date: Fri, 2 Oct 2020 10:32:27 +0200
> > From: Martin Pieuchot 
> > 
> > On 01/10/20(Thu) 21:44, Mark Kettenis wrote:
> > > > Date: Thu, 1 Oct 2020 14:10:56 +0200
> > > > From: Martin Pieuchot 
> > > > 
> > > > While studying a bug report from naddy@ in 2017 when testing guenther@'s
> > > > amap/anon locking diff I figured out that we have been too optimistic in
> > > > the !MAP_ANON case.
> > > > 
> > > > The reported panic involves, I'd guess, a race between fd_getfile() and
> > > > vref():
> > > > 
> > > >   panic: vref used where vget required
> > > >   db_enter() at db_enter+0x5
> > > >   panic() at panic+0x129
> > > >   vref(ff03b20d29e8) at vref+0x5d
> > > >   uvn_attach(101,ff03a5879dc0) at uvn_attach+0x11d
> > > >   uvm_mmapfile(7,ff03a5879dc0,2,1,13,10012) at 
> > > > uvm_mmapfile+0x12c
> > > >   sys_mmap(c50,8000225f82a0,1) at sys_mmap+0x604
> > > >   syscall() at syscall+0x279
> > > >   --- syscall (number 198) ---
> > > >   end of kernel
> > > > 
> > > > Removing the KERNEL_LOCK() from file mapping was out of the scope of 
> > > > this
> > > > previous work, so I'd like to go back to a single KERNEL_LOCK/UNLOCK 
> > > > dance
> > > > in this code path to remove any false positive.
> > > > 
> > > > Note that this code is currently always run under KERNEL_LOCK() so this
> > > > will only have effect once the syscall will be unlocked.
> > > > 
> > > > ok?
> > > 
> > > Hmm, I thought fd_getfile() was fully mpsafe.
> > 
> > It is to get a reference on `fp'.  However if the current thread
> > releases the KERNEL_LOCK() before calling vref(9) it might lose a
> > race.
> 
> I don't see the race.  The function returns a 'fp' with a reference,
> so 'fp' will be valid regardless of whether we hold the kernel lock or
> not.  So we should be able to take the kernel lock after the
> fd_getfile() call isn't it?

Should we?  I'd assume we can't unless somebody can explain the
contrary.  My point is the following: by releasing the KERNEL_LOCK() we
allow other parts of the kernel: syscalls and fault handlers to mess
with this vnode.

> > > But I suppose the kernel lock needs to be grabbed before we start
> > > looking at the vnode?
> > 
> > Yes, or to say it otherwise not released.
> 
> So the problem is that while we have an 'fp', its f_data member points
> to a vnode that has already been put on the freelist and therefore has
> v_usecount set to zero?  How does that happen?

I don't know.  I'm trying to be conservative to be able to concentrate
on amaps & anons.  I'd rather keep all the rest under a single
KERNEL_LOCK().

Hopefully this can be revisited soon.

> > > Your diff makes the locking a bit convoluted, but I suppose adding a
> > > KERNEL_UNLOCK() before every "goto out" is worse?
> > 
> > I tried to keep the diff as small as possible to not obfuscate the change.
> > If we want cleaner code we can move the !ANON case in a different function.
> 
> Splitting would be hard because of the "goto is_anon".
> 
> > > > Index: uvm/uvm_mmap.c
> > > > ===
> > > > RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
> > > > retrieving revision 1.161
> > > > diff -u -p -r1.161 uvm_mmap.c
> > > > --- uvm/uvm_mmap.c  4 Mar 2020 21:15:39 -   1.161
> > > > +++ uvm/uvm_mmap.c  28 Sep 2020 09:48:26 -
> > > > @@ -288,8 +288,11 @@ sys_mmap(struct proc *p, void *v, regist
> > > >  
> > > > /* check for file mappings (i.e. not anonymous) and verify 
> > > > file. */
> > > > if ((flags & MAP_ANON) == 0) {
> > > > -   if ((fp = fd_getfile(fdp, fd)) == NULL)
> > > > -   return (EBADF);
> > > > +   KERNEL_LOCK();
> > > > +   if ((fp = fd_getfile(fdp, fd)) == NULL) {
> > > > +   error = EBADF;
> > > > +   goto out;
> > > > +   }
> > > >  
> > > > if (fp->f_type != DTYPE_VNODE) {
> > > > error = ENODEV; /* only mmap vnodes! */
> >

Re: Please test: switch select(2) to kqfilters

2020-10-03 Thread Martin Pieuchot
On 02/10/20(Fri) 19:09, Scott Cheloha wrote:
> On Fri, Oct 02, 2020 at 12:19:35PM +0200, Martin Pieuchot wrote:
> > @@ -635,12 +642,39 @@ sys_kevent(struct proc *p, void *v, regi
> > goto done;
> > }
> >  
> > +
> > KQREF(kq);
> > FRELE(fp, p);
> > -   error = kqueue_scan(kq, SCARG(uap, nevents), SCARG(uap, eventlist),
> > -   tsp, kev, p, );
> > +   /*
> > +* Collect as many events as we can.  The timeout on successive
> > +* loops is disabled (kqueue_scan() becomes non-blocking).
> > +*/
> > +   total = 0;
> > +   error = 0;
> > +   kqueue_scan_setup(, kq);
> > +   while ((n = SCARG(uap, nevents) - total) > 0) {
> > +   if (n > nitems(kev))
> > +   n = nitems(kev);
> > +   ready = kqueue_scan(, n, kev, tsp, p, );
> > +   if (ready == 0)
> > +   break;
> > +   error = copyout(kev, SCARG(uap, eventlist) + total,
> > +   sizeof(struct kevent) * ready);
> > +#ifdef KTRACE
> > +   if (KTRPOINT(p, KTR_STRUCT))
> > +   ktrevent(p, kev, ready);
> > +#endif
> > +   total += ready;
> > +   if (error || ready < n)
> > +   break;
> > +   tsp =   /* successive loops non-blocking */
> > +   timespecclear(tsp);
> 
> Here, this.  Why do we force a non-blocking loop the second time?

If there's a second time that implies the first time already reported
some events so there's already something to return to userland.  In that
case we just want to gather the events that were not collected the first
time and not sleep again.

> > +   }
> > +   kqueue_scan_finish();
> > KQRELE(kq);
> > -   *retval = n;
> > +   if (error == EWOULDBLOCK)
> > +   error = 0;
> > +   *retval = total;
> > return (error);
> >  
> >   done:
> > @@ -894,24 +928,22 @@ kqueue_sleep(struct kqueue *kq, struct t
> > return (error);
> >  }
> >  
> > +/*
> > + * Scan the kqueue, blocking if necessary until the target time is reached.
> > + * If tsp is NULL we block indefinitely.  If tsp->ts_secs/nsecs are both
> > + * 0 we do not block at all.
> > + */
> >  int
> > -kqueue_scan(struct kqueue *kq, int maxevents, struct kevent *ulistp,
> > -struct timespec *tsp, struct kevent *kev, struct proc *p, int *retval)
> > +kqueue_scan(struct kqueue_scan_state *scan, int maxevents,
> > +struct kevent *kevp, struct timespec *tsp, struct proc *p, int *errorp)
> >  {
> > -   struct kevent *kevp;
> > -   struct knote mend, mstart, *kn;
> > -   int s, count, nkev, error = 0;
> > -
> > -   nkev = 0;
> > -   kevp = kev;
> > +   struct knote *kn;
> > +   struct kqueue *kq = scan->kqs_kq;
> > +   int s, count, nkev = 0, error = 0;
> >  
> > count = maxevents;
> > if (count == 0)
> > goto done;
> > -
> > -   memset(, 0, sizeof(mstart));
> > -   memset(, 0, sizeof(mend));
> > -
> >  retry:
> > KASSERT(count == maxevents);
> > KASSERT(nkev == 0);
> > @@ -923,7 +955,8 @@ retry:
> >  
> > s = splhigh();
> > if (kq->kq_count == 0) {
> > -   if (tsp != NULL && !timespecisset(tsp)) {
> > +   if ((tsp != NULL && !timespecisset(tsp)) ||
> > +   scan->kqs_nevent != 0) {
> > splx(s);
> > error = 0;
> > goto done;
> > @@ -931,7 +964,7 @@ retry:
> > kq->kq_state |= KQ_SLEEP;
> > error = kqueue_sleep(kq, tsp);
> > splx(s);
> > -   if (error == 0 || error == EWOULDBLOCK)
> > +   if (error == 0)
> > goto retry;
> 
> Why wouldn't we want to retry in the EWOULDBLOCK case?
> You have a check for
> 
>   tsp != NULL && !timespecisset(tsp)
> 
> e.g., when you time out.

I don't recall why or even if there was a reason.  I'll change it back,
thanks.

> > +
> > +   /*
> > +* The poll/select family of syscalls has been designed to
> > +* block when file descriptors are not available, even if
> > +* there's nothing to wait for.
> > +*/
> > +   if (nevents == 0) {
> > +   uint64_t nsecs = INFSLP;
> > +
> > +   if (timeout != NULL)
> > +   nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
> > +
> > +   error = 

Please test: switch select(2) to kqfilters

2020-10-02 Thread Martin Pieuchot
Diff below modifies the internal implementation of {p,}select(2) to
query kqfilter handlers instead of poll ones.

I deliberately left {p,}poll(2) untouched to ease the transition.

This diff includes some kqueue refactoring from visa@.  It is built on
top of the changes that went in during the last release cycle notably
EVFILT_EXCEPT and NOTE_OOB.

A mid-term goal of this change would be to get rid of the poll handlers
in order to have a single event system in the kernel to maintain and
turn mp-safe.

The logic is as follow:

- With this change every thread get a "private" kqueue, usable by the
  kernel only, to register events for select(2) and later poll(2).

- Events specified via FD_SET(2) are converted to their kqueue equivalent.

- kqueue_scan() has been modified to be restartable and work with a given
  kqueue.

- At the end of every {p,}select(2) syscall the private kqueue is purged.

This version includes a fix for a previously reported regression triggered
by regress/usr.bin/ssh's keyscan test.


I'd like to get this in early in this release cycle, so please test and
report back :o)

Thanks,
Martin

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.142
diff -u -p -r1.142 kern_event.c
--- kern/kern_event.c   12 Aug 2020 13:49:24 -  1.142
+++ kern/kern_event.c   1 Oct 2020 12:53:54 -
@@ -64,9 +64,6 @@ void  KQREF(struct kqueue *);
 void   KQRELE(struct kqueue *);
 
 intkqueue_sleep(struct kqueue *, struct timespec *);
-intkqueue_scan(struct kqueue *kq, int maxevents,
-   struct kevent *ulistp, struct timespec *timeout,
-   struct kevent *kev, struct proc *p, int *retval);
 
 intkqueue_read(struct file *, struct uio *, int);
 intkqueue_write(struct file *, struct uio *, int);
@@ -521,6 +518,14 @@ kqueue_alloc(struct filedesc *fdp)
return (kq);
 }
 
+void
+kqueue_exit(struct proc *p)
+{
+   kqueue_terminate(p, p->p_kq);
+   kqueue_free(p->p_kq);
+   p->p_kq = NULL;
+}
+
 int
 sys_kqueue(struct proc *p, void *v, register_t *retval)
 {
@@ -554,6 +559,7 @@ out:
 int
 sys_kevent(struct proc *p, void *v, register_t *retval)
 {
+   struct kqueue_scan_state scan;
struct filedesc* fdp = p->p_fd;
struct sys_kevent_args /* {
syscallarg(int) fd;
@@ -569,6 +575,7 @@ sys_kevent(struct proc *p, void *v, regi
struct timespec ts;
struct timespec *tsp = NULL;
int i, n, nerrors, error;
+   int ready, total;
struct kevent kev[KQ_NEVENTS];
 
if ((fp = fd_getfile(fdp, SCARG(uap, fd))) == NULL)
@@ -597,9 +604,9 @@ sys_kevent(struct proc *p, void *v, regi
kq = fp->f_data;
nerrors = 0;
 
-   while (SCARG(uap, nchanges) > 0) {
-   n = SCARG(uap, nchanges) > KQ_NEVENTS ?
-   KQ_NEVENTS : SCARG(uap, nchanges);
+   while ((n = SCARG(uap, nchanges)) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
error = copyin(SCARG(uap, changelist), kev,
n * sizeof(struct kevent));
if (error)
@@ -635,12 +642,39 @@ sys_kevent(struct proc *p, void *v, regi
goto done;
}
 
+
KQREF(kq);
FRELE(fp, p);
-   error = kqueue_scan(kq, SCARG(uap, nevents), SCARG(uap, eventlist),
-   tsp, kev, p, );
+   /*
+* Collect as many events as we can.  The timeout on successive
+* loops is disabled (kqueue_scan() becomes non-blocking).
+*/
+   total = 0;
+   error = 0;
+   kqueue_scan_setup(, kq);
+   while ((n = SCARG(uap, nevents) - total) > 0) {
+   if (n > nitems(kev))
+   n = nitems(kev);
+   ready = kqueue_scan(, n, kev, tsp, p, );
+   if (ready == 0)
+   break;
+   error = copyout(kev, SCARG(uap, eventlist) + total,
+   sizeof(struct kevent) * ready);
+#ifdef KTRACE
+   if (KTRPOINT(p, KTR_STRUCT))
+   ktrevent(p, kev, ready);
+#endif
+   total += ready;
+   if (error || ready < n)
+   break;
+   tsp =   /* successive loops non-blocking */
+   timespecclear(tsp);
+   }
+   kqueue_scan_finish();
KQRELE(kq);
-   *retval = n;
+   if (error == EWOULDBLOCK)
+   error = 0;
+   *retval = total;
return (error);
 
  done:
@@ -894,24 +928,22 @@ kqueue_sleep(struct kqueue *kq, struct t
return (error);
 }
 
+/*
+ * Scan the kqueue, blocking if necessary until the target time is reached.
+ * If tsp is NULL we block indefinitely.  If tsp->ts_secs/nsecs are both
+ * 0 we do not block at all.
+ */
 int
-kqueue_scan(struct kqueue *kq, int maxevents, struct kevent *ulistp,
-struct timespec *tsp, struct kevent 

Re: mmap: Do not push KERNEL_LOCK() too far

2020-10-02 Thread Martin Pieuchot
On 01/10/20(Thu) 21:44, Mark Kettenis wrote:
> > Date: Thu, 1 Oct 2020 14:10:56 +0200
> > From: Martin Pieuchot 
> > 
> > While studying a bug report from naddy@ in 2017 when testing guenther@'s
> > amap/anon locking diff I figured out that we have been too optimistic in
> > the !MAP_ANON case.
> > 
> > The reported panic involves, I'd guess, a race between fd_getfile() and
> > vref():
> > 
> >   panic: vref used where vget required
> >   db_enter() at db_enter+0x5
> >   panic() at panic+0x129
> >   vref(ff03b20d29e8) at vref+0x5d
> >   uvn_attach(101,ff03a5879dc0) at uvn_attach+0x11d
> >   uvm_mmapfile(7,ff03a5879dc0,2,1,13,10012) at uvm_mmapfile+0x12c
> >   sys_mmap(c50,8000225f82a0,1) at sys_mmap+0x604
> >   syscall() at syscall+0x279
> >   --- syscall (number 198) ---
> >   end of kernel
> > 
> > Removing the KERNEL_LOCK() from file mapping was out of the scope of this
> > previous work, so I'd like to go back to a single KERNEL_LOCK/UNLOCK dance
> > in this code path to remove any false positive.
> > 
> > Note that this code is currently always run under KERNEL_LOCK() so this
> > will only have effect once the syscall will be unlocked.
> > 
> > ok?
> 
> Hmm, I thought fd_getfile() was fully mpsafe.

It is to get a reference on `fp'.  However if the current thread
releases the KERNEL_LOCK() before calling vref(9) it might lose a
race.

> But I suppose the kernel lock needs to be grabbed before we start
> looking at the vnode?

Yes, or to say it otherwise not released.

> Your diff makes the locking a bit convoluted, but I suppose adding a
> KERNEL_UNLOCK() before every "goto out" is worse?

I tried to keep the diff as small as possible to not obfuscate the change.
If we want cleaner code we can move the !ANON case in a different function.

> > Index: uvm/uvm_mmap.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
> > retrieving revision 1.161
> > diff -u -p -r1.161 uvm_mmap.c
> > --- uvm/uvm_mmap.c  4 Mar 2020 21:15:39 -   1.161
> > +++ uvm/uvm_mmap.c  28 Sep 2020 09:48:26 -
> > @@ -288,8 +288,11 @@ sys_mmap(struct proc *p, void *v, regist
> >  
> > /* check for file mappings (i.e. not anonymous) and verify file. */
> > if ((flags & MAP_ANON) == 0) {
> > -   if ((fp = fd_getfile(fdp, fd)) == NULL)
> > -   return (EBADF);
> > +   KERNEL_LOCK();
> > +   if ((fp = fd_getfile(fdp, fd)) == NULL) {
> > +   error = EBADF;
> > +   goto out;
> > +   }
> >  
> > if (fp->f_type != DTYPE_VNODE) {
> > error = ENODEV; /* only mmap vnodes! */
> > @@ -313,6 +316,7 @@ sys_mmap(struct proc *p, void *v, regist
> > flags |= MAP_ANON;
> > FRELE(fp, p);
> > fp = NULL;
> > +   KERNEL_UNLOCK();
> > goto is_anon;
> > }
> >  
> > @@ -362,9 +366,7 @@ sys_mmap(struct proc *p, void *v, regist
> >  * EPERM.
> >  */
> > if (fp->f_flag & FWRITE) {
> > -   KERNEL_LOCK();
> > error = VOP_GETATTR(vp, , p->p_ucred, p);
> > -   KERNEL_UNLOCK();
> > if (error)
> > goto out;
> > if ((va.va_flags & (IMMUTABLE|APPEND)) == 0)
> > @@ -390,9 +392,9 @@ sys_mmap(struct proc *p, void *v, regist
> > goto out;
> > }
> > }
> > -   KERNEL_LOCK();
> > error = uvm_mmapfile(>p_vmspace->vm_map, , size, prot,
> > maxprot, flags, vp, pos, lim_cur(RLIMIT_MEMLOCK), p);
> > +   FRELE(fp, p);
> > KERNEL_UNLOCK();
> > } else {/* MAP_ANON case */
> > if (fd != -1)
> > @@ -428,7 +430,10 @@ is_anon:   /* label for SunOS style /dev/z
> > /* remember to add offset */
> > *retval = (register_t)(addr + pageoff);
> >  
> > +   return (error);
> > +
> >  out:
> > +   KERNEL_UNLOCK();
> > if (fp)
> > FRELE(fp, p);
> > return (error);
> > 
> > 



Re: KASSERT() for VOP_*

2020-10-01 Thread Martin Pieuchot
On 09/09/20(Wed) 08:41, Martin Pieuchot wrote:
> This is mostly the same diff that has been backed out months ago with
> the VOP_CLOSE() case fixed.  VOP_CLOSE() can accept a NULL argument
> instead of `curproc' when garbage collecting passed FDs.
> 
> The intent is to stop passing a "struct proc *" when a function applies
> only to `curproc'.  Synchronization/locking primitives are obviously
> different if a CPU can modify the fields of any thread or only of the
> current one.

Now that we're early in the release cycle I'd like to get this in.

Any ok?

> Index: kern/vfs_vops.c
> ===
> RCS file: /cvs/src/sys/kern/vfs_vops.c,v
> retrieving revision 1.28
> diff -u -p -r1.28 vfs_vops.c
> --- kern/vfs_vops.c   8 Apr 2020 08:07:51 -   1.28
> +++ kern/vfs_vops.c   27 Apr 2020 08:10:02 -
> @@ -145,6 +145,8 @@ VOP_OPEN(struct vnode *vp, int mode, str
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
> +
>   if (vp->v_op->vop_open == NULL)
>   return (EOPNOTSUPP);
>  
> @@ -164,6 +166,7 @@ VOP_CLOSE(struct vnode *vp, int fflag, s
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == NULL || p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_close == NULL)
> @@ -184,6 +187,7 @@ VOP_ACCESS(struct vnode *vp, int mode, s
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_access == NULL)
> @@ -202,6 +206,7 @@ VOP_GETATTR(struct vnode *vp, struct vat
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   if (vp->v_op->vop_getattr == NULL)
>   return (EOPNOTSUPP);
>  
> @@ -219,6 +224,7 @@ VOP_SETATTR(struct vnode *vp, struct vat
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_setattr == NULL)
> @@ -282,6 +288,7 @@ VOP_IOCTL(struct vnode *vp, u_long comma
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   if (vp->v_op->vop_ioctl == NULL)
>   return (EOPNOTSUPP);
>  
> @@ -300,6 +307,7 @@ VOP_POLL(struct vnode *vp, int fflag, in
>   a.a_events = events;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   if (vp->v_op->vop_poll == NULL)
>   return (EOPNOTSUPP);
>  
> @@ -344,6 +352,7 @@ VOP_FSYNC(struct vnode *vp, struct ucred
>   a.a_waitfor = waitfor;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_fsync == NULL)
> @@ -565,6 +574,7 @@ VOP_INACTIVE(struct vnode *vp, struct pr
>   a.a_vp = vp;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_inactive == NULL)
> @@ -581,6 +591,7 @@ VOP_RECLAIM(struct vnode *vp, struct pro
>   a.a_vp = vp;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   if (vp->v_op->vop_reclaim == NULL)
>   return (EOPNOTSUPP);
>  
> 



amap: KASSERT()s and local variables

2020-10-01 Thread Martin Pieuchot
Use more KASSERT()s instead of the "if (x) panic()" idiom for sanity
checks and add a couple of local variables to reduce the difference
with NetBSD and help for upcoming locking.

ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.84
diff -u -p -r1.84 uvm_amap.c
--- uvm/uvm_amap.c  25 Sep 2020 08:04:48 -  1.84
+++ uvm/uvm_amap.c  1 Oct 2020 12:13:23 -
@@ -460,8 +460,7 @@ amap_wipeout(struct vm_amap *amap)
map ^= 1 << slot;
anon = chunk->ac_anon[slot];
 
-   if (anon == NULL || anon->an_ref == 0)
-   panic("amap_wipeout: corrupt amap");
+   KASSERT(anon != NULL && anon->an_ref != 0);
 
refs = --anon->an_ref;
if (refs == 0) {
@@ -669,9 +668,7 @@ ReStart:
pg = anon->an_page;
 
/* page must be resident since parent is wired */
-   if (pg == NULL)
-   panic("amap_cow_now: non-resident wired page"
-   " in anon %p", anon);
+   KASSERT(pg != NULL);
 
/*
 * if the anon ref count is one, we are safe (the child
@@ -740,24 +737,23 @@ ReStart:
 void
 amap_splitref(struct vm_aref *origref, struct vm_aref *splitref, vaddr_t 
offset)
 {
+   struct vm_amap *amap = origref->ar_amap;
int leftslots;
 
AMAP_B2SLOT(leftslots, offset);
-   if (leftslots == 0)
-   panic("amap_splitref: split at zero offset");
+   KASSERT(leftslots != 0);
 
-   /* now: we have a valid am_mapped array. */
-   if (origref->ar_amap->am_nslot - origref->ar_pageoff - leftslots <= 0)
-   panic("amap_splitref: map size check failed");
+   KASSERT(amap->am_nslot - origref->ar_pageoff - leftslots > 0);
 
 #ifdef UVM_AMAP_PPREF
-/* establish ppref before we add a duplicate reference to the amap */
-   if (origref->ar_amap->am_ppref == NULL)
-   amap_pp_establish(origref->ar_amap);
+/* Establish ppref before we add a duplicate reference to the amap. */
+   if (amap->am_ppref == NULL)
+   amap_pp_establish(amap);
 #endif
 
-   splitref->ar_amap = origref->ar_amap;
-   splitref->ar_amap->am_ref++;/* not a share reference */
+   /* Note: not a share reference. */
+   amap->am_ref++;
+   splitref->ar_amap = amap;
splitref->ar_pageoff = origref->ar_pageoff + leftslots;
 }
 
@@ -828,9 +824,7 @@ amap_pp_adjref(struct vm_amap *amap, int
 * now adjust reference counts in range.  merge the first
 * changed entry with the last unchanged entry if possible.
 */
-   if (lcv != curslot)
-   panic("amap_pp_adjref: overshot target");
-
+   KASSERT(lcv == curslot);
for (/* lcv already set */; lcv < stopslot ; lcv += len) {
pp_getreflen(ppref, lcv, , );
if (lcv + len > stopslot) { /* goes past end? */
@@ -840,8 +834,7 @@ amap_pp_adjref(struct vm_amap *amap, int
len = stopslot - lcv;
}
ref += adjval;
-   if (ref < 0)
-   panic("amap_pp_adjref: negative reference count");
+   KASSERT(ref >= 0);
if (lcv == prevlcv + prevlen && ref == prevref) {
pp_setreflen(ppref, prevlcv, ref, prevlen + len);
} else {
@@ -1104,20 +1097,17 @@ amap_add(struct vm_aref *aref, vaddr_t o
 
slot = UVM_AMAP_SLOTIDX(slot);
if (replace) {
-   if (chunk->ac_anon[slot] == NULL)
-   panic("amap_add: replacing null anon");
-   if (chunk->ac_anon[slot]->an_page != NULL &&
-   (amap->am_flags & AMAP_SHARED) != 0) {
-   pmap_page_protect(chunk->ac_anon[slot]->an_page,
-   PROT_NONE);
+   struct vm_anon *oanon  = chunk->ac_anon[slot];
+
+   KASSERT(oanon != NULL);
+   if (oanon->an_page && (amap->am_flags & AMAP_SHARED) != 0) {
+   pmap_page_protect(oanon->an_page, PROT_NONE);
/*
 * XXX: suppose page is supposed to be wired somewhere?
 */
}
} else {   /* !replace */
-   if (chunk->ac_anon[slot] != NULL)
-   panic("amap_add: slot in use");
-
+   KASSERT(chunk->ac_anon[slot] == NULL);
chunk->ac_usedmap |= 1 << slot;
amap->am_nused++;
}
@@ -1140,12 +1130,10 @@ amap_unadd(struct vm_aref *aref, vaddr_t
slot += aref->ar_pageoff;
KASSERT(slot < amap->am_nslot);
chunk = 

mmap: Do not push KERNEL_LOCK() too far

2020-10-01 Thread Martin Pieuchot
While studying a bug report from naddy@ in 2017 when testing guenther@'s
amap/anon locking diff I figured out that we have been too optimistic in
the !MAP_ANON case.

The reported panic involves, I'd guess, a race between fd_getfile() and
vref():

  panic: vref used where vget required
  db_enter() at db_enter+0x5
  panic() at panic+0x129
  vref(ff03b20d29e8) at vref+0x5d
  uvn_attach(101,ff03a5879dc0) at uvn_attach+0x11d
  uvm_mmapfile(7,ff03a5879dc0,2,1,13,10012) at uvm_mmapfile+0x12c
  sys_mmap(c50,8000225f82a0,1) at sys_mmap+0x604
  syscall() at syscall+0x279
  --- syscall (number 198) ---
  end of kernel

Removing the KERNEL_LOCK() from file mapping was out of the scope of this
previous work, so I'd like to go back to a single KERNEL_LOCK/UNLOCK dance
in this code path to remove any false positive.

Note that this code is currently always run under KERNEL_LOCK() so this
will only have effect once the syscall will be unlocked.

ok?

Index: uvm/uvm_mmap.c
===
RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
retrieving revision 1.161
diff -u -p -r1.161 uvm_mmap.c
--- uvm/uvm_mmap.c  4 Mar 2020 21:15:39 -   1.161
+++ uvm/uvm_mmap.c  28 Sep 2020 09:48:26 -
@@ -288,8 +288,11 @@ sys_mmap(struct proc *p, void *v, regist
 
/* check for file mappings (i.e. not anonymous) and verify file. */
if ((flags & MAP_ANON) == 0) {
-   if ((fp = fd_getfile(fdp, fd)) == NULL)
-   return (EBADF);
+   KERNEL_LOCK();
+   if ((fp = fd_getfile(fdp, fd)) == NULL) {
+   error = EBADF;
+   goto out;
+   }
 
if (fp->f_type != DTYPE_VNODE) {
error = ENODEV; /* only mmap vnodes! */
@@ -313,6 +316,7 @@ sys_mmap(struct proc *p, void *v, regist
flags |= MAP_ANON;
FRELE(fp, p);
fp = NULL;
+   KERNEL_UNLOCK();
goto is_anon;
}
 
@@ -362,9 +366,7 @@ sys_mmap(struct proc *p, void *v, regist
 * EPERM.
 */
if (fp->f_flag & FWRITE) {
-   KERNEL_LOCK();
error = VOP_GETATTR(vp, , p->p_ucred, p);
-   KERNEL_UNLOCK();
if (error)
goto out;
if ((va.va_flags & (IMMUTABLE|APPEND)) == 0)
@@ -390,9 +392,9 @@ sys_mmap(struct proc *p, void *v, regist
goto out;
}
}
-   KERNEL_LOCK();
error = uvm_mmapfile(>p_vmspace->vm_map, , size, prot,
maxprot, flags, vp, pos, lim_cur(RLIMIT_MEMLOCK), p);
+   FRELE(fp, p);
KERNEL_UNLOCK();
} else {/* MAP_ANON case */
if (fd != -1)
@@ -428,7 +430,10 @@ is_anon:   /* label for SunOS style /dev/z
/* remember to add offset */
*retval = (register_t)(addr + pageoff);
 
+   return (error);
+
 out:
+   KERNEL_UNLOCK();
if (fp)
FRELE(fp, p);
return (error);



Garbage fix for USB_GET_FULL_DESC

2020-09-28 Thread Martin Pieuchot
Copy with uiomove(9) the correct size of the descriptor and not a random
value from the stack.  This is Coverity CID 1497167.

As I understand it there's no security impact as the size is always
caped by `ufd_size' however the returned descriptor might be corrupted
and this can explain why userland applications might randomly fail.

ok?

Index: ugen.c
===
RCS file: /cvs/src/sys/dev/usb/ugen.c,v
retrieving revision 1.107
diff -u -p -u -5 -r1.107 ugen.c
--- ugen.c  2 Sep 2020 12:36:12 -   1.107
+++ ugen.c  28 Sep 2020 09:12:47 -
@@ -1121,10 +1121,11 @@ ugen_do_ioctl(struct ugen_softc *sc, int
 
cdesc = usbd_get_cdesc(sc->sc_udev, fd->ufd_config_index,
_len);
if (cdesc == NULL)
return (EINVAL);
+   len = cdesc_len;
if (len > fd->ufd_size)
len = fd->ufd_size;
iov.iov_base = (caddr_t)fd->ufd_data;
iov.iov_len = len;
uio.uio_iov = 



uvm_swapisfull()

2020-09-25 Thread Martin Pieuchot
Introduce a new helper to check global variables that will need to be
protected from concurrent access.

The name is taken from NetBSD in order to reduce the difference between
our trees.

Diff below introduces a syntax change in uvmpd_scan() where the boolean
test "<" becomes "!=".  This shouldn't matter because `uvmexp.swpgonly'
must always be smaller than `uvmexp.swpages'.

Ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.101
diff -u -p -r1.101 uvm_fault.c
--- uvm/uvm_fault.c 24 Sep 2020 09:51:07 -  1.101
+++ uvm/uvm_fault.c 25 Sep 2020 08:28:51 -
@@ -881,7 +881,6 @@ ReFault:
/* check for out of RAM */
if (anon == NULL || pg == NULL) {
uvmfault_unlockall(, amap, NULL, oanon);
-   KASSERT(uvmexp.swpgonly <= uvmexp.swpages);
if (anon == NULL)
uvmexp.fltnoanon++;
else {
@@ -889,7 +888,7 @@ ReFault:
uvmexp.fltnoram++;
}
 
-   if (uvmexp.swpgonly == uvmexp.swpages)
+   if (uvm_swapisfull())
return (ENOMEM);
 
/* out of RAM, wait for more */
@@ -942,8 +941,7 @@ ReFault:
 * as the map may change while we're asleep.
 */
uvmfault_unlockall(, amap, NULL, oanon);
-   KASSERT(uvmexp.swpgonly <= uvmexp.swpages);
-   if (uvmexp.swpgonly == uvmexp.swpages) {
+   if (uvm_swapisfull()) {
/* XXX instrumentation */
return (ENOMEM);
}
@@ -1137,7 +1135,6 @@ Case2:
 
/* unlock and fail ... */
uvmfault_unlockall(, amap, uobj, NULL);
-   KASSERT(uvmexp.swpgonly <= uvmexp.swpages);
if (anon == NULL)
uvmexp.fltnoanon++;
else {
@@ -1145,7 +1142,7 @@ Case2:
uvmexp.fltnoram++;
}
 
-   if (uvmexp.swpgonly == uvmexp.swpages)
+   if (uvm_swapisfull())
return (ENOMEM);
 
/* out of RAM, wait for more */
@@ -1191,11 +1188,10 @@ Case2:
if (amap_add(>aref,
ufi.orig_rvaddr - ufi.entry->start, anon, 0)) {
uvmfault_unlockall(, amap, NULL, oanon);
-   KASSERT(uvmexp.swpgonly <= uvmexp.swpages);
uvm_anfree(anon);
uvmexp.fltnoamap++;
 
-   if (uvmexp.swpgonly == uvmexp.swpages)
+   if (uvm_swapisfull())
return (ENOMEM);
 
amap_populate(>aref,
@@ -1225,8 +1221,7 @@ Case2:
atomic_clearbits_int(>pg_flags, PG_BUSY|PG_FAKE|PG_WANTED);
UVM_PAGE_OWN(pg, NULL);
uvmfault_unlockall(, amap, uobj, NULL);
-   KASSERT(uvmexp.swpgonly <= uvmexp.swpages);
-   if (uvmexp.swpgonly == uvmexp.swpages) {
+   if (uvm_swapisfull()) {
/* XXX instrumentation */
return (ENOMEM);
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.86
diff -u -p -r1.86 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   4 Apr 2020 22:08:02 -   1.86
+++ uvm/uvm_pdaemon.c   25 Sep 2020 08:28:52 -
@@ -522,9 +522,7 @@ uvmpd_scan_inactive(struct pglist *pglst
 * reactivate it so that we eventually cycle
 * all pages thru the inactive queue.
 */
-   KASSERT(uvmexp.swpgonly <= uvmexp.swpages);
-   if ((p->pg_flags & PQ_SWAPBACKED) &&
-   uvmexp.swpgonly == uvmexp.swpages) {
+   if ((p->pg_flags & PQ_SWAPBACKED) && uvm_swapisfull()) {
dirtyreacts++;
uvm_pageactivate(p);
continue;
@@ -879,7 +877,7 @@ uvmpd_scan(void)
swap_shortage = 0;
if (uvmexp.free < uvmexp.freetarg &&
uvmexp.swpginuse == uvmexp.swpages &&
-   uvmexp.swpgonly < uvmexp.swpages &&
+   !uvm_swapisfull() &&
pages_freed == 0) {
swap_shortage = uvmexp.freetarg - uvmexp.free;
}
Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.146
diff -u -p -r1.146 uvm_swap.c
--- 

Re: Push back kernel lock a bit in amd64 pageflttrap()

2020-09-25 Thread Martin Pieuchot
On 24/09/20(Thu) 16:35, Mark Kettenis wrote:
> This avoids taking the kernel lock when ci_inatomic is set.  This
> might speed up inteldrm(4) a bit.  Since uvm_grow() still needs the
> kernel lock, some reorganization of the code is necessary.
> 
> I'm not sure this actaully has an impact.  If we end up here with
> ci_inatomic set we're going to return EFAULT and take a slow path
> anyway.  So maybe it is better to leave this until we make uvm_grow()
> mpsafe?

I think it's valuable in itself.  Some questions below:

> Index: arch/amd64/amd64/trap.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/trap.c,v
> retrieving revision 1.81
> diff -u -p -r1.81 trap.c
> --- arch/amd64/amd64/trap.c   14 Sep 2020 12:51:28 -  1.81
> +++ arch/amd64/amd64/trap.c   24 Sep 2020 14:21:24 -
> @@ -173,11 +173,10 @@ pageflttrap(struct trapframe *frame, uin
>   pcb = >p_addr->u_pcb;
>   va = trunc_page((vaddr_t)cr2);
>  
> - KERNEL_LOCK();
> -
>   if (!usermode) {
>   /* This will only trigger if SMEP is enabled */
>   if (cr2 <= VM_MAXUSER_ADDRESS && frame->tf_err & PGEX_I) {
> + KERNEL_LOCK();

This is necessary to prevent another thread from corrupting the global
array `faultbuf' if another fault occur, right?

That opens a question about how to protect `faultstr'.  Maybe we could
even grab the KERNEL_LOCK() inside fault().

>   fault("attempt to execute user address %p "
>   "in supervisor mode", (void *)cr2);
>   /* retain kernel lock */
> @@ -186,6 +185,7 @@ pageflttrap(struct trapframe *frame, uin
>   /* This will only trigger if SMAP is enabled */
>   if (pcb->pcb_onfault == NULL && cr2 <= VM_MAXUSER_ADDRESS &&
>   frame->tf_err & PGEX_P) {
> + KERNEL_LOCK();
>   fault("attempt to access user address %p "
>   "in supervisor mode", (void *)cr2);
>   /* retain kernel lock */
> @@ -216,28 +216,29 @@ pageflttrap(struct trapframe *frame, uin
>   caddr_t onfault = pcb->pcb_onfault;
>  
>   pcb->pcb_onfault = NULL;

`pcb_onfault' is no longer modified under KERNEL_LOCK().  Is it safe
because this member is "owned" by the current thread?  Maybe we should
start documenting this, the "struct proc" has some examples.

> + KERNEL_LOCK();
>   error = uvm_fault(map, va, frame->tf_err & PGEX_P ?
>   VM_FAULT_PROTECT : VM_FAULT_INVALID, ftype);
> + if (error == 0 && map != kernel_map)
> + uvm_grow(p, va);

Is there anything preventing us from having such idiom inside
uvm_fault()?  Not now, but maybe in the future.

> + KERNEL_UNLOCK();
>   pcb->pcb_onfault = onfault;
>   } else
>   error = EFAULT;
>  
> - if (error == 0) {
> - if (map != kernel_map)
> - uvm_grow(p, va);
> - } else if (!usermode) {
> + if (error && !usermode) {
>   if (pcb->pcb_onfault != 0) {
> - KERNEL_UNLOCK();
>   frame->tf_rip = (u_int64_t)pcb->pcb_onfault;
>   return 1;
>   } else {
>   /* bad memory access in the kernel */
> + KERNEL_LOCK();
>   fault("uvm_fault(%p, 0x%llx, 0, %d) -> %x",
>   map, cr2, ftype, error);
>   /* retain kernel lock */
>   return 0;
>   }
> - } else {
> + } else if (error) {
>   union sigval sv;
>   int signal, sicode;
>  
> @@ -260,8 +261,6 @@ pageflttrap(struct trapframe *frame, uin
>   sv.sival_ptr = (void *)cr2;
>   trapsignal(p, signal, T_PAGEFLT, sicode, sv);
>   }
> -
> - KERNEL_UNLOCK();
>  
>   return 1;
>  }
> 



amap: panic -> KASSERT

2020-09-24 Thread Martin Pieuchot
Convert various "if (x) panic()" idioms into "KASSERT(!x)".  The panic
message isn't helping for such sanity checks and this help reducing the
diff with NetBSD.

ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.83
diff -u -p -r1.83 uvm_amap.c
--- uvm/uvm_amap.c  22 Sep 2020 14:31:08 -  1.83
+++ uvm/uvm_amap.c  24 Sep 2020 09:47:54 -
@@ -1019,9 +1019,7 @@ amap_lookup(struct vm_aref *aref, vaddr_
 
AMAP_B2SLOT(slot, offset);
slot += aref->ar_pageoff;
-
-   if (slot >= amap->am_nslot)
-   panic("amap_lookup: offset out of range");
+   KASSERT(slot < amap->am_nslot);
 
chunk = amap_chunk_get(amap, slot, 0, PR_NOWAIT);
if (chunk == NULL)
@@ -1046,8 +1044,7 @@ amap_lookups(struct vm_aref *aref, vaddr
AMAP_B2SLOT(slot, offset);
slot += aref->ar_pageoff;
 
-   if ((slot + (npages - 1)) >= amap->am_nslot)
-   panic("amap_lookups: offset out of range");
+   KASSERT((slot + (npages - 1)) < amap->am_nslot);
 
for (i = 0, lcv = slot; lcv < slot + npages; i += n, lcv += n) {
n = UVM_AMAP_CHUNK - UVM_AMAP_SLOTIDX(lcv);
@@ -1078,9 +1075,7 @@ amap_populate(struct vm_aref *aref, vadd
 
AMAP_B2SLOT(slot, offset);
slot += aref->ar_pageoff;
-
-   if (slot >= amap->am_nslot)
-   panic("amap_populate: offset out of range");
+   KASSERT(slot < amap->am_nslot);
 
chunk = amap_chunk_get(amap, slot, 1, PR_WAITOK);
KASSERT(chunk != NULL);
@@ -1101,9 +1096,8 @@ amap_add(struct vm_aref *aref, vaddr_t o
 
AMAP_B2SLOT(slot, offset);
slot += aref->ar_pageoff;
+   KASSERT(slot < amap->am_nslot);
 
-   if (slot >= amap->am_nslot)
-   panic("amap_add: offset out of range");
chunk = amap_chunk_get(amap, slot, 1, PR_NOWAIT);
if (chunk == NULL)
return 1;
@@ -1144,9 +1138,7 @@ amap_unadd(struct vm_aref *aref, vaddr_t
 
AMAP_B2SLOT(slot, offset);
slot += aref->ar_pageoff;
-
-   if (slot >= amap->am_nslot)
-   panic("amap_unadd: offset out of range");
+   KASSERT(slot < amap->am_nslot);
chunk = amap_chunk_get(amap, slot, 0, PR_NOWAIT);
if (chunk == NULL)
panic("amap_unadd: chunk for slot %d not present", slot);



Re: uvm: __inline -> inline

2020-09-22 Thread Martin Pieuchot
On 22/09/20(Tue) 10:20, Mark Kettenis wrote:
> > Date: Tue, 22 Sep 2020 09:15:00 +0200
> > From: Martin Pieuchot 
> > 
> > Spell inline correctly, also reduce the diff with NetBSD for uvm_amap.c
> > and uvm_fault.c.
> > 
> > ok?
> 
> In general, yes.  This might interfere with the diff that guenther@
> did a while ago to lock amaps and unlock more of uvm.  Now that the
> uvm_map_inentry() mystery is (largely) solved, it may be worth looking
> into that diff again.  Or is that what you're doing right now?

That's what I am doing right now without the knowledge of guenther@'s
prior work, could you share it?



pmap_enter(9) doesn't sleep

2020-09-22 Thread Martin Pieuchot
Allocations in the various pmap_enter(9) are done with uvm_pagealloc(9),
which sets the UVM_PLA_NOWAIT flag, and/or with pool_get(9) w/ PR_NOWAIT.

So the comment below seems outdated to me, ok to kill it?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.98
diff -u -p -r1.98 uvm_fault.c
--- uvm/uvm_fault.c 12 Sep 2020 17:08:49 -  1.98
+++ uvm/uvm_fault.c 22 Sep 2020 07:46:43 -
@@ -702,13 +702,6 @@ ReFault:
pmap_update(ufi.orig_map->pmap);
 
/* (shadowed == TRUE) if there is an anon at the faulting address */
-   /*
-* note that if we are really short of RAM we could sleep in the above
-* call to pmap_enter.   bad?
-*
-* XXX Actually, that is bad; pmap_enter() should just fail in that
-* XXX case.  --thorpej
-*/
/*
 * if the desired page is not shadowed by the amap and we have a
 * backing object, then we check to see if the backing object would



uvm: __inline -> inline

2020-09-22 Thread Martin Pieuchot
Spell inline correctly, also reduce the diff with NetBSD for uvm_amap.c
and uvm_fault.c.

ok?

Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.28
diff -u -p -r1.28 uvm_addr.c
--- uvm/uvm_addr.c  13 Sep 2020 10:05:25 -  1.28
+++ uvm/uvm_addr.c  22 Sep 2020 07:12:10 -
@@ -186,7 +186,7 @@ uvm_addr_entrybyspace(struct uaddr_free_
 }
 #endif /* !SMALL_KERNEL */
 
-static __inline vaddr_t
+static inline vaddr_t
 uvm_addr_align_forward(vaddr_t addr, vaddr_t align, vaddr_t offset)
 {
vaddr_t adjusted;
@@ -201,7 +201,7 @@ uvm_addr_align_forward(vaddr_t addr, vad
return (adjusted < addr ? adjusted + align : adjusted);
 }
 
-static __inline vaddr_t
+static inline vaddr_t
 uvm_addr_align_backward(vaddr_t addr, vaddr_t align, vaddr_t offset)
 {
vaddr_t adjusted;
Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.82
diff -u -p -r1.82 uvm_amap.c
--- uvm/uvm_amap.c  4 Jan 2020 16:17:29 -   1.82
+++ uvm/uvm_amap.c  22 Sep 2020 07:07:45 -
@@ -63,20 +63,20 @@ static char amap_small_pool_names[UVM_AM
  */
 
 static struct vm_amap *amap_alloc1(int, int, int);
-static __inline void amap_list_insert(struct vm_amap *);
-static __inline void amap_list_remove(struct vm_amap *);   
+static inline void amap_list_insert(struct vm_amap *);
+static inline void amap_list_remove(struct vm_amap *);   
 
 struct vm_amap_chunk *amap_chunk_get(struct vm_amap *, int, int, int);
 void amap_chunk_free(struct vm_amap *, struct vm_amap_chunk *);
 void amap_wiperange_chunk(struct vm_amap *, struct vm_amap_chunk *, int, int);
 
-static __inline void
+static inline void
 amap_list_insert(struct vm_amap *amap)
 {
LIST_INSERT_HEAD(_list, amap, am_list);
 }
 
-static __inline void
+static inline void
 amap_list_remove(struct vm_amap *amap)
 { 
LIST_REMOVE(amap, am_list);
@@ -190,13 +190,10 @@ amap_chunk_free(struct vm_amap *amap, st
  * here are some in-line functions to help us.
  */
 
-static __inline void pp_getreflen(int *, int, int *, int *);
-static __inline void pp_setreflen(int *, int, int, int);
-
 /*
  * pp_getreflen: get the reference and length for a specific offset
  */
-static __inline void
+static inline void
 pp_getreflen(int *ppref, int offset, int *refp, int *lenp)
 {
 
@@ -212,7 +209,7 @@ pp_getreflen(int *ppref, int offset, int
 /*
  * pp_setreflen: set the reference and length for a specific offset
  */
-static __inline void
+static inline void
 pp_setreflen(int *ppref, int offset, int ref, int len)
 {
if (len == 1) {
Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.86
diff -u -p -r1.86 uvm_aobj.c
--- uvm/uvm_aobj.c  18 Jul 2019 23:47:33 -  1.86
+++ uvm/uvm_aobj.c  22 Sep 2020 07:11:50 -
@@ -256,7 +256,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-__inline static int
+inline static int
 uao_find_swslot(struct uvm_aobj *aobj, int pageidx)
 {
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.98
diff -u -p -r1.98 uvm_fault.c
--- uvm/uvm_fault.c 12 Sep 2020 17:08:49 -  1.98
+++ uvm/uvm_fault.c 22 Sep 2020 07:07:59 -
@@ -159,7 +159,7 @@ static struct uvm_advice uvmadvice[MADV_
  * private prototypes
  */
 static void uvmfault_amapcopy(struct uvm_faultinfo *);
-static __inline void uvmfault_anonflush(struct vm_anon **, int);
+static inline void uvmfault_anonflush(struct vm_anon **, int);
 void   uvmfault_unlockmaps(struct uvm_faultinfo *, boolean_t);
 void   uvmfault_update_stats(struct uvm_faultinfo *);
 
@@ -171,7 +171,7 @@ voiduvmfault_update_stats(struct uvm_fa
  *
  * => does not have to deactivate page if it is busy
  */
-static __inline void
+static inline void
 uvmfault_anonflush(struct vm_anon **anons, int n)
 {
int lcv;
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.267
diff -u -p -r1.267 uvm_map.c
--- uvm/uvm_map.c   14 Sep 2020 20:31:09 -  1.267
+++ uvm/uvm_map.c   22 Sep 2020 07:11:47 -
@@ -167,7 +167,7 @@ boolean_tuvm_map_inentry_fix(struct p
  * Tree management functions.
  */
 
-static __inline voiduvm_mapent_copy(struct vm_map_entry*,
+static inline void  uvm_mapent_copy(struct vm_map_entry*,
struct vm_map_entry*);
 static inline int   uvm_mapentry_addrcmp(const struct vm_map_entry*,
const struct vm_map_entry*);
@@ -361,7 +361,7 @@ uvm_mapentry_addrcmp(const struct vm_map
 /*
  * Copy mapentry.
  */

Re: sigabort(), p_sigmask & p_siglist

2020-09-16 Thread Martin Pieuchot
On 16/09/20(Wed) 02:08, Theo de Raadt wrote:
> Something doesn't feel right.
> 
> db_kill_cmd finds a process, called p, then kills it.  In your new diff
> calling sigexit, take note of the comment at the top:
> 
>  * Force the current process to exit with the specified signal, dumping core
> 
> current process?  Doesn't look like it, it looks like it kills p.  But then
> it calls exit1?
> 
> Is this actually behaving the same for the db_kill_cmd() call?

Indeed, I messed up, in sigexit() the "struct proc *" argument means
`curproc'.



Re: KASSERT() for VOP_*

2020-09-16 Thread Martin Pieuchot
On 09/09/20(Wed) 08:41, Martin Pieuchot wrote:
> This is mostly the same diff that has been backed out months ago with
> the VOP_CLOSE() case fixed.  VOP_CLOSE() can accept a NULL argument
> instead of `curproc' when garbage collecting passed FDs.
> 
> The intent is to stop passing a "struct proc *" when a function applies
> only to `curproc'.  Synchronization/locking primitives are obviously
> different if a CPU can modify the fields of any thread or only of the
> current one.

Anyone?

> Index: kern/vfs_vops.c
> ===
> RCS file: /cvs/src/sys/kern/vfs_vops.c,v
> retrieving revision 1.28
> diff -u -p -r1.28 vfs_vops.c
> --- kern/vfs_vops.c   8 Apr 2020 08:07:51 -   1.28
> +++ kern/vfs_vops.c   27 Apr 2020 08:10:02 -
> @@ -145,6 +145,8 @@ VOP_OPEN(struct vnode *vp, int mode, str
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
> +
>   if (vp->v_op->vop_open == NULL)
>   return (EOPNOTSUPP);
>  
> @@ -164,6 +166,7 @@ VOP_CLOSE(struct vnode *vp, int fflag, s
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == NULL || p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_close == NULL)
> @@ -184,6 +187,7 @@ VOP_ACCESS(struct vnode *vp, int mode, s
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_access == NULL)
> @@ -202,6 +206,7 @@ VOP_GETATTR(struct vnode *vp, struct vat
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   if (vp->v_op->vop_getattr == NULL)
>   return (EOPNOTSUPP);
>  
> @@ -219,6 +224,7 @@ VOP_SETATTR(struct vnode *vp, struct vat
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_setattr == NULL)
> @@ -282,6 +288,7 @@ VOP_IOCTL(struct vnode *vp, u_long comma
>   a.a_cred = cred;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   if (vp->v_op->vop_ioctl == NULL)
>   return (EOPNOTSUPP);
>  
> @@ -300,6 +307,7 @@ VOP_POLL(struct vnode *vp, int fflag, in
>   a.a_events = events;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   if (vp->v_op->vop_poll == NULL)
>   return (EOPNOTSUPP);
>  
> @@ -344,6 +352,7 @@ VOP_FSYNC(struct vnode *vp, struct ucred
>   a.a_waitfor = waitfor;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_fsync == NULL)
> @@ -565,6 +574,7 @@ VOP_INACTIVE(struct vnode *vp, struct pr
>   a.a_vp = vp;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   ASSERT_VP_ISLOCKED(vp);
>  
>   if (vp->v_op->vop_inactive == NULL)
> @@ -581,6 +591,7 @@ VOP_RECLAIM(struct vnode *vp, struct pro
>   a.a_vp = vp;
>   a.a_p = p;
>  
> + KASSERT(p == curproc);
>   if (vp->v_op->vop_reclaim == NULL)
>   return (EOPNOTSUPP);
>  
> 



Re: sigabort(), p_sigmask & p_siglist

2020-09-16 Thread Martin Pieuchot
On 16/09/20(Wed) 06:09, Miod Vallat wrote:
> 
> > Diff below introduces an helper for sending an uncatchable SIGABRT and
> > annotate that `p_siglist' and `p_sigmask' are updated using atomic
> > operations.
> 
> Why not use sigexit(p, SIGABRT); for that purpose?

That's a better solution indeed.  deraadt@ pointed something that goes
in this direction as well because sigexit() parks siblings earlier and
that reduces the amount of noise between the detection of corruption
and the content of the coredump.

Updated diff below, ok?

Index: kern/kern_sig.c
===
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.262
diff -u -p -r1.262 kern_sig.c
--- kern/kern_sig.c 13 Sep 2020 13:33:37 -  1.262
+++ kern/kern_sig.c 16 Sep 2020 07:45:26 -
@@ -122,6 +122,8 @@ const int sigprop[NSIG + 1] = {
 #definestopsigmask (sigmask(SIGSTOP) | sigmask(SIGTSTP) | \
sigmask(SIGTTIN) | sigmask(SIGTTOU))
 
+void setsigvec(struct proc *, int, struct sigaction *);
+
 void proc_stop(struct proc *p, int);
 void proc_stop_sweep(void *);
 void *proc_stop_si;
Index: kern/kern_pledge.c
===
RCS file: /cvs/src/sys/kern/kern_pledge.c,v
retrieving revision 1.263
diff -u -p -r1.263 kern_pledge.c
--- kern/kern_pledge.c  17 Jul 2020 16:28:19 -  1.263
+++ kern/kern_pledge.c  16 Sep 2020 07:45:28 -
@@ -529,7 +529,6 @@ pledge_fail(struct proc *p, int error, u
 {
const char *codes = "";
int i;
-   struct sigaction sa;
 
/* Print first matching pledge */
for (i = 0; code && pledgenames[i].bits != 0; i++)
@@ -550,11 +549,7 @@ pledge_fail(struct proc *p, int error, u
p->p_p->ps_acflag |= APLEDGE;
 
/* Send uncatchable SIGABRT for coredump */
-   memset(, 0, sizeof sa);
-   sa.sa_handler = SIG_DFL;
-   setsigvec(p, SIGABRT, );
-   atomic_clearbits_int(>p_sigmask, sigmask(SIGABRT));
-   psignal(p, SIGABRT);
+   sigexit(p, SIGABRT);
 
p->p_p->ps_pledge = 0;  /* Disable all PLEDGE_ flags */
KERNEL_UNLOCK();
Index: kern/kern_proc.c
===
RCS file: /cvs/src/sys/kern/kern_proc.c,v
retrieving revision 1.86
diff -u -p -r1.86 kern_proc.c
--- kern/kern_proc.c30 Jan 2020 08:51:27 -  1.86
+++ kern/kern_proc.c16 Sep 2020 07:45:26 -
@@ -494,7 +494,6 @@ void
 db_kill_cmd(db_expr_t addr, int have_addr, db_expr_t count, char *modif)
 {
struct process *pr;
-   struct sigaction sa;
struct proc *p;
 
pr = prfind(addr);
@@ -506,11 +505,7 @@ db_kill_cmd(db_expr_t addr, int have_add
p = TAILQ_FIRST(>ps_threads);
 
/* Send uncatchable SIGABRT for coredump */
-   memset(, 0, sizeof sa);
-   sa.sa_handler = SIG_DFL;
-   setsigvec(p, SIGABRT, );
-   atomic_clearbits_int(>p_sigmask, sigmask(SIGABRT));
-   psignal(p, SIGABRT);
+   sigexit(p, SIGABRT);
 }
 
 void
Index: sys/signalvar.h
===
RCS file: /cvs/src/sys/sys/signalvar.h,v
retrieving revision 1.43
diff -u -p -r1.43 signalvar.h
--- sys/signalvar.h 13 Sep 2020 13:33:37 -  1.43
+++ sys/signalvar.h 16 Sep 2020 07:45:22 -
@@ -128,7 +128,6 @@ voidtrapsignal(struct proc *p, int sig,
 void   sigexit(struct proc *, int);
 intsigismasked(struct proc *, int);
 intsigonstack(size_t);
-void   setsigvec(struct proc *, int, struct sigaction *);
 intkillpg1(struct proc *, int, int, int);
 
 void   signal_init(void);



sigabort(), p_sigmask & p_siglist

2020-09-15 Thread Martin Pieuchot
Diff below introduces an helper for sending an uncatchable SIGABRT and
annotate that `p_siglist' and `p_sigmask' are updated using atomic
operations.

As a result setsigvec() becomes local to kern/kern_sig.c.

Note that other places in the kernel use sigexit(p, SIGABRT) for the
same purpose and aren't converted by this change.

Ik?

Index: kern/kern_pledge.c
===
RCS file: /cvs/src/sys/kern/kern_pledge.c,v
retrieving revision 1.263
diff -u -p -r1.263 kern_pledge.c
--- kern/kern_pledge.c  17 Jul 2020 16:28:19 -  1.263
+++ kern/kern_pledge.c  15 Sep 2020 08:30:19 -
@@ -529,7 +529,6 @@ pledge_fail(struct proc *p, int error, u
 {
const char *codes = "";
int i;
-   struct sigaction sa;
 
/* Print first matching pledge */
for (i = 0; code && pledgenames[i].bits != 0; i++)
@@ -550,11 +549,7 @@ pledge_fail(struct proc *p, int error, u
p->p_p->ps_acflag |= APLEDGE;
 
/* Send uncatchable SIGABRT for coredump */
-   memset(, 0, sizeof sa);
-   sa.sa_handler = SIG_DFL;
-   setsigvec(p, SIGABRT, );
-   atomic_clearbits_int(>p_sigmask, sigmask(SIGABRT));
-   psignal(p, SIGABRT);
+   sigabort(p);
 
p->p_p->ps_pledge = 0;  /* Disable all PLEDGE_ flags */
KERNEL_UNLOCK();
Index: kern/kern_proc.c
===
RCS file: /cvs/src/sys/kern/kern_proc.c,v
retrieving revision 1.86
diff -u -p -r1.86 kern_proc.c
--- kern/kern_proc.c30 Jan 2020 08:51:27 -  1.86
+++ kern/kern_proc.c15 Sep 2020 08:52:57 -
@@ -494,7 +494,6 @@ void
 db_kill_cmd(db_expr_t addr, int have_addr, db_expr_t count, char *modif)
 {
struct process *pr;
-   struct sigaction sa;
struct proc *p;
 
pr = prfind(addr);
@@ -506,11 +505,7 @@ db_kill_cmd(db_expr_t addr, int have_add
p = TAILQ_FIRST(>ps_threads);
 
/* Send uncatchable SIGABRT for coredump */
-   memset(, 0, sizeof sa);
-   sa.sa_handler = SIG_DFL;
-   setsigvec(p, SIGABRT, );
-   atomic_clearbits_int(>p_sigmask, sigmask(SIGABRT));
-   psignal(p, SIGABRT);
+   sigabort(p);
 }
 
 void
Index: kern/kern_sig.c
===
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.262
diff -u -p -r1.262 kern_sig.c
--- kern/kern_sig.c 13 Sep 2020 13:33:37 -  1.262
+++ kern/kern_sig.c 15 Sep 2020 08:33:04 -
@@ -122,6 +122,8 @@ const int sigprop[NSIG + 1] = {
 #definestopsigmask (sigmask(SIGSTOP) | sigmask(SIGTSTP) | \
sigmask(SIGTTIN) | sigmask(SIGTTOU))
 
+void setsigvec(struct proc *, int, struct sigaction *);
+
 void proc_stop(struct proc *p, int);
 void proc_stop_sweep(void *);
 void *proc_stop_si;
@@ -1483,6 +1493,21 @@ sigexit(struct proc *p, int signum)
}
exit1(p, 0, signum, EXIT_NORMAL);
/* NOTREACHED */
+}
+
+/*
+ * Send uncatchable SIGABRT for coredump.
+ */
+void
+sigabort(struct proc *p)
+{
+   struct sigaction sa;
+
+   memset(, 0, sizeof sa);
+   sa.sa_handler = SIG_DFL;
+   setsigvec(p, SIGABRT, );
+   atomic_clearbits_int(>p_sigmask, sigmask(SIGABRT));
+   psignal(p, SIGABRT);
 }
 
 /*
Index: sys/signalvar.h
===
RCS file: /cvs/src/sys/sys/signalvar.h,v
retrieving revision 1.43
diff -u -p -r1.43 signalvar.h
--- sys/signalvar.h 13 Sep 2020 13:33:37 -  1.43
+++ sys/signalvar.h 15 Sep 2020 08:34:15 -
@@ -126,9 +126,9 @@ voidsiginit(struct sigacts *);
 void   trapsignal(struct proc *p, int sig, u_long code, int type,
union sigval val);
 void   sigexit(struct proc *, int);
+void   sigabort(struct proc *);
 intsigismasked(struct proc *, int);
 intsigonstack(size_t);
-void   setsigvec(struct proc *, int, struct sigaction *);
 intkillpg1(struct proc *, int, int, int);
 
 void   signal_init(void);
Index: sys/proc.h
===
RCS file: /cvs/src/sys/sys/proc.h,v
retrieving revision 1.299
diff -u -p -r1.299 proc.h
--- sys/proc.h  26 Aug 2020 03:19:09 -  1.299
+++ sys/proc.h  15 Sep 2020 10:15:36 -
@@ -383,14 +383,14 @@ struct proc {
struct  kcov_dev *p_kd; /* kcov device handle */
struct  lock_list_entry *p_sleeplocks;  /* WITNESS lock tracking */ 
 
-   int  p_siglist; /* Signals arrived but not delivered. */
+   int  p_siglist; /* [a] Signals arrived & not delivered*/
 
 /* End area that is zeroed on creation. */
 #definep_endzero   p_startcopy
 
 /* The following fields are all copied upon creation in fork. */
 #definep_startcopy p_sigmask
-   sigset_t p_sigmask; /* Current signal mask. */
+   sigset_t p_sigmask; /* [a] Current signal 

curproc vs MP vs locking

2020-09-15 Thread Martin Pieuchot
Many functions in the kernel take a "struct proc *" as argument.  When
reviewing diffs or reading the signature of such functions it is not
clear if this pointer can be any thread or if it is, like in many cases,
pointing to `curproc'.

This distinction matters when it comes to reading/writing members of
this "struct proc" and that's why a growing number of functions start
with the following idiom:

KASSERT(p == curproc);

This is verbose and redundant, so I suggested to always use `curproc'
and stop passing a "struct proc *" as argument when a function isn't
meant to modify any thread.  claudio@ raised a concern of performance
claiming that `curproc' isn't always cheap.  Is it still true?  Does
the KASSERT()s make us pay the cost anyhow?

If that's the case can we adopt a convention to help review functions
that take a "struct proc *" but only mean `curproc'?  What about naming
this parameter `curp' instead of `p'?



Re: go/rust vs uvm_map_inentry()

2020-09-13 Thread Martin Pieuchot
On 13/09/20(Sun) 16:54, Mark Kettenis wrote:
> > Date: Sun, 13 Sep 2020 16:49:48 +0200
> > From: Sebastien Marie 
> > 
> > On Sun, Sep 13, 2020 at 03:29:57PM +0200, Martin Pieuchot wrote:
> > > I'm no longer able to reproduce the corruption while building lang/go
> > > with the diff below.  Something relevant to threading change in go since
> > > march?
> > > 
> > > Can someone try this diff and tell me if go and/or rust still fail?
> > 
> > quickly tested with rustc build (nightly here), and it is failing at
> > random places (not always at the same) with memory errors (signal
> > 11, compiler ICE signal 6...)
> 
> Is it failing when you don't have tracing enabled and not failing when
> the tracing is disabled perhaps?

It is failing even without tracing.



Re: pppoe: move softc list out of NET_LOCK() into new pppoe lock

2020-09-13 Thread Martin Pieuchot
On 13/09/20(Sun) 15:12, Klemens Nanni wrote:
> This is my first try trading global locks for interface specific ones.
> 
> pppoe(4) keeps a list of all its interfaces which is then obviously
> traversed during create and destroy.
> 
> Currently, the net lock is grabbed for this, but there seems to be no
> justification other than reusing^Wabusing an existing lock.
> 
> I run this diff with WITNESS and kern.witness=2 on my edgerouter 4
> providing my home uplink via pppoe0:  the kernel runs stable, there's
> not witness log showing up and creating and destroying hundreds of
> additional pppoe(4) devices works without disruption.
> 
> Is this the right direction?

I doubt it is, see below:

> Index: if_pppoe.c
> ===
> RCS file: /cvs/src/sys/net/if_pppoe.c,v
> retrieving revision 1.73
> diff -u -p -r1.73 if_pppoe.c
> --- if_pppoe.c13 Sep 2020 11:00:40 -  1.73
> +++ if_pppoe.c13 Sep 2020 11:31:12 -
> @@ -460,8 +463,10 @@ static void pppoe_dispatch_disc_pkt(stru
>   err_msg = "TAG HUNIQUE ERROR";
>   break;
>   }
> + rw_enter_read(_lock);
>   sc = pppoe_find_softc_by_hunique(mtod(n, caddr_t) + 
> noff,
>   len, m->m_pkthdr.ph_ifidx);
> + rw_exit_read(_lock);

This introduce a new sleeping point in the packet processing path,
something we are avoiding thanks to the NET_LOCK().

Plus this use of a lock alone is insufficient to guarantee the integrity
of `sc' because it is used outside of the dance.

The NET_LOCK() is fine here, please concentrate on the KERNEL_LOCK()
that's where contention happens.



go/rust vs uvm_map_inentry()

2020-09-13 Thread Martin Pieuchot
I'm no longer able to reproduce the corruption while building lang/go
with the diff below.  Something relevant to threading change in go since
march?

Can someone try this diff and tell me if go and/or rust still fail?

Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.266
diff -u -p -r1.266 uvm_map.c
--- uvm/uvm_map.c   12 Sep 2020 17:08:50 -  1.266
+++ uvm/uvm_map.c   13 Sep 2020 10:12:25 -
@@ -1893,16 +1893,16 @@ uvm_map_inentry(struct proc *p, struct p
boolean_t ok = TRUE;
 
if (uvm_map_inentry_recheck(serial, addr, ie)) {
-   KERNEL_LOCK();
ok = uvm_map_inentry_fix(p, ie, addr, fn, serial);
if (!ok) {
+   KERNEL_LOCK();
printf(fmt, p->p_p->ps_comm, p->p_p->ps_pid, p->p_tid,
addr, ie->ie_start, ie->ie_end);
p->p_p->ps_acflag |= AMAP;
sv.sival_ptr = (void *)PROC_PC(p);
trapsignal(p, SIGSEGV, 0, SEGV_ACCERR, sv);
+   KERNEL_UNLOCK();
}
-   KERNEL_UNLOCK();
}
return (ok);
 }



Re: pppoe: start documenting locks

2020-09-13 Thread Martin Pieuchot
On 13/09/20(Sun) 10:05, Klemens Nanni wrote:
> 
> Here's a start at struct pppoe_softc;  for every member I went through
> code paths looking for *_LOCK() or *_ASSERT_LOCKED().
> 
> Pretty much all members are under the net lock, some are proctected by
> both net and kernel lock, e.g. the start routine is called with
> KERNEL_LOCK().
> 
> I did not go through the sppp struct members yet.
> 
> Does this look correct?

Without doing another audit but with the fact that pseudo-device are
generally run by a thread holding the NET_LOCK() I'd assume it's ok.

> From here on, I think we can start and already pull out a few of those
> members and put them under a new pppoe(4) specific lock.

First we should remove the KRENEL_LOCK() from around pppoeintr().  The
NET_LOCK() is not an issue right now.

One comment below, either way ok mpi@

> Index: if_pppoe.c
> ===
> RCS file: /cvs/src/sys/net/if_pppoe.c,v
> retrieving revision 1.70
> diff -u -p -r1.70 if_pppoe.c
> --- if_pppoe.c28 Jul 2020 09:52:32 -  1.70
> +++ if_pppoe.c20 Aug 2020 15:27:09 -
> @@ -114,27 +115,34 @@ struct pppoetag {
>  #define  PPPOE_DISC_MAXPADI  4   /* retry PADI four times 
> (quickly) */
>  #define  PPPOE_DISC_MAXPADR  2   /* retry PADR twice */
>  
> +/*
> + * Locks used to protect struct members and global data
> + *   I   immutable after creation
> + *   K   kernel lock

I wouldn't bother repeating 'I' and 'K' if they are not used in the
description below.

> + *   N   net lock
> + */
> +
>  struct pppoe_softc {
>   struct sppp sc_sppp;/* contains a struct ifnet as first 
> element */
> - LIST_ENTRY(pppoe_softc) sc_list;
> - unsigned int sc_eth_ifidx;
> + LIST_ENTRY(pppoe_softc) sc_list;/* [N] */
> + unsigned int sc_eth_ifidx;  /* [N] */
>  
> - int sc_state;   /* discovery phase or session connected 
> */
> - struct ether_addr sc_dest;  /* hardware address of concentrator */
> - u_int16_t sc_session;   /* PPPoE session id */
> -
> - char *sc_service_name;  /* if != NULL: requested name of 
> service */
> - char *sc_concentrator_name; /* if != NULL: requested concentrator 
> id */
> - u_int8_t *sc_ac_cookie; /* content of AC cookie we must echo 
> back */
> - size_t sc_ac_cookie_len;/* length of cookie data */
> - u_int8_t *sc_relay_sid; /* content of relay SID we must echo 
> back */
> - size_t sc_relay_sid_len;/* length of relay SID data */
> - u_int32_t sc_unique;/* our unique id */
> - struct timeout sc_timeout;  /* timeout while not in session state */
> - int sc_padi_retried;/* number of PADI retries already done 
> */
> - int sc_padr_retried;/* number of PADR retries already done 
> */
> + int sc_state;   /* [N] discovery phase or session 
> connected */
> + struct ether_addr sc_dest;  /* [N] hardware address of concentrator 
> */
> + u_int16_t sc_session;   /* [N] PPPoE session id */
> +
> + char *sc_service_name;  /* [N] if != NULL: requested name of 
> service */
> + char *sc_concentrator_name; /* [N] if != NULL: requested 
> concentrator id */
> + u_int8_t *sc_ac_cookie; /* [N] content of AC cookie we must 
> echo back */
> + size_t sc_ac_cookie_len;/* [N] length of cookie data */
> + u_int8_t *sc_relay_sid; /* [N] content of relay SID we must 
> echo back */
> + size_t sc_relay_sid_len;/* [N] length of relay SID data */
> + u_int32_t sc_unique;/* [I] our unique id */
> + struct timeout sc_timeout;  /* [N] timeout while not in session 
> state */
> + int sc_padi_retried;/* [N] number of PADI retries already 
> done */
> + int sc_padr_retried;/* [N] number of PADR retries already 
> done */
>  
> - struct timeval sc_session_time; /* time the session was established */
> + struct timeval sc_session_time; /* [N] time the session was established 
> */
>  };
>  
>  /* incoming traffic will be queued here */
> 



Re: sppp: add free() sizes

2020-09-12 Thread Martin Pieuchot
On 12/09/20(Sat) 14:49, Klemens Nanni wrote:
> These are the last free(buf, 0) occurences in if_pppoe.c and
> if_spppsubr.c changing to non-zero sizes.
> 
> I've been running with this the last week without any issues.
> 
> Feedback? OK?

Maybe store `pwdlen' and `idlen' in "struct sppp" instead of recomputing
it everytime? 

Another approach would be to always use array of AUTHMAXLEN, I'm not sure
the size justifies two malloc(9).

Anyway the diff is ok mpi@

> Index: if_spppsubr.c
> ===
> RCS file: /cvs/src/sys/net/if_spppsubr.c,v
> retrieving revision 1.186
> diff -u -p -r1.186 if_spppsubr.c
> --- if_spppsubr.c 22 Aug 2020 16:12:12 -  1.186
> +++ if_spppsubr.c 3 Sep 2020 21:43:54 -
> @@ -750,13 +750,15 @@ sppp_detach(struct ifnet *ifp)
>  
>   /* release authentication data */
>   if (sp->myauth.name != NULL)
> - free(sp->myauth.name, M_DEVBUF, 0);
> + free(sp->myauth.name, M_DEVBUF, strlen(sp->myauth.name) + 1);
>   if (sp->myauth.secret != NULL)
> - free(sp->myauth.secret, M_DEVBUF, 0);
> + free(sp->myauth.secret, M_DEVBUF,
> + strlen(sp->myauth.secret) + 1);
>   if (sp->hisauth.name != NULL)
> - free(sp->hisauth.name, M_DEVBUF, 0);
> + free(sp->hisauth.name, M_DEVBUF, strlen(sp->hisauth.name) + 1);
>   if (sp->hisauth.secret != NULL)
> - free(sp->hisauth.secret, M_DEVBUF, 0);
> + free(sp->hisauth.secret, M_DEVBUF,
> + strlen(sp->hisauth.secret) + 1);
>  }
>  
>  /*
> @@ -4579,9 +4587,11 @@ sppp_set_params(struct sppp *sp, struct 
>   if (spa->proto == 0) {
>   /* resetting auth */
>   if (auth->name != NULL)
> - free(auth->name, M_DEVBUF, 0);
> + free(auth->name, M_DEVBUF,
> + strlen(auth->name) + 1);
>   if (auth->secret != NULL)
> - free(auth->secret, M_DEVBUF, 0);
> + free(auth->secret, M_DEVBUF,
> + strlen(auth->secret) + 1);
>   bzero(auth, sizeof *auth);
>   explicit_bzero(sp->chap_challenge, sizeof 
> sp->chap_challenge);
>   } else {
> @@ -4594,7 +4604,8 @@ sppp_set_params(struct sppp *sp, struct 
>   p = malloc(len, M_DEVBUF, M_WAITOK);
>   strlcpy(p, spa->name, len);
>   if (auth->name != NULL)
> - free(auth->name, M_DEVBUF, 0);
> + free(auth->name, M_DEVBUF,
> + strlen(auth->name) + 1);
>   auth->name = p;
>  
>   if (spa->secret[0] != '\0') {
> @@ -4603,7 +4614,8 @@ sppp_set_params(struct sppp *sp, struct 
>   p = malloc(len, M_DEVBUF, M_WAITOK);
>   strlcpy(p, spa->secret, len);
>   if (auth->secret != NULL)
> - free(auth->secret, M_DEVBUF, 0);
> + free(auth->secret, M_DEVBUF,
> + strlen(auth->secret) + 1);
>   auth->secret = p;
>   } else if (!auth->secret) {
>   p = malloc(1, M_DEVBUF, M_WAITOK);
> 



UVM tracepoints for dt(4)

2020-09-11 Thread Martin Pieuchot
To investigate the race exposed by the last locking change in
uvm_map_inentry() [0], I'd like to add the following tracepoints.

The idea is to compare page fault addresses and permissions with
the insertion/removal of entries in a given map.  Diff below is
the first part of the puzzle, ok?

[0] https://marc.info/?l=openbsd-tech=157293690312531=2

An example of bt(5) script using those tracepoints look like this:

  tracepoint:uvm:fault {
  printf("%s:%d(%s) fault   0x%x type=0x%x, prot=0x%x\n",
  nsecs, tid, comm, arg0, arg1, arg2);
  }
  tracepoint:uvm:map_insert {
printf("%s:%d(%s) insert [0x%x, 0x%x), prot=0x%x\n",
 nsecs, tid, comm, arg0, arg1, arg2);
  }
  tracepoint:uvm:map_remove {
printf("%s:%d(%s) remove [0x%x, 0x%x) prot=0x%x\n",
 nsecs, tid, comm, arg0, arg1, arg2);
  }


Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.97
diff -u -p -r1.97 uvm_fault.c
--- uvm/uvm_fault.c 8 Dec 2019 12:37:45 -   1.97
+++ uvm/uvm_fault.c 11 Sep 2020 07:16:01 -
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -507,6 +508,7 @@ uvm_fault(vm_map_t orig_map, vaddr_t vad
pg = NULL;
 
uvmexp.faults++;/* XXX: locking? */
+   TRACEPOINT(uvm, fault, vaddr, fault_type, access_type, NULL);
 
/* init the IN parameters in the ufi */
ufi.orig_map = orig_map;
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.265
diff -u -p -r1.265 uvm_map.c
--- uvm/uvm_map.c   6 Jul 2020 19:22:40 -   1.265
+++ uvm/uvm_map.c   11 Sep 2020 07:41:53 -
@@ -95,6 +95,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef SYSVSHM
 #include 
@@ -455,6 +456,9 @@ uvm_mapent_addr_insert(struct vm_map *ma
KDASSERT((entry->start & (vaddr_t)PAGE_MASK) == 0 &&
(entry->end & (vaddr_t)PAGE_MASK) == 0);
 
+   TRACEPOINT(uvm, map_insert,
+   entry->start, entry->end, entry->protection, NULL);
+
UVM_MAP_REQ_WRITE(map);
res = RBT_INSERT(uvm_map_addr, >addr, entry);
if (res != NULL) {
@@ -474,6 +478,9 @@ void
 uvm_mapent_addr_remove(struct vm_map *map, struct vm_map_entry *entry)
 {
struct vm_map_entry *res;
+
+   TRACEPOINT(uvm, map_remove,
+   entry->start, entry->end, entry->protection, NULL);
 
UVM_MAP_REQ_WRITE(map);
res = RBT_REMOVE(uvm_map_addr, >addr, entry);
Index: dev/dt/dt_prov_static.c
===
RCS file: /cvs/src/sys/dev/dt/dt_prov_static.c,v
retrieving revision 1.2
diff -u -p -r1.2 dt_prov_static.c
--- dev/dt/dt_prov_static.c 25 Mar 2020 14:59:23 -  1.2
+++ dev/dt/dt_prov_static.c 11 Sep 2020 07:43:58 -
@@ -34,7 +34,7 @@ struct dt_provider dt_prov_static = {
 };
 
 /*
- * Scheduler provider
+ * Scheduler
  */
 DT_STATIC_PROBE2(sched, dequeue, "pid_t", "pid_t");
 DT_STATIC_PROBE2(sched, enqueue, "pid_t", "pid_t");
@@ -51,6 +51,13 @@ DT_STATIC_PROBE1(raw_syscalls, sys_enter
 DT_STATIC_PROBE1(raw_syscalls, sys_exit, "register_t");
 
 /*
+ * UVM
+ */
+DT_STATIC_PROBE3(uvm, fault, "vaddr_t", "vm_fault_t", "vm_prot_t");
+DT_STATIC_PROBE3(uvm, map_insert, "vaddr_t", "vaddr_t", "vm_prot_t");
+DT_STATIC_PROBE3(uvm, map_remove, "vaddr_t", "vaddr_t", "vm_prot_t");
+
+/*
  * List of all static probes
  */
 struct dt_probe *dtps_static[] = {
@@ -65,6 +72,10 @@ struct dt_probe *dtps_static[] = {
/* Raw syscalls */
&_DT_STATIC_P(raw_syscalls, sys_enter),
&_DT_STATIC_P(raw_syscalls, sys_exit),
+   /* UVM */
+   &_DT_STATIC_P(uvm, fault),
+   &_DT_STATIC_P(uvm, map_insert),
+   &_DT_STATIC_P(uvm, map_remove),
 };
 
 int



Re: issignal() w/o KERNEL_LOCK()

2020-09-09 Thread Martin Pieuchot
On 09/09/20(Wed) 10:02, Claudio Jeker wrote:
> On Wed, Sep 09, 2020 at 08:33:30AM +0200, Martin Pieuchot wrote:
> > Per-process data structures needed to suspend the execution of threads
> > are since recently protected by the SCHED_LOCK().  So the KERNEL_LOCK()
> > dance inside issignal() is no longer necessary and can be removed, ok?
> 
> This is not quite right. single_thread_set() still needs the
> KERNEL_LOCK() to avoid racing against itself.

Which data structure still requires the KERNEL_LOCK()?  Are you talking
about the per-process list of threads?

Isn't the SCHED_LOCK() enough to prevent racing against itself?



KASSERT() for VOP_*

2020-09-09 Thread Martin Pieuchot
This is mostly the same diff that has been backed out months ago with
the VOP_CLOSE() case fixed.  VOP_CLOSE() can accept a NULL argument
instead of `curproc' when garbage collecting passed FDs.

The intent is to stop passing a "struct proc *" when a function applies
only to `curproc'.  Synchronization/locking primitives are obviously
different if a CPU can modify the fields of any thread or only of the
current one.

Index: kern/vfs_vops.c
===
RCS file: /cvs/src/sys/kern/vfs_vops.c,v
retrieving revision 1.28
diff -u -p -r1.28 vfs_vops.c
--- kern/vfs_vops.c 8 Apr 2020 08:07:51 -   1.28
+++ kern/vfs_vops.c 27 Apr 2020 08:10:02 -
@@ -145,6 +145,8 @@ VOP_OPEN(struct vnode *vp, int mode, str
a.a_cred = cred;
a.a_p = p;
 
+   KASSERT(p == curproc);
+
if (vp->v_op->vop_open == NULL)
return (EOPNOTSUPP);
 
@@ -164,6 +166,7 @@ VOP_CLOSE(struct vnode *vp, int fflag, s
a.a_cred = cred;
a.a_p = p;
 
+   KASSERT(p == NULL || p == curproc);
ASSERT_VP_ISLOCKED(vp);
 
if (vp->v_op->vop_close == NULL)
@@ -184,6 +187,7 @@ VOP_ACCESS(struct vnode *vp, int mode, s
a.a_cred = cred;
a.a_p = p;
 
+   KASSERT(p == curproc);
ASSERT_VP_ISLOCKED(vp);
 
if (vp->v_op->vop_access == NULL)
@@ -202,6 +206,7 @@ VOP_GETATTR(struct vnode *vp, struct vat
a.a_cred = cred;
a.a_p = p;
 
+   KASSERT(p == curproc);
if (vp->v_op->vop_getattr == NULL)
return (EOPNOTSUPP);
 
@@ -219,6 +224,7 @@ VOP_SETATTR(struct vnode *vp, struct vat
a.a_cred = cred;
a.a_p = p;
 
+   KASSERT(p == curproc);
ASSERT_VP_ISLOCKED(vp);
 
if (vp->v_op->vop_setattr == NULL)
@@ -282,6 +288,7 @@ VOP_IOCTL(struct vnode *vp, u_long comma
a.a_cred = cred;
a.a_p = p;
 
+   KASSERT(p == curproc);
if (vp->v_op->vop_ioctl == NULL)
return (EOPNOTSUPP);
 
@@ -300,6 +307,7 @@ VOP_POLL(struct vnode *vp, int fflag, in
a.a_events = events;
a.a_p = p;
 
+   KASSERT(p == curproc);
if (vp->v_op->vop_poll == NULL)
return (EOPNOTSUPP);
 
@@ -344,6 +352,7 @@ VOP_FSYNC(struct vnode *vp, struct ucred
a.a_waitfor = waitfor;
a.a_p = p;
 
+   KASSERT(p == curproc);
ASSERT_VP_ISLOCKED(vp);
 
if (vp->v_op->vop_fsync == NULL)
@@ -565,6 +574,7 @@ VOP_INACTIVE(struct vnode *vp, struct pr
a.a_vp = vp;
a.a_p = p;
 
+   KASSERT(p == curproc);
ASSERT_VP_ISLOCKED(vp);
 
if (vp->v_op->vop_inactive == NULL)
@@ -581,6 +591,7 @@ VOP_RECLAIM(struct vnode *vp, struct pro
a.a_vp = vp;
a.a_p = p;
 
+   KASSERT(p == curproc);
if (vp->v_op->vop_reclaim == NULL)
return (EOPNOTSUPP);
 



sigismasked()

2020-09-09 Thread Martin Pieuchot
Simple helper function to centralize the manipulation of `ps_sigignore'
and `p_sigmask' in kern/kern_sig.c and later on add the corresponding
asserts, ok?

Index: kern/kern_sig.c
===
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.260
diff -u -p -r1.260 kern_sig.c
--- kern/kern_sig.c 26 Aug 2020 03:16:53 -  1.260
+++ kern/kern_sig.c 8 Sep 2020 05:46:25 -
@@ -1486,6 +1486,22 @@ sigexit(struct proc *p, int signum)
/* NOTREACHED */
 }
 
+/*
+ * Return 1 if `sig', a given signal, is ignored or masked for `p', a given
+ * thread, and 0 otherwise.
+ */
+int
+sigismasked(struct proc *p, int sig)
+{
+   struct process *pr = p->p_p;
+
+   if ((pr->ps_sigacts->ps_sigignore & sigmask(sig)) ||
+   (p->p_sigmask & sigmask(sig)))
+   return 1;
+
+   return 0;
+}
+
 int nosuidcoredump = 1;
 
 struct coredump_iostate {
Index: kern/tty_pty.c
===
RCS file: /cvs/src/sys/kern/tty_pty.c,v
retrieving revision 1.103
diff -u -p -r1.103 tty_pty.c
--- kern/tty_pty.c  20 Jul 2020 14:34:16 -  1.103
+++ kern/tty_pty.c  8 Sep 2020 05:28:46 -
@@ -289,8 +289,7 @@ ptsread(dev_t dev, struct uio *uio, int 
 again:
if (pti->pt_flags & PF_REMOTE) {
while (isbackground(pr, tp)) {
-   if ((pr->ps_sigacts->ps_sigignore & sigmask(SIGTTIN)) ||
-   (p->p_sigmask & sigmask(SIGTTIN)) ||
+   if (sigismasked(p, SIGTTIN) ||
pr->ps_pgrp->pg_jobc == 0 ||
pr->ps_flags & PS_PPWAIT)
return (EIO);
Index: kern/tty.c
===
RCS file: /cvs/src/sys/kern/tty.c,v
retrieving revision 1.163
diff -u -p -r1.163 tty.c
--- kern/tty.c  22 Jul 2020 17:39:50 -  1.163
+++ kern/tty.c  8 Sep 2020 05:28:46 -
@@ -744,8 +744,7 @@ ttioctl(struct tty *tp, u_long cmd, cadd
case  TIOCSWINSZ:
while (isbackground(pr, tp) &&
(pr->ps_flags & PS_PPWAIT) == 0 &&
-   (pr->ps_sigacts->ps_sigignore & sigmask(SIGTTOU)) == 0 &&
-   (p->p_sigmask & sigmask(SIGTTOU)) == 0) {
+   !sigismasked(p, SIGTTOU)) {
if (pr->ps_pgrp->pg_jobc == 0)
return (EIO);
pgsignal(pr->ps_pgrp, SIGTTOU, 1);
@@ -1498,8 +1497,7 @@ loop: lflag = tp->t_lflag;
 * Hang process if it's in the background.
 */
if (isbackground(pr, tp)) {
-   if ((pr->ps_sigacts->ps_sigignore & sigmask(SIGTTIN)) ||
-  (p->p_sigmask & sigmask(SIGTTIN)) ||
+   if (sigismasked(p, SIGTTIN) ||
pr->ps_flags & PS_PPWAIT || pr->ps_pgrp->pg_jobc == 0) {
error = EIO;
goto out;
@@ -1749,8 +1747,7 @@ loop:
pr = p->p_p;
if (isbackground(pr, tp) &&
ISSET(tp->t_lflag, TOSTOP) && (pr->ps_flags & PS_PPWAIT) == 0 &&
-   (pr->ps_sigacts->ps_sigignore & sigmask(SIGTTOU)) == 0 &&
-   (p->p_sigmask & sigmask(SIGTTOU)) == 0) {
+   !sigismasked(p, SIGTTOU)) {
if (pr->ps_pgrp->pg_jobc == 0) {
error = EIO;
goto out;
Index: sys/signalvar.h
===
RCS file: /cvs/src/sys/sys/signalvar.h,v
retrieving revision 1.41
diff -u -p -r1.41 signalvar.h
--- sys/signalvar.h 10 May 2020 00:56:06 -  1.41
+++ sys/signalvar.h 8 Sep 2020 05:29:10 -
@@ -126,6 +126,7 @@ voidsiginit(struct process *);
 void   trapsignal(struct proc *p, int sig, u_long code, int type,
union sigval val);
 void   sigexit(struct proc *, int);
+intsigismasked(struct proc *, int);
 intsigonstack(size_t);
 void   setsigvec(struct proc *, int, struct sigaction *);
 intkillpg1(struct proc *, int, int, int);



issignal() w/o KERNEL_LOCK()

2020-09-09 Thread Martin Pieuchot
Per-process data structures needed to suspend the execution of threads
are since recently protected by the SCHED_LOCK().  So the KERNEL_LOCK()
dance inside issignal() is no longer necessary and can be removed, ok?

Note that CURSIG() is currently always called with the KERNEL_LOCK()
held so the code below is redundant.

This is a step towards getting signal handling out of ze big lock.

Index: kern/kern_sig.c
===
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.260
diff -u -p -r1.260 kern_sig.c
--- kern/kern_sig.c 26 Aug 2020 03:16:53 -  1.260
+++ kern/kern_sig.c 8 Sep 2020 05:48:51 -
@@ -1203,11 +1203,7 @@ issignal(struct proc *p)
signum != SIGKILL) {
pr->ps_xsig = signum;
 
-   if (dolock)
-   KERNEL_LOCK();
single_thread_set(p, SINGLE_PTRACE, 0);
-   if (dolock)
-   KERNEL_UNLOCK();
 
if (dolock)
SCHED_LOCK(s);
@@ -1215,11 +1211,7 @@ issignal(struct proc *p)
if (dolock)
SCHED_UNLOCK(s);
 
-   if (dolock)
-   KERNEL_LOCK();
single_thread_clear(p, 0);
-   if (dolock)
-   KERNEL_UNLOCK();
 
/*
 * If we are no longer being traced, or the parent
@@ -1484,6 +1476,22 @@ sigexit(struct proc *p, int signum)
}
exit1(p, 0, signum, EXIT_NORMAL);
/* NOTREACHED */
+}
+
+/*
+ * Return 1 if `sig', a given signal, is ignored or masked for `p', a given
+ * thread, and 0 otherwise.
+ */
+int
+sigismasked(struct proc *p, int sig)
+{
+   struct process *pr = p->p_p;
+
+   if ((pr->ps_sigacts->ps_sigignore & sigmask(sig)) ||
+   (p->p_sigmask & sigmask(sig)))
+   return 1;
+
+   return 0;
 }
 
 int nosuidcoredump = 1;



m_defrag(9) leak

2020-08-25 Thread Martin Pieuchot
Maxime Villard mentioned a leak due to a missing m_freem() in wg(4):
https://marc.info/?l=netbsd-tech-net=159827988018641=2

It seems to be that such leak is present in other uses of m_defrag() in
the tree.  I won't take the time to go though all of them, but an audit
would be welcome.

Index: net//if_wg.c
===
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.12
diff -u -p -r1.12 if_wg.c
--- net//if_wg.c21 Aug 2020 22:59:27 -  1.12
+++ net//if_wg.c25 Aug 2020 06:34:32 -
@@ -2022,8 +2022,10 @@ wg_input(void *_sc, struct mbuf *m, stru
/* m has a IP/IPv6 header of hlen length, we don't need it anymore. */
m_adj(m, hlen);
 
-   if (m_defrag(m, M_NOWAIT) != 0)
+   if (m_defrag(m, M_NOWAIT) != 0) {
+   m_freem(m);
return NULL;
+   }
 
if ((m->m_pkthdr.len == sizeof(struct wg_pkt_initiation) &&
*mtod(m, uint32_t *) == WG_PKT_INITIATION) ||



Enable EVFILT_EXCEPT

2020-08-21 Thread Martin Pieuchot
The kqueue-based poll(2) backend is still a WIP due to regressions in
the kqueue layer.  In the meantime should we expose EVFILT_EXCEPT to
userland?  The diff below should be enough to allow userland apps to
use the new code paths. 

ok?

Index: sys/event.h
===
RCS file: /cvs/src/sys/sys/event.h,v
retrieving revision 1.44
diff -u -p -r1.44 event.h
--- sys/event.h 22 Jun 2020 13:14:32 -  1.44
+++ sys/event.h 21 Aug 2020 07:09:31 -
@@ -41,7 +41,7 @@
 #define EVFILT_DEVICE  (-8)/* devices */
 #define EVFILT_EXCEPT  (-9)/* exceptional conditions */
 
-#define EVFILT_SYSCOUNT8
+#define EVFILT_SYSCOUNT9
 
 #define EV_SET(kevp, a, b, c, d, e, f) do {\
struct kevent *__kevp = (kevp); \



Re: Fewer pool_get() in kqueue_register()

2020-08-19 Thread Martin Pieuchot
On 18/08/20(Tue) 15:30, Visa Hankala wrote:
> On Tue, Aug 18, 2020 at 11:04:47AM +0200, Martin Pieuchot wrote:
> > Diff below changes the order of operations in kqueue_register() to get
> > rid of an unnecessary pool_get().  When an event is already present on
> > the list try to acquire it first.  Note that knote_acquire() may sleep
> > in which case the list might have changed so the lookup has to always
> > begin from the start.
> > 
> > This will help with lazy removal of knote in poll/select.  In this
> > scenario EV_ADD is generally always done with an knote already on the
> > list.
> 
> Some of the overhead could be absorbed by using a pool cache, as shown
> in the diff below. However, I am not suggesting that the cache should
> be taken into use right now. The frequency of knote pool usage is
> relatively low currently; there are other pools that would benefit more
> from caching.

Agreed, this is a nice idea to revisit.  Do you have a way to measure
which pool could benefit from caches?

I'm also not in a hurry.  That said I'd like to be able to re-use
descriptor that are already on the kqueue.  For per-thread kqueue there's
no possible race.  And since we'll need some sort of serialization to
unlock kevent(2) this could be built on top of it.

> A related question is what implications the increased use of the pool
> cache feature would have under memory pressure.

Do you have suggestion on how to measure this as well?  Could dt(4)
probes or evtcount() help us? 

> Index: kern/init_main.c
> ===
> RCS file: src/sys/kern/init_main.c,v
> retrieving revision 1.300
> diff -u -p -r1.300 init_main.c
> --- kern/init_main.c  16 Jun 2020 05:09:29 -  1.300
> +++ kern/init_main.c  18 Aug 2020 15:09:38 -
> @@ -71,6 +71,7 @@
>  #include 
>  #endif
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -148,7 +149,6 @@ void  crypto_init(void);
>  void db_ctf_init(void);
>  void prof_init(void);
>  void init_exec(void);
> -void kqueue_init(void);
>  void futex_init(void);
>  void taskq_init(void);
>  void timeout_proc_init(void);
> @@ -431,7 +431,9 @@ main(void *framep)
>   prof_init();
>  #endif
>  
> - mbcpuinit();/* enable per cpu mbuf data */
> + /* Enable per cpu data. */
> + mbcpuinit();
> + kqueue_init_percpu();
>  
>   /* init exec and emul */
>   init_exec();
> Index: kern/kern_event.c
> ===
> RCS file: src/sys/kern/kern_event.c,v
> retrieving revision 1.142
> diff -u -p -r1.142 kern_event.c
> --- kern/kern_event.c 12 Aug 2020 13:49:24 -  1.142
> +++ kern/kern_event.c 18 Aug 2020 15:09:38 -
> @@ -205,6 +205,12 @@ kqueue_init(void)
>   PR_WAITOK, "knotepl", NULL);
>  }
>  
> +void
> +kqueue_init_percpu(void)
> +{
> + pool_cache_init(_pool);
> +}
> +
>  int
>  filt_fileattach(struct knote *kn)
>  {
> Index: sys/event.h
> ===
> RCS file: src/sys/sys/event.h,v
> retrieving revision 1.44
> diff -u -p -r1.44 event.h
> --- sys/event.h   22 Jun 2020 13:14:32 -  1.44
> +++ sys/event.h   18 Aug 2020 15:09:38 -
> @@ -210,6 +210,8 @@ extern void   knote_activate(struct knote 
>  extern void  knote_remove(struct proc *p, struct knlist *list);
>  extern void  knote_fdclose(struct proc *p, int fd);
>  extern void  knote_processexit(struct proc *);
> +extern void  kqueue_init(void);
> +extern void  kqueue_init_percpu(void);
>  extern int   kqueue_register(struct kqueue *kq,
>   struct kevent *kev, struct proc *p);
>  extern int   filt_seltrue(struct knote *kn, long hint);



Re: Fewer pool_get() in kqueue_register()

2020-08-18 Thread Martin Pieuchot
On 18/08/20(Tue) 11:22, Mark Kettenis wrote:
> > Date: Tue, 18 Aug 2020 11:04:47 +0200
> > From: Martin Pieuchot 
> > 
> > Diff below changes the order of operations in kqueue_register() to get
> > rid of an unnecessary pool_get().  When an event is already present on
> > the list try to acquire it first.  Note that knote_acquire() may sleep
> > in which case the list might have changed so the lookup has to always
> > begin from the start.
> > 
> > This will help with lazy removal of knote in poll/select.  In this
> > scenario EV_ADD is generally always done with an knote already on the
> > list.
> > 
> > ok?
> 
> But pool_get() may sleep as well.  In my experience it is better to do
> the resource allocation up front and release afterwards if it turned
> out you didn't need the resource.  That's what the current code does.

There's indeed a race possible when multiple threads try to register the
same event on the same kqueue, this will lead to a double insert.  Thanks! 

> > Index: kern/kern_event.c
> > ===
> > RCS file: /cvs/src/sys/kern/kern_event.c,v
> > retrieving revision 1.142
> > diff -u -p -r1.142 kern_event.c
> > --- kern/kern_event.c   12 Aug 2020 13:49:24 -  1.142
> > +++ kern/kern_event.c   18 Aug 2020 08:58:27 -
> > @@ -696,7 +696,7 @@ kqueue_register(struct kqueue *kq, struc
> > struct filedesc *fdp = kq->kq_fdp;
> > const struct filterops *fops = NULL;
> > struct file *fp = NULL;
> > -   struct knote *kn = NULL, *newkn = NULL;
> > +   struct knote *kn, *newkn = NULL;
> > struct knlist *list = NULL;
> > int s, error = 0;
> >  
> > @@ -721,22 +721,12 @@ kqueue_register(struct kqueue *kq, struc
> > return (EBADF);
> > }
> >  
> > -   if (kev->flags & EV_ADD)
> > -   newkn = pool_get(_pool, PR_WAITOK | PR_ZERO);
> > -
> >  again:
> > +   kn = NULL;
> > if (fops->f_flags & FILTEROP_ISFD) {
> > -   if ((fp = fd_getfile(fdp, kev->ident)) == NULL) {
> > -   error = EBADF;
> > -   goto done;
> > -   }
> > -   if (kev->flags & EV_ADD)
> > -   kqueue_expand_list(kq, kev->ident);
> > if (kev->ident < kq->kq_knlistsize)
> > list = >kq_knlist[kev->ident];
> > } else {
> > -   if (kev->flags & EV_ADD)
> > -   kqueue_expand_hash(kq);
> > if (kq->kq_knhashmask != 0) {
> > list = >kq_knhash[
> > KN_HASH((u_long)kev->ident, kq->kq_knhashmask)];
> > @@ -749,10 +739,6 @@ again:
> > s = splhigh();
> > if (!knote_acquire(kn)) {
> > splx(s);
> > -   if (fp != NULL) {
> > -   FRELE(fp, p);
> > -   fp = NULL;
> > -   }
> > goto again;
> > }
> > splx(s);
> > @@ -760,6 +746,21 @@ again:
> > }
> > }
> > }
> > +
> > +   if (kev->flags & EV_ADD && kn == NULL) {
> > +   newkn = pool_get(_pool, PR_WAITOK | PR_ZERO);
> > +   if (fops->f_flags & FILTEROP_ISFD) {
> > +   if ((fp = fd_getfile(fdp, kev->ident)) == NULL) {
> > +   error = EBADF;
> > +   goto done;
> > +   }
> > +   kqueue_expand_list(kq, kev->ident);
> > +   } else {
> > +   kqueue_expand_hash(kq);
> > +   }
> > +
> > +   }
> > +
> > KASSERT(kn == NULL || (kn->kn_status & KN_PROCESSING) != 0);
> >  
> > if (kn == NULL && ((kev->flags & EV_ADD) == 0)) {
> > 
> > 



Push KERNEL_LOCK/UNLOCK in trapsignal()

2020-08-18 Thread Martin Pieuchot
Taken from a larger diff from claudio@, this reduces the lock dances in
MD code and put it where we should focus our effort in kern/kern_sig.c.

ok?

Index: kern/kern_sig.c
===
RCS file: /cvs/src/sys/kern/kern_sig.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sig.c
--- kern/kern_sig.c 15 Jun 2020 13:18:33 -  1.258
+++ kern/kern_sig.c 18 Aug 2020 09:34:11 -
@@ -802,6 +802,7 @@ trapsignal(struct proc *p, int signum, u
struct sigacts *ps = pr->ps_sigacts;
int mask;
 
+   KERNEL_LOCK();
switch (signum) {
case SIGILL:
case SIGBUS:
@@ -842,6 +843,7 @@ trapsignal(struct proc *p, int signum, u
sigexit(p, signum);
ptsignal(p, signum, STHREAD);
}
+   KERNEL_UNLOCK();
 }
 
 /*
Index: arch/alpha/alpha/trap.c
===
RCS file: /cvs/src/sys/arch/alpha/alpha/trap.c,v
retrieving revision 1.88
diff -u -p -r1.88 trap.c
--- arch/alpha/alpha/trap.c 6 Sep 2019 12:22:01 -   1.88
+++ arch/alpha/alpha/trap.c 18 Aug 2020 09:18:54 -
@@ -488,9 +488,7 @@ do_fault:
printtrap(a0, a1, a2, entry, framep, 1, user);
 #endif
sv.sival_ptr = v;
-   KERNEL_LOCK();
trapsignal(p, i, ucode, typ, sv);
-   KERNEL_UNLOCK();
 out:
if (user) {
/* Do any deferred user pmap operations. */
Index: arch/amd64/amd64/trap.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/trap.c,v
retrieving revision 1.79
diff -u -p -r1.79 trap.c
--- arch/amd64/amd64/trap.c 21 Jan 2020 03:06:39 -  1.79
+++ arch/amd64/amd64/trap.c 18 Aug 2020 09:18:54 -
@@ -391,9 +391,7 @@ usertrap(struct trapframe *frame)
}
 
sv.sival_ptr = (void *)frame->tf_rip;
-   KERNEL_LOCK();
trapsignal(p, sig, type, code, sv);
-   KERNEL_UNLOCK();
 
 out:
userret(p);
Index: arch/arm/arm/fault.c
===
RCS file: /cvs/src/sys/arch/arm/arm/fault.c,v
retrieving revision 1.39
diff -u -p -r1.39 fault.c
--- arch/arm/arm/fault.c6 Sep 2019 12:22:01 -   1.39
+++ arch/arm/arm/fault.c18 Aug 2020 09:18:54 -
@@ -373,9 +373,7 @@ data_abort_handler(trapframe_t *tf)
sd.trap = fsr;
 do_trapsignal:
sv.sival_int = sd.addr;
-   KERNEL_LOCK();
trapsignal(p, sd.signo, sd.trap, sd.code, sv);
-   KERNEL_UNLOCK();
 out:
/* If returning to user mode, make sure to invoke userret() */
if (user)
@@ -596,13 +594,9 @@ prefetch_abort_handler(trapframe_t *tf)
printf("UVM: pid %d (%s), uid %d killed: "
"out of swap\n", p->p_p->ps_pid, p->p_p->ps_comm,
p->p_ucred ? (int)p->p_ucred->cr_uid : -1);
-   KERNEL_LOCK();
trapsignal(p, SIGKILL, 0, SEGV_MAPERR, sv);
-   KERNEL_UNLOCK();
} else {
-   KERNEL_LOCK();
trapsignal(p, SIGSEGV, 0, SEGV_MAPERR, sv);
-   KERNEL_UNLOCK();
}
 
 out:
Index: arch/arm/arm/undefined.c
===
RCS file: /cvs/src/sys/arch/arm/arm/undefined.c,v
retrieving revision 1.13
diff -u -p -r1.13 undefined.c
--- arch/arm/arm/undefined.c13 Mar 2019 09:28:21 -  1.13
+++ arch/arm/arm/undefined.c18 Aug 2020 09:18:54 -
@@ -113,9 +113,7 @@ gdb_trapper(u_int addr, u_int insn, stru
if (insn == GDB_BREAKPOINT || insn == GDB5_BREAKPOINT) {
if (code == FAULT_USER) {
sv.sival_int = addr;
-   KERNEL_LOCK();
trapsignal(p, SIGTRAP, 0, TRAP_BRKPT, sv);
-   KERNEL_UNLOCK();
return 0;
}
}
@@ -174,9 +172,7 @@ undefinedinstruction(trapframe_t *frame)
if (__predict_false((fault_pc & 3) != 0)) {
/* Give the user an illegal instruction signal. */
sv.sival_int = (u_int32_t) fault_pc;
-   KERNEL_LOCK();
trapsignal(p, SIGILL, 0, ILL_ILLOPC, sv);
-   KERNEL_UNLOCK();
userret(p);
return;
}
@@ -260,9 +256,7 @@ undefinedinstruction(trapframe_t *frame)
}
 
sv.sival_int = frame->tf_pc;
-   KERNEL_LOCK();
trapsignal(p, SIGILL, 0, ILL_ILLOPC, sv);
-   KERNEL_UNLOCK();
}
 
if ((fault_code & FAULT_USER) == 0)
Index: arch/arm64/arm64/trap.c
===
RCS file: /cvs/src/sys/arch/arm64/arm64/trap.c,v
retrieving revision 1.28
diff -u -p -r1.28 trap.c
--- arch/arm64/arm64/trap.c 17 Aug 2020 08:09:03 -  1.28
+++ arch/arm64/arm64/trap.c

Fewer pool_get() in kqueue_register()

2020-08-18 Thread Martin Pieuchot
Diff below changes the order of operations in kqueue_register() to get
rid of an unnecessary pool_get().  When an event is already present on
the list try to acquire it first.  Note that knote_acquire() may sleep
in which case the list might have changed so the lookup has to always
begin from the start.

This will help with lazy removal of knote in poll/select.  In this
scenario EV_ADD is generally always done with an knote already on the
list.

ok?

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.142
diff -u -p -r1.142 kern_event.c
--- kern/kern_event.c   12 Aug 2020 13:49:24 -  1.142
+++ kern/kern_event.c   18 Aug 2020 08:58:27 -
@@ -696,7 +696,7 @@ kqueue_register(struct kqueue *kq, struc
struct filedesc *fdp = kq->kq_fdp;
const struct filterops *fops = NULL;
struct file *fp = NULL;
-   struct knote *kn = NULL, *newkn = NULL;
+   struct knote *kn, *newkn = NULL;
struct knlist *list = NULL;
int s, error = 0;
 
@@ -721,22 +721,12 @@ kqueue_register(struct kqueue *kq, struc
return (EBADF);
}
 
-   if (kev->flags & EV_ADD)
-   newkn = pool_get(_pool, PR_WAITOK | PR_ZERO);
-
 again:
+   kn = NULL;
if (fops->f_flags & FILTEROP_ISFD) {
-   if ((fp = fd_getfile(fdp, kev->ident)) == NULL) {
-   error = EBADF;
-   goto done;
-   }
-   if (kev->flags & EV_ADD)
-   kqueue_expand_list(kq, kev->ident);
if (kev->ident < kq->kq_knlistsize)
list = >kq_knlist[kev->ident];
} else {
-   if (kev->flags & EV_ADD)
-   kqueue_expand_hash(kq);
if (kq->kq_knhashmask != 0) {
list = >kq_knhash[
KN_HASH((u_long)kev->ident, kq->kq_knhashmask)];
@@ -749,10 +739,6 @@ again:
s = splhigh();
if (!knote_acquire(kn)) {
splx(s);
-   if (fp != NULL) {
-   FRELE(fp, p);
-   fp = NULL;
-   }
goto again;
}
splx(s);
@@ -760,6 +746,21 @@ again:
}
}
}
+
+   if (kev->flags & EV_ADD && kn == NULL) {
+   newkn = pool_get(_pool, PR_WAITOK | PR_ZERO);
+   if (fops->f_flags & FILTEROP_ISFD) {
+   if ((fp = fd_getfile(fdp, kev->ident)) == NULL) {
+   error = EBADF;
+   goto done;
+   }
+   kqueue_expand_list(kq, kev->ident);
+   } else {
+   kqueue_expand_hash(kq);
+   }
+
+   }
+
KASSERT(kn == NULL || (kn->kn_status & KN_PROCESSING) != 0);
 
if (kn == NULL && ((kev->flags & EV_ADD) == 0)) {



kqueue_scan_setup/finish

2020-08-14 Thread Martin Pieuchot
The previous change introducing the kqueue_scan_setup()/finish() API
required to switch poll(2) internals to use the kqueue mechanism has
been backed out.  The reason for the regression is still unknown, so
let's take a baby step approach.

Diff below introduces the new API with only minimal changes.  It should
not introduce any change in behavior.

Comments?  Oks?

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.142
diff -u -p -r1.142 kern_event.c
--- kern/kern_event.c   12 Aug 2020 13:49:24 -  1.142
+++ kern/kern_event.c   14 Aug 2020 10:13:38 -
@@ -64,9 +64,6 @@ void  KQREF(struct kqueue *);
 void   KQRELE(struct kqueue *);
 
 intkqueue_sleep(struct kqueue *, struct timespec *);
-intkqueue_scan(struct kqueue *kq, int maxevents,
-   struct kevent *ulistp, struct timespec *timeout,
-   struct kevent *kev, struct proc *p, int *retval);
 
 intkqueue_read(struct file *, struct uio *, int);
 intkqueue_write(struct file *, struct uio *, int);
@@ -554,6 +551,7 @@ out:
 int
 sys_kevent(struct proc *p, void *v, register_t *retval)
 {
+   struct kqueue_scan_state scan;
struct filedesc* fdp = p->p_fd;
struct sys_kevent_args /* {
syscallarg(int) fd;
@@ -635,11 +633,12 @@ sys_kevent(struct proc *p, void *v, regi
goto done;
}
 
-   KQREF(kq);
+   kqueue_scan_setup(, kq);
FRELE(fp, p);
-   error = kqueue_scan(kq, SCARG(uap, nevents), SCARG(uap, eventlist),
+   error = kqueue_scan(, SCARG(uap, nevents), SCARG(uap, eventlist),
tsp, kev, p, );
-   KQRELE(kq);
+   kqueue_scan_finish();
+
*retval = n;
return (error);
 
@@ -895,11 +894,13 @@ kqueue_sleep(struct kqueue *kq, struct t
 }
 
 int
-kqueue_scan(struct kqueue *kq, int maxevents, struct kevent *ulistp,
-struct timespec *tsp, struct kevent *kev, struct proc *p, int *retval)
+kqueue_scan(struct kqueue_scan_state *scan, int maxevents,
+struct kevent *ulistp, struct timespec *tsp, struct kevent *kev,
+struct proc *p, int *retval)
 {
+   struct kqueue *kq = scan->kqs_kq;
struct kevent *kevp;
-   struct knote mend, mstart, *kn;
+   struct knote *kn;
int s, count, nkev, error = 0;
 
nkev = 0;
@@ -909,9 +910,6 @@ kqueue_scan(struct kqueue *kq, int maxev
if (count == 0)
goto done;
 
-   memset(, 0, sizeof(mstart));
-   memset(, 0, sizeof(mend));
-
 retry:
KASSERT(count == maxevents);
KASSERT(nkev == 0);
@@ -939,18 +937,16 @@ retry:
goto done;
}
 
-   mstart.kn_filter = EVFILT_MARKER;
-   mstart.kn_status = KN_PROCESSING;
-   TAILQ_INSERT_HEAD(>kq_head, , kn_tqe);
-   mend.kn_filter = EVFILT_MARKER;
-   mend.kn_status = KN_PROCESSING;
-   TAILQ_INSERT_TAIL(>kq_head, , kn_tqe);
+   TAILQ_INSERT_TAIL(>kq_head, >kqs_end, kn_tqe);
+   TAILQ_INSERT_HEAD(>kq_head, >kqs_start, kn_tqe);
while (count) {
-   kn = TAILQ_NEXT(, kn_tqe);
+   kn = TAILQ_NEXT(>kqs_start, kn_tqe);
if (kn->kn_filter == EVFILT_MARKER) {
-   if (kn == ) {
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
+   if (kn == >kqs_end) {
+   TAILQ_REMOVE(>kq_head, >kqs_end,
+   kn_tqe);
+   TAILQ_REMOVE(>kq_head, >kqs_start,
+   kn_tqe);
splx(s);
if (count == maxevents)
goto retry;
@@ -958,8 +954,9 @@ retry:
}
 
/* Move start marker past another thread's marker. */
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
-   TAILQ_INSERT_AFTER(>kq_head, kn, , kn_tqe);
+   TAILQ_REMOVE(>kq_head, >kqs_start, kn_tqe);
+   TAILQ_INSERT_AFTER(>kq_head, kn, >kqs_start,
+   kn_tqe);
continue;
}
 
@@ -1029,8 +1026,8 @@ retry:
break;
}
}
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
-   TAILQ_REMOVE(>kq_head, , kn_tqe);
+   TAILQ_REMOVE(>kq_head, >kqs_end, kn_tqe);
+   TAILQ_REMOVE(>kq_head, >kqs_start, kn_tqe);
splx(s);
 done:
if (nkev != 0) {
@@ -1044,6 +1041,33 @@ done:
*retval = maxevents - count;
return (error);
 }
+
+void
+kqueue_scan_setup(struct kqueue_scan_state *scan, struct kqueue *kq)
+{
+   memset(scan, 0, sizeof(*scan));
+
+   KQREF(kq);
+   scan->kqs_kq = kq;
+   scan->kqs_start.kn_filter = EVFILT_MARKER;
+   

Re: TCP congestion control progression

2020-08-14 Thread Martin Pieuchot
On 13/08/20(Thu) 10:14, Brian Brombacher wrote:
> 
> 
> >> On Aug 9, 2020, at 6:29 PM, Chris Cappuccio  wrote:
> > Brian Brombacher [br...@planetunix.net] wrote:
> >> 
> >> I am wondering what approach the project is planning to use to modernize 
> >> the congestion control algorithms.  I'm interested in assisting the 
> >> project 
> >> with development effort in this area.  I've spent time making modifications
> >> for my own purposes and would prefer to understand the projects goals 
> >> before
> >> continuing, if possible.
> > 
> > Various improvements have been made over the years for dynamic window size,
> > also further tweaks for higher bandwidth over high latency connections. I'd
> > recommend sharing your current modifications here to get feedback.
> > 
> > Chris
> 
> Hi Chris,
> 
> The modifications I’ve made are around parameters related to the slow start 
> threshold logic, ENOBUF behavior on interface write, and each Reno congestion 
> control logic segment has also received some changes.  The majority of the 
> changes are implementing sysctl tunables to experiment with varying degrees 
> of network behavior I’ve been witnessing on my infrastructure.
> 
> I’ve been evaluating potential patterns to allow optimal maintenance and 
> addition of future CC algorithms.

mikeb@ has been working on this in the past.  His work can be used as a
solid basis, see:

https://github.com/mbelop/src/tree/tcpcc
https://github.com/mbelop/src/tree/tcpcc2



Re: pppx(4): move ifnet out of KERNEL_LOCK()

2020-08-06 Thread Martin Pieuchot
On 05/08/20(Wed) 12:50, Vitaliy Makkoveev wrote:
> pipex(4) and pppx(4) are ready to became a little bit more MP capable.
> Diff below moves pppx(4) related `ifnet' out of KERNEL_LOCK().

Nice, one comment below.

> Index: sys/net/if_pppx.c
> ===
> RCS file: /cvs/src/sys/net/if_pppx.c,v
> retrieving revision 1.98
> diff -u -p -r1.98 if_pppx.c
> --- sys/net/if_pppx.c 28 Jul 2020 09:53:36 -  1.98
> +++ sys/net/if_pppx.c 5 Aug 2020 09:34:50 -
> @@ -864,28 +863,23 @@ pppx_if_destroy(struct pppx_dev *pxd, st
>  }
>  
>  void
> -pppx_if_start(struct ifnet *ifp)
> +pppx_if_qstart(struct ifqueue *ifq)
>  {
> + struct ifnet *ifp = ifq->ifq_if;
>   struct pppx_if *pxi = (struct pppx_if *)ifp->if_softc;
>   struct mbuf *m;
>   int proto;
>  
> - if (!ISSET(ifp->if_flags, IFF_RUNNING))
> - return;
> -
> - for (;;) {
> - m = ifq_dequeue(>if_snd);
> -
> - if (m == NULL)
> - break;
> -
> + while ((m = ifq_dequeue(ifq)) != NULL) {
>   proto = *mtod(m, int *);
>   m_adj(m, sizeof(proto));
>  
>   ifp->if_obytes += m->m_pkthdr.len;
>   ifp->if_opackets++;
>  
> + NET_LOCK();
>   pipex_ppp_output(m, pxi->pxi_session, proto);
> + NET_UNLOCK();

This means the lock is taken and released for every packet.  It would be
better to grab it outside the loop.

>   }
>  }



Re: pipex(4): kill pipexintr()

2020-07-31 Thread Martin Pieuchot
On 31/07/20(Fri) 21:58, Vitaliy Makkoveev wrote:
> [...] 
> What denies us to move pipex(4) under it's own lock?

Such question won't lead us anywhere.  It assumes it makes sense to move
pipex under its own lock.  This assumption has many drawback which clearly
haven't been studied and more importantly it doesn't explains what for.

What is your goal?  What are you trying to achieve?  Improve latency?
Improve performances?  Of which subsystem?  Where is the bottleneck?
What is the architecture of the system?

IMHO the KERNEL_LOCK() should be removed an anything else postponed at
least until one has a clear understanding of the whole subsystem under
the NET_LOCK().



Re: pipex(4): kill pipexintr()

2020-07-31 Thread Martin Pieuchot
On 31/07/20(Fri) 12:15, Vitaliy Makkoveev wrote:
> On Fri, Jul 31, 2020 at 09:36:32AM +0900, YASUOKA Masahiko wrote:
> > On Thu, 30 Jul 2020 22:43:10 +0300
> > Vitaliy Makkoveev  wrote:
> > > On Thu, Jul 30, 2020 at 10:05:13PM +0900, YASUOKA Masahiko wrote:
> > >> On Thu, 30 Jul 2020 15:34:09 +0300
> > >> Vitaliy Makkoveev  wrote:
> > >> > On Thu, Jul 30, 2020 at 09:13:46PM +0900, YASUOKA Masahiko wrote:
> > >> >> Hi,
> > >> >> 
> > >> >> sys/net/if_ethersubr.c:
> > >> >> 372 void
> > >> >> 373 ether_input(struct ifnet *ifp, struct mbuf *m)
> > >> >> (snip)
> > >> >> 519 #if NPPPOE > 0 || defined(PIPEX)
> > >> >> 520 case ETHERTYPE_PPPOEDISC:
> > >> >> 521 case ETHERTYPE_PPPOE:
> > >> >> 522 if (m->m_flags & (M_MCAST | M_BCAST))
> > >> >> 523 goto dropanyway;
> > >> >> 524 #ifdef PIPEX
> > >> >> 525 if (pipex_enable) {
> > >> >> 526 struct pipex_session *session;
> > >> >> 527 
> > >> >> 528 if ((session = 
> > >> >> pipex_pppoe_lookup_session(m)) != NULL) {
> > >> >> 529 pipex_pppoe_input(m, session);
> > >> >> 530 return;
> > >> >> 531 }
> > >> >> 532 }
> > >> >> 533 #endif
> > >> >> 
> > >> >> previously a packet which branchces to #529 is enqueued.
> > >> >> 
> > >> >> If the diff removes the queue, then the pipex input routine is
> > >> >> executed by the NIC's interrupt handler.
> > >> >> 
> > >> >> The queues had been made to avoid that kind of situations.
> > >> > 
> > >> > It's not enqueued in pppoe case. According pipex_pppoe_input() code we
> > >> > call pipex_common_input() with `useq' argument set to '0', so we don't
> > >> > enqueue mbuf(9) but pass it to pipex_ppp_input() which will pass it to
> > >> > ipv{4,6}_input().
> > >> 
> > >> You are right.  Sorry, I forgot about this which I did that by myself.
> > >> 
> > > 
> > > I'm interesting the reason why you did that.
> > > 
> > >> >> Also I don't see a relation of the use-after-free problem and killing
> > >> >> queues.  Can't we fix the problem unless we kill the queues?
> > >> > 
> > >> > Yes we can. Reference counters allow us to keep orphan sessions in 
> > >> > these
> > >> > queues without use after free issue.
> > >> > 
> > >> > I will wait your commentaries current enqueuing before to do something.
> > >> 
> > >> I have another concern.
> > >> 
> > >> You might know, when L2TP/IPsec is used heavily, the crypto thread
> > >> uses 100% of 1 CPU core.  In that case, that thread becomes like
> > >> below:
> > >> 
> > >>   crypto thread -> udp_userreq -> pipex_l2tp_input
> > >> 
> > >> some clients are using MPPE(RC4 encryption) on CCP.  It's not so
> > >> light.
> > >> 
> > >> How do we offload this for CPUs?  I am thinking that "pipex" can have
> > >> a dedicated thread.  Do we have another scenario?
> > >>
> > > 
> > > I suppose you mean udp_input(). What is you call "crypto thread"? I did
> > > a little backtrace but I didn't find this thread.
> > > 
> > > ether_resolve
> > >   if_input_local
> > > ipv4_input
> > >   ip_input_if
> > > ip_ours
> > >   ip_deliver
> > > udp_input (through pr_input)
> > >   pipex_l2tp_input
> > > 
> > > ipi{,6}_mloopback
> > >   if_input_local
> > > ipv4_input
> > >   ...
> > > udp_input (through pr_input)
> > >   pipex_l2tp_input
> > > 
> > > loinput
> > >   if_input_local
> > > ipv4_input
> > >   ...
> > > udp_input (through pr_input)
> > >   pipex_l2tp_input
> > > 
> > > Also various pseudo drivers call ipv{4,6}_input() and underlay
> > > udp_unput() too.
> > > 
> > > Except nfs, we call udp_usrreq() through socket layer only. Do you mean
> > > userland as "crypto thread"?
> > 
> > Sorry, udp_usrreq() should be usr_input() and crypto thread meant a
> > kthread for crypto_taskq_mp_safe, whose name is "crynlk" (see
> > crypto_init()).
> > 
> > A packet of L2TP/IPsec (encapsulated IP/PPP/L2TP/UDP/ESP/UDP/IP) is
> > processed like:
> > 
> >ipv4_input
> >  ...
> >udp_input
> >  ipsec_common_input
> >esp_input
> >  crypto_dispatch
> >=> crypto_taskq_mp_safe
> > 
> >kthread "crynlk"
> >  crypto_invoke
> >... (*1)
> >  crypto_done
> >esp_input_cb
> >  ipsec_common_input_cb
> >ip_deliver
> >  udp_input
> >pipex_l2tp_input
> >  pipex_common_input
> >(*2)
> >pipex_ppp_input
> >  pipex_mppe_input (*3)
> >pipex_ppp_input
> >  pipex_ip_input
> >ipv4_input
> >  ...
> > 
> > At *2 there was a queue.  "crynlk" is a busy thread, since it is doing
> > decryption at *1.  I think it's 

Re: usbd_abort_pipe(); usbd_close_pipe; dance

2020-07-31 Thread Martin Pieuchot
On 31/07/20(Fri) 11:22, Marcus Glocker wrote:
> Maybe I'm missing something here.

You aren't.  Historically usbd_close_pipe() wasn't aborting transfers.
We changed it to do so as it happened to be the easiest fix to some
issues that had been copy/pasted.

It's just that nobody took the time to do the cleanup you're now
suggesting, thanks!

> But is there any specific reason why the most of our USB drivers are
> calling usbd_abort_pipe() right before usbd_close_pipe()?  Since
> usbd_close_pipe() already will call usbd_abort_pipe() if the pipe isn't
> empty, as documented in the man page:
> 
> DESCRIPTION
>  The usbd_abort_pipe() function aborts any transfers queued on pipe.
> 
>  The usbd_close_pipe() function aborts any transfers queued on pipe
>  then deletes it.
> 
> In case this happened because of an inherited copy/paste chain, can we
> nuke the superfluous usbd_abort_pipe() calls?

Yes please, ok mpi@

> Index: if_atu.c
> ===
> RCS file: /cvs/src/sys/dev/usb/if_atu.c,v
> retrieving revision 1.131
> diff -u -p -u -p -r1.131 if_atu.c
> --- if_atu.c  10 Jul 2020 13:26:40 -  1.131
> +++ if_atu.c  31 Jul 2020 08:26:24 -
> @@ -2252,7 +2252,6 @@ atu_stop(struct ifnet *ifp, int disable)
>  
>   /* Stop transfers. */
>   if (sc->atu_ep[ATU_ENDPT_RX] != NULL) {
> - usbd_abort_pipe(sc->atu_ep[ATU_ENDPT_RX]);
>   err = usbd_close_pipe(sc->atu_ep[ATU_ENDPT_RX]);
>   if (err) {
>   DPRINTF(("%s: close rx pipe failed: %s\n",
> @@ -2262,7 +2261,6 @@ atu_stop(struct ifnet *ifp, int disable)
>   }
>  
>   if (sc->atu_ep[ATU_ENDPT_TX] != NULL) {
> - usbd_abort_pipe(sc->atu_ep[ATU_ENDPT_TX]);
>   err = usbd_close_pipe(sc->atu_ep[ATU_ENDPT_TX]);
>   if (err) {
>   DPRINTF(("%s: close tx pipe failed: %s\n",
> Index: if_aue.c
> ===
> RCS file: /cvs/src/sys/dev/usb/if_aue.c,v
> retrieving revision 1.110
> diff -u -p -u -p -r1.110 if_aue.c
> --- if_aue.c  10 Jul 2020 13:26:40 -  1.110
> +++ if_aue.c  31 Jul 2020 08:26:25 -
> @@ -1518,7 +1518,6 @@ aue_stop(struct aue_softc *sc)
>  
>   /* Stop transfers. */
>   if (sc->aue_ep[AUE_ENDPT_RX] != NULL) {
> - usbd_abort_pipe(sc->aue_ep[AUE_ENDPT_RX]);
>   err = usbd_close_pipe(sc->aue_ep[AUE_ENDPT_RX]);
>   if (err) {
>   printf("%s: close rx pipe failed: %s\n",
> @@ -1528,7 +1527,6 @@ aue_stop(struct aue_softc *sc)
>   }
>  
>   if (sc->aue_ep[AUE_ENDPT_TX] != NULL) {
> - usbd_abort_pipe(sc->aue_ep[AUE_ENDPT_TX]);
>   err = usbd_close_pipe(sc->aue_ep[AUE_ENDPT_TX]);
>   if (err) {
>   printf("%s: close tx pipe failed: %s\n",
> @@ -1538,7 +1536,6 @@ aue_stop(struct aue_softc *sc)
>   }
>  
>   if (sc->aue_ep[AUE_ENDPT_INTR] != NULL) {
> - usbd_abort_pipe(sc->aue_ep[AUE_ENDPT_INTR]);
>   err = usbd_close_pipe(sc->aue_ep[AUE_ENDPT_INTR]);
>   if (err) {
>   printf("%s: close intr pipe failed: %s\n",
> Index: if_axe.c
> ===
> RCS file: /cvs/src/sys/dev/usb/if_axe.c,v
> retrieving revision 1.141
> diff -u -p -u -p -r1.141 if_axe.c
> --- if_axe.c  10 Jul 2020 13:26:40 -  1.141
> +++ if_axe.c  31 Jul 2020 08:26:25 -
> @@ -1473,7 +1473,6 @@ axe_stop(struct axe_softc *sc)
>  
>   /* Stop transfers. */
>   if (sc->axe_ep[AXE_ENDPT_RX] != NULL) {
> - usbd_abort_pipe(sc->axe_ep[AXE_ENDPT_RX]);
>   err = usbd_close_pipe(sc->axe_ep[AXE_ENDPT_RX]);
>   if (err) {
>   printf("axe%d: close rx pipe failed: %s\n",
> @@ -1483,7 +1482,6 @@ axe_stop(struct axe_softc *sc)
>   }
>  
>   if (sc->axe_ep[AXE_ENDPT_TX] != NULL) {
> - usbd_abort_pipe(sc->axe_ep[AXE_ENDPT_TX]);
>   err = usbd_close_pipe(sc->axe_ep[AXE_ENDPT_TX]);
>   if (err) {
>   printf("axe%d: close tx pipe failed: %s\n",
> @@ -1493,7 +1491,6 @@ axe_stop(struct axe_softc *sc)
>   }
>  
>   if (sc->axe_ep[AXE_ENDPT_INTR] != NULL) {
> - usbd_abort_pipe(sc->axe_ep[AXE_ENDPT_INTR]);
>   err = usbd_close_pipe(sc->axe_ep[AXE_ENDPT_INTR]);
>   if (err) {
>   printf("axe%d: close intr pipe failed: %s\n",
> Index: if_axen.c
> ===
> RCS file: /cvs/src/sys/dev/usb/if_axen.c,v
> retrieving revision 1.29
> diff -u -p -u -p -r1.29 if_axen.c
> --- if_axen.c 10 Jul 2020 13:26:40 -  1.29
> +++ if_axen.c 31 Jul 2020 08:26:25 -
> @@ -1426,7 +1426,6 @@ axen_stop(struct axen_softc *sc)
>  
>   /* Stop transfers. */
>   if 

Re: pipex(4): kill pipexintr()

2020-07-31 Thread Martin Pieuchot
On 30/07/20(Thu) 21:13, YASUOKA Masahiko wrote:
> sys/net/if_ethersubr.c:
> 372 void
> 373 ether_input(struct ifnet *ifp, struct mbuf *m)
> (snip)
> 519 #if NPPPOE > 0 || defined(PIPEX)
> 520 case ETHERTYPE_PPPOEDISC:
> 521 case ETHERTYPE_PPPOE:
> 522 if (m->m_flags & (M_MCAST | M_BCAST))
> 523 goto dropanyway;
> 524 #ifdef PIPEX
> 525 if (pipex_enable) {
> 526 struct pipex_session *session;
> 527 
> 528 if ((session = pipex_pppoe_lookup_session(m)) != 
> NULL) {
> 529 pipex_pppoe_input(m, session);
> 530 return;
> 531 }
> 532 }
> 533 #endif
> 
> previously a packet which branchces to #529 is enqueued.
> 
> If the diff removes the queue, then the pipex input routine is
> executed by the NIC's interrupt handler.
>
> The queues had been made to avoid that kind of situations.

Since many releases there's a generic queue between every NIC interrupt
handler and the softnet thread.

With this diff the pipex routine is still executed in a task.  Previously
it was the `if_input_task_locked' which correspond to all netisr, now it
is `ifiq_task' which calls if_input_process().



Re: pipex(4): kill pipexintr()

2020-07-30 Thread Martin Pieuchot
On 29/07/20(Wed) 23:04, Vitaliy Makkoveev wrote:
> Now pipex(4) is fully covered by NET_LOCK() and this is documented. But
> we still have an issue with pipex(4) session itself and I guess it's
> time to fix it.
> 
> We have `pipexinq' and `pipexoutq' mbuf(9) queues to store mbufs. Each
> mbuf(9) passed to these queues stores the pointer to corresponding
> session referenced as `m_pkthdr.ph_cookie'. We enqueue incoming mbufs for
> pppx(4) and incoming and outgoing mbufs for pppac(4). But we don't
> enqueue pppoe related mbufs. After packet was enqueued to corresponding
> queue we call schednetisr() which just schedules netisr() to run:
> 
>  cut begin 
> 
> 780 pipex_ppp_enqueue(struct mbuf *m0, struct pipex_session *session,
> 781 struct mbuf_queue *mq)
> 782 {
> 783 m0->m_pkthdr.ph_cookie = session;
> 784 /* XXX need to support other protocols */
> 785 m0->m_pkthdr.ph_ppp_proto = PPP_IP;
> 786 
> 787 if (mq_enqueue(mq, m0) != 0)
> 788 return (1);
> 789 
> 790 schednetisr(NETISR_PIPEX);
> 791 
> 792 return (0);
> 793 }
> 
>  cut end 
> 
> Also we have pipex_timer() which should destroy session in safe way, but
> it does this only for pppac(4) and only for sessions closed by
> `PIPEXDSESSION' command:
> 
>  cut begin 
> 
> 812 pipex_timer(void *ignored_arg)
> 813 {
>   /* skip */
> 846 case PIPEX_STATE_CLOSED:
> 847 /*
> 848  * mbuf queued in pipexinq or pipexoutq may have a
> 849* refererce to this session.
> 850  */
> 851 if (!mq_empty() || !mq_empty())
> 852 continue;
> 853 
> 854 pipex_destroy_session(session);
> 855 break;
> 
>  cut end 
> 
> While we destroy sessions through pipex_rele_session() or through
> pipex_iface_fini() or through `PIPEXSMODE' command we don't check
> `pipexinq' and `pipexoutq' state. This means we can break them.
> 
> It's not guaranteed that netisr() will start just after schednetisr()
> call. This means we can destroy session, but corresponding mbuf(9) is
> stored within `pipexinq' or `pipexoutq'. It's `m_pkthdr.ph_cookie' still
> stores pointer to destroyed session and we have use after free issue. I
> wonder why we didn't caught panic yet.
> 
> I propose to kill `pipexinq', `pipexoutq' and pipexintr(). There is
> absolutely no reason them to exist. This should not only fix issue
> described above but simplifies code too.

Time is fine.  Make sure you watch for possible fallouts.   If you're
curious you can generate Flamegraphs or profile the kernel with and
without this diff to get an idea of what get executed.  This change
should improve latency by reducing contention on the KERNEL_LOCK().

ok mpi@

Note that as a next step it could be beneficial to pass an `ifp' pointer
to pipex_ip{,6}_input() and maybe even pipex_ppp_input() to reduce the
number of if_get(9).  This will make an interesting pattern appear ;)

> Index: lib/libc/sys/sysctl.2
> ===
> RCS file: /cvs/src/lib/libc/sys/sysctl.2,v
> retrieving revision 1.40
> diff -u -p -r1.40 sysctl.2
> --- lib/libc/sys/sysctl.2 17 May 2020 05:48:39 -  1.40
> +++ lib/libc/sys/sysctl.2 29 Jul 2020 13:47:40 -
> @@ -2033,35 +2033,11 @@ The currently defined variable names are
>  .Bl -column "Third level name" "integer" "Changeable" -offset indent
>  .It Sy "Third level name" Ta Sy "Type" Ta Sy "Changeable"
>  .It Dv PIPEXCTL_ENABLE Ta integer Ta yes
> -.It Dv PIPEXCTL_INQ Ta node Ta not applicable
> -.It Dv PIPEXCTL_OUTQ Ta node Ta not applicable
>  .El
>  .Bl -tag -width "123456"
>  .It Dv PIPEXCTL_ENABLE
>  If set to 1, enable PIPEX processing.
>  The default is 0.
> -.It Dv PIPEXCTL_INQ Pq Va net.pipex.inq
> -Fourth level comprises an array of
> -.Vt struct ifqueue
> -structures containing information about the PIPEX packet input queue.
> -The forth level names for the elements of
> -.Vt struct ifqueue
> -are the same as described in
> -.Li ip.arpq
> -in the
> -.Dv PF_INET
> -section.
> -.It Dv PIPEXCTL_OUTQ Pq Va net.pipex.outq
> -Fourth level comprises an array of
> -.Vt struct ifqueue
> -structures containing information about PIPEX packet output queue.
> -The forth level names for the elements of
> -.Vt struct ifqueue
> -are the same as described in
> -.Li ip.arpq
> -in the
> -.Dv PF_INET
> -section.
>  .El
>  .El
>  .Ss CTL_VFS
> Index: sys/net/if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.616
> diff -u -p -r1.616 if.c
> --- sys/net/if.c  24 Jul 2020 18:17:14 -  1.616
> +++ sys/net/if.c  29 Jul 2020 13:47:44 -
> @@ -909,13 +909,6 @@ if_netisr(void *unused)
>   KERNEL_UNLOCK();
>   }
>  

Re: xhci(4) isoc: fix bogus handling of chained TRBs

2020-07-28 Thread Martin Pieuchot
On 26/07/20(Sun) 16:23, Marcus Glocker wrote:
> On Sun, 26 Jul 2020 13:27:34 +
> sc.dy...@gmail.com wrote:
> 
> > On 2020/07/26 10:54, Marcus Glocker wrote:
> > > On Sat, 25 Jul 2020 20:31:44 +
> > > sc.dy...@gmail.com wrote:
> > >   
> > >> On 2020/07/25 18:10, Marcus Glocker wrote:  
> > >>> On Sun, Jul 19, 2020 at 02:12:21PM +, sc.dy...@gmail.com
> > >>> wrote: 
> >  On 2020/07/19 11:25, Marcus Glocker wrote:
> > > On Sun, 19 Jul 2020 02:25:30 +
> > > sc.dy...@gmail.com wrote:
> > >
> > >> hi,
> > >>
> > >> It works on AMD Bolton xHCI (78141022), Intel PCH (1e318086),
> > >> and ASM1042 (10421b21).
> > >> I simply play with ffplay -f v4l2 /dev/video0 to test.
> > >
> > > If your cam supports MJPEG it's good to add '-input_format
> > > mjpeg' with higher resolutions like 1280x720, because that will
> > > generated varying image sizes, which hit the 64k memory boundary
> > > more often, and thus generate potentially more chained TDs.
> > 
> >  Thank you for useful information.
> >  My webcam supprots at most 640x480, but 1024 bytes/frame x (2+1)
> >  maxburst x 40 frames = 122880 bytes/xfer is enough to observe TD
> >  fragmentation.
> > 
> > 
> > >> At this moment it does not work on VL805, but I have no idea.
> > >> I'll investigate furthermore...
> > >>>
> > >>> Did you already had a chance to figure out something regarding the
> > >>> issue you faced on your VL805 controller?
> > >>>
> > >>> I'm running the diff here since then on the Intel xHCI controller
> > >>> and couldn't re-produce any errors using different uvideo(4) and
> > >>> uaudio(4) devices.
> > >>> 
> > >>
> > >> No, yet -- all I know about this problem is VL805 genegates
> > >> many MISSED_SRV Transfer Event for Isoch-IN pipe.
> > >>
> > >> xhci0: slot 3 missed srv with 123 TRB
> > >>  :
> > >>
> > >> Even if I increase UVIDEO_IXFERS in uvideo.h to 6, HC still detects
> > >> MISSED_SRV. When I disable splitting TD, it works well.
> > >> I added printf paddr in the event TRB but each paddr of MISSED_SRV
> > >> is 0, that does not meet 4.10.3.2.
> > >> Parameters in this endpoint context are
> > >>
> > >> xhci0: slot 3 dci 3 ival 0 mps 1024 maxb 2 mep 3072 atl 3072 mult 0
> > >>
> > >> looks sane.  
> > > 
> > > Hmm, I see.
> > > 
> > > I currently have also no idea what exactly is causing the missed
> > > service events.  I was reading a little bit yesterday about the
> > > VL805 and could find some statements where people say it's not
> > > fully compliant with the xHCI specs, and in Linux it took some
> > > cooperation with the vendor to make it work.
> > > 
> > > One thing I still wanted to ask you to understand whether the
> > > problem on your VL805 is only related with my last diff;  Are the
> > > multi-trb transfers working fine with your last diff on the VL805?  
> > 
> > On VL805 ffplay plays the movie sometimes smoothly, sometimes laggy.
> > The multi-TRB transfer itself works on VL805 with your patch.
> > Not all splitted TD fail to transfer. Successful splitted transfer
> > works as intended.
> > I think MISSED_SRV is caused by other reason, maybe isochronous
> > scheduling problem.
> > Thus, IMO your patch can be committed.
> > 
> > VL805 also has same problem that AMD Bolton has.
> > It may generate the 1. TRB event w/ cc = SHORT_PKT and
> > remain = requested length (that is, transferred length = 0),
> > but the 2. TRB w/ cc = SUCCESS and remain = 0.
> > remain of 2. TRB should be given length, and CC should be SHORT_PKT.
> > Your patch fixes this problem.
> 
> OK, that's what I wanted to understand.
> I also have the impression that the MISSED_SRV issue on the VL805 is
> related to another problem which we should trace separately from the
> multi-trb problem.  Thanks for that useful feedback.
> 
> Attached the latest version of my patch including all the inputs
> received (mostly related to malloc/free).
> 
> Patrick, Martin, would you be fine to OK that?

The logic looks fine.  I'd suggest you move the trb_processed array to
the xhci_pipe structure.  Command and Event rings do not need it, right?
This should also allow you to get rid of the malloc/free by always using
XHCI_MAX_XFER elements.

> Index: xhci.c
> ===
> RCS file: /cvs/src/sys/dev/usb/xhci.c,v
> retrieving revision 1.116
> diff -u -p -u -p -r1.116 xhci.c
> --- xhci.c30 Jun 2020 10:21:59 -  1.116
> +++ xhci.c19 Jul 2020 06:51:58 -
> @@ -82,7 +82,7 @@ voidxhci_event_xfer(struct xhci_softc *
>  int  xhci_event_xfer_generic(struct xhci_softc *, struct usbd_xfer *,
>   struct xhci_pipe *, uint32_t, int, uint8_t, uint8_t, uint8_t);
>  int  xhci_event_xfer_isoc(struct usbd_xfer *, struct xhci_pipe *,
> - uint32_t, int);
> + uint32_t, int, uint8_t);
>  void xhci_event_command(struct xhci_softc *, uint64_t);
>  void 

Re: net80211: skip input block ack window gaps faster

2020-07-28 Thread Martin Pieuchot
On 17/07/20(Fri) 18:15, Stefan Sperling wrote:
> On Fri, Jul 17, 2020 at 03:59:38PM +0200, Stefan Sperling wrote:
> > While measuring Tx performance at a fixed Tx rate with iwm(4) I observed
> > unexpected dips in throughput measured by tcpbench. These dips coincided
> > with one or more gap timeouts shown in 'netstat -W iwm0', such as:
> > 77 input block ack window gaps timed out
> > Which means lost frames on the receive side were stalling subsequent frames
> > and thus slowing tcpbench down.
> > 
> > I decided to disable the gap timeout entirely to see what would happen if
> > those missing frames were immediately skipped rather than waiting for them.
> > The result was stable throughput according to tcpbench.
> > 
> > I then wrote the patch below which keeps the gap timeout intact (it is 
> > needed
> > in case the peer stops sending anything) but skips missing frames at the 
> > head
> > of the Rx block window once a certain amount of frames have queued up. This
> > heuristics avoids having to wait for the timeout to fire in order to get
> > frames flowing again if we lose one of more frames during Rx traffic bursts.
> > 
> > I have picked a threshold of 16 outstanding frames based on local testing.
> > I have no idea if this is a good threshold for everyone. It would help to
> > get some feedback from tests in other RF environments and other types of 
> > access points. Any regressions?
> 
> Next version.
> 
> One problem with the previous patch was that it effectively limited the
> size of the BA window to the arbitrarily chosen limit of 16. We should not
> drop frames which arrive out of order but still fall within the BA window.
> 
> With this version, we allow the entire block ack window (usually 64 frames)
> to fill up beyond the missing frame at the head, and only then bypass the
> gap timeout handler and skip over the missing frame directly. I can still
> trigger this shortcut with tcpbench, and still see the timeout run sometimes.
> Direct skip should be faster than having to wait for the timeout to run,
> and missing just one out of 64 frames is a common case in my testing.
> 
> Also, I am not quite sure if calling if_input() from a timeout is such a
> good idea. Any opinions about that? This patch still lets the gap timeout
> handler clear the leading gap but avoids flushing buffered frames there.
> The peer will now need to send another frame to flush the buffer, but now
> if_input() will be called from network interrupt context only. Which is
> probably a good thing?

if_input() can be called in any context.  Using a timeout means you need
some extra logic to free the queue.  It might also add to the latency

> This code still seems to recover well enough from occasional packet loss,
> which is what this is all about. If you are on a really bad link, none
> of this will help anyway.
> 
> diff refs/heads/master refs/heads/ba-gap
> blob - 098aa9bce19481ce09676ce3c4fc0040f14c9b93
> blob + 4f41b568311bf29e131a3f4802e0a238ba940fe0
> --- sys/net80211/ieee80211_input.c
> +++ sys/net80211/ieee80211_input.c
> @@ -67,6 +67,7 @@ voidieee80211_input_ba(struct ieee80211com *, 
> struct 
>   struct mbuf_list *);
>  void ieee80211_input_ba_flush(struct ieee80211com *, struct ieee80211_node *,
>   struct ieee80211_rx_ba *, struct mbuf_list *);
> +int  ieee80211_input_ba_gap_skip(struct ieee80211_rx_ba *);
>  void ieee80211_input_ba_gap_timeout(void *arg);
>  void ieee80211_ba_move_window(struct ieee80211com *,
>   struct ieee80211_node *, u_int8_t, u_int16_t, struct mbuf_list *);
> @@ -837,10 +838,29 @@ ieee80211_input_ba(struct ieee80211com *ic, struct mbu
>   rxi->rxi_flags |= IEEE80211_RXI_AMPDU_DONE;
>   ba->ba_buf[idx].rxi = *rxi;
>  
> - if (ba->ba_buf[ba->ba_head].m == NULL)
> - timeout_add_msec(>ba_gap_to, IEEE80211_BA_GAP_TIMEOUT);
> - else if (timeout_pending(>ba_gap_to))
> - timeout_del(>ba_gap_to);
> + if (ba->ba_buf[ba->ba_head].m == NULL) {
> + if (ba->ba_gapwait < (ba->ba_winsize - 1)) {
> + if (ba->ba_gapwait == 0) {
> + timeout_add_msec(>ba_gap_to,
> + IEEE80211_BA_GAP_TIMEOUT);
> + }
> + ba->ba_gapwait++;
> + } else {
> + /*
> +  * A full BA window worth of frames is now waiting.
> +  * Skip the missing frame at the head of the window.
> +  */
> + int skipped = ieee80211_input_ba_gap_skip(ba);
> + ic->ic_stats.is_ht_rx_ba_frame_lost += skipped;
> + ba->ba_gapwait = 0;
> + if (timeout_pending(>ba_gap_to))
> + timeout_del(>ba_gap_to);
> + }
> + } else {
> + ba->ba_gapwait = 0;
> + if (timeout_pending(>ba_gap_to))
> + 

Re: pipex(4): document global data locks

2020-07-28 Thread Martin Pieuchot
On 17/07/20(Fri) 17:04, Vitaliy Makkoveev wrote:
> Subj. Also add NET_ASSERT_LOCKED() to pipex_{link,unlink,rele}_session()
> to be sure they called under NET_LOCK().

pipex_rele_session() is freeing memory.  When this function is called
those chunks of memory shouldn't be referenced by any other CPU or piece
of descriptor in the network stack.  So the NET_LOCK() shouldn't be
required.

Rest of the diff is fine.  I'd suggest you put the assertions just above
the LIST_INSERT or LIST_REMOVE like it is done in other parts of the stack.

> Index: sys/net/pipex.c
> ===
> RCS file: /cvs/src/sys/net/pipex.c,v
> retrieving revision 1.120
> diff -u -p -r1.120 pipex.c
> --- sys/net/pipex.c   17 Jul 2020 08:57:27 -  1.120
> +++ sys/net/pipex.c   17 Jul 2020 14:01:10 -
> @@ -83,19 +83,24 @@ struct pool pipex_session_pool;
>  struct pool mppe_key_pool;
>  
>  /*
> - * static/global variables
> + * Global data
> + * Locks used to protect global data
> + *   A   atomic operation
> + *   I   immutable after creation
> + *   N   net lock
>   */
> -int  pipex_enable = 0;
> +
> +int  pipex_enable = 0;   /* [N] */
>  struct pipex_hash_head
> -pipex_session_list,  /* master session list 
> */
> -pipex_close_wait_list,   /* expired session list */
> -pipex_peer_addr_hashtable[PIPEX_HASH_SIZE],  /* peer's address hash 
> */
> -pipex_id_hashtable[PIPEX_HASH_SIZE]; /* peer id hash */
> +pipex_session_list,  /* [N] master session 
> list */
> +pipex_close_wait_list,   /* [N] expired session list */
> +pipex_peer_addr_hashtable[PIPEX_HASH_SIZE],  /* [N] peer's address 
> hash */
> +pipex_id_hashtable[PIPEX_HASH_SIZE]; /* [N] peer id hash */
>  
> -struct radix_node_head   *pipex_rd_head4 = NULL;
> -struct radix_node_head   *pipex_rd_head6 = NULL;
> +struct radix_node_head   *pipex_rd_head4 = NULL; /* [N] */
> +struct radix_node_head   *pipex_rd_head6 = NULL; /* [N] */
>  struct timeout pipex_timer_ch;   /* callout timer context */
> -int pipex_prune = 1; /* walk list every seconds */
> +int pipex_prune = 1; /* [I] walk list every seconds */
>  
>  /* pipex traffic queue */
>  struct mbuf_queue pipexinq = MBUF_QUEUE_INITIALIZER(IFQ_MAXLEN, IPL_NET);
> @@ -105,7 +110,7 @@ struct mbuf_queue pipexoutq = MBUF_QUEUE
>  #define ph_ppp_proto ether_vtag
>  
>  #ifdef PIPEX_DEBUG
> -int pipex_debug = 0; /* systcl net.inet.ip.pipex_debug */
> +int pipex_debug = 0; /* [A] systcl net.inet.ip.pipex_debug */
>  #endif
>  
>  /* PPP compression == MPPE is assumed, so don't answer CCP Reset-Request. */
> @@ -419,6 +424,8 @@ pipex_init_session(struct pipex_session 
>  void
>  pipex_rele_session(struct pipex_session *session)
>  {
> + NET_ASSERT_LOCKED();
> +
>   if (session->mppe_recv.old_session_keys)
>   pool_put(_key_pool, session->mppe_recv.old_session_keys);
>   pool_put(_session_pool, session);
> @@ -430,6 +437,8 @@ pipex_link_session(struct pipex_session 
>  {
>   struct pipex_hash_head *chain;
>  
> + NET_ASSERT_LOCKED();
> +
>   if (!iface->pipexmode)
>   return (ENXIO);
>   if (pipex_lookup_by_session_id(session->protocol,
> @@ -463,6 +472,8 @@ pipex_link_session(struct pipex_session 
>  void
>  pipex_unlink_session(struct pipex_session *session)
>  {
> + NET_ASSERT_LOCKED();
> +
>   session->ifindex = 0;
>  
>   LIST_REMOVE(session, id_chain);
> 



Re: pipex_iface_fini() release multicast session under NET_LOCK()

2020-07-28 Thread Martin Pieuchot
On 17/07/20(Fri) 16:29, Vitaliy Makkoveev wrote:
> We are going to lock the whole pipex(4) by NET_LOCK(). So move
> `multicast_session' freeing undet NET_LOCK() too.

pipex_iface_fini() should be called on the last reference of the
descriptor.  So this shouldn't be necessary.  If there's an issue
with the current order of the operations, we should certainly fix
it differently.

> Index: sys/net/pipex.c
> ===
> RCS file: /cvs/src/sys/net/pipex.c,v
> retrieving revision 1.120
> diff -u -p -r1.120 pipex.c
> --- sys/net/pipex.c   17 Jul 2020 08:57:27 -  1.120
> +++ sys/net/pipex.c   17 Jul 2020 13:23:16 -
> @@ -192,8 +192,8 @@ pipex_iface_stop(struct pipex_iface_cont
>  void
>  pipex_iface_fini(struct pipex_iface_context *pipex_iface)
>  {
> - pool_put(_session_pool, pipex_iface->multicast_session);
>   NET_LOCK();
> + pool_put(_session_pool, pipex_iface->multicast_session);
>   pipex_iface_stop(pipex_iface);
>   NET_UNLOCK();
>  }
> 



Re: fix races in if_clone_create()

2020-07-13 Thread Martin Pieuchot
On 06/07/20(Mon) 15:44, Vitaliy Makkoveev wrote:
> > On 6 Jul 2020, at 12:17, Martin Pieuchot  wrote:
> > Assertions and documentation are more important than preventing races
> > because they allow to build awareness and elegant solutions instead of
> > hacking diffs until stuff work without knowing why.
> > 
> > There are two cases where `ifp' are inserted into `ifnet':
> > 1. by autoconf during boot or hotplug
> > 2. by cloning ioctl
> > 
> > In the second case it is always about pseudo-devices.  So the assertion
> > should be conditional like:
> > 
> > if (ISSET(ifp->if_xflags, IFXF_CLONED))
> > rw_assert_wrlock(_lock);
> > 
> > In other words this fixes serializes insertions/removal on the global
> > list `ifnet', the KERNEL_LOCK() being still required for reading it.
> > 
> > Is there any other data structure which ends up being protected by this
> > approach and could be documented?
> 
> We should be sure there is no multiple `ifnet’s in `if_list’ with the same
> `if_xname’.

That's a symptom of a bug.  Checking for a symptom won't prevent another
type of corruption, maybe next time it will be a corrupted pointer?

> And the assertion you proposed looks not obvious here.

Why, is it because of the if() check?  That's required unless we change
put all if_attach() functions under the lock which would require changing
all driver in-tree.  However since drivers for physical devices are being
attached without having multiple CPUs running there's no possible race.

> Assertion like below looks more reasonable but introduces performance
> impact.

We should first aim for correctness then performance.  In this case,
performance is not even an issue because interfaces are not created
often compared to the rate of processing packets.



Re: pppac(4): fix races in pppacopen()

2020-07-13 Thread Martin Pieuchot
On 11/07/20(Sat) 23:51, Vitaliy Makkoveev wrote:
> [...] 
> The way you suggest to go is to introduce rwlock and serialize
> pppacopen() and pppacclose(). This is bad idea because we will sleep
> while we are holding rwlock.

That's the whole point of a rwlock to be able to sleep while holding the
lock.  The goal is to prevent any other thread coming from userland to
enter any code path leading to the same data structure.

This is the same as what the KERNEL_LOCK() was supposed to do assuming
there where no sleeping point in if_attach() and pipex_iface_init().

>  Also this is bad idea because you should
> prevent access to `sc' which is being destroyed because you can grab it
> by concurrent thread.

Which data structure is the other thread using to get a reference on `sc'?

If the data structure is protected by the rwlock, like I suggest for
ifunit(), there's no problem.  If it is protected by the KERNEL_LOCK()
then any sleeping point can lead to a race.  That's why we're doing such
changes.

In case of a data structure protected by the KERNEL_LOCK() the easiest
way to deal with sleeping point is to re-check when coming back to
sleep.  This works well if the sleeping point is not deep into another
layer. 

Another way is to have a per-driver lock or serialization mechanism.

>   You must serialize *all* access to this `sc'
> elsewhere your "protection" is useless.

The question is not access to `sc' is access to which global data
structure having a reference to this `sc'?  If the data structure is
common to all the network stack, like ifunit()'s then we should look a
for solution for all the network stack with all the usages of the list.
That includes many driver's *open() functions, the cloning ioctls, etc. 

If the data structure is per-driver then the locking/serialization
mechanism is per-driver.

> pppx(4) had no problems with unit protection. Also it had no problems
> to access incomplete `pxi'. Now pppx(4) has fixed access to `pxi' which
> is being destroyed. And this is the way to go in pppac(4) layer too.
> 
> We have pppx_dev2pxd() to obtain `pxd'. While we adding extra check to
> pppx_dev2pxd() this is not system wide. Also pppac(4) already has
> `sc_dead' to prevent concurrent pppac_ioctl() access to dying `sc'. You
> suggest to serialize pppac_ioctl() too?

The way `sc_dead' is used in other drivers is a way to prevent
per-driver ioctl(2) while a pseudo-device is being detached.  It assumes
the NET_LOCK() is held around pseuo-device ioctl(2) so the flag is
protected (but not documented by the NET_LOCK().  If you want to do the
same go for it, but please do not add another meaning to the same mechanism.
Having drivers that work similarly reduces maintains effort. 



Re: pppac(4): fix races in pppacopen()

2020-07-11 Thread Martin Pieuchot
On 10/07/20(Fri) 14:38, Vitaliy Makkoveev wrote:
> On Fri, Jul 10, 2020 at 01:22:40PM +0200, Martin Pieuchot wrote:
> > On 10/07/20(Fri) 14:07, Vitaliy Makkoveev wrote:
> > > We have some races in pppac(4)
> > > 1. malloc(9) can sleep so we must check `sc' presence after malloc(9)
> > 
> > Makes sense.
> > 
> > > 2. we can sleep between `sc' insertion to `sc_entry' list and 
> > > `sc_pipex_iface' initialization. Concurrent pppacioctl() can touch
> > > this incomplete `sc'.
> > 
> > Why not insert the descriptor at the end?  Shouldn't the order of
> > operations be:
> > 
> > pipex_iface_init();
> > if_attach();
> > LIST_INSERT_HEAD()
> > 
> > This way there's no need for a `ready' flag since the descriptor is only
> > added to global data structures once it is completely initialized.
> > 
> > Using a `sc_ready' or `sc_dead' approach is something that require
> > touching all drivers whereas serializing insertions to global data
> > structures can be done at once for all the kernel.
> 
> No, because we introduce the races with if_attach(). The similar races
> are in if_clone_attach(). We can do multiple `ifp' attachment with the
> same name.

Yes that's the same problem.  It is also present in other parts of the
userland/network stack boundary.  That's why I'm arguing that the best
approach is to use a lock and document which data structures it
protects.

We should concentrate on protecting access to data structures and not
code paths.



Re: pppac(4): fix races in pppacopen()

2020-07-10 Thread Martin Pieuchot
On 10/07/20(Fri) 14:07, Vitaliy Makkoveev wrote:
> We have some races in pppac(4)
> 1. malloc(9) can sleep so we must check `sc' presence after malloc(9)

Makes sense.

> 2. we can sleep between `sc' insertion to `sc_entry' list and 
> `sc_pipex_iface' initialization. Concurrent pppacioctl() can touch
> this incomplete `sc'.

Why not insert the descriptor at the end?  Shouldn't the order of
operations be:

pipex_iface_init();
if_attach();
LIST_INSERT_HEAD()

This way there's no need for a `ready' flag since the descriptor is only
added to global data structures once it is completely initialized.

Using a `sc_ready' or `sc_dead' approach is something that require
touching all drivers whereas serializing insertions to global data
structures can be done at once for all the kernel.

> Index: sys/net/if_pppx.c
> ===
> RCS file: /cvs/src/sys/net/if_pppx.c,v
> retrieving revision 1.91
> diff -u -p -r1.91 if_pppx.c
> --- sys/net/if_pppx.c 6 Jul 2020 20:37:51 -   1.91
> +++ sys/net/if_pppx.c 10 Jul 2020 11:04:53 -
> @@ -1019,7 +1019,7 @@ RBT_GENERATE(pppx_ifs, pppx_if, pxi_entr
>  
>  struct pppac_softc {
>   struct ifnetsc_if;
> - unsigned intsc_dead;
> + unsigned intsc_ready;
>   dev_t   sc_dev;
>   LIST_ENTRY(pppac_softc)
>   sc_entry;
> @@ -1072,8 +1072,12 @@ pppac_lookup(dev_t dev)
>   struct pppac_softc *sc;
>  
>   LIST_FOREACH(sc, _devs, sc_entry) {
> - if (sc->sc_dev == dev)
> - return (sc);
> + if (sc->sc_dev == dev) {
> + if (sc->sc_ready)
> + return (sc);
> + else
> + break;
> + }
>   }
>  
>   return (NULL);
> @@ -1088,22 +1092,25 @@ pppacattach(int n)
>  int
>  pppacopen(dev_t dev, int flags, int mode, struct proc *p)
>  {
> - struct pppac_softc *sc;
> + struct pppac_softc *sc, *sc_tmp;
>   struct ifnet *ifp;
>  
> - sc = pppac_lookup(dev);
> - if (sc != NULL)
> - return (EBUSY);
> -
>   sc = malloc(sizeof(*sc), M_DEVBUF, M_WAITOK|M_ZERO);
> +
> + LIST_FOREACH(sc_tmp, _devs, sc_entry) {
> + if (sc_tmp->sc_dev == dev) {
> + free(sc, M_DEVBUF, sizeof(*sc));
> + return (EBUSY);
> + }
> + }
> +
>   sc->sc_dev = dev;
> + LIST_INSERT_HEAD(_devs, sc, sc_entry);
>  
>   mtx_init(>sc_rsel_mtx, IPL_SOFTNET);
>   mtx_init(>sc_wsel_mtx, IPL_SOFTNET);
>   mq_init(>sc_mq, IFQ_MAXLEN, IPL_SOFTNET);
>  
> - LIST_INSERT_HEAD(_devs, sc, sc_entry);
> -
>   ifp = >sc_if;
>   snprintf(ifp->if_xname, sizeof(ifp->if_xname), "pppac%u", minor(dev));
>  
> @@ -1129,6 +1136,7 @@ pppacopen(dev_t dev, int flags, int mode
>  #endif
>  
>   pipex_iface_init(>sc_pipex_iface, ifp);
> + sc->sc_ready = 1;
>  
>   return (0);
>  }
> @@ -1136,12 +1144,14 @@ pppacopen(dev_t dev, int flags, int mode
>  int
>  pppacread(dev_t dev, struct uio *uio, int ioflag)
>  {
> - struct pppac_softc *sc = pppac_lookup(dev);
> + struct pppac_softc *sc;
>   struct ifnet *ifp = >sc_if;
>   struct mbuf *m0, *m;
>   int error = 0;
>   size_t len;
>  
> + if ((sc = pppac_lookup(dev)) == NULL)
> + return (EBADF);
>   if (!ISSET(ifp->if_flags, IFF_RUNNING))
>   return (EHOSTDOWN);
>  
> @@ -1181,12 +1191,14 @@ pppacread(dev_t dev, struct uio *uio, in
>  int
>  pppacwrite(dev_t dev, struct uio *uio, int ioflag)
>  {
> - struct pppac_softc *sc = pppac_lookup(dev);
> + struct pppac_softc *sc;
>   struct ifnet *ifp = >sc_if;
>   uint32_t proto;
>   int error;
>   struct mbuf *m;
>  
> + if ((sc = pppac_lookup(dev)) == NULL)
> + return (EBADF);
>   if (!ISSET(ifp->if_flags, IFF_RUNNING))
>   return (EHOSTDOWN);
>  
> @@ -1258,9 +1270,12 @@ pppacwrite(dev_t dev, struct uio *uio, i
>  int
>  pppacioctl(dev_t dev, u_long cmd, caddr_t data, int flags, struct proc *p)
>  {
> - struct pppac_softc *sc = pppac_lookup(dev);
> + struct pppac_softc *sc;
>   int error = 0;
>  
> + if ((sc = pppac_lookup(dev)) == NULL)
> + return (EBADF);
> +
>   switch (cmd) {
>   case TUNSIFMODE: /* make npppd happy */
>   break;
> @@ -1282,9 +1297,12 @@ pppacioctl(dev_t dev, u_long cmd, caddr_
>  int
>  pppacpoll(dev_t dev, int events, struct proc *p)
>  {
> - struct pppac_softc *sc = pppac_lookup(dev);
> + struct pppac_softc *sc;
>   int revents = 0;
>  
> + if ((sc = pppac_lookup(dev)) == NULL)
> + goto out;
> +
>   if (events & (POLLIN | POLLRDNORM)) {
>   if (!mq_empty(>sc_mq))
>   revents |= events & (POLLIN | POLLRDNORM);
> @@ -1296,17 +1314,20 @@ pppacpoll(dev_t dev, int events, struct 
>   

Re: pipex(4): kill pipexintr()

2020-07-10 Thread Martin Pieuchot
On 07/07/20(Tue) 01:01, Vitaliy Makkoveev wrote:
> On Mon, Jul 06, 2020 at 08:47:23PM +0200, Martin Pieuchot wrote:
> > On 06/07/20(Mon) 19:23, Vitaliy Makkoveev wrote:
> > > > On 6 Jul 2020, at 17:36, Martin Pieuchot  wrote:
> > > [...] 
> > > Unfortunately you can’t be sure about NET_LOCK() status while you are
> > > in pppac_start(). It was described at this thread [1].
> > > 
> > > We have two cases:
> > > 1. pppac_start() called from pppac_output(). NET_LOCK() was inherited.
> > 
> > Such recursions should be avoided.  if_enqueue() should take care of
> > that.
> 
> I suggest to finish the route to if_get(9) before. Updated diff which
> removes pipexintr() below. Just against the most resent source tree.

The tasks are not orthogonal.  Making sure the NET_LOCK() is taken
inside the pipex boundaries help for this task as well.

That said the current code is not ready for the proposed diff.  At
least `pppx_devs', `pipex_rd_head4' and `pipex_rd_head6' must be
protected/annotated. 

What about all the lists/hashtables?  They aren't annotated, are they
all protected by the NET_LOCK()?

What about `pppx_ifs' is it only used under the KERNEL_LOCK()?

One comment below:

> Index: sys/net/pipex.c
> ===
> RCS file: /cvs/src/sys/net/pipex.c,v
> retrieving revision 1.119
> diff -u -p -r1.119 pipex.c
> --- sys/net/pipex.c   6 Jul 2020 20:37:51 -   1.119
> +++ sys/net/pipex.c   6 Jul 2020 21:55:17 -
> @@ -948,8 +861,26 @@ pipex_ip_output(struct mbuf *m0, struct 
>   m0->m_flags &= ~(M_BCAST|M_MCAST);
>  
>   /* output ip packets to the session tunnel */
> - if (pipex_ppp_enqueue(m0, session, ))
> - goto dropped;
> + if (session->is_multicast != 0) {
> + struct pipex_session *session_tmp;
> + struct mbuf *m;
> +

Please add a NET_ASSERT_LOCKED() here to indicate that `pipex_session_list'
needs it.

> + LIST_FOREACH(session_tmp, _session_list, session_list) {
> + if (session->pipex_iface != session_tmp->pipex_iface)
> + continue;
> + if (session_tmp->ip_forward == 0 &&
> + session_tmp->ip6_forward == 0)
> + continue;
> + m = m_copym(m0, 0, M_COPYALL, M_NOWAIT);
> + if (m == NULL) {
> + session->stat.oerrors++;
> + continue;
> + }
> + pipex_ppp_output(m, session_tmp, PPP_IP);
> + }
> + m_freem(m);
> + } else
> + pipex_ppp_output(m0, session, PPP_IP);
>  
>   return;
>  drop:



Re: pppx_if_output() don't lock `pppx_devs_lk'

2020-07-10 Thread Martin Pieuchot
On 08/07/20(Wed) 12:05, Vitaliy Makkoveev wrote:
> `pppx_devs_lk' used to protect `pxd_entry' list. We lock `pppx_devs_lk'
> in pppx_if_output() to be sure `pxd' is not destroyed by concurrent
> pppxclose() but it's useless. We destroy all corresponding `pxi' before
> `pxd' and `ifnet's are already detached.

This lock seems to only prevent races if malloc(9) sleeps inside
pppxopen().  Could you address that and remove the lock altogether?

What is really protecting the data structure and lifetime of its
elements is the KERNEL_LOCK() currently.

> Index: sys/net/if_pppx.c
> ===
> RCS file: /cvs/src/sys/net/if_pppx.c,v
> retrieving revision 1.91
> diff -u -p -r1.91 if_pppx.c
> --- sys/net/if_pppx.c 6 Jul 2020 20:37:51 -   1.91
> +++ sys/net/if_pppx.c 8 Jul 2020 09:04:31 -
> @@ -957,7 +957,6 @@ pppx_if_output(struct ifnet *ifp, struct
>   th = mtod(m, struct pppx_hdr *);
>   th->pppx_proto = 0; /* not used */
>   th->pppx_id = pxi->pxi_session->ppp_id;
> - rw_enter_read(_devs_lk);
>   error = mq_enqueue(>pxi_dev->pxd_svcq, m);
>   if (error == 0) {
>   if (pxi->pxi_dev->pxd_waiting) {
> @@ -966,7 +965,6 @@ pppx_if_output(struct ifnet *ifp, struct
>   }
>   selwakeup(>pxi_dev->pxd_rsel);
>   }
> - rw_exit_read(_devs_lk);
>   }
>  
>  out:
> 



Re: USB3 stack with async. transfers support

2020-07-08 Thread Martin Pieuchot
On 07/07/20(Tue) 11:13, Martin wrote:
> Hi tech@,
> 
> Not so long ago I've ported UHD driver to support Ettus USRP devices which 
> uses libusb and asynchronous USB3 data transfers.
> Is USB3 async. data stack implemented or planned to have some devices like 
> USRP working?

I'm not aware of anyone working on this at the moment.  A GSoC student
gave a try to implement an asynchronous interface with the kernel to
submit transfers a couple of years ago.  It might be a start for someone
interested in this task.



Re: pipex(4): kill pipexintr()

2020-07-06 Thread Martin Pieuchot
On 06/07/20(Mon) 19:23, Vitaliy Makkoveev wrote:
> > On 6 Jul 2020, at 17:36, Martin Pieuchot  wrote:
> [...] 
> Unfortunately you can’t be sure about NET_LOCK() status while you are
> in pppac_start(). It was described at this thread [1].
> 
> We have two cases:
> 1. pppac_start() called from pppac_output(). NET_LOCK() was inherited.

Such recursions should be avoided.  if_enqueue() should take care of
that.



Re: pipex(4): kill pipexintr()

2020-07-06 Thread Martin Pieuchot
On 06/07/20(Mon) 16:42, Vitaliy Makkoveev wrote:
> [...] 
> pipex(4) is simultaneously locked by NET_LOCK() and KERNEL_LOCK() but
> with two exceptions:
> 
> 1. As you pointed pipex_pppoe_input() called without KERNEL_LOCK() held.
> 2. pppac_start() called without NET_LOCK() held. Or with NET_LOCK()
>held. It depends on `if_snd' usage.
> 
> Diff below enforces pppac_start() to be called with NET_LOCK() held.
> Also all externally called pipex(4) input and output routines have
> NET_ASSERT_LOCKED() assertion.
> 
> Now pipex(4) is fully protected by NET_LOCK() so description of struct
> members chenget too.
> 
> Index: sys/net/if_pppx.c
> ===
> RCS file: /cvs/src/sys/net/if_pppx.c,v
> retrieving revision 1.90
> diff -u -p -r1.90 if_pppx.c
> --- sys/net/if_pppx.c 24 Jun 2020 08:52:53 -  1.90
> +++ sys/net/if_pppx.c 6 Jul 2020 11:10:17 -
> @@ -1117,6 +1117,8 @@ pppacopen(dev_t dev, int flags, int mode
>   ifp->if_output = pppac_output;
>   ifp->if_start = pppac_start;
>   ifp->if_ioctl = pppac_ioctl;
> + /* XXXSMP: be sure pppac_start() called under NET_LOCK() */
> + IFQ_SET_MAXLEN(>if_snd, 1);

Is it possible to grab the NET_LOCK() inside pppac_start() instead of
grabbing it outside?  This should allow *start() routine to be called
from any context.

It might be interesting to see that as a difference between the NET_LOCK()
used to protect the network stack internals and the NET_LOCK() used to
protect pipex(4) internals.  Such distinction might help to convert the
latter into a different lock or primitive.

> Index: sys/net/pipex.c
> ===
> RCS file: /cvs/src/sys/net/pipex.c,v
> retrieving revision 1.117
> diff -u -p -r1.117 pipex.c
> --- sys/net/pipex.c   30 Jun 2020 14:05:13 -  1.117
> +++ sys/net/pipex.c   6 Jul 2020 11:10:17 -
> @@ -869,6 +869,7 @@ pipex_output(struct mbuf *m0, int af, in
>   struct ip ip;
>   struct mbuf *mret;
>  
> + NET_ASSERT_LOCKED();

This function doesn't touch any shared data structure, we'd better move
the NET_ASSERT_LOCKED() above rn_lookuo() in pipex_lookup_by_ip_address().

Note that `pipex_rd_head4' and `pipex_rd_head6' are, with this diff,
also protected by the NET_LOCK() and should be annotated as such.

>   session = NULL;
>   mret = NULL;
>   switch (af) {
> @@ -962,6 +963,8 @@ pipex_ppp_output(struct mbuf *m0, struct
>  {
>   u_char *cp, hdr[16];
>  
> + NET_ASSERT_LOCKED();

Same here, it seems that the only reason the NET_LOCK() is necessary in
the output path is to prevent corruption of the `session' descriptor being
used.  So we'd rather put the assertion above the LIST_FOREACH(). 

Anyway all of those can be addressed later, your diff is ok mpi@



Re: fix races in if_clone_create()

2020-07-06 Thread Martin Pieuchot
On 01/07/20(Wed) 00:02, Vitaliy Makkoveev wrote:
> On Tue, Jun 30, 2020 at 03:48:22PM +0300, Vitaliy Makkoveev wrote:
> > On Tue, Jun 30, 2020 at 12:08:03PM +0200, Martin Pieuchot wrote:
> > > On 29/06/20(Mon) 11:59, Vitaliy Makkoveev wrote:
> > > > [...] 
> > > > I reworked tool for reproduce. Now I avoided fork()/exec() route and it
> > > > takes couple of minutes to take panic on 4 cores. Also some screenshots
> > > > attached.
> > > 
> > > Setting kern.pool_debug=2 makes the race reproducible in seconds.
> 
> Unfortunately you will catch splassert() caused by kern/sched_bsd.c:304.
> malloc() will call yield() while we are holding NET_LOCK(). I attached
> screenshot with splassertion to this mail.

With kern.splassert < 3 it is fine. 

> > > Could you turn this test into something committable in regress/?  We can
> > > link it to the build once a fix is committed.
> > > 
> > 
> > We have 3 races with cloned interfaces:
> > 1. if_clone_create() vs if_clone_create()
> > 2. if_clone_destroy() vs if_clone_destroy()
> > 3. if_clone_destroy() vs the rest of stack
> > 
> > It makes sences to commit unified test to regress/, so I suggest to wait
> > a little.
> 
> The another solution.
> 
> Diff below introduces per-`ifc' serialization for if_clone_create() and
> if_clone_destroy(). There is no index bitmap anymore.

I like the simplification.  More comments below:

> +/*
> + * Lock a clone network interface.
> + */
> +int
> +if_clone_lock(struct if_clone *ifc)
> +{
> + int error;
> +
> + rw_enter_write(>ifc_lock);
> +
> + while (ifc->ifc_flags & IFC_CREATE_LOCKED) {
> + ifc->ifc_flags |= IFC_CREATE_LOCKWAIT;
> + error = rwsleep_nsec(>ifc_flags, >ifc_lock,
> + PWAIT|PCATCH, "ifclk", INFSLP);
> + if(error != 0) {
> + ifc->ifc_flags &= ~IFC_CREATE_LOCKWAIT;
> + rw_exit_write(>ifc_lock);
> + return error;
> + }
> + }
> + ifc->ifc_flags |= IFC_CREATE_LOCKED;
> + ifc->ifc_flags &= ~IFC_CREATE_LOCKWAIT;
> +
> + rw_exit_write(>ifc_lock);
> + 
> + return 0;
> +}

This is like re-implementing a rwlock but loosing the debugging ability of
WITNESS.

I also don't see any reason for having a per-ifc lock.  If, at least one
of the problems, is a double insert in `ifnet' then we should be able to
assert that a lock is held when doing such assertion.

Assertions and documentation are more important than preventing races
because they allow to build awareness and elegant solutions instead of
hacking diffs until stuff work without knowing why.

There are two cases where `ifp' are inserted into `ifnet':
 1. by autoconf during boot or hotplug
 2. by cloning ioctl

In the second case it is always about pseudo-devices.  So the assertion
should be conditional like:

if (ISSET(ifp->if_xflags, IFXF_CLONED))
rw_assert_wrlock(_lock);

In other words this fixes serializes insertions/removal on the global
list `ifnet', the KERNEL_LOCK() being still required for reading it.

Is there any other data structure which ends up being protected by this
approach and could be documented?



Re: pipex(4): kill pipexintr()

2020-07-06 Thread Martin Pieuchot
On 01/07/20(Wed) 22:42, Vitaliy Makkoveev wrote:
> pipex(4) has 2 mbuf queues: `pipexinq' and `pipexoutq'. When mbuf passed
> to pipex it goes to one of these queues and pipexintr() will be
> scheduled to process them. pipexintr() called from `netisr' context.
> 
> It's true for pppac(4) but for pppx(4) only incoming mbufs go to
> `pipexinq. Outgoing mbufs go directly to stack. pppx(4) enabled in
> npppd.conf(5) by default so I guess it's the common case of pipex(4)
> usage.
> 
> The code looks like there is no requirements to this delayed mbufs
> processing, we can pass it directly to stack as we do for pppx(4)
> outgoing traffic.
> 
> Also we have some troubles with pipexintr() as it was described in [1].
> It's protection of `ph_cookie'. We don't this protection this time and
> we can't because we should brake if_get(9) logic.
> 
> Diff below removes pipexintr(). Now all mbufs passed directly without
> enqueueing within pipex(4). We also can destroy sessions safe in all
> cases. We also can use if_get(9) instead using unreferenced pointers to
> `ifnet' within pipex(4). We also avoided context switch while we
> processing mbufs within pipex(4). We decreased latency.
> 
> I'm seeding debian torrents with this diff an all goes well.

With this diff the content of pipexintr() is no longer executed with the
KERNEL_LOCK() held.  This can be seen by following the code starting in
ether_input().

Grabbing the KERNEL_LOCK() there is not a way forward.  The whole idea
of if_input_process() is to be free of KERNEL_LOCK() to not introduce
latency delay.

So this changes implies that `pipex_session_list' and possibly other
global data structures as well as the elements linked in those are all
protected by the NET_LOCK().  I believe this is the easiest way forward.

That said I would be comfortable with this diff going in if an audit of
the data structures accessed in the code path starting by pipex_pppoe_input()
has been done.  That implies annotating/documenting which data structures
are now protected by the NET_LOCK() and adding the necessary
NET_ASSERT_LOCK().  Such audit might lead to consider some ioctl code
path changes to now serialize on the NET_LOCK() instead of the
KERNEL_LOCK().

> 1. https://marc.info/?t=15930080902=1=2
> 
> Index: lib/libc/sys/sysctl.2
> ===
> RCS file: /cvs/src/lib/libc/sys/sysctl.2,v
> retrieving revision 1.40
> diff -u -p -r1.40 sysctl.2
> --- lib/libc/sys/sysctl.2 17 May 2020 05:48:39 -  1.40
> +++ lib/libc/sys/sysctl.2 1 Jul 2020 19:20:22 -
> @@ -2033,35 +2033,11 @@ The currently defined variable names are
>  .Bl -column "Third level name" "integer" "Changeable" -offset indent
>  .It Sy "Third level name" Ta Sy "Type" Ta Sy "Changeable"
>  .It Dv PIPEXCTL_ENABLE Ta integer Ta yes
> -.It Dv PIPEXCTL_INQ Ta node Ta not applicable
> -.It Dv PIPEXCTL_OUTQ Ta node Ta not applicable
>  .El
>  .Bl -tag -width "123456"
>  .It Dv PIPEXCTL_ENABLE
>  If set to 1, enable PIPEX processing.
>  The default is 0.
> -.It Dv PIPEXCTL_INQ Pq Va net.pipex.inq
> -Fourth level comprises an array of
> -.Vt struct ifqueue
> -structures containing information about the PIPEX packet input queue.
> -The forth level names for the elements of
> -.Vt struct ifqueue
> -are the same as described in
> -.Li ip.arpq
> -in the
> -.Dv PF_INET
> -section.
> -.It Dv PIPEXCTL_OUTQ Pq Va net.pipex.outq
> -Fourth level comprises an array of
> -.Vt struct ifqueue
> -structures containing information about PIPEX packet output queue.
> -The forth level names for the elements of
> -.Vt struct ifqueue
> -are the same as described in
> -.Li ip.arpq
> -in the
> -.Dv PF_INET
> -section.
>  .El
>  .El
>  .Ss CTL_VFS
> Index: sys/net/if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.611
> diff -u -p -r1.611 if.c
> --- sys/net/if.c  30 Jun 2020 09:31:38 -  1.611
> +++ sys/net/if.c  1 Jul 2020 19:20:27 -
> @@ -1012,13 +1012,6 @@ if_netisr(void *unused)
>   KERNEL_UNLOCK();
>   }
>  #endif
> -#ifdef PIPEX
> - if (n & (1 << NETISR_PIPEX)) {
> - KERNEL_LOCK();
> - pipexintr();
> - KERNEL_UNLOCK();
> - }
> -#endif
>   t |= n;
>   }
>  
> Index: sys/net/netisr.h
> ===
> RCS file: /cvs/src/sys/net/netisr.h,v
> retrieving revision 1.51
> diff -u -p -r1.51 netisr.h
> --- sys/net/netisr.h  6 Aug 2019 22:57:54 -   1.51
> +++ sys/net/netisr.h  1 Jul 2020 19:20:27 -
> @@ -48,7 +48,6 @@
>  #define  NETISR_IPV6 24  /* same as AF_INET6 */
>  #define  NETISR_ISDN 26  /* same as AF_E164 */
>  #define  NETISR_PPP  28  /* for PPP processing */
> -#define  NETISR_PIPEX27  /* for 

Re: fix races in if_clone_create()

2020-06-30 Thread Martin Pieuchot
On 29/06/20(Mon) 11:59, Vitaliy Makkoveev wrote:
> [...] 
> I reworked tool for reproduce. Now I avoided fork()/exec() route and it
> takes couple of minutes to take panic on 4 cores. Also some screenshots
> attached.

Setting kern.pool_debug=2 makes the race reproducible in seconds.

Could you turn this test into something committable in regress/?  We can
link it to the build once a fix is committed.

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> static struct ifreq ifr;
> 
> static void *clone_create(void *arg)
> {
>   int s;
> 
>   if((s=socket(AF_INET, SOCK_DGRAM, 0))<0)
>   err(1, "socket()");
>   while(1){
>   if(ioctl(s, SIOCIFCREATE, )<0)
>   if(errno==EINVAL)
>   exit(1);
>   }
> 
>   return NULL;
> }
> 
> static void *clone_destroy(void *arg)
> {
>   int s;
> 
>   if((s=socket(AF_INET, SOCK_DGRAM, 0))<0)
>   err(1, "socket()");
>   while(1){
>   if(ioctl(s, SIOCIFDESTROY, )<0)
>   if(errno==EINVAL)
>   exit(1);
>   }
> 
>   return NULL;
> }
> 
> int main(int argc, char *argv[])
> {
>   pthread_t thr;
>   int i;
> 
>   if(argc!=2){
>   fprintf(stderr, "usage: %s ifname\n", getprogname());
>   return 1;
>   }
> 
>   if(getuid()!=0){
>   fprintf(stderr, "should be root\n");
>   return 1;
>   }
> 
>   memset(, 0, sizeof(ifr));
>   strlcpy(ifr.ifr_name, argv[1], sizeof(ifr.ifr_name));
> 
>   for(i=0; i<8*4; ++i){
>   if(pthread_create(, NULL, clone_create, NULL)!=0)
>   errx(1, "pthread_create(clone_create)");
>   }
> 
>   clone_destroy(NULL);
> 
>   return 0;
> }
> 
>  cut end 







Re: route add ::/0 ...

2020-06-29 Thread Martin Pieuchot
On 28/06/20(Sun) 20:41, YASUOKA Masahiko wrote:
> Hi,
> 
> When "::/0" is used as "default",
> 
>   # route add ::/0 fe80::1%em0
>   add net ::/0: gateway fe80::1%em0: Invalid argument
> 
> route command trims the sockaddr to { .len = 2, .family = AF_INET6 }
> for "::/0", but rtable_satoplen() refuses it.  I think it should be
> accepted.

rtable_satoplen() is used in many places, not just in the socket parsing
code used by route(8).  I don't know what side effects can be introduced
by this change.

Why is IPv6 different from IPv4 when it comes to the default route?
Shouldn't we change route(8) to have a `sa_len' of 0?  That would make
the following true:

mlen = mask->sa_len;

/* Default route */
if (mlen == 0)
return (0)

> Allow sockaddr for prefix length be trimmed before the key(address)
> field.  Actually "route" command trims at the address family field for
> "::/0"
> 
> Index: sys/net/rtable.c
> ===
> RCS file: /cvs/src/sys/net/rtable.c,v
> retrieving revision 1.69
> diff -u -p -r1.69 rtable.c
> --- sys/net/rtable.c  21 Jun 2019 17:11:42 -  1.69
> +++ sys/net/rtable.c  28 Jun 2020 11:30:54 -
> @@ -887,8 +887,8 @@ rtable_satoplen(sa_family_t af, struct s
>  
>   ap = (uint8_t *)((uint8_t *)mask) + dp->dom_rtoffset;
>   ep = (uint8_t *)((uint8_t *)mask) + mlen;
> - if (ap > ep)
> - return (-1);
> + if (ap >= ep)
> + return (0);

That means the kernel now silently ignore sockaddr short `sa_len'. Are
they supposed to be supported or are they symptoms of bugs?

>   /* Trim trailing zeroes. */
>   while (ap < ep && ep[-1] == 0)



Re: pipex(4): use reference counters for `ifnet'

2020-06-28 Thread Martin Pieuchot
On 27/06/20(Sat) 17:58, Vitaliy Makkoveev wrote:
> > [...] 
> > Look at r1.329 of net/if.c.  Prior to this change if_detach_queues() was
> > used to free all mbufs when an interface was removed.  Now lazy freeing
> > is used everytime if_get(9) rerturns NULL.
> > 
> > This is possible because we store an index and not a pointer directly in
> > the mbuf.
> > 
> > The advantage of storing a session pointer in `ph_cookie' is that no
> > lookup is required in pipexintr(), right?  Maybe we could save a ID
> > instead and do a lookup.  How big can be the `pipex_session_list'?
> >
> 
> It's unlimited. In pppac(4) case you create the only one interface and
> you can share it between the count of sessions you wish. In my practice
> I had machines with 800+ active ppp interfaces in 2005. We can have
> dosens of cores and hundreds gigs of ram now. How big can be real
> count of active ppp interfaces on VPN provider's NAS?

With that number of items a linear list might not be the best fit if we
decide to stop using pointers.  So if we want to use a "lookup" a
different data structure might be more appropriate.

> I looked at r1.328 of net/if.c.
> Imagine we have a lot of connected clients with high traffic. One on them
> starts connect/disconnect games. It's not malicious hacker, just mobile
> phone in area with low signal. You need:
> 
> 1. block pipexintr() (and netisr too).
> 2. block insertion to queues from concurrent threads.
> 3. Walk through very loaded queues and compare. And most or may be all
>packets are foreign.
> 4. Repeat it every time you lost connection.
> 
> Yes, now it's all serealized. And pipexintr() is already blocked while
> we do session destruction(). But what is the reason to make your future
> life harder? pipex(4) session already has pointer to it's relared
> `ifnet'. `ifnet' already has reference counters. Why don't use them?

We decided as a team to not use reference counting because they are
hard to debug.  Using if_get(9)/if_put(9) allows us to check that our
changes where correct with static analysis tools.

If we start using reference counting for ifps in pipex(4) they we now
have an exception in the network stack.  That means the techniques
applied are not coherent and it makes it harder to work across a huge
amount of code.

But I don't care, if there's a consensus that it is the way to go, then
go for it.

> In the way I suggest to use refcounters for pipex(4) session you don't
> need to block your packet processing. You don't need to do searchs. You
> just need to be sure the `ifnet' you has is still alive. You already has
> mechanics for that. Why don't use this?

The same can be said with the actual machinery.  Using a if_get(9)-like
approach doesn't block packet processing, you don't need to to search,
there's no reference counting bug that can lead to deadlock, it is
consistent with the existing network stack and there are already example
of implementation like: if_get() and rtable_get().

> You don't like if_ref() to be
> used outside net/if.c? Ok, what's wrong with referenced pointers
> obtained by if_get(9)? While session is alive it *uses* it's `ifnet'. It
> uses `ifnet' all lifetime *not* only while output. And we should be
> shure we didn't destroy `ifnet' while someone uses it. What's wrong with
> referenced pointers? 

If there's a bug, if the reference counting is messed up, it is hard to
find where that happened.  It is like a use-after-free, where the crash
happens is not where the bug is.  If you see a leak you don't know where
the reference drop is missing.

Sure they are many ways to deal with that, but since we did not embrace
them. I'm not sure that it makes sense to change direction or introduce
a difference now.

>  The way I wish to go used in file(9). May be it is
> totally wrong and we need global in-kernel descriptor table where `fp'
> will be referenced by index too :) ?

That's not what I'm saying.  if_get(9) already exists and is already
used in a certain way, I don't see why not embrace that and keep the
network stack coherent.

But once again, I don't want to block anyone in its development, if you
and others agree that's the way to go, then let's go this way.



<    1   2   3   4   5   6   7   8   9   10   >