Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-23 Thread Martin Pieuchot
On 23/11/22(Wed) 16:34, Mark Kettenis wrote:
> > Date: Wed, 23 Nov 2022 10:52:32 +0100
> > From: Martin Pieuchot 
> > 
> > On 22/11/22(Tue) 23:40, Mark Kettenis wrote:
> > > > Date: Tue, 22 Nov 2022 17:47:44 +
> > > > From: Miod Vallat 
> > > > 
> > > > > Here is a diff.  Maybe bluhm@ can try this on the macppc machine that
> > > > > triggered the original "vref used where vget required" problem?
> > > > 
> > > > On a similar machine it panics after a few hours with:
> > > > 
> > > > panic: uvn_flush: PGO_SYNCIO return 'try again' error (impossible)
> > > > 
> > > > The trace (transcribed by hand) is
> > > > uvn_flush+0x820
> > > > uvm_vnp_terminate+0x79
> > > > vclean+0xdc
> > > > vgonel+0x70
> > > > getnewvnode+0x240
> > > > ffs_vget+0xcc
> > > > ffs_inode_alloc+0x13c
> > > > ufs_makeinode+0x94
> > > > ufs_create+0x58
> > > > VOP_CREATE+0x48
> > > > vn_open+0x188
> > > > doopenat+0x1b4
> > > 
> > > Ah right, there is another path where we end up with a refcount of
> > > zero.  Should be fixable, but I need to think about this for a bit.
> > 
> > Not sure to understand what you mean with refcount of 0.  Could you
> > elaborate?
> 
> Sorry, I was thinking ahead a bit.  I'm pretty much convinced that the
> issue we're dealing with is a race between a vnode being
> recycled/cleaned and the pagedaemon paging out pages associated with
> that same vnode.
> 
> The crashes we've seen before were all in the pagedaemon path where we
> end up calling into the VFS layer with a vnode that has v_usecount ==
> 0.  My "fix" avoids that, but hits the issue that when we are in the
> codepath that is recycling/cleaning the vnode, we can't use vget() to
> get a reference to the vnode since it checks that the vnode isn't in
> the process of being cleaned.
> 
> But if we avoid that issue (by for example) skipping the vget() call
> if the UVM_VNODE_DYING flag is set, we run into the same scenario
> where we call into the VFS layer with v_usecount == 0.  Now that may
> not actually be a problem, but I need to investigate this a bit more.

When the UVM_VNODE_DYING flag is set the caller always own a valid
reference to the vnode.  Either because it is in the process of cleaning
it via  uvm_vnp_terminate() or because it uvn_detach() has been called
which means the reference to the vnode hasn't been dropped yet.  So I
believe `v_usecount' for such vnode is positive.

> Or maybe calling into the VFS layer with a vnode that has v_usecount
> == 0 is perfectly fine and we should do the vget() dance I propose in
> uvm_vnp_unache() instead of in uvn_put().

I'm not following.  uvm_vnp_uncache() is always called with a valid
vnode, no?



Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-23 Thread Martin Pieuchot
On 22/11/22(Tue) 23:40, Mark Kettenis wrote:
> > Date: Tue, 22 Nov 2022 17:47:44 +
> > From: Miod Vallat 
> > 
> > > Here is a diff.  Maybe bluhm@ can try this on the macppc machine that
> > > triggered the original "vref used where vget required" problem?
> > 
> > On a similar machine it panics after a few hours with:
> > 
> > panic: uvn_flush: PGO_SYNCIO return 'try again' error (impossible)
> > 
> > The trace (transcribed by hand) is
> > uvn_flush+0x820
> > uvm_vnp_terminate+0x79
> > vclean+0xdc
> > vgonel+0x70
> > getnewvnode+0x240
> > ffs_vget+0xcc
> > ffs_inode_alloc+0x13c
> > ufs_makeinode+0x94
> > ufs_create+0x58
> > VOP_CREATE+0x48
> > vn_open+0x188
> > doopenat+0x1b4
> 
> Ah right, there is another path where we end up with a refcount of
> zero.  Should be fixable, but I need to think about this for a bit.

Not sure to understand what you mean with refcount of 0.  Could you
elaborate?

My understanding of the panic reported is that the proposed diff creates
a complicated relationship between the vnode and the UVM vnode layer.
The above problem occurs because VXLOCK is set on a vnode when it is
being recycled *before* calling uvm_vnp_terminate().   Now that uvn_io()
calls vget(9) it will fail because VXLOCK is set, which is what we want
during vclean(9).



Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-22 Thread Martin Pieuchot
On 18/11/22(Fri) 21:33, Mark Kettenis wrote:
> > Date: Thu, 17 Nov 2022 20:23:37 +0100
> > From: Mark Kettenis 
> > 
> > > From: Jeremie Courreges-Anglas 
> > > Date: Thu, 17 Nov 2022 18:00:21 +0100
> > > 
> > > On Tue, Nov 15 2022, Martin Pieuchot  wrote:
> > > > UVM vnode objects include a reference count to keep track of the number
> > > > of processes that have the corresponding pages mapped in their VM space.
> > > >
> > > > When the last process referencing a given library or executable dies,
> > > > the reaper will munmap this object on its behalf.  When this happens it
> > > > doesn't free the associated pages to speed-up possible re-use of the
> > > > file.  Instead the pages are placed on the inactive list but stay ready
> > > > to be pmap_enter()'d without requiring I/O as soon as a newly process
> > > > needs to access them.
> > > >
> > > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
> > > > doesn't work well with swapping [0].  For some reason when the page 
> > > > daemon
> > > > wants to free pages on the inactive list it tries to flush the pages to
> > > > disk and panic(9) because it needs a valid reference to the vnode to do
> > > > so.
> > > >
> > > > This indicates that the mechanism described above, which seems to work
> > > > fine for RO mappings, is currently buggy in more complex situations.
> > > > Flushing the pages when the last reference of the UVM object is dropped
> > > > also doesn't seem to be enough as bluhm@ reported [1].
> > > >
> > > > The diff below, which has already be committed and reverted, gets rid of
> > > > the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
> > > > the arm64 caching bug has been found and fixed.
> > > >
> > > > Getting rid of this logic means more I/O will be generated and pages
> > > > might have a faster reuse cycle.  I'm aware this might introduce a small
> > > > slowdown,
> > > 
> > > Numbers for my usual make -j4 in libc,
> > > on an Unmatched riscv64 box, now:
> > >16m32.65s real21m36.79s user30m53.45s system
> > >16m32.37s real21m33.40s user31m17.98s system
> > >16m32.63s real21m35.74s user31m12.01s system
> > >16m32.13s real21m36.12s user31m06.92s system
> > > After:
> > >19m14.15s real21m09.39s user36m51.33s system
> > >19m19.11s real21m02.61s user36m58.46s system
> > >19m21.77s real21m09.23s user37m03.85s system
> > >19m09.39s real21m08.96s user36m36.00s system
> > > 
> > > 4 cores amd64 VM, before (-current plus an other diff):
> > >1m54.31s real 2m47.36s user 4m24.70s system
> > >1m52.64s real 2m45.68s user 4m23.46s system
> > >1m53.47s real 2m43.59s user 4m27.60s system
> > > After:
> > >2m34.12s real 2m51.15s user 6m20.91s system
> > >2m34.30s real 2m48.48s user 6m23.34s system
> > >2m37.07s real 2m49.60s user 6m31.53s system
> > > 
> > > > however I believe we should work towards loading files from the
> > > > buffer cache to save I/O cycles instead of having another layer of 
> > > > cache.
> > > > Such work isn't trivial and making sure the vnode <-> UVM relation is
> > > > simple and well understood is the first step in this direction.
> > > >
> > > > I'd appreciate if the diff below could be tested on many architectures,
> > > > include the offending rpi4.
> > > 
> > > Mike has already tested a make build on a riscv64 Unmatched.  I have
> > > also run regress in sys, lib/libc and lib/libpthread on that arch.  As
> > > far as I can see this looks stable on my machine, but what I really care
> > > about is the riscv64 bulk build cluster (I'm going to start another
> > > bulk build soon).
> > > 
> > > > Comments?  Oks?
> > > 
> > > The performance drop in my microbenchmark kinda worries me but it's only
> > > a microbenchmark...
> > 
> > I wouldn't call this a microbenchmark.  I fear this is typical for
> > builds of anything on clang architectures.  And I expect it to be
> > worse on single-processor machine where *every* time we execute clang
> > or lld all the pages are thrown 

Get rid of UVM_VNODE_CANPERSIST

2022-11-15 Thread Martin Pieuchot
UVM vnode objects include a reference count to keep track of the number
of processes that have the corresponding pages mapped in their VM space.

When the last process referencing a given library or executable dies,
the reaper will munmap this object on its behalf.  When this happens it
doesn't free the associated pages to speed-up possible re-use of the
file.  Instead the pages are placed on the inactive list but stay ready
to be pmap_enter()'d without requiring I/O as soon as a newly process
needs to access them.

The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
doesn't work well with swapping [0].  For some reason when the page daemon
wants to free pages on the inactive list it tries to flush the pages to
disk and panic(9) because it needs a valid reference to the vnode to do
so.

This indicates that the mechanism described above, which seems to work
fine for RO mappings, is currently buggy in more complex situations.
Flushing the pages when the last reference of the UVM object is dropped
also doesn't seem to be enough as bluhm@ reported [1].

The diff below, which has already be committed and reverted, gets rid of
the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
the arm64 caching bug has been found and fixed.

Getting rid of this logic means more I/O will be generated and pages
might have a faster reuse cycle.  I'm aware this might introduce a small
slowdown, however I believe we should work towards loading files from the
buffer cache to save I/O cycles instead of having another layer of cache.
Such work isn't trivial and making sure the vnode <-> UVM relation is
simple and well understood is the first step in this direction.

I'd appreciate if the diff below could be tested on many architectures,
include the offending rpi4.

Comments?  Oks?

[0] https://marc.info/?l=openbsd-bugs=164846737707559=2 
[1] https://marc.info/?l=openbsd-bugs=166843373415030=2

Index: uvm/uvm_vnode.c
===
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.130
diff -u -p -r1.130 uvm_vnode.c
--- uvm/uvm_vnode.c 20 Oct 2022 13:31:52 -  1.130
+++ uvm/uvm_vnode.c 15 Nov 2022 13:28:28 -
@@ -161,11 +161,8 @@ uvn_attach(struct vnode *vp, vm_prot_t a
 * add it to the writeable list, and then return.
 */
if (uvn->u_flags & UVM_VNODE_VALID) {   /* already active? */
+   KASSERT(uvn->u_obj.uo_refs > 0);
 
-   /* regain vref if we were persisting */
-   if (uvn->u_obj.uo_refs == 0) {
-   vref(vp);
-   }
uvn->u_obj.uo_refs++;   /* bump uvn ref! */
 
/* check for new writeable uvn */
@@ -235,14 +232,14 @@ uvn_attach(struct vnode *vp, vm_prot_t a
KASSERT(uvn->u_obj.uo_refs == 0);
uvn->u_obj.uo_refs++;
oldflags = uvn->u_flags;
-   uvn->u_flags = UVM_VNODE_VALID|UVM_VNODE_CANPERSIST;
+   uvn->u_flags = UVM_VNODE_VALID;
uvn->u_nio = 0;
uvn->u_size = used_vnode_size;
 
/*
 * add a reference to the vnode.   this reference will stay as long
 * as there is a valid mapping of the vnode.   dropped when the
-* reference count goes to zero [and we either free or persist].
+* reference count goes to zero.
 */
vref(vp);
 
@@ -323,16 +320,6 @@ uvn_detach(struct uvm_object *uobj)
 */
vp->v_flag &= ~VTEXT;
 
-   /*
-* we just dropped the last reference to the uvn.   see if we can
-* let it "stick around".
-*/
-   if (uvn->u_flags & UVM_VNODE_CANPERSIST) {
-   /* won't block */
-   uvn_flush(uobj, 0, 0, PGO_DEACTIVATE|PGO_ALLPAGES);
-   goto out;
-   }
-
/* its a goner! */
uvn->u_flags |= UVM_VNODE_DYING;
 
@@ -382,7 +369,6 @@ uvn_detach(struct uvm_object *uobj)
/* wake up any sleepers */
if (oldflags & UVM_VNODE_WANTED)
wakeup(uvn);
-out:
rw_exit(uobj->vmobjlock);
 
/* drop our reference to the vnode. */
@@ -498,8 +484,8 @@ uvm_vnp_terminate(struct vnode *vp)
}
 
/*
-* done.   now we free the uvn if its reference count is zero
-* (true if we are zapping a persisting uvn).   however, if we are
+* done.   now we free the uvn if its reference count is zero.
+* however, if we are
 * terminating a uvn with active mappings we let it live ... future
 * calls down to the vnode layer will fail.
 */
@@ -507,14 +493,14 @@ uvm_vnp_terminate(struct vnode *vp)
if (uvn->u_obj.uo_refs) {
/*
 * uvn must live on it is dead-vnode state until all references
-* are gone.   restore flags.clear CANPERSIST state.
+* are gone.   restore flags.
 */
uvn->u_flags &= 

btrace: string comparison in filters

2022-11-11 Thread Martin Pieuchot
Diff below adds support for the common following idiom:

syscall:open:entry
/comm == "ksh"/
{
...
}

String comparison is tricky as it can be combined with any other
expression in filters, like:

syscall:mmap:entry
/comm == "cc" && pid != 4589/
{
...
}

I don't have the energy to change the parser so I went for the easy
solution to treat any "stupid" string comparisons as 'true' albeit
printing a warning.  I'd love if somebody with some yacc knowledge
could come up with a better solution.

ok?

Index: usr.sbin/btrace/bt_parse.y
===
RCS file: /cvs/src/usr.sbin/btrace/bt_parse.y,v
retrieving revision 1.46
diff -u -p -r1.46 bt_parse.y
--- usr.sbin/btrace/bt_parse.y  28 Apr 2022 21:04:24 -  1.46
+++ usr.sbin/btrace/bt_parse.y  11 Nov 2022 14:34:37 -
@@ -218,6 +218,7 @@ variable: lvar  { $$ = bl_find($1); }
 factor : '(' expr ')'  { $$ = $2; }
| NUMBER{ $$ = ba_new($1, B_AT_LONG); }
| BUILTIN   { $$ = ba_new(NULL, $1); }
+   | CSTRING   { $$ = ba_new($1, B_AT_STR); }
| staticv
| variable
| mentry
Index: usr.sbin/btrace/btrace.c
===
RCS file: /cvs/src/usr.sbin/btrace/btrace.c,v
retrieving revision 1.64
diff -u -p -r1.64 btrace.c
--- usr.sbin/btrace/btrace.c11 Nov 2022 10:51:39 -  1.64
+++ usr.sbin/btrace/btrace.c11 Nov 2022 14:44:15 -
@@ -434,14 +434,23 @@ rules_setup(int fd)
struct bt_rule *r, *rbegin = NULL;
struct bt_probe *bp;
struct bt_stmt *bs;
+   struct bt_arg *ba;
int dokstack = 0, on = 1;
uint64_t evtflags;
 
TAILQ_FOREACH(r, _rules, br_next) {
evtflags = 0;
-   SLIST_FOREACH(bs, >br_action, bs_next) {
-   struct bt_arg *ba;
 
+   if (r->br_filter != NULL &&
+   r->br_filter->bf_condition != NULL)  {
+
+   bs = r->br_filter->bf_condition;
+   ba = SLIST_FIRST(>bs_args);
+
+   evtflags |= ba2dtflags(ba);
+   }
+
+   SLIST_FOREACH(bs, >br_action, bs_next) {
SLIST_FOREACH(ba, >bs_args, ba_next)
evtflags |= ba2dtflags(ba);
 
@@ -1175,6 +1184,36 @@ baexpr2long(struct bt_arg *ba, struct dt
lhs = ba->ba_value;
rhs = SLIST_NEXT(lhs, ba_next);
 
+   /*
+* String comparison also use '==' and '!='.
+*/
+   if (lhs->ba_type == B_AT_STR ||
+   (rhs != NULL && rhs->ba_type == B_AT_STR)) {
+   char lstr[STRLEN], rstr[STRLEN];
+
+   strlcpy(lstr, ba2str(lhs, dtev), sizeof(lstr));
+   strlcpy(rstr, ba2str(rhs, dtev), sizeof(rstr));
+
+   result = strncmp(lstr, rstr, STRLEN) == 0;
+
+   switch (ba->ba_type) {
+   case B_AT_OP_EQ:
+   break;
+   case B_AT_OP_NE:
+   result = !result;
+   break;
+   default:
+   warnx("operation '%d' unsupported on strings",
+   ba->ba_type);
+   result = 1;
+   }
+
+   debug("ba=%p eval '(%s %s %s) = %d'\n", ba, lstr, ba_name(ba),
+  rstr, result);
+
+   goto out;
+   }
+
lval = ba2long(lhs, dtev);
if (rhs == NULL) {
rval = 0;
@@ -1233,9 +1272,10 @@ baexpr2long(struct bt_arg *ba, struct dt
xabort("unsupported operation %d", ba->ba_type);
}
 
-   debug("ba=%p eval '%ld %s %ld = %d'\n", ba, lval, ba_name(ba),
+   debug("ba=%p eval '(%ld %s %ld) = %d'\n", ba, lval, ba_name(ba),
   rval, result);
 
+out:
--recursions;
 
return result;
@@ -1245,10 +1285,15 @@ const char *
 ba_name(struct bt_arg *ba)
 {
switch (ba->ba_type) {
+   case B_AT_STR:
+   return (const char *)ba->ba_value;
+   case B_AT_LONG:
+   return ba2str(ba, NULL);
case B_AT_NIL:
return "0";
case B_AT_VAR:
case B_AT_MAP:
+   case B_AT_HIST:
break;
case B_AT_BI_PID:
return "pid";
@@ -1326,7 +1371,8 @@ ba_name(struct bt_arg *ba)
xabort("unsupported type %d", ba->ba_type);
}
 
-   assert(ba->ba_type == B_AT_VAR || ba->ba_type == B_AT_MAP);
+   assert(ba->ba_type == B_AT_VAR || ba->ba_type == B_AT_MAP ||
+   ba->ba_type == B_AT_HIST);
 
static char buf[64];
size_t sz;
@@ -1516,9 +1562,13 @@ ba2str(struct bt_arg *ba, struct dt_evt 
 int
 ba2dtflags(struct bt_arg *ba)
 {
+   static long recursions;
struct bt_arg *bval;
int flags = 0;
 
+   if (++recursions >= __MAXOPERANDS)
+ 

Re: push kernel lock inside ifioctl_get()

2022-11-08 Thread Martin Pieuchot
On 08/11/22(Tue) 15:28, Klemens Nanni wrote:
> After this mechanical move, I can unlock the individual SIOCG* in there.

I'd suggest grabbing the KERNEL_LOCK() after NET_LOCK_SHARED().
Otherwise you might spin for the first one then release it when going
to sleep.

> OK?
> 
> Index: if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.667
> diff -u -p -r1.667 if.c
> --- if.c  8 Nov 2022 15:20:24 -   1.667
> +++ if.c  8 Nov 2022 15:26:07 -
> @@ -2426,33 +2426,43 @@ ifioctl_get(u_long cmd, caddr_t data)
>   size_t bytesdone;
>   const char *label;
>  
> - KERNEL_LOCK();
> -
>   switch(cmd) {
>   case SIOCGIFCONF:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = ifconf(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCIFGCLONERS:
> + KERNEL_LOCK();
>   error = if_clone_list((struct if_clonereq *)data);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGMEMB:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgroupmembers(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGATTR:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgroupattribs(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGLIST:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgrouplist(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   }
> +
> + KERNEL_LOCK();
>  
>   ifp = if_unit(ifr->ifr_name);
>   if (ifp == NULL) {
> 



Mark sched_yield(2) as NOLOCK

2022-11-08 Thread Martin Pieuchot
Now that mmap/munmap/mprotect(2) are no longer creating contention it is
possible to see that sched_yield(2) is one of the syscalls waiting for
the KERNEL_LOCK() to be released.  However this is no longer necessary.

Traversing `ps_threads' require either the KERNEL_LOCK() or the
SCHED_LOCK() and we are holding both in this case.  So let's drop the
requirement for the KERNEL_LOCK().

ok?

Index: kern/syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.235
diff -u -p -r1.235 syscalls.master
--- kern/syscalls.master8 Nov 2022 11:05:57 -   1.235
+++ kern/syscalls.master8 Nov 2022 13:09:10 -
@@ -531,7 +531,7 @@
 #else
 297UNIMPL
 #endif
-298STD { int sys_sched_yield(void); }
+298STD NOLOCK  { int sys_sched_yield(void); }
 299STD NOLOCK  { pid_t sys_getthrid(void); }
 300OBSOL   t32___thrsleep
 301STD NOLOCK  { int sys___thrwakeup(const volatile void *ident, \



Re: xenstore.c: return error number

2022-11-08 Thread Martin Pieuchot
On 01/11/22(Tue) 15:26, Masato Asou wrote:
> Hi,
> 
> Return error number instead of call panic().

Makes sense to me.  Do you know how this error can occur?  Is is a logic
error or are we trusting values produced by a third party?

> comment, ok?
> --
> ASOU Masato
> 
> diff --git a/sys/dev/pv/xenstore.c b/sys/dev/pv/xenstore.c
> index 1e4f15d30eb..dc89ba0fa6d 100644
> --- a/sys/dev/pv/xenstore.c
> +++ b/sys/dev/pv/xenstore.c
> @@ -118,6 +118,7 @@ struct xs_msg {
>   struct xs_msghdr xsm_hdr;
>   uint32_t xsm_read;
>   uint32_t xsm_dlen;
> + int  xsm_error;
>   uint8_t *xsm_data;
>   TAILQ_ENTRY(xs_msg)  xsm_link;
>  };
> @@ -566,9 +567,7 @@ xs_intr(void *arg)
>   }
>  
>   if (xsm->xsm_hdr.xmh_len > xsm->xsm_dlen)
> - panic("message too large: %d vs %d for type %d, rid %u",
> - xsm->xsm_hdr.xmh_len, xsm->xsm_dlen, xsm->xsm_hdr.xmh_type,
> - xsm->xsm_hdr.xmh_rid);
> + xsm->xsm_error = EMSGSIZE;
>  
>   len = MIN(xsm->xsm_hdr.xmh_len - xsm->xsm_read, avail);
>   if (len) {
> @@ -800,7 +799,9 @@ xs_cmd(struct xs_transaction *xst, int cmd, const char 
> *path,
>   error = xs_geterror(xsm);
>   DPRINTF("%s: xenstore request %d \"%s\" error %s\n",
>   xs->xs_sc->sc_dev.dv_xname, cmd, path, xsm->xsm_data);
> - } else if (mode == READ) {
> + } else if (xsm->xsm_error != 0)
> + error = xsm->xsm_error;
> + else if (mode == READ) {
>   KASSERT(iov && iov_cnt);
>   error = xs_parse(xst, xsm, iov, iov_cnt);
>   }
> 



Re: Please test: unlock mprotect/mmap/munmap

2022-11-08 Thread Martin Pieuchot
On 08/11/22(Tue) 11:12, Mark Kettenis wrote:
> > Date: Tue, 8 Nov 2022 10:32:14 +0100
> > From: Christian Weisgerber 
> > 
> > Martin Pieuchot:
> > 
> > > These 3 syscalls should now be ready to run w/o KERNEL_LOCK().  This
> > > will reduce contention a lot.  I'd be happy to hear from test reports
> > > on many architectures and possible workloads.
> > 
> > This survived a full amd64 package build.
> 
> \8/
> 
> I think that means it should be comitted.

I agree.  This has been tested on i386, riscv64, m88k, arm64, amd64 (of
course) and sparc64.  I'm pretty confident.



Re: push kernel lock down in ifioctl()

2022-11-07 Thread Martin Pieuchot
On 07/11/22(Mon) 15:16, Klemens Nanni wrote:
> Not all interface ioctls need the kernel lock, but they all grab it.
> 
> Here's a mechanical diff splitting the single lock/unlock around
> ifioctl() into individual lock/unlock dances inside ifioctl().
> 
> From there we can unlock individual ioctls piece by piece.
> 
> Survives regress on sparc64 and didn't blow up on my amd64 notebook yet.
> 
> Feedback? Objection? OK?

Makes sense.  Your diff is missing the kern/sys_socket.c chunk.

This stuff is hairy.  I'd suggest moving very very carefully.  For
example, I wouldn't bother releasing the KERNEL_LOCK() before the
if_put().  Yes, what you're suggesting is correct.  Or at least should
be...

> Index: net/if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.665
> diff -u -p -r1.665 if.c
> --- net/if.c  8 Sep 2022 10:22:06 -   1.665
> +++ net/if.c  7 Nov 2022 15:13:01 -
> @@ -1942,19 +1942,25 @@ ifioctl(struct socket *so, u_long cmd, c
>   case SIOCIFCREATE:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   error = if_clone_create(ifr->ifr_name, 0);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCIFDESTROY:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   error = if_clone_destroy(ifr->ifr_name);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCSIFGATTR:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   NET_LOCK();
>   error = if_setgroupattribs(data);
>   NET_UNLOCK();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFCONF:
>   case SIOCIFGCLONERS:
> @@ -1973,12 +1979,19 @@ ifioctl(struct socket *so, u_long cmd, c
>   case SIOCGIFRDOMAIN:
>   case SIOCGIFGROUP:
>   case SIOCGIFLLPRIO:
> - return (ifioctl_get(cmd, data));
> + KERNEL_LOCK();
> + error = ifioctl_get(cmd, data);
> + KERNEL_UNLOCK();
> + return (error);
>   }
>  
> + KERNEL_LOCK();
> +
>   ifp = if_unit(ifr->ifr_name);
> - if (ifp == NULL)
> + if (ifp == NULL) {
> + KERNEL_UNLOCK();
>   return (ENXIO);
> + }
>   oif_flags = ifp->if_flags;
>   oif_xflags = ifp->if_xflags;
>  
> @@ -2396,6 +2409,8 @@ forceup:
>  
>   if (((oif_flags ^ ifp->if_flags) & IFF_UP) != 0)
>   getmicrotime(>if_lastchange);
> +
> + KERNEL_UNLOCK();
>  
>   if_put(ifp);
>  
> 



Please test: unlock mprotect/mmap/munmap

2022-11-06 Thread Martin Pieuchot
These 3 syscalls should now be ready to run w/o KERNEL_LOCK().  This
will reduce contention a lot.  I'd be happy to hear from test reports
on many architectures and possible workloads.

Do not forget to run "make syscalls" before building the kernel.

Index: syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.234
diff -u -p -r1.234 syscalls.master
--- syscalls.master 25 Oct 2022 16:10:31 -  1.234
+++ syscalls.master 6 Nov 2022 10:50:45 -
@@ -126,7 +126,7 @@
struct sigaction *osa); }
 47 STD NOLOCK  { gid_t sys_getgid(void); }
 48 STD NOLOCK  { int sys_sigprocmask(int how, sigset_t mask); }
-49 STD { void *sys_mmap(void *addr, size_t len, int prot, \
+49 STD NOLOCK  { void *sys_mmap(void *addr, size_t len, int prot, \
int flags, int fd, off_t pos); }
 50 STD { int sys_setlogin(const char *namebuf); }
 #ifdef ACCOUNTING
@@ -171,8 +171,8 @@
const struct kevent *changelist, int nchanges, \
struct kevent *eventlist, int nevents, \
const struct timespec *timeout); }
-73 STD { int sys_munmap(void *addr, size_t len); }
-74 STD { int sys_mprotect(void *addr, size_t len, \
+73 STD NOLOCK  { int sys_munmap(void *addr, size_t len); }
+74 STD NOLOCK  { int sys_mprotect(void *addr, size_t len, \
int prot); }
 75 STD { int sys_madvise(void *addr, size_t len, \
int behav); }



Re: Towards unlocking mmap(2) & munmap(2)

2022-10-30 Thread Martin Pieuchot
On 30/10/22(Sun) 12:45, Klemens Nanni wrote:
> On Sun, Oct 30, 2022 at 12:40:02PM +, Klemens Nanni wrote:
> > regress on i386/GENERIC.MP+WITNESS with this diff shows
> 
> Another one;  This machine has three read-only NFS mounts, but none of
> them are used during builds or regress.

It's the same.  See archives of bugs@ for discussion about this lock
order reversal and a potential fix from visa@.

> 
> This one is most certainly from the NFS regress tests themselves:
> 127.0.0.1:/mnt/regress-nfs-server  3548  2088  1284   
>  62%/mnt/regress-nfs-client
> 
> witness: lock order reversal:
>  1st 0xd6381eb8 vmmaplk (>lock)
>  2nd 0xf5c98d24 nfsnode (>n_lock)
> lock order data w2 -> w1 missing
> lock order ">lock"(rwlock) -> ">n_lock"(rrwlock) first seen at:
> #0  rw_enter+0x57
> #1  rrw_enter+0x3d
> #2  nfs_lock+0x27
> #3  VOP_LOCK+0x50
> #4  vn_lock+0x91
> #5  vn_rdwr+0x64
> #6  vndstrategy+0x2bd
> #7  physio+0x18f
> #8  vndwrite+0x1a
> #9  spec_write+0x74
> #10 VOP_WRITE+0x3f
> #11 vn_write+0xde
> #12 dofilewritev+0xbb
> #13 sys_pwrite+0x55
> #14 syscall+0x2ec
> #15 Xsyscall_untramp+0xa9
> 



Re: Towards unlocking mmap(2) & munmap(2)

2022-10-30 Thread Martin Pieuchot
On 30/10/22(Sun) 12:40, Klemens Nanni wrote:
> On Fri, Oct 28, 2022 at 11:08:55AM +0200, Martin Pieuchot wrote:
> > On 20/10/22(Thu) 16:17, Martin Pieuchot wrote:
> > > On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> > > > Diff below adds a minimalist set of assertions to ensure proper locks
> > > > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > > > mmap(2) for anons and munmap(2).
> > > > 
> > > > Please test it with WITNESS enabled and report back.
> > > 
> > > New version of the diff that includes a lock/unlock dance  in 
> > > uvm_map_teardown().  While grabbing this lock should not be strictly
> > > necessary because no other reference to the map should exist when the
> > > reaper is holding it, it helps make progress with asserts.  Grabbing
> > > the lock is easy and it can also save us a lot of time if there is any
> > > reference counting bugs (like we've discovered w/ vnode and swapping).
> > 
> > Here's an updated version that adds a lock/unlock dance in
> > uvm_map_deallocate() to satisfy the assert in uvm_unmap_remove().
> > Thanks to tb@ from pointing this out.
> > 
> > I received many positive feedback and test reports, I'm now asking for
> > oks.
> 
> regress on i386/GENERIC.MP+WITNESS with this diff shows

This isn't related to this diff.



Re: Towards unlocking mmap(2) & munmap(2)

2022-10-28 Thread Martin Pieuchot
On 20/10/22(Thu) 16:17, Martin Pieuchot wrote:
> On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> > Diff below adds a minimalist set of assertions to ensure proper locks
> > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > mmap(2) for anons and munmap(2).
> > 
> > Please test it with WITNESS enabled and report back.
> 
> New version of the diff that includes a lock/unlock dance  in 
> uvm_map_teardown().  While grabbing this lock should not be strictly
> necessary because no other reference to the map should exist when the
> reaper is holding it, it helps make progress with asserts.  Grabbing
> the lock is easy and it can also save us a lot of time if there is any
> reference counting bugs (like we've discovered w/ vnode and swapping).

Here's an updated version that adds a lock/unlock dance in
uvm_map_deallocate() to satisfy the assert in uvm_unmap_remove().
Thanks to tb@ from pointing this out.

I received many positive feedback and test reports, I'm now asking for
oks.


Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  28 Oct 2022 08:41:30 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 28 Oct 2022 08:41:30 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.301
diff -u -p -r1.301 uvm_map.c
--- uvm/uvm_map.c   24 Oct 2022 15:11:56 -  1.301
+++ uvm/uvm_map.c   28 Oct 2022 08:46:28 -
@@ -491,6 +491,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +572,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1457,6 +1461,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1584,6 +1590,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(>addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1704,6 +1712,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, )) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1868,6 +1878,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1996,10 +2008,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return 0;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrlock(map);
 
/* Find first affected

Re: Towards unlocking mmap(2) & munmap(2)

2022-10-20 Thread Martin Pieuchot
On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> Diff below adds a minimalist set of assertions to ensure proper locks
> are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> mmap(2) for anons and munmap(2).
> 
> Please test it with WITNESS enabled and report back.

New version of the diff that includes a lock/unlock dance  in 
uvm_map_teardown().  While grabbing this lock should not be strictly
necessary because no other reference to the map should exist when the
reaper is holding it, it helps make progress with asserts.  Grabbing
the lock is easy and it can also save us a lot of time if there is any
reference counting bugs (like we've discovered w/ vnode and swapping).

Please test and report back.

Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  20 Oct 2022 14:09:30 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 20 Oct 2022 14:09:30 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.298
diff -u -p -r1.298 uvm_map.c
--- uvm/uvm_map.c   16 Oct 2022 16:16:37 -  1.298
+++ uvm/uvm_map.c   20 Oct 2022 14:09:31 -
@@ -491,6 +491,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +572,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1457,6 +1461,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1584,6 +1590,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(>addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1704,6 +1712,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, )) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1868,6 +1878,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1996,10 +2008,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return 0;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrlock(map);
 
/* Find first affected entry. */
entry = uvm_map_entrybyaddr(>addr, start);
@@ -2526,6 +2535,8 @@ uvm_map_teardown(struct vm_map *map)
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
 
+   vm_map_lock(map);
+
/* Remove address selectors. */
uvm_addr_destroy(map->uaddr_e

Re: Towards unlocking mmap(2) & munmap(2)

2022-09-14 Thread Martin Pieuchot
On 14/09/22(Wed) 15:47, Klemens Nanni wrote:
> On 14.09.22 18:55, Mike Larkin wrote:
> > On Sun, Sep 11, 2022 at 12:26:31PM +0200, Martin Pieuchot wrote:
> > > Diff below adds a minimalist set of assertions to ensure proper locks
> > > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > > mmap(2) for anons and munmap(2).
> > > 
> > > Please test it with WITNESS enabled and report back.
> > > 
> > 
> > Do you want this tested in conjunction with the aiodoned diff or by itself?
> 
> This diff looks like a subset of the previous uvm lock assertion diff
> that came out of the previous "unlock mmap(2) for anonymous mappings"
> thread[0].
> 
> https://marc.info/?l=openbsd-tech=164423248318212=2
> 
> It didn't land eventually, I **think** syzcaller was a blocker which we
> only realised once it was committed and picked up by syzcaller.
> 
> Now it's been some time and more UVM changes landed, but the majority
> (if not all) lock assertions and comments from the above linked diff
> should still hold true.
> 
> mpi, I can dust off and resend that diff, If you want.
> Nothing for release, but perhaps it helps testing your current efforts.

Please hold on, this diff is known to trigger a KASSERT() with witness.
I'll send an update version soon.

Thank you for disregarding this diff for the moment.



Towards unlocking mmap(2) & munmap(2)

2022-09-11 Thread Martin Pieuchot
Diff below adds a minimalist set of assertions to ensure proper locks
are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
mmap(2) for anons and munmap(2).

Please test it with WITNESS enabled and report back.

Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  11 Sep 2022 09:08:10 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 11 Sep 2022 08:57:35 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.294
diff -u -p -r1.294 uvm_map.c
--- uvm/uvm_map.c   15 Aug 2022 15:53:45 -  1.294
+++ uvm/uvm_map.c   11 Sep 2022 09:37:44 -
@@ -162,6 +162,8 @@ int  uvm_map_inentry_recheck(u_long, v
 struct p_inentry *);
 boolean_t   uvm_map_inentry_fix(struct proc *, struct p_inentry *,
 vaddr_t, int (*)(vm_map_entry_t), u_long);
+boolean_t   uvm_map_is_stack_remappable(struct vm_map *,
+vaddr_t, vsize_t);
 /*
  * Tree management functions.
  */
@@ -491,6 +493,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +574,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1446,6 +1452,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1573,6 +1581,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(>addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1692,6 +1702,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, )) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1843,6 +1855,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1971,10 +1985,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrlock(map);
 
/* Find first affected entry. */
entry = uvm_map_entrybyaddr(>addr, start);
@@ -4027,6 +4038,8 @@ uvm_map_checkprot(struct vm_map *map, va
 {
struct vm_map_entry *entry;
 
+   vm_map_assert_anylock(map);
+
if (start < map->min_offset || end > map->max_offset || start > end)
return FALSE;
if (start == end)
@@ -4886,6 +4899,7 @@ uvm_map_freelist_update(struct vm_map *m
 vaddr_t b_start, vaddr_t b_end, vaddr_t s_start, vaddr_t s_end, int flags)
 {
KDASSERT(b_end >= b_start && s_end >= 

uvm_vnode locking & documentation

2022-09-10 Thread Martin Pieuchot
Previous fix from gnezdo@ pointed out that `u_flags' accesses should be
serialized by `vmobjlock'.  Diff below documents this and fix the
remaining places where the lock isn't yet taken.  One exception still
remains, the first loop of uvm_vnp_sync().  This cannot be fixed right
now due to possible deadlocks but that's not a reason for not documenting
& fixing the rest of this file.

This has been tested on amd64 and arm64.

Comments?  Oks?

Index: uvm/uvm_vnode.c
===
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.128
diff -u -p -r1.128 uvm_vnode.c
--- uvm/uvm_vnode.c 10 Sep 2022 16:14:36 -  1.128
+++ uvm/uvm_vnode.c 10 Sep 2022 18:23:57 -
@@ -68,11 +68,8 @@
  * we keep a simpleq of vnodes that are currently being sync'd.
  */
 
-LIST_HEAD(uvn_list_struct, uvm_vnode);
-struct uvn_list_struct uvn_wlist;  /* writeable uvns */
-
-SIMPLEQ_HEAD(uvn_sq_struct, uvm_vnode);
-struct uvn_sq_struct uvn_sync_q;   /* sync'ing uvns */
+LIST_HEAD(, uvm_vnode) uvn_wlist;  /* [K] writeable uvns */
+SIMPLEQ_HEAD(, uvm_vnode)  uvn_sync_q; /* [S] sync'ing uvns */
 struct rwlock uvn_sync_lock;   /* locks sync operation */
 
 extern int rebooting;
@@ -144,41 +141,40 @@ uvn_attach(struct vnode *vp, vm_prot_t a
struct partinfo pi;
u_quad_t used_vnode_size = 0;
 
-   /* first get a lock on the uvn. */
-   while (uvn->u_flags & UVM_VNODE_BLOCKED) {
-   uvn->u_flags |= UVM_VNODE_WANTED;
-   tsleep_nsec(uvn, PVM, "uvn_attach", INFSLP);
-   }
-
/* if we're mapping a BLK device, make sure it is a disk. */
if (vp->v_type == VBLK && bdevsw[major(vp->v_rdev)].d_type != D_DISK) {
return NULL;
}
 
+   /* first get a lock on the uvn. */
+   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
+   while (uvn->u_flags & UVM_VNODE_BLOCKED) {
+   uvn->u_flags |= UVM_VNODE_WANTED;
+   rwsleep_nsec(uvn, uvn->u_obj.vmobjlock, PVM, "uvn_attach",
+   INFSLP);
+   }
+
/*
 * now uvn must not be in a blocked state.
 * first check to see if it is already active, in which case
 * we can bump the reference count, check to see if we need to
 * add it to the writeable list, and then return.
 */
-   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
if (uvn->u_flags & UVM_VNODE_VALID) {   /* already active? */
KASSERT(uvn->u_obj.uo_refs > 0);
 
uvn->u_obj.uo_refs++;   /* bump uvn ref! */
-   rw_exit(uvn->u_obj.vmobjlock);
 
/* check for new writeable uvn */
if ((accessprot & PROT_WRITE) != 0 &&
(uvn->u_flags & UVM_VNODE_WRITEABLE) == 0) {
-   LIST_INSERT_HEAD(_wlist, uvn, u_wlist);
-   /* we are now on wlist! */
uvn->u_flags |= UVM_VNODE_WRITEABLE;
+   LIST_INSERT_HEAD(_wlist, uvn, u_wlist);
}
+   rw_exit(uvn->u_obj.vmobjlock);
 
return (>u_obj);
}
-   rw_exit(uvn->u_obj.vmobjlock);
 
/*
 * need to call VOP_GETATTR() to get the attributes, but that could
@@ -189,6 +185,7 @@ uvn_attach(struct vnode *vp, vm_prot_t a
 * it.
 */
uvn->u_flags = UVM_VNODE_ALOCK;
+   rw_exit(uvn->u_obj.vmobjlock);
 
if (vp->v_type == VBLK) {
/*
@@ -213,9 +210,11 @@ uvn_attach(struct vnode *vp, vm_prot_t a
}
 
if (result != 0) {
+   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
if (uvn->u_flags & UVM_VNODE_WANTED)
wakeup(uvn);
uvn->u_flags = 0;
+   rw_exit(uvn->u_obj.vmobjlock);
return NULL;
}
 
@@ -236,18 +235,19 @@ uvn_attach(struct vnode *vp, vm_prot_t a
uvn->u_nio = 0;
uvn->u_size = used_vnode_size;
 
-   /* if write access, we need to add it to the wlist */
-   if (accessprot & PROT_WRITE) {
-   LIST_INSERT_HEAD(_wlist, uvn, u_wlist);
-   uvn->u_flags |= UVM_VNODE_WRITEABLE;/* we are on wlist! */
-   }
-
/*
 * add a reference to the vnode.   this reference will stay as long
 * as there is a valid mapping of the vnode.   dropped when the
 * reference count goes to zero.
 */
vref(vp);
+
+   /* if write access, we need to add it to the wlist */
+   if (accessprot & PROT_WRITE) {
+   uvn->u_flags |= UVM_VNODE_WRITEABLE;
+   LIST_INSERT_HEAD(_wlist, uvn, u_wlist);
+   }
+
if (oldflags & UVM_VNODE_WANTED)
wakeup(uvn);
 
@@ -273,6 +273,7 @@ uvn_reference(struct uvm_object *uobj)
struct uvm_vnode *uvn = (struct uvm_vnode *) uobj;
 #endif
 
+   

Re: Unmap page in uvm_anon_release()

2022-09-10 Thread Martin Pieuchot
On 10/09/22(Sat) 15:12, Mark Kettenis wrote:
> > Date: Sat, 10 Sep 2022 14:18:02 +0200
> > From: Martin Pieuchot 
> > 
> > Diff below fixes a bug exposed when swapping on arm64.  When an anon is
> > released make sure the all the pmap references to the related page are
> > removed.
> 
> I'm a little bit puzzled by this.  So these pages are still mapped
> even though there are no references to the anon anymore?

I don't know.  I just realised that all the code paths leading to
uvm_pagefree() get rid of the pmap references by calling page_protect()
except a couple of them in the aiodone daemon and the clustering code in
the pager.

This can't hurt and make the existing code coherent.  Maybe it just
hides the bug, I don't know.



Unmap page in uvm_anon_release()

2022-09-10 Thread Martin Pieuchot
Diff below fixes a bug exposed when swapping on arm64.  When an anon is
released make sure the all the pmap references to the related page are
removed.

We could move the pmap_page_protect(pg, PROT_NONE) inside uvm_pagefree()
to avoid future issue but that's for a later refactoring.

With this diff I can no longer reproduce the SIGBUS issue on the
rockpro64 and swapping is stable as long as I/O from sdmmc(4) work.

This should be good enough to commit the diff that got reverted, but I'll
wait to be sure there's no regression.

ok?

Index: uvm/uvm_anon.c
===
RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
retrieving revision 1.54
diff -u -p -r1.54 uvm_anon.c
--- uvm/uvm_anon.c  26 Mar 2021 13:40:05 -  1.54
+++ uvm/uvm_anon.c  10 Sep 2022 12:10:34 -
@@ -255,6 +255,7 @@ uvm_anon_release(struct vm_anon *anon)
KASSERT(anon->an_ref == 0);
 
uvm_lock_pageq();
+   pmap_page_protect(pg, PROT_NONE);
uvm_pagefree(pg);
uvm_unlock_pageq();
KASSERT(anon->an_page == NULL);
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 10 Sep 2022 12:10:34 -
@@ -396,7 +396,6 @@ uvmfault_anonget(struct uvm_faultinfo *u
 * anon and try again.
 */
if (pg->pg_flags & PG_RELEASED) {
-   pmap_page_protect(pg, PROT_NONE);
KASSERT(anon->an_ref == 0);
/*
 * Released while we had unlocked amap.



Re: ps(1): add -d (descendancy) option to display parent/child process relationships

2022-09-01 Thread Martin Pieuchot
On 01/09/22(Thu) 03:37, Job Snijders wrote:
> Dear all,
> 
> Some ps(1) implementations have an '-d' ('descendancy') option. Through
> ASCII art parent/child process relationships are grouped and displayed.
> Here is an example:
> 
> $ ps ad -O ppid,user
>   PID  PPID USER TT  STATTIME COMMAND
> 18180 12529 job  pb  I+p  0:00.01 `-- -sh (sh)
> 26689 56460 job  p3  Ip   0:00.01   `-- -ksh (ksh)
>  5153 26689 job  p3  I+p  0:40.18 `-- mutt
> 62046 25272 job  p4  Sp   0:00.25   `-- -ksh (ksh)
> 61156 62046 job  p4  R+/0 0:00.00 `-- ps -ad -O ppid
> 26816  2565 job  p5  Ip   0:00.01   `-- -ksh (ksh)
> 79431 26816 root p5  Ip   0:00.16 `-- /bin/ksh
> 43915 79431 _rpki-cl p5  S+pU 0:06.97   `-- rpki-client
> 70511 43915 _rpki-cl p5  I+pU 0:01.26 |-- rpki-client: parser 
> (rpki-client)
> 96992 43915 _rpki-cl p5  I+pU 0:00.00 |-- rpki-client: rsync 
> (rpki-client)
> 49160 43915 _rpki-cl p5  S+p  0:01.52 |-- rpki-client: http 
> (rpki-client)
> 99329 43915 _rpki-cl p5  S+p  0:03.20 `-- rpki-client: rrdp 
> (rpki-client)
> 
> The functionality is similar to pstree(1) in the ports collection.
> 
> The below changeset borrows heavily from the following two
> implementations:
> 
> 
> https://github.com/freebsd/freebsd-src/commit/044fce530f89a819827d351de364d208a30e9645.patch
> 
> https://github.com/NetBSD/src/commit/b82f6d00d93d880d3976c4f1e88c33d88a8054ad.patch
> 
> Thoughts?

I'd love to have such feature in base.

> Index: extern.h
> ===
> RCS file: /cvs/src/bin/ps/extern.h,v
> retrieving revision 1.23
> diff -u -p -r1.23 extern.h
> --- extern.h  5 Jan 2022 04:10:36 -   1.23
> +++ extern.h  1 Sep 2022 03:31:36 -
> @@ -44,44 +44,44 @@ extern VAR var[];
>  extern VARENT *vhead;
>  
>  __BEGIN_DECLS
> -void  command(const struct kinfo_proc *, VARENT *);
> -void  cputime(const struct kinfo_proc *, VARENT *);
> +void  command(const struct pinfo *, VARENT *);
> +void  cputime(const struct pinfo *, VARENT *);
>  int   donlist(void);
> -void  elapsed(const struct kinfo_proc *, VARENT *);
> +void  elapsed(const struct pinfo *, VARENT *);
>  doublegetpcpu(const struct kinfo_proc *);
> -doublegetpmem(const struct kinfo_proc *);
> -void  gname(const struct kinfo_proc *, VARENT *);
> -void  supgid(const struct kinfo_proc *, VARENT *);
> -void  supgrp(const struct kinfo_proc *, VARENT *);
> -void  logname(const struct kinfo_proc *, VARENT *);
> -void  longtname(const struct kinfo_proc *, VARENT *);
> -void  lstarted(const struct kinfo_proc *, VARENT *);
> -void  maxrss(const struct kinfo_proc *, VARENT *);
> +doublegetpmem(const struct pinfo *);
> +void  gname(const struct pinfo *, VARENT *);
> +void  supgid(const struct pinfo *, VARENT *);
> +void  supgrp(const struct pinfo *, VARENT *);
> +void  logname(const struct pinfo *, VARENT *);
> +void  longtname(const struct pinfo *, VARENT *);
> +void  lstarted(const struct pinfo *, VARENT *);
> +void  maxrss(const struct pinfo *, VARENT *);
>  void  nlisterr(struct nlist *);
> -void  p_rssize(const struct kinfo_proc *, VARENT *);
> -void  pagein(const struct kinfo_proc *, VARENT *);
> +void  p_rssize(const struct pinfo *, VARENT *);
> +void  pagein(const struct pinfo *, VARENT *);
>  void  parsefmt(char *);
> -void  pcpu(const struct kinfo_proc *, VARENT *);
> -void  pmem(const struct kinfo_proc *, VARENT *);
> -void  pri(const struct kinfo_proc *, VARENT *);
> +void  pcpu(const struct pinfo *, VARENT *);
> +void  pmem(const struct pinfo *, VARENT *);
> +void  pri(const struct pinfo *, VARENT *);
>  void  printheader(void);
> -void  pvar(const struct kinfo_proc *kp, VARENT *);
> -void  pnice(const struct kinfo_proc *kp, VARENT *);
> -void  rgname(const struct kinfo_proc *, VARENT *);
> -void  rssize(const struct kinfo_proc *, VARENT *);
> -void  runame(const struct kinfo_proc *, VARENT *);
> +void  pvar(const struct pinfo *, VARENT *);
> +void  pnice(const struct pinfo *, VARENT *);
> +void  rgname(const struct pinfo *, VARENT *);
> +void  rssize(const struct pinfo *, VARENT *);
> +void  runame(const struct pinfo *, VARENT *);
>  void  showkey(void);
> -void  started(const struct kinfo_proc *, VARENT *);
> -void  printstate(const struct kinfo_proc *, VARENT *);
> -void  printpledge(const struct kinfo_proc *, VARENT *);
> -void  tdev(const struct kinfo_proc *, VARENT *);
> -void  tname(const struct kinfo_proc *, VARENT *);
> -void  tsize(const struct kinfo_proc *, VARENT *);
> -void  dsize(const struct kinfo_proc *, VARENT *);
> -void  ssize(const struct kinfo_proc *, VARENT *);
> -void  ucomm(const struct kinfo_proc *, VARENT *);
> -void  curwd(const struct kinfo_proc *, VARENT *);
> -void  euname(const struct kinfo_proc *, VARENT *);
> -void  vsize(const struct kinfo_proc *, VARENT 

Re: pdaemon locking tweak

2022-08-30 Thread Martin Pieuchot
On 30/08/22(Tue) 15:28, Jonathan Gray wrote:
> On Mon, Aug 29, 2022 at 01:46:20PM +0200, Martin Pieuchot wrote:
> > Diff below refactors the pdaemon's locking by introducing a new *trylock()
> > function for a given page.  This is shamelessly stolen from NetBSD.
> > 
> > This is part of my ongoing effort to untangle the locks used by the page
> > daemon.
> > 
> > ok?
> 
> if (pmap_is_referenced(p)) {
>   uvm_pageactivate(p);
> 
> is no longer under held slock.  Which I believe is intended,
> just not obvious looking at the diff.
> 
> The page queue is already locked on entry to uvmpd_scan_inactive()

Thanks for spotting this.  Indeed the locking required for
uvm_pageactivate() is different in my local tree.  For now
let's keep the existing order of operations.

Updated diff below.

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   30 Aug 2022 08:30:58 -  1.103
+++ uvm/uvm_pdaemon.c   30 Aug 2022 08:39:19 -
@@ -101,6 +101,7 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
+struct rwlock  *uvmpd_trylockowner(struct vm_page *);
 void   uvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
@@ -367,6 +368,34 @@ uvm_aiodone_daemon(void *arg)
}
 }
 
+/*
+ * uvmpd_trylockowner: trylock the page's owner.
+ *
+ * => return the locked rwlock on success.  otherwise, return NULL.
+ */
+struct rwlock *
+uvmpd_trylockowner(struct vm_page *pg)
+{
+
+   struct uvm_object *uobj = pg->uobject;
+   struct rwlock *slock;
+
+   if (uobj != NULL) {
+   slock = uobj->vmobjlock;
+   } else {
+   struct vm_anon *anon = pg->uanon;
+
+   KASSERT(anon != NULL);
+   slock = anon->an_lock;
+   }
+
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
+   return NULL;
+   }
+
+   return slock;
+}
+
 
 /*
  * uvmpd_dropswap: free any swap allocated to this page.
@@ -474,51 +503,43 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 
anon = p->uanon;
uobj = p->uobject;
-   if (p->pg_flags & PQ_ANON) {
+
+   /*
+* first we attempt to lock the object that this page
+* belongs to.  if our attempt fails we skip on to
+* the next page (no harm done).  it is important to
+* "try" locking the object as we are locking in the
+* wrong order (pageq -> object) and we don't want to
+* deadlock.
+*/
+   slock = uvmpd_trylockowner(p);
+   if (slock == NULL) {
+   continue;
+   }
+
+   /*
+* move referenced pages back to active queue
+* and skip to next page.
+*/
+   if (pmap_is_referenced(p)) {
+   uvm_pageactivate(p);
+   rw_exit(slock);
+   uvmexp.pdreact++;
+   continue;
+   }
+
+   if (p->pg_flags & PG_BUSY) {
+   rw_exit(slock);
+   uvmexp.pdbusy++;
+   continue;
+   }
+
+   /* does the page belong to an object? */
+   if (uobj != NULL) {
+   uvmexp.pdobscan++;
+   } else {
KASSERT(anon != NULL);
-   slock = anon->an_lock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   /* lock failed, skip this page */
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip to next page.
-*/
-   if (pmap_is_referenced(p)) {
-   uvm_pageactivate(p);
-   rw_exit(slock);
-   uvmexp.pdreact++;
-   continue;
-   }
-   if (p->pg_flags & PG_BUSY) {
- 

uvmpd_dropswap()

2022-08-29 Thread Martin Pieuchot
Small refactoring to introduce uvmpd_dropswap().  This will make an
upcoming rewrite of the pdaemon smaller & easier to review :o)

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.102
diff -u -p -r1.102 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   22 Aug 2022 12:03:32 -  1.102
+++ uvm/uvm_pdaemon.c   29 Aug 2022 11:55:52 -
@@ -105,6 +105,7 @@ voiduvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
+void   uvmpd_dropswap(struct vm_page *);
 
 /*
  * uvm_wait: wait (sleep) for the page daemon to free some pages
@@ -367,6 +368,23 @@ uvm_aiodone_daemon(void *arg)
 }
 
 
+/*
+ * uvmpd_dropswap: free any swap allocated to this page.
+ *
+ * => called with owner locked.
+ */
+void
+uvmpd_dropswap(struct vm_page *pg)
+{
+   struct vm_anon *anon = pg->uanon;
+
+   if ((pg->pg_flags & PQ_ANON) && anon->an_swslot) {
+   uvm_swap_free(anon->an_swslot, 1);
+   anon->an_swslot = 0;
+   } else if (pg->pg_flags & PQ_AOBJ) {
+   uao_dropswap(pg->uobject, pg->offset >> PAGE_SHIFT);
+   }
+}
 
 /*
  * uvmpd_scan_inactive: scan an inactive list for pages to clean or free.
@@ -566,16 +584,7 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
KASSERT(uvmexp.swpginuse <= uvmexp.swpages);
if ((p->pg_flags & PQ_SWAPBACKED) &&
uvmexp.swpginuse == uvmexp.swpages) {
-
-   if ((p->pg_flags & PQ_ANON) &&
-   p->uanon->an_swslot) {
-   uvm_swap_free(p->uanon->an_swslot, 1);
-   p->uanon->an_swslot = 0;
-   }
-   if (p->pg_flags & PQ_AOBJ) {
-   uao_dropswap(p->uobject,
-p->offset >> PAGE_SHIFT);
-   }
+   uvmpd_dropswap(p);
}
 
/*
@@ -599,16 +608,7 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
if (swap_backed) {
/* free old swap slot (if any) */
-   if (anon) {
-   if (anon->an_swslot) {
-   uvm_swap_free(anon->an_swslot,
-   1);
-   anon->an_swslot = 0;
-   }
-   } else {
-   uao_dropswap(uobj,
-p->offset >> PAGE_SHIFT);
-   }
+   uvmpd_dropswap(p);
 
/* start new cluster (if necessary) */
if (swslot == 0) {



pdaemon locking tweak

2022-08-29 Thread Martin Pieuchot
Diff below refactors the pdaemon's locking by introducing a new *trylock()
function for a given page.  This is shamelessly stolen from NetBSD.

This is part of my ongoing effort to untangle the locks used by the page
daemon.

ok?

Index: uvm//uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.102
diff -u -p -r1.102 uvm_pdaemon.c
--- uvm//uvm_pdaemon.c  22 Aug 2022 12:03:32 -  1.102
+++ uvm//uvm_pdaemon.c  29 Aug 2022 11:36:59 -
@@ -101,6 +101,7 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
+struct rwlock  *uvmpd_trylockowner(struct vm_page *);
 void   uvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
@@ -367,6 +368,34 @@ uvm_aiodone_daemon(void *arg)
 }
 
 
+/*
+ * uvmpd_trylockowner: trylock the page's owner.
+ *
+ * => return the locked rwlock on success.  otherwise, return NULL.
+ */
+struct rwlock *
+uvmpd_trylockowner(struct vm_page *pg)
+{
+
+   struct uvm_object *uobj = pg->uobject;
+   struct rwlock *slock;
+
+   if (uobj != NULL) {
+   slock = uobj->vmobjlock;
+   } else {
+   struct vm_anon *anon = pg->uanon;
+
+   KASSERT(anon != NULL);
+   slock = anon->an_lock;
+   }
+
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
+   return NULL;
+   }
+
+   return slock;
+}
+
 
 /*
  * uvmpd_scan_inactive: scan an inactive list for pages to clean or free.
@@ -454,53 +483,44 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
uvmexp.pdscans++;
nextpg = TAILQ_NEXT(p, pageq);
 
+   /*
+* move referenced pages back to active queue
+* and skip to next page.
+*/
+   if (pmap_is_referenced(p)) {
+   uvm_pageactivate(p);
+   uvmexp.pdreact++;
+   continue;
+   }
+
anon = p->uanon;
uobj = p->uobject;
-   if (p->pg_flags & PQ_ANON) {
+
+   /*
+* first we attempt to lock the object that this page
+* belongs to.  if our attempt fails we skip on to
+* the next page (no harm done).  it is important to
+* "try" locking the object as we are locking in the
+* wrong order (pageq -> object) and we don't want to
+* deadlock.
+*/
+   slock = uvmpd_trylockowner(p);
+   if (slock == NULL) {
+   continue;
+   }
+
+   if (p->pg_flags & PG_BUSY) {
+   rw_exit(slock);
+   uvmexp.pdbusy++;
+   continue;
+   }
+
+   /* does the page belong to an object? */
+   if (uobj != NULL) {
+   uvmexp.pdobscan++;
+   } else {
KASSERT(anon != NULL);
-   slock = anon->an_lock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   /* lock failed, skip this page */
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip to next page.
-*/
-   if (pmap_is_referenced(p)) {
-   uvm_pageactivate(p);
-   rw_exit(slock);
-   uvmexp.pdreact++;
-   continue;
-   }
-   if (p->pg_flags & PG_BUSY) {
-   rw_exit(slock);
-   uvmexp.pdbusy++;
-   continue;
-   }
uvmexp.pdanscan++;
-   } else {
-   KASSERT(uobj != NULL);
-   slock = uobj->vmobjlock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip 

Simplify locking code in pdaemon

2022-08-18 Thread Martin Pieuchot
Use a "slock" variable as done in multiple places to simplify the code.
The locking stay the same.  This is just a first step to simplify this
mess.

Also get rid of the return value of the function, it is never checked.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.101
diff -u -p -r1.101 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   28 Jun 2022 19:31:30 -  1.101
+++ uvm/uvm_pdaemon.c   18 Aug 2022 10:44:52 -
@@ -102,7 +102,7 @@ extern void drmbackoff(long);
  */
 
 void   uvmpd_scan(struct uvm_pmalloc *);
-boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
+void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -377,17 +377,16 @@ uvm_aiodone_daemon(void *arg)
  * => we handle the building of swap-backed clusters
  * => we return TRUE if we are exiting because we met our target
  */
-
-boolean_t
+void
 uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
-   boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
struct vm_page *swpps[SWCLUSTPAGES];/* XXX: see below */
+   struct rwlock *slock;
int swnpages, swcpages; /* XXX: see below */
int swslot;
struct vm_anon *anon;
@@ -402,7 +401,6 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
swslot = 0;
swnpages = swcpages = 0;
-   free = 0;
dirtyreacts = 0;
p = NULL;
 
@@ -431,18 +429,14 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
uobj = NULL;
anon = NULL;
-
if (p) {
/*
-* update our copy of "free" and see if we've met
-* our target
+* see if we've met our target
 */
free = uvmexp.free - BUFPAGES_DEFICIT;
if (((pma == NULL || (pma->pm_flags & UVM_PMA_FREED)) &&
(free + uvmexp.paging >= uvmexp.freetarg << 2)) ||
dirtyreacts == UVMPD_NUMDIRTYREACTS) {
-   retval = TRUE;
-
if (swslot == 0) {
/* exit now if no swap-i/o pending */
break;
@@ -450,9 +444,9 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 
/* set p to null to signal final swap i/o */
p = NULL;
+   nextpg = NULL;
}
}
-
if (p) {/* if (we have a new page to consider) */
/*
 * we are below target and have a new page to consider.
@@ -460,11 +454,12 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
uvmexp.pdscans++;
nextpg = TAILQ_NEXT(p, pageq);
 
+   anon = p->uanon;
+   uobj = p->uobject;
if (p->pg_flags & PQ_ANON) {
-   anon = p->uanon;
KASSERT(anon != NULL);
-   if (rw_enter(anon->an_lock,
-   RW_WRITE|RW_NOSLEEP)) {
+   slock = anon->an_lock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
/* lock failed, skip this page */
continue;
}
@@ -474,23 +469,20 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
if (pmap_is_referenced(p)) {
uvm_pageactivate(p);
-   rw_exit(anon->an_lock);
+   rw_exit(slock);
uvmexp.pdreact++;
continue;
}
if (p->pg_flags & PG_BUSY) {
-   rw_exit(anon->an_lock);
+   rw_exit(slock);
uvmexp.pdbusy++;
-   /* someone else owns page, skip it */
continue;
}
uvmexp.pdanscan++;
} else {
-   uobj = p->uobject;
  

Fix a race in uvm_pseg_release()

2022-08-18 Thread Martin Pieuchot
The lock must be grabbed before iterating on the global array, ok?

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.88
diff -u -p -r1.88 uvm_pager.c
--- uvm/uvm_pager.c 15 Aug 2022 03:21:04 -  1.88
+++ uvm/uvm_pager.c 18 Aug 2022 10:31:16 -
@@ -209,6 +209,7 @@ uvm_pseg_release(vaddr_t segaddr)
struct uvm_pseg *pseg;
vaddr_t va = 0;
 
+   mtx_enter(_pseg_lck);
for (pseg = [0]; pseg != [PSEG_NUMSEGS]; pseg++) {
if (pseg->start <= segaddr &&
segaddr < pseg->start + MAX_PAGER_SEGS * MAXBSIZE)
@@ -222,7 +223,6 @@ uvm_pseg_release(vaddr_t segaddr)
/* test for no remainder */
KDASSERT(segaddr == pseg->start + id * MAXBSIZE);
 
-   mtx_enter(_pseg_lck);
 
KASSERT(UVM_PSEG_INUSE(pseg, id));
 



Re: uvm_swap: introduce uvm_swap_data_lock

2022-08-17 Thread Martin Pieuchot
On 16/01/22(Sun) 15:35, Martin Pieuchot wrote:
> On 30/12/21(Thu) 23:38, Theo Buehler wrote:
> > The diff below does two things: it adds a uvm_swap_data_lock mutex and
> > trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad()
> 
> Why is it enough?  Which fields is the lock protecting in these
> function?  Is it `uvmexp.swpages', could that be documented?  

It is documented in the diff below.

> 
> What about `nswapdev'?  Why is the rwlock grabbed before reading it in
> sys_swapctl()?i

Because it is always modified with the lock, I added some documentation.

> What about `swpginuse'?

This is still under KERNEL_LOCK(), documented below.

> If the mutex/rwlock are used to protect the global `swap_priority' could
> that be also documented?  Once this is documented it should be trivial to
> see that some places are missing some locking.  Is it intentional?
> 
> > The uvm_swap_data_lock protects all swap data structures, so needs to be
> > grabbed a few times, many of them already documented in the comments.
> > 
> > For review, I suggest comparing to what NetBSD did and also going
> > through the consumers (swaplist_insert, swaplist_find, swaplist_trim)
> > and check that they are properly locked when called, or that there is
> > the KERNEL_LOCK() in place when swap data structures are manipulated.
> 
> I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce
> the differences with NetBSD.

Done.

> > In swapmount() I introduced locking since that's needed to be able to
> > assert that the proper locks are held in swaplist_{insert,find,trim}.
> 
> Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down?
> What about `uvmexp.nswget' and `uvmexp.swpgonly' in there?

This has been done as part of another change.  This diff uses an atomic
operation to increase `nswget' in case multiple threads fault on a page
in swap at the same time.

Updated diff below, ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.163
diff -u -p -r1.163 uvm_swap.c
--- uvm/uvm_swap.c  6 Aug 2022 13:44:04 -   1.163
+++ uvm/uvm_swap.c  17 Aug 2022 11:46:20 -
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -84,13 +85,16 @@
  * the system maintains a global data structure describing all swap
  * partitions/files.   there is a sorted LIST of "swappri" structures
  * which describe "swapdev"'s at that priority.   this LIST is headed
- * by the "swap_priority" global var.each "swappri" contains a 
+ * by the "swap_priority" global var.each "swappri" contains a
  * TAILQ of "swapdev" structures at that priority.
  *
  * locking:
  *  - swap_syscall_lock (sleep lock): this lock serializes the swapctl
  *system call and prevents the swap priority list from changing
  *while we are in the middle of a system call (e.g. SWAP_STATS).
+ *  - uvm_swap_data_lock (mutex): this lock protects all swap data
+ *structures including the priority list, the swapdev structures,
+ *and the swapmap arena.
  *
  * each swap device has the following info:
  *  - swap device in use (could be disabled, preventing future use)
@@ -106,7 +110,7 @@
  * userland controls and configures swap with the swapctl(2) system call.
  * the sys_swapctl performs the following operations:
  *  [1] SWAP_NSWAP: returns the number of swap devices currently configured
- *  [2] SWAP_STATS: given a pointer to an array of swapent structures 
+ *  [2] SWAP_STATS: given a pointer to an array of swapent structures
  * (passed in via "arg") of a size passed in via "misc" ... we load
  * the current swap config into the array.
  *  [3] SWAP_ON: given a pathname in arg (could be device or file) and a
@@ -208,9 +212,10 @@ struct extent *swapmap;/* controls the
 
 /* list of all active swap devices [by priority] */
 LIST_HEAD(swap_priority, swappri);
-struct swap_priority swap_priority;
+struct swap_priority swap_priority;/* [S] */
 
 /* locks */
+struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE);
 struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk");
 
 struct mutex oommtx = MUTEX_INITIALIZER(IPL_VM);
@@ -224,7 +229,7 @@ void swapdrum_add(struct swapdev *, in
 struct swapdev *swapdrum_getsdp(int);
 
 struct swapdev *swaplist_find(struct vnode *, int);
-voidswaplist_insert(struct swapdev *, 
+voidswaplist_insert(struct swapdev *,
 struct swappri *, int);
 voidswaplist_trim(void);
 
@@ -472,16 +477,19 @@ uvm_swap_finicrypt_all(void)
 /*
  * swaplist_insert: insert swap device "sdp" into the 

Re: patch: change swblk_t type and use it in blist

2022-08-05 Thread Martin Pieuchot
On 05/08/22(Fri) 18:10, Sebastien Marie wrote:
> Hi,
> 
> When initially ported blist from DragonFlyBSD, we used custom type bsblk_t 
> and 
> bsbmp_t instead of the one used by DragonFlyBSD (swblk_t and u_swblk_t).
> 
> The reason was swblk_t is already defined on OpenBSD, and was incompatible 
> with 
> blist (int32_t). It is defined, but not used (outside some regress file which 
> seems to be not affected by type change).
> 
> This diff changes the __swblk_t definition in sys/_types.h to be 'unsigned 
> long', and switch back blist to use swblk_t (and u_swblk_t, even if it isn't 
> 'unsigned swblk_t').
> 
> It makes the diff with DragonFlyBSD more thin. I added a comment with the git 
> id 
> used for the initial port.
> 
> I tested it on i386 and amd64 (kernel and userland).
> 
> By changing bitmap type from 'u_long' to 'u_swblk_t' ('u_int64_t'), it makes 
> the 
> regress the same on 64 and 32bits archs (and it success on both).
> 
> Comments or OK ?

Makes sense to me.  I'm not a standard/type lawyer so I don't know if
this is fine for userland.  So I'm ok with it.

> diff /home/semarie/repos/openbsd/src
> commit - 73f52ef7130cefbe5a8fe028eedaad0e54be7303
> path + /home/semarie/repos/openbsd/src
> blob - e05867429cdd81c434f9ca589c1fb8c6d25957f8
> file + sys/sys/_types.h
> --- sys/sys/_types.h
> +++ sys/sys/_types.h
> @@ -60,7 +60,7 @@ typedef __uint8_t   __sa_family_t;  /* sockaddr 
> address f
>  typedef  __int32_t   __segsz_t;  /* segment size */
>  typedef  __uint32_t  __socklen_t;/* length type for network 
> syscalls */
>  typedef  long__suseconds_t;  /* microseconds (signed) */
> -typedef  __int32_t   __swblk_t;  /* swap offset */
> +typedef  unsigned long   __swblk_t;  /* swap offset */
>  typedef  __int64_t   __time_t;   /* epoch time */
>  typedef  __int32_t   __timer_t;  /* POSIX timer identifiers */
>  typedef  __uint32_t  __uid_t;/* user id */
> blob - 102ca95dd45ba6d9cab0f3fcbb033d6043ec1606
> file + sys/sys/blist.h
> --- sys/sys/blist.h
> +++ sys/sys/blist.h
> @@ -1,4 +1,5 @@
>  /* $OpenBSD: blist.h,v 1.1 2022/07/29 17:47:12 semarie Exp $ */
> +/* DragonFlyBSD:7b80531f545c7d3c51c1660130c71d01f6bccbe0:/sys/sys/blist.h */
>  /*
>   * Copyright (c) 2003,2004 The DragonFly Project.  All rights reserved.
>   * 
> @@ -65,15 +66,13 @@
>  #include 
>  #endif
>  
> -#define  SWBLK_BITS 64
> -typedef u_long bsbmp_t;
> -typedef u_long bsblk_t;
> +typedef u_int64_tu_swblk_t;
>  
>  /*
>   * note: currently use SWAPBLK_NONE as an absolute value rather then
>   * a flag bit.
>   */
> -#define SWAPBLK_NONE ((bsblk_t)-1)
> +#define SWAPBLK_NONE ((swblk_t)-1)
>  
>  /*
>   * blmeta and bl_bitmap_t MUST be a power of 2 in size.
> @@ -81,39 +80,39 @@ typedef u_long bsblk_t;
>  
>  typedef struct blmeta {
>   union {
> - bsblk_t bmu_avail;  /* space available under us */
> - bsbmp_t bmu_bitmap; /* bitmap if we are a leaf  */
> + swblk_t bmu_avail;  /* space available under us */
> + u_swblk_t   bmu_bitmap; /* bitmap if we are a leaf  */
>   } u;
> - bsblk_t bm_bighint; /* biggest contiguous block hint*/
> + swblk_t bm_bighint; /* biggest contiguous block hint*/
>  } blmeta_t;
>  
>  typedef struct blist {
> - bsblk_t bl_blocks;  /* area of coverage */
> + swblk_t bl_blocks;  /* area of coverage */
>   /* XXX int64_t bl_radix */
> - bsblk_t bl_radix;   /* coverage radix   */
> - bsblk_t bl_skip;/* starting skip*/
> - bsblk_t bl_free;/* number of free blocks*/
> + swblk_t bl_radix;   /* coverage radix   */
> + swblk_t bl_skip;/* starting skip*/
> + swblk_t bl_free;/* number of free blocks*/
>   blmeta_t*bl_root;   /* root of radix tree   */
> - bsblk_t bl_rootblks;/* bsblk_t blks allocated for tree */
> + swblk_t bl_rootblks;/* swblk_t blks allocated for tree */
>  } *blist_t;
>  
> -#define BLIST_META_RADIX (sizeof(bsbmp_t)*8/2)   /* 2 bits per */
> -#define BLIST_BMAP_RADIX (sizeof(bsbmp_t)*8) /* 1 bit per */
> +#define BLIST_META_RADIX (sizeof(u_swblk_t)*8/2) /* 2 bits per */
> +#define BLIST_BMAP_RADIX (sizeof(u_swblk_t)*8)   /* 1 bit per */
>  
>  /*
>   * The radix may exceed the size of a 64 bit signed (or unsigned) int
> - * when the maximal number of blocks is allocated.  With a 32-bit bsblk_t
> + * when the maximal number of blocks is allocated.  With a 32-bit swblk_t
>   * this corresponds to ~1G x PAGE_SIZE = 4096GB.  The swap code usually
>   * divides this by 4, leaving us with a capability of up to four 1TB swap
>   * devices.
>   *
> - * With a 

Re: Introduce uvm_pagewait()

2022-07-11 Thread Martin Pieuchot
On 28/06/22(Tue) 14:13, Martin Pieuchot wrote:
> I'd like to abstract the use of PG_WANTED to start unifying & cleaning
> the various cases where a code path is waiting for a busy page.  Here's
> the first step.
> 
> ok?

Anyone?

> Index: uvm/uvm_amap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
> retrieving revision 1.90
> diff -u -p -r1.90 uvm_amap.c
> --- uvm/uvm_amap.c30 Aug 2021 16:59:17 -  1.90
> +++ uvm/uvm_amap.c28 Jun 2022 11:53:08 -
> @@ -781,9 +781,7 @@ ReStart:
>* it and then restart.
>*/
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK,
> - "cownow", INFSLP);
> + uvm_pagewait(pg, amap->am_lock, "cownow");
>   goto ReStart;
>   }
>  
> Index: uvm/uvm_aobj.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
> retrieving revision 1.103
> diff -u -p -r1.103 uvm_aobj.c
> --- uvm/uvm_aobj.c29 Dec 2021 20:22:06 -  1.103
> +++ uvm/uvm_aobj.c28 Jun 2022 11:53:08 -
> @@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj)
>   while ((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
>   pmap_page_protect(pg, PROT_NONE);
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det",
> - INFSLP);
> + uvm_pagewait(pg, uobj->vmobjlock, "uao_det");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   continue;
>   }
>   uao_dropswap(>u_obj, pg->offset >> PAGE_SHIFT);
> @@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
>  
>   /* Make sure page is unbusy, else wait for it. */
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh",
> - INFSLP);
> + uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   curoff -= PAGE_SIZE;
>   continue;
>   }
> @@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t 
>  
>   /* page is there, see if we need to wait on it */
>   if ((ptmp->pg_flags & PG_BUSY) != 0) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(ptmp, uobj->vmobjlock, PVM,
> - "uao_get", INFSLP);
> + uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   continue;   /* goto top of pps while loop */
>   }
>  
> Index: uvm/uvm_km.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_km.c,v
> retrieving revision 1.150
> diff -u -p -r1.150 uvm_km.c
> --- uvm/uvm_km.c  7 Jun 2022 12:07:45 -   1.150
> +++ uvm/uvm_km.c  28 Jun 2022 11:53:08 -
> @@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj,
>   for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
>   pp = uvm_pagelookup(uobj, curoff);
>   if (pp && pp->pg_flags & PG_BUSY) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm",
> - INFSLP);
> + uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   curoff -= PAGE_SIZE; /* loop back to us */
>   continue;
>   }
> Index: uvm/uvm_page.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> retrieving revision 1.166
> diff -u -p -r1.166 uvm_page.c
> --- uvm/uvm_page.c12 May 2022 12:48:36 -  1.166
> +++ uvm/uv

Faster M operation for the swapper to be great again

2022-06-30 Thread Martin Pieuchot
Diff below uses two tricks to make uvm_pagermapin/out() faster and less
likely to fail in OOM situations.

These functions are used to map buffers when swapping pages in/out and
when faulting on mmaped files.  robert@ even measured a 75% improvement
when populating pages related to files that aren't yet in the buffer
cache.

The first trick is to use the direct map when available.  I'm doing this
for single pages but km_alloc(9) also does that for single segment...
uvm_io() only maps one page at a time for the moment so this should be
enough.

The second trick is to use pmap_kenter_pa() which doesn't fail and is
faster.

With this changes the "freeze" happening on my server when entering many
pages to swap in OOM situation is much shorter and the machine becomes
quickly responsive.

ok?

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.81
diff -u -p -r1.81 uvm_pager.c
--- uvm/uvm_pager.c 28 Jun 2022 19:07:40 -  1.81
+++ uvm/uvm_pager.c 30 Jun 2022 13:34:46 -
@@ -258,6 +258,16 @@ uvm_pagermapin(struct vm_page **pps, int
vsize_t size;
struct vm_page *pp;
 
+#ifdef __HAVE_PMAP_DIRECT
+   /* use direct mappings for single page */
+   if (npages == 1) {
+   KASSERT(pps[0]);
+   KASSERT(pps[0]->pg_flags & PG_BUSY);
+   kva = pmap_map_direct(pps[0]);
+   return kva;
+   }
+#endif
+
prot = PROT_READ;
if (flags & UVMPAGER_MAPIN_READ)
prot |= PROT_WRITE;
@@ -273,14 +283,7 @@ uvm_pagermapin(struct vm_page **pps, int
pp = *pps++;
KASSERT(pp);
KASSERT(pp->pg_flags & PG_BUSY);
-   /* Allow pmap_enter to fail. */
-   if (pmap_enter(pmap_kernel(), cva, VM_PAGE_TO_PHYS(pp),
-   prot, PMAP_WIRED | PMAP_CANFAIL | prot) != 0) {
-   pmap_remove(pmap_kernel(), kva, cva);
-   pmap_update(pmap_kernel());
-   uvm_pseg_release(kva);
-   return 0;
-   }
+   pmap_kenter_pa(cva, VM_PAGE_TO_PHYS(pp), prot);
}
pmap_update(pmap_kernel());
return kva;
@@ -294,8 +297,15 @@ uvm_pagermapin(struct vm_page **pps, int
 void
 uvm_pagermapout(vaddr_t kva, int npages)
 {
+#ifdef __HAVE_PMAP_DIRECT
+   /* use direct mappings for single page */
+   if (npages == 1) {
+   pmap_unmap_direct(kva);
+   return;
+   }
+#endif
 
-   pmap_remove(pmap_kernel(), kva, kva + ((vsize_t)npages << PAGE_SHIFT));
+   pmap_kremove(kva, (vsize_t)npages << PAGE_SHIFT);
pmap_update(pmap_kernel());
uvm_pseg_release(kva);
 



Re: Use SMR instead of SRP list in rtsock.c

2022-06-30 Thread Martin Pieuchot
On 30/06/22(Thu) 11:56, Claudio Jeker wrote:
> On Thu, Jun 30, 2022 at 12:34:33PM +0300, Vitaliy Makkoveev wrote:
> > On Thu, Jun 30, 2022 at 11:08:48AM +0200, Claudio Jeker wrote:
> > > This diff converts the SRP list to a SMR list in rtsock.c
> > > SRP is a bit strange with how it works and the SMR code is a bit easier to
> > > understand. Since we can sleep in the SMR_TAILQ_FOREACH() we need to grab
> > > a refcount on the route pcb so that we can leave the SMR critical section
> > > and then enter the SMR critical section at the end of the loop before
> > > dropping the refcount again.
> > > 
> > > The diff does not immeditaly explode but I doubt we can exploit
> > > parallelism in route_input() so this may fail at some later stage if it is
> > > wrong.
> > > 
> > > Comments from the lock critics welcome
> > 
> > We use `so_lock' rwlock(9) to protect route domain sockets. We can't
> > convert this SRP list to SMR list because we call solock() within
> > foreach loop.

We shouldn't use SRP list either, no?  Or are we allowed to sleep
holding a SRP reference?  That's the question that triggered this diff.

> because of the so_lock the code uses a refcnt on the route pcb to make
> sure that the object is not freed while we sleep. So that is handled by
> this diff.
>  
> > We can easily crash kernel by running in parallel some "route monitor"
> > commands and "while true; ifconfig vether0 create ; ifconfig vether0
> > destroy; done".
> 
> That does not cause problem on my system.
>  
> > > -- 
> > > :wq Claudio
> > > 
> > > Index: sys/net/rtsock.c
> > > ===
> > > RCS file: /cvs/src/sys/net/rtsock.c,v
> > > retrieving revision 1.334
> > > diff -u -p -r1.334 rtsock.c
> > > --- sys/net/rtsock.c  28 Jun 2022 10:01:13 -  1.334
> > > +++ sys/net/rtsock.c  30 Jun 2022 08:02:09 -
> > > @@ -71,7 +71,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > -#include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -107,8 +107,6 @@ struct walkarg {
> > >  };
> > >  
> > >  void route_prinit(void);
> > > -void rcb_ref(void *, void *);
> > > -void rcb_unref(void *, void *);
> > >  int  route_output(struct mbuf *, struct socket *, struct sockaddr *,
> > >   struct mbuf *);
> > >  int  route_ctloutput(int, struct socket *, int, int, struct mbuf *);
> > > @@ -149,7 +147,7 @@ intrt_setsource(unsigned int, struct 
> > >  struct rtpcb {
> > >   struct socket   *rop_socket;/* [I] */
> > >  
> > > - SRPL_ENTRY(rtpcb)   rop_list;
> > > + SMR_TAILQ_ENTRY(rtpcb)  rop_list;
> > >   struct refcnt   rop_refcnt;
> > >   struct timeout  rop_timeout;
> > >   unsigned introp_msgfilter;  /* [s] */
> > > @@ -162,8 +160,7 @@ struct rtpcb {
> > >  #define  sotortpcb(so)   ((struct rtpcb *)(so)->so_pcb)
> > >  
> > >  struct rtptable {
> > > - SRPL_HEAD(, rtpcb)  rtp_list;
> > > - struct srpl_rc  rtp_rc;
> > > + SMR_TAILQ_HEAD(, rtpcb) rtp_list;
> > >   struct rwlock   rtp_lk;
> > >   unsigned intrtp_count;
> > >  };
> > > @@ -185,29 +182,12 @@ struct rtptable rtptable;
> > >  void
> > >  route_prinit(void)
> > >  {
> > > - srpl_rc_init(_rc, rcb_ref, rcb_unref, NULL);
> > >   rw_init(_lk, "rtsock");
> > > - SRPL_INIT(_list);
> > > + SMR_TAILQ_INIT(_list);
> > >   pool_init(_pool, sizeof(struct rtpcb), 0,
> > >   IPL_SOFTNET, PR_WAITOK, "rtpcb", NULL);
> > >  }
> > >  
> > > -void
> > > -rcb_ref(void *null, void *v)
> > > -{
> > > - struct rtpcb *rop = v;
> > > -
> > > - refcnt_take(>rop_refcnt);
> > > -}
> > > -
> > > -void
> > > -rcb_unref(void *null, void *v)
> > > -{
> > > - struct rtpcb *rop = v;
> > > -
> > > - refcnt_rele_wake(>rop_refcnt);
> > > -}
> > > -
> > >  int
> > >  route_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf 
> > > *nam,
> > >  struct mbuf *control, struct proc *p)
> > > @@ -325,8 +305,7 @@ route_attach(struct socket *so, int prot
> > >   so->so_options |= SO_USELOOPBACK;
> > >  
> > >   rw_enter(_lk, RW_WRITE);
> > > - SRPL_INSERT_HEAD_LOCKED(_rc, _list, rop,
> > > - rop_list);
> > > + SMR_TAILQ_INSERT_HEAD_LOCKED(_list, rop, rop_list);
> > >   rtptable.rtp_count++;
> > >   rw_exit(_lk);
> > >  
> > > @@ -347,8 +326,7 @@ route_detach(struct socket *so)
> > >   rw_enter(_lk, RW_WRITE);
> > >  
> > >   rtptable.rtp_count--;
> > > - SRPL_REMOVE_LOCKED(_rc, _list, rop, rtpcb,
> > > - rop_list);
> > > + SMR_TAILQ_REMOVE_LOCKED(_list, rop, rop_list);
> > >   rw_exit(_lk);
> > >  
> > >   sounlock(so);
> > > @@ -356,6 +334,7 @@ route_detach(struct socket *so)
> > >   /* wait for all references to drop */
> > >   refcnt_finalize(>rop_refcnt, "rtsockrefs");
> > >   timeout_del_barrier(>rop_timeout);
> > > + smr_barrier();
> > >  
> > >   solock(so);
> > >  
> > > @@ -501,7 +480,6 @@ route_input(struct mbuf *m0, struct sock
> > >   struct rtpcb *rop;
> > >   

Re: arp llinfo mutex

2022-06-29 Thread Martin Pieuchot
On 29/06/22(Wed) 19:40, Alexander Bluhm wrote:
> Hi,
> 
> To fix the KASSERT(la != NULL) we have to protect the rt_llinfo
> with a mutex.  The idea is to keep rt_llinfo and RTF_LLINFO consistent.
> Also do not put the mutex in the fast path.

Losing the RTM_ADD/DELETE race is not a bug.  I would not add a printf
in these cases.  I understand you might want one for debugging purposes
but I don't see any value in committing it.  Do you agree?

Note that some times the code checks for the RTF_LLINFO flags and some
time for rt_llinfo != NULL.  This is inconsistent and a bit confusing
now that we use a mutex to protect those states.

Could you document that rt_llinfo is now protected by the mutex (or
KERNEL_LOCK())?

Anyway this is an improvement ok mpi@

PS: What about ND6?

> Index: netinet/if_ether.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/if_ether.c,v
> retrieving revision 1.250
> diff -u -p -r1.250 if_ether.c
> --- netinet/if_ether.c27 Jun 2022 20:47:10 -  1.250
> +++ netinet/if_ether.c28 Jun 2022 14:00:12 -
> @@ -101,6 +101,8 @@ void arpreply(struct ifnet *, struct mbu
>  unsigned int);
>  
>  struct niqueue arpinq = NIQUEUE_INITIALIZER(50, NETISR_ARP);
> +
> +/* llinfo_arp live time, rt_llinfo and RTF_LLINFO are protected by arp_mtx */
>  struct mutex arp_mtx = MUTEX_INITIALIZER(IPL_SOFTNET);
>  
>  LIST_HEAD(, llinfo_arp) arp_list; /* [mN] list of all llinfo_arp structures 
> */
> @@ -155,7 +157,7 @@ void
>  arp_rtrequest(struct ifnet *ifp, int req, struct rtentry *rt)
>  {
>   struct sockaddr *gate = rt->rt_gateway;
> - struct llinfo_arp *la = (struct llinfo_arp *)rt->rt_llinfo;
> + struct llinfo_arp *la;
>   time_t uptime;
>  
>   NET_ASSERT_LOCKED();
> @@ -171,7 +173,7 @@ arp_rtrequest(struct ifnet *ifp, int req
>   rt->rt_expire = 0;
>   break;
>   }
> - if ((rt->rt_flags & RTF_LOCAL) && !la)
> + if ((rt->rt_flags & RTF_LOCAL) && rt->rt_llinfo == NULL)
>   rt->rt_expire = 0;
>   /*
>* Announce a new entry if requested or warn the user
> @@ -192,44 +194,54 @@ arp_rtrequest(struct ifnet *ifp, int req
>   }
>   satosdl(gate)->sdl_type = ifp->if_type;
>   satosdl(gate)->sdl_index = ifp->if_index;
> - if (la != NULL)
> - break; /* This happens on a route change */
>   /*
>* Case 2:  This route may come from cloning, or a manual route
>* add with a LL address.
>*/
>   la = pool_get(_pool, PR_NOWAIT | PR_ZERO);
> - rt->rt_llinfo = (caddr_t)la;
>   if (la == NULL) {
>   log(LOG_DEBUG, "%s: pool get failed\n", __func__);
>   break;
>   }
>  
> + mtx_enter(_mtx);
> + if (rt->rt_llinfo != NULL) {
> + /* we lost the race, another thread has entered it */
> + mtx_leave(_mtx);
> + printf("%s: llinfo exists\n", __func__);
> + pool_put(_pool, la);
> + break;
> + }
>   mq_init(>la_mq, LA_HOLD_QUEUE, IPL_SOFTNET);
> + rt->rt_llinfo = (caddr_t)la;
>   la->la_rt = rt;
>   rt->rt_flags |= RTF_LLINFO;
> + LIST_INSERT_HEAD(_list, la, la_list);
>   if ((rt->rt_flags & RTF_LOCAL) == 0)
>   rt->rt_expire = uptime;
> - mtx_enter(_mtx);
> - LIST_INSERT_HEAD(_list, la, la_list);
>   mtx_leave(_mtx);
> +
>   break;
>  
>   case RTM_DELETE:
> - if (la == NULL)
> - break;
>   mtx_enter(_mtx);
> + la = (struct llinfo_arp *)rt->rt_llinfo;
> + if (la == NULL) {
> + /* we lost the race, another thread has removed it */
> + mtx_leave(_mtx);
> + printf("%s: llinfo missing\n", __func__);
> + break;
> + }
>   LIST_REMOVE(la, la_list);
> - mtx_leave(_mtx);
>   rt->rt_llinfo = NULL;
>   rt->rt_flags &= ~RTF_LLINFO;
>   atomic_sub_int(_hold_total, mq_purge(>la_mq));
> + mtx_leave(_mtx);
> +
>   pool_put(_pool, la);
>   break;
>  
>   case RTM_INVALIDATE:
> - if (la == NULL)
> - break;
>   if (!ISSET(rt->rt_flags, RTF_LOCAL))
>   arpinvalidate(rt);
>   break;
> @@ -363,8 +375,6 @@ arpresolve(struct ifnet *ifp, struct rte
>   goto bad;
>   }
>  
> - la = (struct llinfo_arp *)rt->rt_llinfo;
> - KASSERT(la != NULL);
>  
>   /*
>* Check the 

Simplify aiodone daemon

2022-06-29 Thread Martin Pieuchot
The aiodone daemon accounts for and frees/releases pages they were
written to swap.  It is only used for asynchronous write.  The diff
below uses this knowledge to:

- Stop suggesting that uvm_swap_get() can be asynchronous.  There's an
  assert for PGO_SYNCIO 3 lines above.

- Remove unused support for asynchronous read, including error
  conditions, from uvm_aio_aiodone_pages().

- Grab the proper lock for each page that has been written to swap.
  This allows to enable an assert in uvm_page_unbusy().

- Move the uvm_anon_release() call outside of uvm_page_unbusy() and
  assert for the different anon cases.  This will allows us to unify
  code paths waiting for busy pages.

This is adapted/simplified from what is in NetBSD.

ok?

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  29 Jun 2022 11:16:35 -
@@ -143,7 +143,6 @@ struct pool uvm_aobj_pool;
 
 static struct uao_swhash_elt   *uao_find_swhash_elt(struct uvm_aobj *, int,
 boolean_t);
-static int  uao_find_swslot(struct uvm_object *, int);
 static boolean_tuao_flush(struct uvm_object *, voff_t,
 voff_t, int);
 static void uao_free(struct uvm_aobj *);
@@ -241,7 +240,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-inline static int
+int
 uao_find_swslot(struct uvm_object *uobj, int pageidx)
 {
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_aobj.h
--- uvm/uvm_aobj.h  21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_aobj.h  29 Jun 2022 11:16:35 -
@@ -60,6 +60,7 @@
 
 void uao_init(void);
 int uao_set_swslot(struct uvm_object *, int, int);
+int uao_find_swslot (struct uvm_object *, int);
 int uao_dropswap(struct uvm_object *, int);
 int uao_swap_off(int, int);
 int uao_shrink(struct uvm_object *, int);
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  29 Jun 2022 11:47:55 -
@@ -1036,13 +1036,14 @@ uvm_pagefree(struct vm_page *pg)
  * uvm_page_unbusy: unbusy an array of pages.
  *
  * => pages must either all belong to the same object, or all belong to anons.
+ * => if pages are object-owned, object must be locked.
  * => if pages are anon-owned, anons must have 0 refcount.
+ * => caller must make sure that anon-owned pages are not PG_RELEASED.
  */
 void
 uvm_page_unbusy(struct vm_page **pgs, int npgs)
 {
struct vm_page *pg;
-   struct uvm_object *uobj;
int i;
 
for (i = 0; i < npgs; i++) {
@@ -1052,35 +1053,19 @@ uvm_page_unbusy(struct vm_page **pgs, in
continue;
}
 
-#if notyet
-   /*
- * XXX swap case in uvm_aio_aiodone() is not holding the lock.
-*
-* This isn't compatible with the PG_RELEASED anon case below.
-*/
KASSERT(uvm_page_owner_locked_p(pg));
-#endif
KASSERT(pg->pg_flags & PG_BUSY);
 
if (pg->pg_flags & PG_WANTED) {
wakeup(pg);
}
if (pg->pg_flags & PG_RELEASED) {
-   uobj = pg->uobject;
-   if (uobj != NULL) {
-   uvm_lock_pageq();
-   pmap_page_protect(pg, PROT_NONE);
-   /* XXX won't happen right now */
-   if (pg->pg_flags & PQ_AOBJ)
-   uao_dropswap(uobj,
-   pg->offset >> PAGE_SHIFT);
-   uvm_pagefree(pg);
-   uvm_unlock_pageq();
-   } else {
-   rw_enter(pg->uanon->an_lock, RW_WRITE);
-   uvm_anon_release(pg->uanon);
-   }
+   KASSERT(pg->uobject != NULL ||
+   (pg->uanon != NULL && pg->uanon->an_ref > 0));
+   atomic_clearbits_int(>pg_flags, PG_RELEASED);
+   uvm_pagefree(pg);
} else {
+   KASSERT((pg->pg_flags & PG_FAKE) == 0);
atomic_clearbits_int(>pg_flags, PG_WANTED|PG_BUSY);
UVM_PAGE_OWN(pg, NULL);
   

Re: Unlocking pledge(2)

2022-06-28 Thread Martin Pieuchot
On 28/06/22(Tue) 18:17, Jeremie Courreges-Anglas wrote:
> 
> Initially I just wandered in syscall_mi.h and found the locking scheme
> super weird, even if technically correct.  pledge_syscall() better be
> safe to call without the kernel lock so I don't understand why we're
> sometimes calling it with the kernel lock and sometimes not.
> 
> ps_pledge is 64 bits so it's not possible to unset bits in an atomic
> manner on all architectures.  Even if we're only removing bits and there
> is probably no way to see a completely garbage value, it makes sense to
> just protect ps_pledge (and ps_execpledge) in the usual manner so that
> we can unlock the syscall.  The diff below protects the fields using
> ps_mtx even though I initially used a dedicated ps_pledge_mtx.
> unveil_destroy() needs to be moved after the critical section.
> regress/sys/kern/pledge looks happy with this.  The sys/syscall_mi.h
> change can be committed in a separate step.
> 
> Input and oks welcome.

This looks nice.  I doubt there's any existing program where you can
really test this.  Even firefox and chromium should do things
correctly.

Maybe you should write a regress test that tries to break the kernel.

> Index: arch/amd64/amd64/vmm.c
> ===
> RCS file: /home/cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.315
> diff -u -p -r1.315 vmm.c
> --- arch/amd64/amd64/vmm.c27 Jun 2022 15:12:14 -  1.315
> +++ arch/amd64/amd64/vmm.c28 Jun 2022 13:54:25 -
> @@ -713,7 +713,7 @@ pledge_ioctl_vmm(struct proc *p, long co
>   case VMM_IOC_CREATE:
>   case VMM_IOC_INFO:
>   /* The "parent" process in vmd forks and manages VMs */
> - if (p->p_p->ps_pledge & PLEDGE_PROC)
> + if (pledge_get(p->p_p) & PLEDGE_PROC)
>   return (0);
>   break;
>   case VMM_IOC_TERM:
> @@ -1312,7 +1312,7 @@ vm_find(uint32_t id, struct vm **res)
>* The managing vmm parent process can lookup all
>* all VMs and is indicated by PLEDGE_PROC.
>*/
> - if (((p->p_p->ps_pledge &
> + if (((pledge_get(p->p_p) &
>   (PLEDGE_VMM | PLEDGE_PROC)) == PLEDGE_VMM) &&
>   (vm->vm_creator_pid != p->p_p->ps_pid))
>   return (pledge_fail(p, EPERM, PLEDGE_VMM));
> Index: kern/init_sysent.c
> ===
> RCS file: /home/cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.238
> diff -u -p -r1.238 init_sysent.c
> --- kern/init_sysent.c27 Jun 2022 14:26:05 -  1.238
> +++ kern/init_sysent.c28 Jun 2022 15:18:25 -
> @@ -1,10 +1,10 @@
> -/*   $OpenBSD: init_sysent.c,v 1.238 2022/06/27 14:26:05 cheloha Exp $   
> */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
>   *
>   * DO NOT EDIT-- this file is automatically generated.
> - * created from; OpenBSD: syscalls.master,v 1.224 2022/05/16 07:36:04 
> mvs Exp 
> + * created from; OpenBSD: syscalls.master,v 1.225 2022/06/27 14:26:05 
> cheloha Exp 
>   */
>  
>  #include 
> @@ -248,7 +248,7 @@ const struct sysent sysent[] = {
>   sys_listen },   /* 106 = listen */
>   { 4, s(struct sys_chflagsat_args), 0,
>   sys_chflagsat },/* 107 = chflagsat */
> - { 2, s(struct sys_pledge_args), 0,
> + { 2, s(struct sys_pledge_args), SY_NOLOCK | 0,
>   sys_pledge },   /* 108 = pledge */
>   { 4, s(struct sys_ppoll_args), 0,
>   sys_ppoll },/* 109 = ppoll */
> Index: kern/kern_event.c
> ===
> RCS file: /home/cvs/src/sys/kern/kern_event.c,v
> retrieving revision 1.191
> diff -u -p -r1.191 kern_event.c
> --- kern/kern_event.c 27 Jun 2022 13:35:21 -  1.191
> +++ kern/kern_event.c 28 Jun 2022 13:55:18 -
> @@ -331,7 +331,7 @@ filt_procattach(struct knote *kn)
>   int s;
>  
>   if ((curproc->p_p->ps_flags & PS_PLEDGE) &&
> - (curproc->p_p->ps_pledge & PLEDGE_PROC) == 0)
> + (pledge_get(curproc->p_p) & PLEDGE_PROC) == 0)
>   return pledge_fail(curproc, EPERM, PLEDGE_PROC);
>  
>   if (kn->kn_id > PID_MAX)
> Index: kern/kern_pledge.c
> ===
> RCS file: /home/cvs/src/sys/kern/kern_pledge.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 kern_pledge.c
> --- kern/kern_pledge.c26 Jun 2022 06:11:49 -  1.282
> +++ kern/kern_pledge.c28 Jun 2022 15:21:46 -
> @@ -21,6 +21,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -465,13 +466,26 @@ sys_pledge(struct proc *p, void *v, regi
>   struct process *pr = p->p_p;
>   uint64_t 

Re: Fix the swapper

2022-06-28 Thread Martin Pieuchot
On 27/06/22(Mon) 15:44, Martin Pieuchot wrote:
> Diff below contain 3 parts that can be committed independently.  The 3
> of them are necessary to allow the pagedaemon to make progress in OOM
> situation and to satisfy all the allocations waiting for pages in
> specific ranges.
> 
> * uvm/uvm_pager.c part reserves a second segment for the page daemon.
>   This is necessary to ensure the two uvm_pagermapin() calls needed by
>   uvm_swap_io() succeed in emergency OOM situation.  (the 2nd segment is
>   necessary when encryption or bouncing is required)
> 
> * uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region
>   for the same reason.  Note that a sleeping point is introduced because
>   the pagedaemon is faster than the asynchronous I/O and in OOM
>   situation it tends to stay busy building cluster that it then discard
>   because no memory is available.
> 
> * uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive 
>   list of pages to account for a given memory range.  Without this the
>   daemon could spin infinitely doing nothing because the global limits
>   are reached.

Here's an updated diff with a fix on top:

 * in uvm/uvm_swap.c make sure uvm_swap_allocpages() is allowed to sleep
   when coming from uvm_fault().  This makes the faulting process wait
   instead of dying when there isn't any free pages to do the bouncing.

I'd appreciate more reviews and tests !

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.80
diff -u -p -r1.80 uvm_pager.c
--- uvm/uvm_pager.c 28 Jun 2022 12:10:37 -  1.80
+++ uvm/uvm_pager.c 28 Jun 2022 15:25:30 -
@@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[]
  * The number of uvm_pseg instances is dynamic using an array segs.
  * At most UVM_PSEG_COUNT instances can exist.
  *
- * psegs[0] always exists (so that the pager can always map in pages).
- * psegs[0] element 0 is always reserved for the pagedaemon.
+ * psegs[0/1] always exist (so that the pager can always map in pages).
+ * psegs[0/1] element 0 are always reserved for the pagedaemon.
  *
  * Any other pseg is automatically created when no space is available
  * and automatically destroyed when it is no longer in use.
@@ -93,6 +93,7 @@ uvm_pager_init(void)
 
/* init pager map */
uvm_pseg_init([0]);
+   uvm_pseg_init([1]);
mtx_init(_pseg_lck, IPL_VM);
 
/* init ASYNC I/O queue */
@@ -168,9 +169,10 @@ pager_seg_restart:
goto pager_seg_fail;
}
 
-   /* Keep index 0 reserved for pagedaemon. */
-   if (pseg == [0] && curproc != uvm.pagedaemon_proc)
-   i = 1;
+   /* Keep indexes 0,1 reserved for pagedaemon. */
+   if ((pseg == [0] || pseg == [1]) &&
+   (curproc != uvm.pagedaemon_proc))
+   i = 2;
else
i = 0;
 
@@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr)
pseg->use &= ~(1 << id);
wakeup();
 
-   if (pseg != [0] && UVM_PSEG_EMPTY(pseg)) {
+   if ((pseg != [0] && pseg != [1]) && UVM_PSEG_EMPTY(pseg)) {
va = pseg->start;
pseg->start = 0;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   12 May 2022 12:49:31 -  1.99
+++ uvm/uvm_pdaemon.c   28 Jun 2022 13:59:49 -
@@ -101,8 +101,8 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
-void   uvmpd_scan(void);
-boolean_t  uvmpd_scan_inactive(struct pglist *);
+void   uvmpd_scan(struct uvm_pmalloc *);
+boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -281,7 +281,7 @@ uvm_pageout(void *arg)
if (pma != NULL ||
((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) ||
((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) {
-   uvmpd_scan();
+   uvmpd_scan(pma);
}
 
/*
@@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg)
  */
 
 boolean_t
-uvmpd_scan_inactive(struct pglist *pglst)
+uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
-   struct vm_page *pps[MAXBSIZE >> PAGE_SHIFT], **ppsp;
+   struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
-

Introduce uvm_pagewait()

2022-06-28 Thread Martin Pieuchot
I'd like to abstract the use of PG_WANTED to start unifying & cleaning
the various cases where a code path is waiting for a busy page.  Here's
the first step.

ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.90
diff -u -p -r1.90 uvm_amap.c
--- uvm/uvm_amap.c  30 Aug 2021 16:59:17 -  1.90
+++ uvm/uvm_amap.c  28 Jun 2022 11:53:08 -
@@ -781,9 +781,7 @@ ReStart:
 * it and then restart.
 */
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK,
-   "cownow", INFSLP);
+   uvm_pagewait(pg, amap->am_lock, "cownow");
goto ReStart;
}
 
Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  28 Jun 2022 11:53:08 -
@@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj)
while ((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
pmap_page_protect(pg, PROT_NONE);
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det",
-   INFSLP);
+   uvm_pagewait(pg, uobj->vmobjlock, "uao_det");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
continue;
}
uao_dropswap(>u_obj, pg->offset >> PAGE_SHIFT);
@@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
 
/* Make sure page is unbusy, else wait for it. */
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh",
-   INFSLP);
+   uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
curoff -= PAGE_SIZE;
continue;
}
@@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t 
 
/* page is there, see if we need to wait on it */
if ((ptmp->pg_flags & PG_BUSY) != 0) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(ptmp, uobj->vmobjlock, PVM,
-   "uao_get", INFSLP);
+   uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
continue;   /* goto top of pps while loop */
}
 
Index: uvm/uvm_km.c
===
RCS file: /cvs/src/sys/uvm/uvm_km.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_km.c
--- uvm/uvm_km.c7 Jun 2022 12:07:45 -   1.150
+++ uvm/uvm_km.c28 Jun 2022 11:53:08 -
@@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj,
for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
pp = uvm_pagelookup(uobj, curoff);
if (pp && pp->pg_flags & PG_BUSY) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm",
-   INFSLP);
+   uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
curoff -= PAGE_SIZE; /* loop back to us */
continue;
}
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  28 Jun 2022 11:57:42 -
@@ -1087,6 +1087,23 @@ uvm_page_unbusy(struct vm_page **pgs, in
}
 }
 
+/*
+ * uvm_pagewait: wait for a busy page
+ *
+ * => page must be known PG_BUSY
+ * => object must be locked
+ * => object will be unlocked on return
+ */
+void
+uvm_pagewait(struct vm_page *pg, struct rwlock *lock, const char *wmesg)
+{
+   KASSERT(rw_lock_held(lock));
+   KASSERT((pg->pg_flags & PG_BUSY) != 0);
+
+   atomic_setbits_int(>pg_flags, PG_WANTED);
+   rwsleep_nsec(pg, lock, PVM | PNORELOCK, wmesg, INFSLP);
+}
+
 #if defined(UVM_PAGE_TRKOWN)
 /*
  * uvm_page_own: set or 

Re: kernel lock in arp

2022-06-27 Thread Martin Pieuchot
On 27/06/22(Mon) 19:11, Alexander Bluhm wrote:
> On Mon, Jun 27, 2022 at 11:49:23AM +0200, Alexander Bluhm wrote:
> > On Sat, May 21, 2022 at 10:50:28PM +0300, Vitaliy Makkoveev wrote:
> > > This diff looks good, except the re-check after kernel lock. It???s
> > > supposed `rt??? could became inconsistent, right? But what stops to
> > > make it inconsistent after first unlocked RTF_LLINFO flag check?
> > >
> > > I think this re-check should gone.
> >
> > I have copied the re-check from intenal genua code.  I am not sure
> > if it is really needed.  We know from Hrvoje that the diff with
> > re-check is stable.  And we know that it crashes without kernel
> > lock at all.
> >
> > I have talked with mpi@ about it.  The main problem is that we have
> > no write lock when we change RTF_LLINFO.  Then rt_llinfo can get
> > NULL or inconsistent.
> >
> > Plan is that I put some lock asserts into route add and delete.
> > This helps to find the parts that modify RTF_LLINFO and rt_llinfo
> > without exclusive lock.
> >
> > Maybe we need some kernel lock somewhere else.  Or we want to use
> > some ARP mutex.  We could also add some comment and commit the diff
> > that I have.  We know that it is faster and stable.  Pushing the
> > kernel lock down or replacing it with something clever can always
> > be done later.
> 
> We need the re-check.  I have tested it with a printf.  It is
> triggered by running arp -d in a loop while forwarding.
> 
> The concurrent threads are these:
> 
> rtrequest_delete(8000246b7428,3,80775048,8000246b7510,0) at 
> rtrequest_delete+0x67
> rtdeletemsg(fd8834a23550,80775048,0) at rtdeletemsg+0x1ad
> rtrequest(b,8000246b7678,3,8000246b7718,0) at rtrequest+0x55c
> rt_clone(8000246b7780,8000246b78f8,0) at rt_clone+0x73
> rtalloc_mpath(8000246b78f8,fd8003169ad8,0) at rtalloc_mpath+0x4c
> ip_forward(fd80b8cc7e00,8077d048,fd8834a230f0,0) at 
> ip_forward+0x137
> ip_input_if(8000246b7a28,8000246b7a34,4,0,8077d048) at 
> ip_input_if+0x353
> ipv4_input(8077d048,fd80b8cc7e00) at ipv4_input+0x39
> ether_input(8077d048,fd80b8cc7e00) at ether_input+0x3ad
> if_input_process(8077d048,8000246b7b18) at if_input_process+0x6f
> ifiq_process(8077d458) at ifiq_process+0x69
> taskq_thread(80036080) at taskq_thread+0x100
> 
> rtrequest_delete(8000246c8d08,3,80775048,8000246c8df0,0) at 
> rtrequest_delete+0x67
> rtdeletemsg(fd8834a230f0,80775048,0) at rtdeletemsg+0x1ad
> rtrequest(b,8000246c8f58,3,8000246c8ff8,0) at rtrequest+0x55c
> rt_clone(8000246c9060,8000246c90b8,0) at rt_clone+0x73
> rtalloc_mpath(8000246c90b8,fd8002c754d8,0) at rtalloc_mpath+0x4c
> in_ouraddr(fd8094771b00,8077d048,8000246c9138) at 
> in_ouraddr+0x84
> ip_input_if(8000246c91d8,8000246c91e4,4,0,8077d048) at 
> ip_input_if+0x1cd
> ipv4_input(8077d048,fd8094771b00) at ipv4_input+0x39
> ether_input(8077d048,fd8094771b00) at ether_input+0x3ad
> if_input_process(8077d048,8000246c92c8) at if_input_process+0x6f
> ifiq_process(80781400) at ifiq_process+0x69
> taskq_thread(80036200) at taskq_thread+0x100
> 
> I have added a comment why kernel lock protects us.  I would like
> to get this in.  It has been tested, reduces the kernel lock and
> is faster.  A more clever lock can be done later.
> 
> ok?

I don't understand how the KERNEL_LOCK() there prevents rtdeletemsg()
from running.  rtrequest_delete() seems completely broken it assumes it
holds an exclusive lock.

To "fix" arp the KERNEL_LOCK() should also be taken in RTM_DELETE and
RTM_RESOLVE inside arp_rtrequest().  Or maybe around ifp->if_rtrequest()

But it doesn't mean there isn't another problem in rtdeletemsg()...

> Index: net/if_ethersubr.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v
> retrieving revision 1.281
> diff -u -p -r1.281 if_ethersubr.c
> --- net/if_ethersubr.c26 Jun 2022 21:19:53 -  1.281
> +++ net/if_ethersubr.c27 Jun 2022 16:55:15 -
> @@ -221,10 +221,7 @@ ether_resolve(struct ifnet *ifp, struct 
>  
>   switch (af) {
>   case AF_INET:
> - KERNEL_LOCK();
> - /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> - KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   eh->ether_type = htons(ETHERTYPE_IP);
> @@ -285,10 +282,7 @@ ether_resolve(struct ifnet *ifp, struct 
>   break;
>  #endif
>   case AF_INET:
> - KERNEL_LOCK();
> - /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> -

CoW & neighbor pages

2022-06-27 Thread Martin Pieuchot
When faulting a page after COW neighborhood pages are likely to already
be entered.   So speed up the fault by doing a narrow fault (do not try
to map in adjacent pages).

This is stolen from NetBSD.

ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.129
diff -u -p -r1.129 uvm_fault.c
--- uvm/uvm_fault.c 4 Apr 2022 09:27:05 -   1.129
+++ uvm/uvm_fault.c 27 Jun 2022 17:05:26 -
@@ -737,6 +737,16 @@ uvm_fault_check(struct uvm_faultinfo *uf
}
 
/*
+* for a case 2B fault waste no time on adjacent pages because
+* they are likely already entered.
+*/
+   if (uobj != NULL && amap != NULL &&
+   (flt->access_type & PROT_WRITE) != 0) {
+   /* wide fault (!narrow) */
+   flt->narrow = TRUE;
+   }
+
+   /*
 * establish range of interest based on advice from mapper
 * and then clip to fit map entry.   note that we only want
 * to do this the first time through the fault.   if we



Fix the swapper

2022-06-27 Thread Martin Pieuchot
Diff below contain 3 parts that can be committed independently.  The 3
of them are necessary to allow the pagedaemon to make progress in OOM
situation and to satisfy all the allocations waiting for pages in
specific ranges.

* uvm/uvm_pager.c part reserves a second segment for the page daemon.
  This is necessary to ensure the two uvm_pagermapin() calls needed by
  uvm_swap_io() succeed in emergency OOM situation.  (the 2nd segment is
  necessary when encryption or bouncing is required)

* uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region
  for the same reason.  Note that a sleeping point is introduced because
  the pagedaemon is faster than the asynchronous I/O and in OOM
  situation it tends to stay busy building cluster that it then discard
  because no memory is available.

* uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive 
  list of pages to account for a given memory range.  Without this the
  daemon could spin infinitely doing nothing because the global limits
  are reached.

At lot could be improved, but this at least makes swapping work in OOM
situations.

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.78
diff -u -p -r1.78 uvm_pager.c
--- uvm/uvm_pager.c 18 Feb 2022 09:04:38 -  1.78
+++ uvm/uvm_pager.c 27 Jun 2022 08:44:41 -
@@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[]
  * The number of uvm_pseg instances is dynamic using an array segs.
  * At most UVM_PSEG_COUNT instances can exist.
  *
- * psegs[0] always exists (so that the pager can always map in pages).
- * psegs[0] element 0 is always reserved for the pagedaemon.
+ * psegs[0/1] always exist (so that the pager can always map in pages).
+ * psegs[0/1] element 0 are always reserved for the pagedaemon.
  *
  * Any other pseg is automatically created when no space is available
  * and automatically destroyed when it is no longer in use.
@@ -93,6 +93,7 @@ uvm_pager_init(void)
 
/* init pager map */
uvm_pseg_init([0]);
+   uvm_pseg_init([1]);
mtx_init(_pseg_lck, IPL_VM);
 
/* init ASYNC I/O queue */
@@ -168,9 +169,10 @@ pager_seg_restart:
goto pager_seg_fail;
}
 
-   /* Keep index 0 reserved for pagedaemon. */
-   if (pseg == [0] && curproc != uvm.pagedaemon_proc)
-   i = 1;
+   /* Keep indexes 0,1 reserved for pagedaemon. */
+   if ((pseg == [0] || pseg == [1]) &&
+   (curproc != uvm.pagedaemon_proc))
+   i = 2;
else
i = 0;
 
@@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr)
pseg->use &= ~(1 << id);
wakeup();
 
-   if (pseg != [0] && UVM_PSEG_EMPTY(pseg)) {
+   if ((pseg != [0] && pseg != [1]) && UVM_PSEG_EMPTY(pseg)) {
va = pseg->start;
pseg->start = 0;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   12 May 2022 12:49:31 -  1.99
+++ uvm/uvm_pdaemon.c   27 Jun 2022 13:24:54 -
@@ -101,8 +101,8 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
-void   uvmpd_scan(void);
-boolean_t  uvmpd_scan_inactive(struct pglist *);
+void   uvmpd_scan(struct uvm_pmalloc *);
+boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -281,7 +281,7 @@ uvm_pageout(void *arg)
if (pma != NULL ||
((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) ||
((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) {
-   uvmpd_scan();
+   uvmpd_scan(pma);
}
 
/*
@@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg)
  */
 
 boolean_t
-uvmpd_scan_inactive(struct pglist *pglst)
+uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
-   struct vm_page *pps[MAXBSIZE >> PAGE_SHIFT], **ppsp;
+   struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
-   struct vm_page *swpps[MAXBSIZE >> PAGE_SHIFT];  /* XXX: see below */
+   struct vm_page *swpps[SWCLUSTPAGES];/* XXX: see below */
int swnpages, swcpages; /* XXX: see below */
int swslot;
struct vm_anon *anon;
@@ -404,8 +404,27 @@ uvmpd_scan_inactive(struct pglist *pglst
swnpages = swcpages = 0;
free = 0;
dirtyreacts = 0;
+   p = NULL;
 
-   for (p 

pdaemon: reserve memory for swapping

2022-06-26 Thread Martin Pieuchot
uvm_swap_io() needs to perform up to 4 allocations to write pages to
disk.  In OOM situation uvm_swap_allocpages() always fail because the
kernel doesn't reserve enough pages.

Diff below set `uvmexp.reserve_pagedaemon' to the number of pages needed
to write a cluster of pages to disk.  With this my machine do not
deadlock and can push pages to swap in OOM case.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  26 Jun 2022 08:17:34 -
@@ -280,10 +280,13 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 
/*
 * init reserve thresholds
-* XXXCDC - values may need adjusting
+*
+* The pagedaemon needs to always be able to write pages to disk,
+* Reserve the minimum amount of pages, a cluster, required by
+* uvm_swap_allocpages()
 */
-   uvmexp.reserve_pagedaemon = 4;
-   uvmexp.reserve_kernel = 8;
+   uvmexp.reserve_pagedaemon = (MAXBSIZE >> PAGE_SHIFT);
+   uvmexp.reserve_kernel = uvmexp.reserve_pagedaemon + 4;
uvmexp.anonminpct = 10;
uvmexp.vnodeminpct = 10;
uvmexp.vtextminpct = 5;



Re: set RTF_DONE in sysctl_dumpentry for the routing table

2022-06-08 Thread Martin Pieuchot
On 08/06/22(Wed) 16:13, Claudio Jeker wrote:
> Notice while hacking in OpenBGPD. Unlike routing socket messages the
> messages from the sysctl interface have RTF_DONE not set.
> I think it would make sense to set RTF_DONE also in this case since it
> makes reusing code easier.
> 
> All messages sent out via sysctl_dumpentry() have been processed by the
> kernel so setting RTF_DONE kind of makes sense.

I agree, ok mpi@

> -- 
> :wq Claudio
> 
> Index: rtsock.c
> ===
> RCS file: /cvs/src/sys/net/rtsock.c,v
> retrieving revision 1.328
> diff -u -p -r1.328 rtsock.c
> --- rtsock.c  6 Jun 2022 14:45:41 -   1.328
> +++ rtsock.c  8 Jun 2022 14:10:20 -
> @@ -1987,7 +1987,7 @@ sysctl_dumpentry(struct rtentry *rt, voi
>   struct rt_msghdr *rtm = (struct rt_msghdr *)w->w_tmem;
>  
>   rtm->rtm_pid = curproc->p_p->ps_pid;
> - rtm->rtm_flags = rt->rt_flags;
> + rtm->rtm_flags = RTF_DONE | rt->rt_flags;
>   rtm->rtm_priority = rt->rt_priority & RTP_MASK;
>   rtm_getmetrics(>rt_rmx, >rtm_rmx);
>   /* Do not account the routing table's reference. */
> 



Re: Fix clearing of sleep timeouts

2022-06-06 Thread Martin Pieuchot
On 06/06/22(Mon) 06:47, David Gwynne wrote:
> On Sun, Jun 05, 2022 at 03:57:39PM +, Visa Hankala wrote:
> > On Sun, Jun 05, 2022 at 12:27:32PM +0200, Martin Pieuchot wrote:
> > > On 05/06/22(Sun) 05:20, Visa Hankala wrote:
> > > > Encountered the following panic:
> > > > 
> > > > panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" 
> > > > failed: file "/usr/src/sys/kern/kern_synch.c", line 373
> > > > Stopped at  db_enter+0x10:  popq%rbp
> > > > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > > >  423109  57118 55 0x3  02  link
> > > >  330695  30276 55 0x3  03  link
> > > > * 46366  85501 55  0x1003  0x40804001  link
> > > >  188803  85501 55  0x1003  0x40820000K link
> > > > db_enter() at db_enter+0x10
> > > > panic(81f25d2b) at panic+0xbf
> > > > __assert(81f9a186,81f372c8,175,81f87c6c) at 
> > > > __assert+0x25
> > > > sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at 
> > > > sleep_setup+0x1d8
> > > > cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46
> > > > timeout_barrier(8000228a28b0) at timeout_barrier+0x109
> > > > timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2
> > > > sleep_finish(800022d64d90,1) at sleep_finish+0x16d
> > > > tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2
> > > > sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at 
> > > > sys_nanosleep+0x12d
> > > > syscall(800022d64f60) at syscall+0x374
> > > > 
> > > > The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously,
> > > > soft-interrupt-driven timeouts could be deleted synchronously without
> > > > blocking. Now, timeout_del_barrier() can sleep regardless of the type
> > > > of the timeout.
> > > > 
> > > > It looks that with small adjustments timeout_del_barrier() can sleep
> > > > in sleep_finish(). The management of run queues is not affected because
> > > > the timeout clearing happens after it. As timeout_del_barrier() does not
> > > > rely on a timeout or signal catching, there should be no risk of
> > > > unbounded recursion or unwanted signal side effects within the sleep
> > > > machinery. In a way, a sleep with a timeout is higher-level than
> > > > one without.
> > > 
> > > I trust you on the analysis.  However this looks very fragile to me.
> > > 
> > > The use of timeout_del_barrier() which can sleep using the global sleep
> > > queue is worrying me.  
> > 
> > I think the queue handling ends in sleep_finish() when SCHED_LOCK()
> > is released. The timeout clearing is done outside of it.
> 
> That's ok.
> 
> > The extra sleeping point inside sleep_finish() is subtle. It should not
> > matter in typical use. But is it permissible with the API? Also, if
> > timeout_del_barrier() sleeps, the thread's priority can change.
> 
> What other options do we have at this point? Spin? Allocate the timeout
> dynamically so sleep_finish doesn't have to wait for it and let the
> handler clean up? How would you stop the timeout handler waking up the
> thread if it's gone back to sleep again for some other reason?
> 
> Sleeping here is the least worst option.

I agree.  I don't think sleeping is bad here.  My concern is about how
sleeping is implemented.  There's a single API built on top of a single
global data structure which now calls itself recursively.  

I'm not sure how much work it would be to make cond_wait(9) use its own
sleep queue...  This is something independent from this fix though.

> As for timeout_del_barrier, if prio is a worry we can provide an
> advanced version of it that lets you pass the prio in. I'd also
> like to change timeout_barrier so it queues the barrier task at the
> head of the pending lists rather than at the tail.

I doubt prio matter.



Re: Fix clearing of sleep timeouts

2022-06-05 Thread Martin Pieuchot
On 05/06/22(Sun) 05:20, Visa Hankala wrote:
> Encountered the following panic:
> 
> panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: 
> file "/usr/src/sys/kern/kern_synch.c", line 373
> Stopped at  db_enter+0x10:  popq%rbp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>  423109  57118 55 0x3  02  link
>  330695  30276 55 0x3  03  link
> * 46366  85501 55  0x1003  0x40804001  link
>  188803  85501 55  0x1003  0x40820000K link
> db_enter() at db_enter+0x10
> panic(81f25d2b) at panic+0xbf
> __assert(81f9a186,81f372c8,175,81f87c6c) at 
> __assert+0x25
> sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at 
> sleep_setup+0x1d8
> cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46
> timeout_barrier(8000228a28b0) at timeout_barrier+0x109
> timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2
> sleep_finish(800022d64d90,1) at sleep_finish+0x16d
> tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2
> sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at 
> sys_nanosleep+0x12d
> syscall(800022d64f60) at syscall+0x374
> 
> The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously,
> soft-interrupt-driven timeouts could be deleted synchronously without
> blocking. Now, timeout_del_barrier() can sleep regardless of the type
> of the timeout.
> 
> It looks that with small adjustments timeout_del_barrier() can sleep
> in sleep_finish(). The management of run queues is not affected because
> the timeout clearing happens after it. As timeout_del_barrier() does not
> rely on a timeout or signal catching, there should be no risk of
> unbounded recursion or unwanted signal side effects within the sleep
> machinery. In a way, a sleep with a timeout is higher-level than
> one without.

I trust you on the analysis.  However this looks very fragile to me.

The use of timeout_del_barrier() which can sleep using the global sleep
queue is worrying me.  

> Note that endtsleep() can run and set P_TIMEOUT during
> timeout_del_barrier() when the thread is blocked in cond_wait().
> To avoid unnecessary atomic read-modify-write operations, the clearing
> of P_TIMEOUT could be conditional, but maybe that is an unnecessary
> optimization at this point.

I agree this optimization seems unnecessary at the moment.

> While it should be possible to make the code use timeout_del() instead
> of timeout_del_barrier(), the outcome might not be outright better. For
> example, sleep_setup() and endtsleep() would have to coordinate so that
> a late-running timeout from previous sleep cycle would not disturb the
> new cycle.

So that's the price for not having to sleep in sleep_finish(), right?

> To test the barrier path reliably, I made the code call
> timeout_del_barrier() twice in a row. The second call is guaranteed
> to sleep. Of course, this is not part of the patch.

ok mpi@

> Index: kern/kern_synch.c
> ===
> RCS file: src/sys/kern/kern_synch.c,v
> retrieving revision 1.187
> diff -u -p -r1.187 kern_synch.c
> --- kern/kern_synch.c 13 May 2022 15:32:00 -  1.187
> +++ kern/kern_synch.c 5 Jun 2022 05:04:45 -
> @@ -370,8 +370,8 @@ sleep_setup(struct sleep_state *sls, con
>   p->p_slppri = prio & PRIMASK;
>   TAILQ_INSERT_TAIL([LOOKUP(ident)], p, p_runq);
>  
> - KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   if (timo) {
> + KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   sls->sls_timeout = 1;
>   timeout_add(>p_sleep_to, timo);
>   }
> @@ -432,13 +432,12 @@ sleep_finish(struct sleep_state *sls, in
>  
>   if (sls->sls_timeout) {
>   if (p->p_flag & P_TIMEOUT) {
> - atomic_clearbits_int(>p_flag, P_TIMEOUT);
>   error1 = EWOULDBLOCK;
>   } else {
> - /* This must not sleep. */
> + /* This can sleep. It must not use timeouts. */
>   timeout_del_barrier(>p_sleep_to);
> - KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   }
> + atomic_clearbits_int(>p_flag, P_TIMEOUT);
>   }
>  
>   /* Check if thread was woken up because of a unwind or signal */
> 



Re: start unlocking kbind(2)

2022-05-31 Thread Martin Pieuchot
On 18/05/22(Wed) 15:53, Alexander Bluhm wrote:
> On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote:
> > +   cookie = SCARG(uap, proc_cookie);
> > +   if (pr->ps_kbind_addr == pc) {
> > +   membar_datadep_consumer();
> > +   if (pr->ps_kbind_cookie != cookie)
> > +   goto sigill;
> > +   } else {
> 
> You must use membar_consumer() here.  membar_datadep_consumer() is
> a barrier between reading pointer and pointed data.  Only alpha
> requires membar_datadep_consumer() for that, everywhere else it is
> a NOP.
> 
> > +   mtx_enter(>ps_mtx);
> > +   kpc = pr->ps_kbind_addr;
> 
> Do we need kpc variable?  I would prefer to read explicit
> pr->ps_kbind_addr in the two places where we use it.
> 
> I think the logic of barriers and mutexes is correct.
> 
> with the suggestions above OK bluhm@

I believe you should go ahead with the current diff.  ok with me.  Moving
the field under the scope of another lock can be easily done afterward.



Re: ffs_truncate: Missing uvm_vnp_uncache() w/ softdep

2022-05-24 Thread Martin Pieuchot
On 24/05/22(Tue) 15:24, Mark Kettenis wrote:
> > Date: Tue, 24 May 2022 14:28:39 +0200
> > From: Martin Pieuchot 
> > 
> > The softdep code path is missing a UVM cache invalidation compared to
> > the !softdep one.  This is necessary to flush pages of a persisting
> > vnode.
> > 
> > Since uvm_vnp_setsize() is also called later in this function for the
> > !softdep case move it to not call it twice.
> > 
> > ok?
> 
> I'm not sure this is correct.  I'm trying to understand why you're
> moving the uvm_uvn_setsize() call.  Are you just trying to call it
> twice?  Or are you trying to avoid calling it at all when we end up in
> an error path?
>
> The way you moved it means we'll still call it twice for "partially
> truncated" files with softdeps.  At least the way I understand the
> code is that the code will fsync the vnode and dropping down in the
> "normal" non-softdep code that will call uvm_vnp_setsize() (and
> uvn_vnp_uncache()) again.  So maybe you should move the
> uvm_uvn_setsize() call into the else case?

We might want to do that indeed.  I'm not sure what are the implications
of calling uvm_vnp_setsize/uncache() after VOP_FSYNC(), which might fail.
So I'd rather play safe and go with that diff.

> > Index: ufs/ffs/ffs_inode.c
> > ===
> > RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v
> > retrieving revision 1.81
> > diff -u -p -r1.81 ffs_inode.c
> > --- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 -  1.81
> > +++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 -
> > @@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le
> > if (length > fs->fs_maxfilesize)
> > return (EFBIG);
> >  
> > -   uvm_vnp_setsize(ovp, length);
> > oip->i_ci.ci_lasta = oip->i_ci.ci_clen 
> > = oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0;
> >  
> > if (DOINGSOFTDEP(ovp)) {
> > +   uvm_vnp_setsize(ovp, length);
> > +   (void) uvm_vnp_uncache(ovp);
> > if (length > 0 || softdep_slowdown(ovp)) {
> > /*
> >  * If a file is only partially truncated, then
> > 
> > 



Please test: rewrite of pdaemon

2022-05-24 Thread Martin Pieuchot
Diff below brings in & adapt most of the changes from NetBSD's r1.37 of
uvm_pdaemon.c.  My motivation for doing this is to untangle the inner
loop of uvmpd_scan_inactive() which will allow us to split the global
`pageqlock' mutex in a next step.

The idea behind this change is to get rid of the too-complex uvm_pager*
abstraction by checking early if a page is going to be flushed or
swapped to disk.  The loop is then clearly divided into two cases which
makes it more readable. 

This also opens the door to a better integration between UVM's vnode
layer and the buffer cache.

The main loop of uvmpd_scan_inactive() can be understood as below:

. If a page can be flushed we can call "uvn_flush()" directly and pass the
  PGO_ALLPAGES flag instead of building a cluster beforehand.  Note that,
  in its current form uvn_flush() is synchronous.

. If the page needs to be swapped, mark it as PG_PAGEOUT, build a cluster
  and once it is full call uvm_swap_put(). 

Please test this diff, do not hesitate to play with the `vm.swapencrypt.enable'
sysctl(2).

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  24 May 2022 12:31:34 -
@@ -143,7 +143,7 @@ struct pool uvm_aobj_pool;
 
 static struct uao_swhash_elt   *uao_find_swhash_elt(struct uvm_aobj *, int,
 boolean_t);
-static int  uao_find_swslot(struct uvm_object *, int);
+int uao_find_swslot(struct uvm_object *, int);
 static boolean_tuao_flush(struct uvm_object *, voff_t,
 voff_t, int);
 static void uao_free(struct uvm_aobj *);
@@ -241,7 +241,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-inline static int
+int
 uao_find_swslot(struct uvm_object *uobj, int pageidx)
 {
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_aobj.h
--- uvm/uvm_aobj.h  21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_aobj.h  24 May 2022 12:31:34 -
@@ -60,6 +60,7 @@
 
 void uao_init(void);
 int uao_set_swslot(struct uvm_object *, int, int);
+int uao_find_swslot (struct uvm_object *, int);
 int uao_dropswap(struct uvm_object *, int);
 int uao_swap_off(int, int);
 int uao_shrink(struct uvm_object *, int);
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.291
diff -u -p -r1.291 uvm_map.c
--- uvm/uvm_map.c   4 May 2022 14:58:26 -   1.291
+++ uvm/uvm_map.c   24 May 2022 12:31:34 -
@@ -3215,8 +3215,9 @@ uvm_object_printit(struct uvm_object *uo
  * uvm_page_printit: actually print the page
  */
 static const char page_flagbits[] =
-   "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5CLEANCHK\6RELEASED\7FAKE\10RDONLY"
-   "\11ZERO\12DEV\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ"
+   "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5PAGEOUT\6RELEASED\7FAKE\10RDONLY"
+   "\11ZERO\12DEV\13CLEANCHK"
+   "\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ"
"\27ENCRYPT\31PMAP0\32PMAP1\33PMAP2\34PMAP3\35PMAP4\36PMAP5";
 
 void
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  24 May 2022 12:32:54 -
@@ -960,6 +960,7 @@ uvm_pageclean(struct vm_page *pg)
 {
u_int flags_to_clear = 0;
 
+   KASSERT((pg->pg_flags & PG_PAGEOUT) == 0);
if ((pg->pg_flags & (PG_TABLED|PQ_ACTIVE|PQ_INACTIVE)) &&
(pg->uobject == NULL || !UVM_OBJ_IS_PMAP(pg->uobject)))
MUTEX_ASSERT_LOCKED();
@@ -978,11 +979,14 @@ uvm_pageclean(struct vm_page *pg)
rw_write_held(pg->uanon->an_lock));
 
/*
-* if the page was an object page (and thus "TABLED"), remove it
-* from the object.
+* remove page from its object or anon.
 */
-   if (pg->pg_flags & PG_TABLED)
+   if (pg->pg_flags & PG_TABLED) {
uvm_pageremove(pg);
+   } else if (pg->uanon != NULL) {
+   pg->uanon->an_page = NULL;
+   pg->uanon = NULL;
+   }
 
/*
 * now remove the page from the queues
@@ -996,10 +1000,6 @@ uvm_pageclean(struct vm_page *pg)
pg->wire_count = 0;
uvmexp.wired--;
}
-   if (pg->uanon) {
-   pg->uanon->an_page = NULL;
-   pg->uanon = NULL;
-   }
 
   

ffs_truncate: Missing uvm_vnp_uncache() w/ softdep

2022-05-24 Thread Martin Pieuchot
The softdep code path is missing a UVM cache invalidation compared to
the !softdep one.  This is necessary to flush pages of a persisting
vnode.

Since uvm_vnp_setsize() is also called later in this function for the
!softdep case move it to not call it twice.

ok?

Index: ufs/ffs/ffs_inode.c
===
RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v
retrieving revision 1.81
diff -u -p -r1.81 ffs_inode.c
--- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 -  1.81
+++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 -
@@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le
if (length > fs->fs_maxfilesize)
return (EFBIG);
 
-   uvm_vnp_setsize(ovp, length);
oip->i_ci.ci_lasta = oip->i_ci.ci_clen 
= oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0;
 
if (DOINGSOFTDEP(ovp)) {
+   uvm_vnp_setsize(ovp, length);
+   (void) uvm_vnp_uncache(ovp);
if (length > 0 || softdep_slowdown(ovp)) {
/*
 * If a file is only partially truncated, then



Re: Call uvm_vnp_uncache() before VOP_RENAME()

2022-05-24 Thread Martin Pieuchot
On 17/05/22(Tue) 16:55, Martin Pieuchot wrote:
> nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a
> possibly mmap'ed file before calling VOP_RENAME().
> 
> ok?

Anyone?

> Index: nfs/nfs_serv.c
> ===
> RCS file: /cvs/src/sys/nfs/nfs_serv.c,v
> retrieving revision 1.120
> diff -u -p -r1.120 nfs_serv.c
> --- nfs/nfs_serv.c11 Mar 2021 13:31:35 -  1.120
> +++ nfs/nfs_serv.c4 May 2022 15:29:06 -
> @@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd
>   error = -1;
>  out:
>   if (!error) {
> + if (tvp) {
> + (void)uvm_vnp_uncache(tvp);
> + }
>   error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, _cnd,
>  tond.ni_dvp, tond.ni_vp, _cnd);
>   } else {
> 



Re: start unlocking kbind(2)

2022-05-17 Thread Martin Pieuchot
On 17/05/22(Tue) 10:44, David Gwynne wrote:
> this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall
> argument checks can be done without the kernel lock.
> 
> care is taken to allow the pc/cookie checks to run without any lock by
> being careful with the order of the checks. all modifications to the
> pc/cookie state are serialised by the per process mutex.

I don't understand why it is safe to do the following check without
holding a mutex:

if (pr->ps_kbind_addr == pc)
...

Is there much differences when always grabbing the per-process mutex?

> i dont know enough about uvm to say whether it is safe to unlock the
> actual memory updates too, but even if i was confident i would still
> prefer to change it as a separate step.

I agree.

> Index: kern/init_sysent.c
> ===
> RCS file: /cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.236
> diff -u -p -r1.236 init_sysent.c
> --- kern/init_sysent.c1 May 2022 23:00:04 -   1.236
> +++ kern/init_sysent.c17 May 2022 00:36:03 -
> @@ -1,4 +1,4 @@
> -/*   $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $  */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
> @@ -204,7 +204,7 @@ const struct sysent sysent[] = {
>   sys_utimensat },/* 84 = utimensat */
>   { 2, s(struct sys_futimens_args), 0,
>   sys_futimens }, /* 85 = futimens */
> - { 3, s(struct sys_kbind_args), 0,
> + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0,
>   sys_kbind },/* 86 = kbind */
>   { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0,
>   sys_clock_gettime },/* 87 = clock_gettime */
> Index: kern/syscalls.master
> ===
> RCS file: /cvs/src/sys/kern/syscalls.master,v
> retrieving revision 1.223
> diff -u -p -r1.223 syscalls.master
> --- kern/syscalls.master  24 Feb 2022 07:41:51 -  1.223
> +++ kern/syscalls.master  17 May 2022 00:36:03 -
> @@ -194,7 +194,7 @@
>   const struct timespec *times, int flag); }
>  85   STD { int sys_futimens(int fd, \
>   const struct timespec *times); }
> -86   STD { int sys_kbind(const struct __kbind *param, \
> +86   STD NOLOCK  { int sys_kbind(const struct __kbind *param, \
>   size_t psize, int64_t proc_cookie); }
>  87   STD NOLOCK  { int sys_clock_gettime(clockid_t clock_id, \
>   struct timespec *tp); }
> Index: uvm/uvm_mmap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
> retrieving revision 1.169
> diff -u -p -r1.169 uvm_mmap.c
> --- uvm/uvm_mmap.c19 Jan 2022 10:43:48 -  1.169
> +++ uvm/uvm_mmap.c17 May 2022 00:36:03 -
> @@ -70,6 +70,7 @@
>  #include 
>  #include   /* for KBIND* */
>  #include 
> +#include 
>  
>  #include /* for __LDPGSZ */
>  
> @@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis
>   const char *data;
>   vaddr_t baseva, last_baseva, endva, pageoffset, kva;
>   size_t psize, s;
> - u_long pc;
> + u_long pc, kpc;
>   int count, i, extra;
> + uint64_t cookie;
>   int error;
>  
>   /*
>* extract syscall args from uap
>*/
>   paramp = SCARG(uap, param);
> - psize = SCARG(uap, psize);
>  
>   /* a NULL paramp disables the syscall for the process */
>   if (paramp == NULL) {
> + mtx_enter(>ps_mtx);
>   if (pr->ps_kbind_addr != 0)
> - sigexit(p, SIGILL);
> + goto leave_sigill;
>   pr->ps_kbind_addr = BOGO_PC;
> + mtx_leave(>ps_mtx);
>   return 0;
>   }
>  
>   /* security checks */
> +
> + /*
> +  * ps_kbind_addr can only be set to 0 or BOGO_PC by the
> +  * kernel, not by a call from userland.
> +  */
>   pc = PROC_PC(p);
> - if (pr->ps_kbind_addr == 0) {
> - pr->ps_kbind_addr = pc;
> - pr->ps_kbind_cookie = SCARG(uap, proc_cookie);
> - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC)
> - sigexit(p, SIGILL);
> - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie))
> - sigexit(p, SIGILL);
> + if (pc == 0 || pc == BOGO_PC)
> + goto sigill;
> +
> + cookie = SCARG(uap, proc_cookie);
> + if (pr->ps_kbind_addr == pc) {
> + membar_datadep_consumer();
> + if (pr->ps_kbind_cookie != cookie)
> + goto sigill;
> + } else {
> + mtx_enter(>ps_mtx);
> + kpc = pr->ps_kbind_addr;
> +
> + /*
> +  * If we're the first thread in (kpc is 0), then
> +  * 

Call uvm_vnp_uncache() before VOP_RENAME()

2022-05-17 Thread Martin Pieuchot
nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a
possibly mmap'ed file before calling VOP_RENAME().

ok?

Index: nfs/nfs_serv.c
===
RCS file: /cvs/src/sys/nfs/nfs_serv.c,v
retrieving revision 1.120
diff -u -p -r1.120 nfs_serv.c
--- nfs/nfs_serv.c  11 Mar 2021 13:31:35 -  1.120
+++ nfs/nfs_serv.c  4 May 2022 15:29:06 -
@@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd
error = -1;
 out:
if (!error) {
+   if (tvp) {
+   (void)uvm_vnp_uncache(tvp);
+   }
error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, _cnd,
   tond.ni_dvp, tond.ni_vp, _cnd);
} else {



Re: uvm_pagedequeue()

2022-05-12 Thread Martin Pieuchot
On 10/05/22(Tue) 20:23, Mark Kettenis wrote:
> > Date: Tue, 10 May 2022 18:45:21 +0200
> > From: Martin Pieuchot 
> > 
> > On 05/05/22(Thu) 14:54, Martin Pieuchot wrote:
> > > Diff below introduces a new wrapper to manipulate active/inactive page
> > > queues. 
> > > 
> > > ok?
> > 
> > Anyone?
> 
> Sorry I started looking at this and got distracted.
> 
> I'm not sure about the changes to uvm_pageactivate().  It doesn't
> quite match what NetBSD does, but I guess NetBSD assumes that
> uvm_pageactiave() isn't called for a page that is already active?  And
> that's something we can't guarantee?

It does look at what NetBSD did 15 years ago.  We're not ready to synchronize
with NetBSD -current yet. 

We're getting there!

> The diff is correct though in the sense that it is equivalent to the
> code we already have.  So if this definitely is the direction you want
> to go:
> 
> ok kettenis@
> 
> > > Index: uvm/uvm_page.c
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> > > retrieving revision 1.165
> > > diff -u -p -r1.165 uvm_page.c
> > > --- uvm/uvm_page.c4 May 2022 14:58:26 -   1.165
> > > +++ uvm/uvm_page.c5 May 2022 12:49:13 -
> > > @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
> > >   /*
> > >* now remove the page from the queues
> > >*/
> > > - if (pg->pg_flags & PQ_ACTIVE) {
> > > - TAILQ_REMOVE(_active, pg, pageq);
> > > - flags_to_clear |= PQ_ACTIVE;
> > > - uvmexp.active--;
> > > - }
> > > - if (pg->pg_flags & PQ_INACTIVE) {
> > > - TAILQ_REMOVE(_inactive, pg, pageq);
> > > - flags_to_clear |= PQ_INACTIVE;
> > > - uvmexp.inactive--;
> > > - }
> > > + uvm_pagedequeue(pg);
> > >  
> > >   /*
> > >* if the page was wired, unwire it now.
> > > @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
> > >   MUTEX_ASSERT_LOCKED();
> > >  
> > >   if (pg->wire_count == 0) {
> > > - if (pg->pg_flags & PQ_ACTIVE) {
> > > - TAILQ_REMOVE(_active, pg, pageq);
> > > - atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
> > > - uvmexp.active--;
> > > - }
> > > - if (pg->pg_flags & PQ_INACTIVE) {
> > > - TAILQ_REMOVE(_inactive, pg, pageq);
> > > - atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
> > > - uvmexp.inactive--;
> > > - }
> > > + uvm_pagedequeue(pg);
> > >   uvmexp.wired++;
> > >   }
> > >   pg->wire_count++;
> > > @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
> > >   KASSERT(uvm_page_owner_locked_p(pg));
> > >   MUTEX_ASSERT_LOCKED();
> > >  
> > > + uvm_pagedequeue(pg);
> > > + if (pg->wire_count == 0) {
> > > + TAILQ_INSERT_TAIL(_active, pg, pageq);
> > > + atomic_setbits_int(>pg_flags, PQ_ACTIVE);
> > > + uvmexp.active++;
> > > +
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * uvm_pagedequeue: remove a page from any paging queue
> > > + */
> > > +void
> > > +uvm_pagedequeue(struct vm_page *pg)
> > > +{
> > > + if (pg->pg_flags & PQ_ACTIVE) {
> > > + TAILQ_REMOVE(_active, pg, pageq);
> > > + atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
> > > + uvmexp.active--;
> > > + }
> > >   if (pg->pg_flags & PQ_INACTIVE) {
> > >   TAILQ_REMOVE(_inactive, pg, pageq);
> > >   atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
> > >   uvmexp.inactive--;
> > >   }
> > > - if (pg->wire_count == 0) {
> > > - /*
> > > -  * if page is already active, remove it from list so we
> > > -  * can put it at tail.  if it wasn't active, then mark
> > > -  * it active and bump active count
> > > -  */
> > > - if (pg->pg_flags & PQ_ACTIVE)
> > > - TAILQ_REMOVE(_active, pg, pageq);
> > > - else {
> > > - atomic_setbits_int(>pg_flags, PQ_ACTIVE);
> > > - uvmexp.active++;
> > > - }
> > > -
> > > - TAILQ_INSERT_TAIL(_active, pg, pageq);
> > > - }
> > >  }
> > > -
> > >  /*
> > >   * uvm_pagezero: zero fill a page
> > >   */
> > > Index: uvm/uvm_page.h
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_page.h,v
> > > retrieving revision 1.67
> > > diff -u -p -r1.67 uvm_page.h
> > > --- uvm/uvm_page.h29 Jan 2022 06:25:33 -  1.67
> > > +++ uvm/uvm_page.h5 May 2022 12:49:13 -
> > > @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *);
> > >  #endif
> > >  
> > >  void uvm_pageactivate(struct vm_page *);
> > > +void uvm_pagedequeue(struct vm_page *);
> > >  vaddr_t  uvm_pageboot_alloc(vsize_t);
> > >  void uvm_pagecopy(struct vm_page *, struct vm_page *);
> > >  void uvm_pagedeactivate(struct vm_page *);
> > > 
> > 
> > 



Re: uvm_pagedequeue()

2022-05-10 Thread Martin Pieuchot
On 05/05/22(Thu) 14:54, Martin Pieuchot wrote:
> Diff below introduces a new wrapper to manipulate active/inactive page
> queues. 
> 
> ok?

Anyone?

> Index: uvm/uvm_page.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> retrieving revision 1.165
> diff -u -p -r1.165 uvm_page.c
> --- uvm/uvm_page.c4 May 2022 14:58:26 -   1.165
> +++ uvm/uvm_page.c5 May 2022 12:49:13 -
> @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
>   /*
>* now remove the page from the queues
>*/
> - if (pg->pg_flags & PQ_ACTIVE) {
> - TAILQ_REMOVE(_active, pg, pageq);
> - flags_to_clear |= PQ_ACTIVE;
> - uvmexp.active--;
> - }
> - if (pg->pg_flags & PQ_INACTIVE) {
> - TAILQ_REMOVE(_inactive, pg, pageq);
> - flags_to_clear |= PQ_INACTIVE;
> - uvmexp.inactive--;
> - }
> + uvm_pagedequeue(pg);
>  
>   /*
>* if the page was wired, unwire it now.
> @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
>   MUTEX_ASSERT_LOCKED();
>  
>   if (pg->wire_count == 0) {
> - if (pg->pg_flags & PQ_ACTIVE) {
> - TAILQ_REMOVE(_active, pg, pageq);
> - atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
> - uvmexp.active--;
> - }
> - if (pg->pg_flags & PQ_INACTIVE) {
> - TAILQ_REMOVE(_inactive, pg, pageq);
> - atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
> - uvmexp.inactive--;
> - }
> + uvm_pagedequeue(pg);
>   uvmexp.wired++;
>   }
>   pg->wire_count++;
> @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
>   KASSERT(uvm_page_owner_locked_p(pg));
>   MUTEX_ASSERT_LOCKED();
>  
> + uvm_pagedequeue(pg);
> + if (pg->wire_count == 0) {
> + TAILQ_INSERT_TAIL(_active, pg, pageq);
> + atomic_setbits_int(>pg_flags, PQ_ACTIVE);
> + uvmexp.active++;
> +
> + }
> +}
> +
> +/*
> + * uvm_pagedequeue: remove a page from any paging queue
> + */
> +void
> +uvm_pagedequeue(struct vm_page *pg)
> +{
> + if (pg->pg_flags & PQ_ACTIVE) {
> + TAILQ_REMOVE(_active, pg, pageq);
> + atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
> + uvmexp.active--;
> + }
>   if (pg->pg_flags & PQ_INACTIVE) {
>   TAILQ_REMOVE(_inactive, pg, pageq);
>   atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
>   uvmexp.inactive--;
>   }
> - if (pg->wire_count == 0) {
> - /*
> -  * if page is already active, remove it from list so we
> -  * can put it at tail.  if it wasn't active, then mark
> -  * it active and bump active count
> -  */
> - if (pg->pg_flags & PQ_ACTIVE)
> - TAILQ_REMOVE(_active, pg, pageq);
> - else {
> - atomic_setbits_int(>pg_flags, PQ_ACTIVE);
> - uvmexp.active++;
> - }
> -
> - TAILQ_INSERT_TAIL(_active, pg, pageq);
> - }
>  }
> -
>  /*
>   * uvm_pagezero: zero fill a page
>   */
> Index: uvm/uvm_page.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.h,v
> retrieving revision 1.67
> diff -u -p -r1.67 uvm_page.h
> --- uvm/uvm_page.h29 Jan 2022 06:25:33 -  1.67
> +++ uvm/uvm_page.h5 May 2022 12:49:13 -
> @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *);
>  #endif
>  
>  void uvm_pageactivate(struct vm_page *);
> +void uvm_pagedequeue(struct vm_page *);
>  vaddr_t  uvm_pageboot_alloc(vsize_t);
>  void uvm_pagecopy(struct vm_page *, struct vm_page *);
>  void uvm_pagedeactivate(struct vm_page *);
> 



Re: uvm: Consider BUFPAGES_DEFICIT in swap_shortage

2022-05-09 Thread Martin Pieuchot
On 05/05/22(Thu) 10:56, Bob Beck wrote:
> On Thu, May 05, 2022 at 10:16:23AM -0600, Bob Beck wrote:
> > Ugh. You???re digging in the most perilous parts of the pile. 
> > 
> > I will go look with you??? sigh. (This is not yet an ok for that.)
> > 
> > > On May 5, 2022, at 7:53 AM, Martin Pieuchot  wrote:
> > > 
> > > When considering the amount of free pages in the page daemon a small
> > > amount is always kept for the buffer cache... except in one place.
> > > 
> > > The diff below gets rid of this exception.  This is important because
> > > uvmpd_scan() is called conditionally using the following check:
> > > 
> > >  if (uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) {
> > >...
> > >  }
> > > 
> > > So in case of swap shortage we might end up freeing fewer pages than
> > > wanted.
> 
> So a bit of backgroud. 
> 
> I am pretty much of the belief that this "low water mark" for pages is 
> nonsense now.  I was in the midst of trying to prove that
> to myself and therefore rip down some of the crazy accounting and
> very arbitrary limits in the buffer cache and got distracted.
> 
> Maybe something like this to start? (buf failing that I think
> your current diff is probably ok).

Thanks.  I'll commit my diff then to make the current code coherent and
let me progress in my refactoring.  Then we can consider changing this
magic.

> Index: sys/sys/mount.h
> ===
> RCS file: /cvs/src/sys/sys/mount.h,v
> retrieving revision 1.148
> diff -u -p -u -p -r1.148 mount.h
> --- sys/sys/mount.h   6 Apr 2021 14:17:35 -   1.148
> +++ sys/sys/mount.h   5 May 2022 16:50:50 -
> @@ -488,10 +488,8 @@ struct bcachestats {
>  #ifdef _KERNEL
>  extern struct bcachestats bcstats;
>  extern long buflowpages, bufhighpages, bufbackpages;
> -#define BUFPAGES_DEFICIT (((buflowpages - bcstats.numbufpages) < 0) ? 0 \
> -: buflowpages - bcstats.numbufpages)
> -#define BUFPAGES_INACT (((bcstats.numcleanpages - buflowpages) < 0) ? 0 \
> -: bcstats.numcleanpages - buflowpages)
> +#define BUFPAGES_DEFICIT 0
> +#define BUFPAGES_INACT bcstats.numcleanpages
>  extern int bufcachepercent;
>  extern void bufadjust(int);
>  struct uvm_constraint_range;
> 
> 
> > > 
> > > ok?
> > > 
> > > Index: uvm/uvm_pdaemon.c
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
> > > retrieving revision 1.98
> > > diff -u -p -r1.98 uvm_pdaemon.c
> > > --- uvm/uvm_pdaemon.c 4 May 2022 14:58:26 -   1.98
> > > +++ uvm/uvm_pdaemon.c 5 May 2022 13:40:28 -
> > > @@ -923,12 +923,13 @@ uvmpd_scan(void)
> > >* detect if we're not going to be able to page anything out
> > >* until we free some swap resources from active pages.
> > >*/
> > > + free = uvmexp.free - BUFPAGES_DEFICIT;
> > >   swap_shortage = 0;
> > > - if (uvmexp.free < uvmexp.freetarg &&
> > > + if (free < uvmexp.freetarg &&
> > >   uvmexp.swpginuse == uvmexp.swpages &&
> > >   !uvm_swapisfull() &&
> > >   pages_freed == 0) {
> > > - swap_shortage = uvmexp.freetarg - uvmexp.free;
> > > + swap_shortage = uvmexp.freetarg - free;
> > >   }
> > > 
> > >   for (p = TAILQ_FIRST(_active);
> > > 
> > 



uvm: Consider BUFPAGES_DEFICIT in swap_shortage

2022-05-05 Thread Martin Pieuchot
When considering the amount of free pages in the page daemon a small
amount is always kept for the buffer cache... except in one place.

The diff below gets rid of this exception.  This is important because
uvmpd_scan() is called conditionally using the following check:
  
  if (uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) {
...
  }

So in case of swap shortage we might end up freeing fewer pages than
wanted.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.98
diff -u -p -r1.98 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   4 May 2022 14:58:26 -   1.98
+++ uvm/uvm_pdaemon.c   5 May 2022 13:40:28 -
@@ -923,12 +923,13 @@ uvmpd_scan(void)
 * detect if we're not going to be able to page anything out
 * until we free some swap resources from active pages.
 */
+   free = uvmexp.free - BUFPAGES_DEFICIT;
swap_shortage = 0;
-   if (uvmexp.free < uvmexp.freetarg &&
+   if (free < uvmexp.freetarg &&
uvmexp.swpginuse == uvmexp.swpages &&
!uvm_swapisfull() &&
pages_freed == 0) {
-   swap_shortage = uvmexp.freetarg - uvmexp.free;
+   swap_shortage = uvmexp.freetarg - free;
}
 
for (p = TAILQ_FIRST(_active);



uvm_pagedequeue()

2022-05-05 Thread Martin Pieuchot
Diff below introduces a new wrapper to manipulate active/inactive page
queues. 

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.165
diff -u -p -r1.165 uvm_page.c
--- uvm/uvm_page.c  4 May 2022 14:58:26 -   1.165
+++ uvm/uvm_page.c  5 May 2022 12:49:13 -
@@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
/*
 * now remove the page from the queues
 */
-   if (pg->pg_flags & PQ_ACTIVE) {
-   TAILQ_REMOVE(_active, pg, pageq);
-   flags_to_clear |= PQ_ACTIVE;
-   uvmexp.active--;
-   }
-   if (pg->pg_flags & PQ_INACTIVE) {
-   TAILQ_REMOVE(_inactive, pg, pageq);
-   flags_to_clear |= PQ_INACTIVE;
-   uvmexp.inactive--;
-   }
+   uvm_pagedequeue(pg);
 
/*
 * if the page was wired, unwire it now.
@@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
MUTEX_ASSERT_LOCKED();
 
if (pg->wire_count == 0) {
-   if (pg->pg_flags & PQ_ACTIVE) {
-   TAILQ_REMOVE(_active, pg, pageq);
-   atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
-   uvmexp.active--;
-   }
-   if (pg->pg_flags & PQ_INACTIVE) {
-   TAILQ_REMOVE(_inactive, pg, pageq);
-   atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
-   uvmexp.inactive--;
-   }
+   uvm_pagedequeue(pg);
uvmexp.wired++;
}
pg->wire_count++;
@@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
KASSERT(uvm_page_owner_locked_p(pg));
MUTEX_ASSERT_LOCKED();
 
+   uvm_pagedequeue(pg);
+   if (pg->wire_count == 0) {
+   TAILQ_INSERT_TAIL(_active, pg, pageq);
+   atomic_setbits_int(>pg_flags, PQ_ACTIVE);
+   uvmexp.active++;
+
+   }
+}
+
+/*
+ * uvm_pagedequeue: remove a page from any paging queue
+ */
+void
+uvm_pagedequeue(struct vm_page *pg)
+{
+   if (pg->pg_flags & PQ_ACTIVE) {
+   TAILQ_REMOVE(_active, pg, pageq);
+   atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
+   uvmexp.active--;
+   }
if (pg->pg_flags & PQ_INACTIVE) {
TAILQ_REMOVE(_inactive, pg, pageq);
atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
-   if (pg->wire_count == 0) {
-   /*
-* if page is already active, remove it from list so we
-* can put it at tail.  if it wasn't active, then mark
-* it active and bump active count
-*/
-   if (pg->pg_flags & PQ_ACTIVE)
-   TAILQ_REMOVE(_active, pg, pageq);
-   else {
-   atomic_setbits_int(>pg_flags, PQ_ACTIVE);
-   uvmexp.active++;
-   }
-
-   TAILQ_INSERT_TAIL(_active, pg, pageq);
-   }
 }
-
 /*
  * uvm_pagezero: zero fill a page
  */
Index: uvm/uvm_page.h
===
RCS file: /cvs/src/sys/uvm/uvm_page.h,v
retrieving revision 1.67
diff -u -p -r1.67 uvm_page.h
--- uvm/uvm_page.h  29 Jan 2022 06:25:33 -  1.67
+++ uvm/uvm_page.h  5 May 2022 12:49:13 -
@@ -224,6 +224,7 @@ boolean_t   uvm_page_physget(paddr_t *);
 #endif
 
 void   uvm_pageactivate(struct vm_page *);
+void   uvm_pagedequeue(struct vm_page *);
 vaddr_tuvm_pageboot_alloc(vsize_t);
 void   uvm_pagecopy(struct vm_page *, struct vm_page *);
 void   uvm_pagedeactivate(struct vm_page *);



Merge swap-backed and object-backed inactive lists

2022-05-02 Thread Martin Pieuchot
Let's simplify the existing logic and use a single list for inactive
pages.  uvmpd_scan_inactive() already does a lot of check if it finds
a page which is swap-backed.  This will be improved in a next change.

ok?

Index: uvm/uvm.h
===
RCS file: /cvs/src/sys/uvm/uvm.h,v
retrieving revision 1.68
diff -u -p -r1.68 uvm.h
--- uvm/uvm.h   24 Nov 2020 13:49:09 -  1.68
+++ uvm/uvm.h   2 May 2022 16:32:16 -
@@ -53,8 +53,7 @@ struct uvm {
 
/* vm_page queues */
struct pglist page_active;  /* [Q] allocated pages, in use */
-   struct pglist page_inactive_swp;/* [Q] pages inactive (reclaim/free) */
-   struct pglist page_inactive_obj;/* [Q] pages inactive (reclaim/free) */
+   struct pglist page_inactive;/* [Q] pages inactive (reclaim/free) */
/* Lock order: pageqlock, then fpageqlock. */
struct mutex pageqlock; /* [] lock for active/inactive page q */
struct mutex fpageqlock;/* [] lock for free page q  + pdaemon */
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.290
diff -u -p -r1.290 uvm_map.c
--- uvm/uvm_map.c   12 Mar 2022 08:11:07 -  1.290
+++ uvm/uvm_map.c   2 May 2022 16:32:16 -
@@ -3281,8 +3281,7 @@ uvm_page_printit(struct vm_page *pg, boo
(*pr)("  >>> page not found in uvm_pmemrange <<<\n");
pgl = NULL;
} else if (pg->pg_flags & PQ_INACTIVE) {
-   pgl = (pg->pg_flags & PQ_SWAPBACKED) ?
-   _inactive_swp : _inactive_obj;
+   pgl = _inactive;
} else if (pg->pg_flags & PQ_ACTIVE) {
pgl = _active;
} else {
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.164
diff -u -p -r1.164 uvm_page.c
--- uvm/uvm_page.c  28 Apr 2022 09:59:28 -  1.164
+++ uvm/uvm_page.c  2 May 2022 16:32:16 -
@@ -185,8 +185,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 */
 
TAILQ_INIT(_active);
-   TAILQ_INIT(_inactive_swp);
-   TAILQ_INIT(_inactive_obj);
+   TAILQ_INIT(_inactive);
mtx_init(, IPL_VM);
mtx_init(, IPL_VM);
uvm_pmr_init();
@@ -994,10 +993,7 @@ uvm_pageclean(struct vm_page *pg)
uvmexp.active--;
}
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(_inactive, pg, pageq);
flags_to_clear |= PQ_INACTIVE;
uvmexp.inactive--;
}
@@ -1253,10 +1249,7 @@ uvm_pagewire(struct vm_page *pg)
uvmexp.active--;
}
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(_inactive, pg, pageq);
atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
@@ -1304,10 +1297,7 @@ uvm_pagedeactivate(struct vm_page *pg)
}
if ((pg->pg_flags & PQ_INACTIVE) == 0) {
KASSERT(pg->wire_count == 0);
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_INSERT_TAIL(_inactive_swp, pg, pageq);
-   else
-   TAILQ_INSERT_TAIL(_inactive_obj, pg, pageq);
+   TAILQ_INSERT_TAIL(_inactive, pg, pageq);
atomic_setbits_int(>pg_flags, PQ_INACTIVE);
uvmexp.inactive++;
pmap_clear_reference(pg);
@@ -1335,10 +1325,7 @@ uvm_pageactivate(struct vm_page *pg)
MUTEX_ASSERT_LOCKED();
 
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(_inactive, pg, pageq);
atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.97
diff -u -p -r1.97 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   30 Apr 2022 17:58:43 -  1.97
+++ uvm/uvm_pdaemon.c   2 May 2022 16:32:16 -
@@ -396,13 +396,6 @@ uvmpd_scan_inactive(struct pglist *pglst
int dirtyreacts;
 
/*
-* note: we currently keep swap-backed pages on a 

uvmpd_scan(): Recheck PG_BUSY after locking the page

2022-04-28 Thread Martin Pieuchot
rw_enter(9) can sleep.  When the lock is finally acquired by the
pagedaemon the previous check might no longer be true and the page
could be busy.  In this case we shouldn't touch it.

Diff below recheck for PG_BUSY after acquiring the lock and also
use a variable for the lock to reduce the differences with NetBSD.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.96
diff -u -p -r1.96 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   11 Apr 2022 16:43:49 -  1.96
+++ uvm/uvm_pdaemon.c   28 Apr 2022 10:22:52 -
@@ -879,6 +879,8 @@ uvmpd_scan(void)
int free, inactive_shortage, swap_shortage, pages_freed;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
+   struct vm_anon *anon;
+   struct rwlock *slock;
boolean_t got_it;
 
MUTEX_ASSERT_LOCKED();
@@ -947,20 +949,34 @@ uvmpd_scan(void)
 p != NULL && (inactive_shortage > 0 || swap_shortage > 0);
 p = nextpg) {
nextpg = TAILQ_NEXT(p, pageq);
-
-   /* skip this page if it's busy. */
-   if (p->pg_flags & PG_BUSY)
+   if (p->pg_flags & PG_BUSY) {
continue;
+   }
 
-   if (p->pg_flags & PQ_ANON) {
-   KASSERT(p->uanon != NULL);
-   if (rw_enter(p->uanon->an_lock, RW_WRITE|RW_NOSLEEP))
+   /*
+* lock the page's owner.
+*/
+   if (p->uobject != NULL) {
+   uobj = p->uobject;
+   slock = uobj->vmobjlock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
continue;
+   }
} else {
-   KASSERT(p->uobject != NULL);
-   if (rw_enter(p->uobject->vmobjlock,
-   RW_WRITE|RW_NOSLEEP))
+   anon = p->uanon;
+   KASSERT(p->uanon != NULL);
+   slock = anon->an_lock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
continue;
+   }
+   }
+
+   /*
+* skip this page if it's busy.
+*/
+   if ((p->pg_flags & PG_BUSY) != 0) {
+   rw_exit(slock);
+   continue;
}
 
/*
@@ -997,10 +1013,11 @@ uvmpd_scan(void)
uvmexp.pddeact++;
inactive_shortage--;
}
-   if (p->pg_flags & PQ_ANON)
-   rw_exit(p->uanon->an_lock);
-   else
-   rw_exit(p->uobject->vmobjlock);
+
+   /*
+* we're done with this page.
+*/
+   rw_exit(slock);
}
 }
 



Call uvm_pageactivate() from uvm_pageunwire()

2022-04-26 Thread Martin Pieuchot
I'd like to use a proper interface to add/remove pages on the
active/inactive queues.  This will help for lock assertions and help
improving the existing LRU limitations.

Diff below makes uvm_pageunwire() call uvm_pageactivate() instead of
inserting the page itself.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.163
diff -u -p -r1.163 uvm_page.c
--- uvm/uvm_page.c  12 Mar 2022 12:34:22 -  1.163
+++ uvm/uvm_page.c  26 Apr 2022 12:13:59 -
@@ -1279,9 +1279,7 @@ uvm_pageunwire(struct vm_page *pg)
 
pg->wire_count--;
if (pg->wire_count == 0) {
-   TAILQ_INSERT_TAIL(_active, pg, pageq);
-   uvmexp.active++;
-   atomic_setbits_int(>pg_flags, PQ_ACTIVE);
+   uvm_pageactivate(pg);
uvmexp.wired--;
}
 }



Decrement uvmexp.swpgonly

2022-04-26 Thread Martin Pieuchot
Small diff to decrement the counter only if the I/O succeed.  This
prevent a false positive if a check is performed before an error is
returned.

ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.154
diff -u -p -r1.154 uvm_swap.c
--- uvm/uvm_swap.c  17 Mar 2022 10:15:13 -  1.154
+++ uvm/uvm_swap.c  26 Apr 2022 12:04:52 -
@@ -1572,17 +1572,16 @@ uvm_swap_get(struct vm_page *page, int s
}
 
KERNEL_LOCK();
-   /* this page is (about to be) no longer only in swap. */
-   atomic_dec_int();
-
result = uvm_swap_io(, swslot, 1, B_READ |
((flags & PGO_SYNCIO) ? 0 : B_ASYNC));
+   KERNEL_UNLOCK();
 
-   if (result != VM_PAGER_OK && result != VM_PAGER_PEND) {
-   /* oops, the read failed so it really is still only in swap. */
-   atomic_inc_int();
+   if (result == VM_PAGER_OK || result == VM_PAGER_PEND) {
+   /*
+* this page is no longer only in swap.
+*/
+   atomic_dec_int();
}
-   KERNEL_UNLOCK();
return (result);
 }
 



Re: refcount btrace

2022-04-11 Thread Martin Pieuchot
On 08/04/22(Fri) 12:16, Alexander Bluhm wrote:
> On Fri, Apr 08, 2022 at 02:39:34AM +, Visa Hankala wrote:
> > On Thu, Apr 07, 2022 at 07:55:11PM +0200, Alexander Bluhm wrote:
> > > On Wed, Mar 23, 2022 at 06:13:27PM +0100, Alexander Bluhm wrote:
> > > > In my opinion tracepoints give insight at minimal cost.  It is worth
> > > > it to have it in GENERIC to make it easy to use.
> > > 
> > > After release I want to revive the btrace of refcounts discussion.
> > > 
> > > As mpi@ mentioned the idea of dt(4) is to have these trace points
> > > in GENERIC kernel.  If you want to hunt a bug, just turn it on.
> > > Refounting is a common place for bugs, leaks can be detected easily.
> > > 
> > > The alternative are some defines that you can compile in and access
> > > from ddb.  This is more work and you would have to implement it for
> > > every recount.
> > > https://marc.info/?l=openbsd-tech=163786435916039=2
> > > 
> > > There is no measuarable performance difference.  dt(4) is written
> > > in a way that is is only one additional branch.  At least my goal
> > > is to add trace points to useful places when we identify them.
> > 
> > DT_INDEX_ENTER() still checks the index first, so it has two branches
> > in practice.
> > 
> > I think dt_tracing should be checked first so that it serves as
> > a gateway to the trace code. Under normal operation, the check's
> > outcome is always the same, which is easy even for simple branch
> > predictors.
> 
> Reordering the check is easy.  Now dt_tracing is first.
> 
> > I have a slight suspicion that dt(4) is now becoming a way to add code
> > that would be otherwise unacceptable. Also, how "durable" are trace
> > points perceived? Is an added trace point an achieved advantage that
> > is difficult to remove even when its utility has diminished? There is
> > a similarity to (ad hoc) debug printfs too.
> 
> As I understand dt(4) it is a replacement for debug printfs.  But
> it has advantages.  I can be turnd on selectively from userland.
> It does not spam the console, but can be processed in userland.  It
> is always there, you don't have to recompile.
> 
> Of course you always have the printf or tracepoint at the worng
> place.  I think people debugging the code should move them to
> the useful places.  Then we may end with generally useful tool.
> A least that is my hope.
> 
> There are obvious places to debug.  We have syscall entry and return.
> And I think reference counting is also generally interesting.

I'm happy if this can help debugging real reference counting issues.  Do
you have a script that could be committed to /usr/share/btrace to show
how to track reference counting using these probes?

> Index: dev/dt/dt_prov_static.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/dt/dt_prov_static.c,v
> retrieving revision 1.13
> diff -u -p -r1.13 dt_prov_static.c
> --- dev/dt/dt_prov_static.c   17 Mar 2022 14:53:59 -  1.13
> +++ dev/dt/dt_prov_static.c   8 Apr 2022 09:40:29 -
> @@ -87,6 +87,12 @@ DT_STATIC_PROBE1(smr, barrier_exit, "int
>  DT_STATIC_PROBE0(smr, wakeup);
>  DT_STATIC_PROBE2(smr, thread, "uint64_t", "uint64_t");
>  
> +/*
> + * reference counting
> + */
> +DT_STATIC_PROBE0(refcnt, none);
> +DT_STATIC_PROBE3(refcnt, inpcb, "void *", "int", "int");
> +DT_STATIC_PROBE3(refcnt, tdb, "void *", "int", "int");
>  
>  /*
>   * List of all static probes
> @@ -127,15 +133,24 @@ struct dt_probe *const dtps_static[] = {
>   &_DT_STATIC_P(smr, barrier_exit),
>   &_DT_STATIC_P(smr, wakeup),
>   &_DT_STATIC_P(smr, thread),
> + /* refcnt */
> + &_DT_STATIC_P(refcnt, none),
> + &_DT_STATIC_P(refcnt, inpcb),
> + &_DT_STATIC_P(refcnt, tdb),
>  };
>  
> +struct dt_probe *const *dtps_index_refcnt;
> +
>  int
>  dt_prov_static_init(void)
>  {
>   int i;
>  
> - for (i = 0; i < nitems(dtps_static); i++)
> + for (i = 0; i < nitems(dtps_static); i++) {
> + if (dtps_static[i] == &_DT_STATIC_P(refcnt, none))
> + dtps_index_refcnt = _static[i];
>   dt_dev_register_probe(dtps_static[i]);
> + }
>  
>   return i;
>  }
> Index: dev/dt/dtvar.h
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/dt/dtvar.h,v
> retrieving revision 1.13
> diff -u -p -r1.13 dtvar.h
> --- dev/dt/dtvar.h27 Feb 2022 10:14:01 -  1.13
> +++ dev/dt/dtvar.h8 Apr 2022 09:42:19 -
> @@ -313,11 +313,30 @@ extern volatile uint32_tdt_tracing; /* 
>  #define  DT_STATIC_ENTER(func, name, args...) do {   
> \
>   extern struct dt_probe _DT_STATIC_P(func, name);\
>   struct dt_probe *dtp = &_DT_STATIC_P(func, name);   \
> - struct dt_provider *dtpv = dtp->dtp_prov;   \
>   \
>   if 

Kill selrecord()

2022-04-11 Thread Martin Pieuchot
Now that poll(2) & select(2) use the kqueue backend under the hood we
can start retiring the old machinery. 

The diff below does not touch driver definitions, however it :

- kills selrecord() & doselwakeup()

- make it obvious that `kern.nselcoll' is now always 0 

- Change all poll/select hooks to return 0

- Kill a seltid check in wsdisplaystart() which is now always true

In a later step we could remove the *_poll() requirement from device
drivers and kill seltrue() & selfalse().

ok?

Index: arch/sparc64/dev/vldcp.c
===
RCS file: /cvs/src/sys/arch/sparc64/dev/vldcp.c,v
retrieving revision 1.22
diff -u -p -r1.22 vldcp.c
--- arch/sparc64/dev/vldcp.c24 Oct 2021 17:05:04 -  1.22
+++ arch/sparc64/dev/vldcp.c11 Apr 2022 16:38:32 -
@@ -581,44 +581,7 @@ vldcpioctl(dev_t dev, u_long cmd, caddr_
 int
 vldcppoll(dev_t dev, int events, struct proc *p)
 {
-   struct vldcp_softc *sc;
-   struct ldc_conn *lc;
-   uint64_t head, tail, state;
-   int revents = 0;
-   int s, err;
-
-   sc = vldcp_lookup(dev);
-   if (sc == NULL)
-   return (POLLERR);
-   lc = >sc_lc;
-
-   s = spltty();
-   if (events & (POLLIN | POLLRDNORM)) {
-   err = hv_ldc_rx_get_state(lc->lc_id, , , );
-
-   if (err == 0 && state == LDC_CHANNEL_UP && head != tail)
-   revents |= events & (POLLIN | POLLRDNORM);
-   }
-   if (events & (POLLOUT | POLLWRNORM)) {
-   err = hv_ldc_tx_get_state(lc->lc_id, , , );
-
-   if (err == 0 && state == LDC_CHANNEL_UP && head != tail)
-   revents |= events & (POLLOUT | POLLWRNORM);
-   }
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM)) {
-   cbus_intr_setenabled(sc->sc_bustag, sc->sc_rx_ino,
-   INTR_ENABLED);
-   selrecord(p, >sc_rsel);
-   }
-   if (events & (POLLOUT | POLLWRNORM)) {
-   cbus_intr_setenabled(sc->sc_bustag, sc->sc_tx_ino,
-   INTR_ENABLED);
-   selrecord(p, >sc_wsel);
-   }
-   }
-   splx(s);
-   return revents;
+   return 0;
 }
 
 void
Index: dev/audio.c
===
RCS file: /cvs/src/sys/dev/audio.c,v
retrieving revision 1.198
diff -u -p -r1.198 audio.c
--- dev/audio.c 21 Mar 2022 19:22:39 -  1.198
+++ dev/audio.c 11 Apr 2022 16:38:52 -
@@ -2053,17 +2053,7 @@ audio_mixer_read(struct audio_softc *sc,
 int
 audio_mixer_poll(struct audio_softc *sc, int events, struct proc *p)
 {
-   int revents = 0;
-
-   mtx_enter(_lock);
-   if (sc->mix_isopen && sc->mix_pending)
-   revents |= events & (POLLIN | POLLRDNORM);
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM))
-   selrecord(p, >mix_sel);
-   }
-   mtx_leave(_lock);
-   return revents;
+   return 0;
 }
 
 int
@@ -2101,21 +2091,7 @@ audio_mixer_close(struct audio_softc *sc
 int
 audio_poll(struct audio_softc *sc, int events, struct proc *p)
 {
-   int revents = 0;
-
-   mtx_enter(_lock);
-   if ((sc->mode & AUMODE_RECORD) && sc->rec.used > 0)
-   revents |= events & (POLLIN | POLLRDNORM);
-   if ((sc->mode & AUMODE_PLAY) && sc->play.used < sc->play.len)
-   revents |= events & (POLLOUT | POLLWRNORM);
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM))
-   selrecord(p, >rec.sel);
-   if (events & (POLLOUT | POLLWRNORM))
-   selrecord(p, >play.sel);
-   }
-   mtx_leave(_lock);
-   return revents;
+   return 0;
 }
 
 int
Index: dev/hotplug.c
===
RCS file: /cvs/src/sys/dev/hotplug.c,v
retrieving revision 1.21
diff -u -p -r1.21 hotplug.c
--- dev/hotplug.c   25 Dec 2020 12:59:52 -  1.21
+++ dev/hotplug.c   11 Apr 2022 16:39:24 -
@@ -183,16 +183,7 @@ hotplugioctl(dev_t dev, u_long cmd, cadd
 int
 hotplugpoll(dev_t dev, int events, struct proc *p)
 {
-   int revents = 0;
-
-   if (events & (POLLIN | POLLRDNORM)) {
-   if (evqueue_count > 0)
-   revents |= events & (POLLIN | POLLRDNORM);
-   else
-   selrecord(p, _sel);
-   }
-
-   return (revents);
+   return (0);
 }
 
 int
Index: dev/midi.c
===
RCS file: /cvs/src/sys/dev/midi.c,v
retrieving revision 1.54
diff -u -p -r1.54 midi.c
--- dev/midi.c  6 Apr 2022 18:59:27 -   1.54
+++ dev/midi.c  11 Apr 2022 16:39:31 -
@@ -334,31 +334,7 @@ done:
 int
 midipoll(dev_t dev, int events, struct proc *p)
 {
-   struct midi_softc *sc;
-   int 

Re: refcount btrace

2022-03-21 Thread Martin Pieuchot
On 20/03/22(Sun) 05:39, Visa Hankala wrote:
> On Sat, Mar 19, 2022 at 12:10:11AM +0100, Alexander Bluhm wrote:
> > On Thu, Mar 17, 2022 at 07:25:27AM +, Visa Hankala wrote:
> > > On Thu, Mar 17, 2022 at 12:42:13AM +0100, Alexander Bluhm wrote:
> > > > I would like to use btrace to debug refernce counting.  The idea
> > > > is to a a tracepoint for every type of refcnt we have.  When it
> > > > changes, print the actual object, the current counter and the change
> > > > value.
> > > 
> > > > Do we want that feature?
> > > 
> > > I am against this in its current form. The code would become more
> > > complex, and the trace points can affect timing. There is a risk that
> > > the kernel behaves slightly differently when dt has been compiled in.
> > 
> > On our main architectures dt(4) is in GENERIC.  I see your timing
> > point for uvm structures.
> 
> In my opinion, having dt(4) enabled by default is another reason why
> there should be no carte blanche for adding trace points. Each trace
> point adds a tiny amount of bloat. Few users will use the tracing
> facility.
> 
> Maybe high-rate trace points could be behind a build option...

The whole point of dt(4) is to be able to debug GENERIC kernel.  I doubt
the cost of an additional if () block matters.



Swap encrypt under memory pressure

2022-03-12 Thread Martin Pieuchot
Try to allocate the buffer before doing the encryption, if it fails we
do not spend time doing the encryption.  This reduce the pressure when
swapping with low memory.

ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.153
diff -u -p -r1.153 uvm_swap.c
--- uvm/uvm_swap.c  22 Feb 2022 01:15:02 -  1.153
+++ uvm/uvm_swap.c  12 Mar 2022 10:30:26 -
@@ -1690,6 +1690,26 @@ uvm_swap_io(struct vm_page **pps, int st
}
}
 
+   /*
+* now allocate a buf for the i/o.
+* [make sure we don't put the pagedaemon to sleep...]
+*/
+   pflag = (async || curproc == uvm.pagedaemon_proc) ? PR_NOWAIT :
+   PR_WAITOK;
+   bp = pool_get(, pflag | PR_ZERO);
+
+   /*
+* if we failed to get a swapbuf, return "try again"
+*/
+   if (bp == NULL) {
+   if (bounce) {
+   uvm_pagermapout(bouncekva, npages);
+   uvm_swap_freepages(tpps, npages);
+   }
+   uvm_pagermapout(kva, npages);
+   return (VM_PAGER_AGAIN);
+   }
+
/* encrypt to swap */
if (write && bounce) {
int i, opages;
@@ -1729,35 +1749,6 @@ uvm_swap_io(struct vm_page **pps, int st
  PGO_PDFREECLUST);
 
kva = bouncekva;
-   }
-
-   /*
-* now allocate a buf for the i/o.
-* [make sure we don't put the pagedaemon to sleep...]
-*/
-   pflag = (async || curproc == uvm.pagedaemon_proc) ? PR_NOWAIT :
-   PR_WAITOK;
-   bp = pool_get(, pflag | PR_ZERO);
-
-   /*
-* if we failed to get a swapbuf, return "try again"
-*/
-   if (bp == NULL) {
-   if (write && bounce) {
-#ifdef UVM_SWAP_ENCRYPT
-   int i;
-
-   /* swap encrypt needs cleanup */
-   if (encrypt)
-   for (i = 0; i < npages; i++)
-   SWAP_KEY_PUT(sdp, SWD_KEY(sdp,
-   startslot + i));
-#endif
-
-   uvm_pagermapout(kva, npages);
-   uvm_swap_freepages(tpps, npages);
-   }
-   return (VM_PAGER_AGAIN);
}
 
/*



Re: more MAKEDEV cleanup

2022-02-10 Thread Martin Pieuchot
On 05/04/21(Mon) 09:25, Miod Vallat wrote:
> The following diff attempts to clean up a few loose ends in the current
> MAKEDEV files:
> 
> - remove no-longer applicable device definitions (MSCP and SMD disks,
>   this kind of thing).
> - makes sure all platforms use the same `ramdisk' target for
>   installation media devices, rather than a mix of `ramd' and `ramdisk'.
> - moves as many `ramdisk' devices to MI land (bio, diskmap, random,
>   etc).
> - reduces the number of block devices in `ramdisk' targets to only one
>   per device, since the installer script will invoke MAKEDEV by itself
>   for the devices it needs to use.
> - sort device names in `all' and `ramdisk' MI lists to make maintainence
>   easier. This causes some ordering change in the `all' target in the
>   generated MAKEDEVs.

What happened to this?

> Index: MAKEDEV.common
> ===
> RCS file: /OpenBSD/src/etc/MAKEDEV.common,v
> retrieving revision 1.113
> diff -u -p -r1.113 MAKEDEV.common
> --- MAKEDEV.common12 Feb 2021 10:26:33 -  1.113
> +++ MAKEDEV.common5 Apr 2021 09:18:49 -
> @@ -114,7 +114,7 @@ dnl make a 'disktgt' macro that automati
>  dnl disktgt(rd, {-rd-})
>  dnl
>  dnl  target(all,rd,0)
> -dnl  target(ramd,rd,0)
> +dnl  target(ramdisk,rd,0)
>  dnl  disk_q(rd)
>  dnl  __devitem(rd, {-rd*-}, {-rd-})dnl
>  dnl
> @@ -122,62 +122,60 @@ dnl  Note: not all devices are generated
>  dnlits own extra list.
>  dnl
>  divert(1)dnl
> +target(all, acpi)dnl
> +target(all, apm)dnl
> +target(all, bio)dnl
> +target(all, bpf)dnl
> +twrget(all, com, tty0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b)dnl
> +twrget(all, czs, cua, a, b, c, d)dnl
> +target(all, diskmap)dnl
> +target(all, dt)dnl
>  twrget(all, fdesc, fd)dnl
> -target(all, st, 0, 1)dnl
> -target(all, std)dnl
> -target(all, ra, 0, 1, 2, 3)dnl
> -target(all, rx, 0, 1)dnl
> -target(all, wd, 0, 1, 2, 3)dnl
> -target(all, xd, 0, 1, 2, 3)dnl
> +target(all, fuse)dnl
> +target(all, hotplug)dnl
> +target(all, joy, 0, 1)dnl
> +target(all, kcov)dnl
> +target(all, kstat)dnl
> +target(all, local)dnl
> +target(all, lpt, 0, 1, 2)dnl
> +twrget(all, lpt, lpa, 0, 1, 2)dnl
> +target(all, par, 0)dnl
> +target(all, pci, 0, 1, 2, 3)dnl
>  target(all, pctr)dnl
>  target(all, pctr0)dnl
>  target(all, pf)dnl
> -target(all, apm)dnl
> -target(all, acpi)dnl
> +target(all, pppac)dnl
> +target(all, pppx)dnl
> +target(all, ptm)dnl
> +target(all, pty, 0)dnl
> +target(all, pvbus, 0, 1)dnl
> +target(all, radio, 0)dnl
> +target(all, rmidi, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> +twrget(all, rnd, random)dnl
> +twrget(all, speak, speaker)dnl
> +target(all, st, 0, 1)dnl
> +target(all, std)dnl
> +target(all, switch, 0, 1, 2, 3)dnl
> +target(all, tap, 0, 1, 2, 3)dnl
>  twrget(all, tth, ttyh, 0, 1)dnl
>  target(all, ttyA, 0, 1)dnl
> -twrget(all, mac_tty0, tty0, 0, 1)dnl
> -twrget(all, tzs, tty, a, b, c, d)dnl
> -twrget(all, czs, cua, a, b, c, d)dnl
>  target(all, ttyc, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> -twrget(all, com, tty0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b)dnl
> -twrget(all, mmcl, mmclock)dnl
> -target(all, lpt, 0, 1, 2)dnl
> -twrget(all, lpt, lpa, 0, 1, 2)dnl
> -target(all, joy, 0, 1)dnl
> -twrget(all, rnd, random)dnl
> -target(all, uk, 0)dnl
> -twrget(all, vi, video, 0, 1)dnl
> -twrget(all, speak, speaker)dnl
> -target(all, asc, 0)dnl
> -target(all, radio, 0)dnl
> +target(all, tun, 0, 1, 2, 3)dnl
>  target(all, tuner, 0)dnl
> -target(all, rmidi, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> +twrget(all, tzs, tty, a, b, c, d)dnl
>  target(all, uall)dnl
> -target(all, pci, 0, 1, 2, 3)dnl
> -twrget(all, wsmouse, wscons)dnl
> -target(all, par, 0)dnl
> -target(all, apci, 0)dnl
> -target(all, local)dnl
> -target(all, ptm)dnl
> -target(all, hotplug)dnl
> -target(all, pppx)dnl
> -target(all, pppac)dnl
> -target(all, fuse)dnl
> +target(all, uk, 0)dnl
> +twrget(all, vi, video, 0, 1)dnl
>  target(all, vmm)dnl
> -target(all, pvbus, 0, 1)dnl
> -target(all, bpf)dnl
> -target(all, kcov)dnl
> -target(all, dt)dnl
> -target(all, kstat)dnl
> +target(all, vnd, 0, 1, 2, 3)dnl
> +target(all, vscsi, 0)dnl
> +target(all, wd, 0, 1, 2, 3)dnl
> +twrget(all, wsmouse, wscons)dnl
>  dnl
>  _mkdev(all, {-all-}, {-dnl
>  show_target(all)dnl
>  -})dnl
>  dnl
> -dnl XXX some arches use ramd, others ramdisk - needs to be fixed eventually
> -__devitem(ramdisk, ramdisk, Ramdisk kernel devices,nothing)dnl
> -dnl
>  target(usb, usb, 0, 1, 2, 3, 4, 5, 6, 7)dnl
>  target(usb, uhid, 0, 1, 2, 3, 4, 5, 6, 7)dnl
>  twrget(usb, fido, fido)dnl
> @@ -208,26 +206,26 @@ __devitem(ch, {-ch*-}, SCSI media change
>  _mcdev(ch, ch*, ch, {-major_ch_c-}, 660, operator)dnl
>  __devitem(uk, uk*, Unknown SCSI devices)dnl
>  _mcdev(uk, uk*, uk, {-major_uk_c-}, 640, operator)dnl
> -dnl XXX see ramdisk above
> -__devitem(ramd, ramdisk, Ramdisk kernel devices,nothing)dnl
>  dnl
> -_mkdev(ramd, ramdisk, {-dnl
> -show_target(ramd)dnl
> +__devitem(ramdisk, ramdisk, Ramdisk kernel devices,nothing)dnl
> +_mkdev(ramdisk, ramdisk, {-dnl
> 

Re: uvm_unmap_kill_entry(): unwire with map lock held

2022-02-04 Thread Martin Pieuchot
On 04/02/22(Fri) 03:39, Klemens Nanni wrote:
> [...] 
> ... with the lock grabbed in uvm_map_teardown() that is, otherwise
> the first call path can lock against itself (regress/misc/posixtestsuite
> is a reproduce for this):
> 
>   vm_map_lock_read_ln+0x38
>   uvm_fault_unwire+0x58
>   uvm_unmap_kill_entry_withlock+0x68
>   uvm_unmap_remove+0x2d4
>   sys_munmap+0x11c
> 
> which is obvious in hindsight.
> 
> So grabbing the lock in uvm_map_teardown() instead avoids that while
> still ensuring a locked map in the path missing a lock.

This should be fine since this function should only be called when the
last reference of a map is dropped.  In other words the locking here is
necessary to satisfy the assertions.

I wonder if lock shouldn't be taken & released around uvm_map_teardown()
which makes it easier to see that this is called after the last refcount
decrement. 

> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 4 Feb 2022 02:51:00 -
> @@ -2734,6 +2751,7 @@ uvm_map_teardown(struct vm_map *map)
>   KERNEL_ASSERT_UNLOCKED();
>  
>   KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
> + vm_map_lock(map);
>  
>   /* Remove address selectors. */
>   uvm_addr_destroy(map->uaddr_exe);
> 



Re: uvm_unmap_kill_entry(): unwire with map lock held

2022-01-31 Thread Martin Pieuchot
On 31/01/22(Mon) 10:24, Klemens Nanni wrote:
> Running with my uvm assert diff showed that uvm_fault_unwire_locked()
> was called without any locks held.
> 
> This happened when I rebooted my machine:
> 
>   uvm_fault_unwire_locked()
>   uvm_unmap_kill_entry_withlock()
>   uvm_unmap_kill_entry()
>   uvm_map_teardown()
>   uvmspace_free()
> 
> This code does not corrupt anything because
> uvm_unmap_kill_entry_withlock() is grabbing the kernel lock aorund its
> uvm_fault_unwire_locked() call.
> 
> But regardless of the kernel lock dances in this code path, the uvm map
> ought to be protected by its own lock.  uvm_fault_unwire() does that.
> 
> uvm_fault_unwire_locked()'s comment says the map must at least be read
> locked, which is what all other code paths to that function do.
> 
> This makes my latest assert diff happy in the reboot case (it did not
> always hit that assert).

I'm happy your asserts found a first bug.

I"m afraid calling this function below could result in a grabbing the
lock twice.  Can we be sure this doesn't happen?

uvm_unmap_kill_entry() is called in many different contexts and this is
currently a mess.  I don't know what NetBSD did in this area but it is
worth looking at and see if there isn't a good idea to untangle this.

> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 31 Jan 2022 09:28:04 -
> @@ -2132,7 +2143,7 @@ uvm_unmap_kill_entry_withlock(struct vm_
>   if (VM_MAPENT_ISWIRED(entry)) {
>   KERNEL_LOCK();
>   entry->wired_count = 0;
> - uvm_fault_unwire_locked(map, entry->start, entry->end);
> + uvm_fault_unwire(map, entry->start, entry->end);
>   KERNEL_UNLOCK();
>   }
>  
> 



Re: unlock mmap(2) for anonymous mappings

2022-01-24 Thread Martin Pieuchot
On 24/01/22(Mon) 12:06, Klemens Nanni wrote:
> On Sun, Jan 16, 2022 at 09:22:50AM -0300, Martin Pieuchot wrote:
> > IMHO this approach of let's try if it works now and revert if it isn't
> > doesn't help us make progress.  I'd be more confident seeing diffs that
> > assert for the right lock in the functions called by uvm_mapanon() and
> > documentation about which lock is protecting which field of the data
> > structures.
> 
> I picked `vm_map's `size' as first underdocumented member.
> All accesses to it are protected by vm_map_lock(), either through the
> function itself or implicitly by already requiring the calling function
> to lock the map.

Could we use a vm_map_assert_locked() or something similar that reflect
the exclusive or shared (read) lock comments?  I don't trust comments.
It's too easy to miss a lock in a code path.

> So annotate functions using `size' wrt. the required lock.
> 
> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 21 Jan 2022 13:03:05 -
> @@ -805,6 +805,8 @@ uvm_map_sel_limits(vaddr_t *min, vaddr_t
>   * Fills in *start_ptr and *end_ptr to be the first and last entry describing
>   * the space.
>   * If called with prefilled *start_ptr and *end_ptr, they are to be correct.
> + *
> + * map must be at least read-locked.
>   */
>  int
>  uvm_map_isavail(struct vm_map *map, struct uvm_addr_state *uaddr,
> @@ -2206,6 +2208,8 @@ uvm_unmap_kill_entry(struct vm_map *map,
>   * If remove_holes, then remove ET_HOLE entries as well.
>   * If markfree, entry will be properly marked free, otherwise, no replacement
>   * entry will be put in the tree (corrupting the tree).
> + *
> + * map must be locked.
>   */
>  void
>  uvm_unmap_remove(struct vm_map *map, vaddr_t start, vaddr_t end,
> @@ -2976,6 +2980,9 @@ uvm_tree_sanity(struct vm_map *map, char
>   UVM_ASSERT(map, addr == vm_map_max(map), file, line);
>  }
>  
> +/*
> + * map must be at least read-locked.
> + */
>  void
>  uvm_tree_size_chk(struct vm_map *map, char *file, int line)
>  {
> Index: uvm_map.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.h,v
> retrieving revision 1.71
> diff -u -p -r1.71 uvm_map.h
> --- uvm_map.h 15 Dec 2021 12:53:53 -  1.71
> +++ uvm_map.h 21 Jan 2022 12:51:26 -
> @@ -272,7 +272,7 @@ struct vm_map {
>  
>   struct uvm_map_addr addr;   /* [v] Entry tree, by addr */
>  
> - vsize_t size;   /* virtual size */
> + vsize_t size;   /* [v] virtual size */
>   int ref_count;  /* [a] Reference count */
>   int flags;  /* flags */
>   struct mutexflags_lock; /* flags lock */
> 



Re: DDBPROF: move option to amd64,i386 GENERIC

2022-01-18 Thread Martin Pieuchot
On 18/01/22(Tue) 04:38, Klemens Nanni wrote:
> While intended for more architectures, DDBPROF is strictly amd64 and
> i386 only, so the machine-independent sys/conf/GENERIC does not seem fit
> (until all architectures are supported).

This define should die.  There's no need to polish this turd.  Somebody
has to stand up and turn this into a "#if NDT > 0" with the audit that
goes with it.

> Index: conf/GENERIC
> ===
> RCS file: /cvs/src/sys/conf/GENERIC,v
> retrieving revision 1.281
> diff -u -p -r1.281 GENERIC
> --- conf/GENERIC  23 Dec 2021 10:04:14 -  1.281
> +++ conf/GENERIC  18 Jan 2022 04:28:29 -
> @@ -4,7 +4,6 @@
>  #GENERIC kernel
>  
>  option   DDB # in-kernel debugger
> -#option  DDBPROF # ddb(4) based profiling
>  #option  DDB_SAFE_CONSOLE # allow break into ddb during boot
>  #makeoptions DEBUG=""# do not compile full symbol table
>  #makeoptions PROF="-pg"  # build profiled kernel
> Index: arch/amd64/conf/GENERIC
> ===
> RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v
> retrieving revision 1.510
> diff -u -p -r1.510 GENERIC
> --- arch/amd64/conf/GENERIC   4 Jan 2022 05:50:43 -   1.510
> +++ arch/amd64/conf/GENERIC   18 Jan 2022 04:28:04 -
> @@ -13,6 +13,8 @@ machine amd64
>  include  "../../../conf/GENERIC"
>  maxusers 80  # estimated number of users
>  
> +#option  DDBPROF # ddb(4) based profiling
> +
>  option   USER_PCICONF# user-space PCI configuration
>  
>  option   APERTURE# in-kernel aperture driver for XFree86
> Index: arch/i386/conf/GENERIC
> ===
> RCS file: /cvs/src/sys/arch/i386/conf/GENERIC,v
> retrieving revision 1.860
> diff -u -p -r1.860 GENERIC
> --- arch/i386/conf/GENERIC2 Jan 2022 23:14:27 -   1.860
> +++ arch/i386/conf/GENERIC18 Jan 2022 04:28:26 -
> @@ -13,6 +13,8 @@ machine i386
>  include  "../../../conf/GENERIC"
>  maxusers 80  # estimated number of users
>  
> +#option  DDBPROF # ddb(4) based profiling
> +
>  option   USER_PCICONF# user-space PCI configuration
>  
>  option   APERTURE# in-kernel aperture driver for XFree86
> 



Re: uvm_swap: introduce uvm_swap_data_lock

2022-01-16 Thread Martin Pieuchot
Nice!

On 30/12/21(Thu) 23:38, Theo Buehler wrote:
> The diff below does two things: it adds a uvm_swap_data_lock mutex and
> trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad()

Why is it enough?  Which fields is the lock protecting in these
function?  Is it `uvmexp.swpages', could that be documented?  

What about `nswapdev'?  Why is the rwlock grabbed before reading it in
sys_swapctl()?i

What about `swpginuse'?

If the mutex/rwlock are used to protect the global `swap_priority' could
that be also documented?  Once this is documented it should be trivial to
see that some places are missing some locking.  Is it intentional?

> The uvm_swap_data_lock protects all swap data structures, so needs to be
> grabbed a few times, many of them already documented in the comments.
> 
> For review, I suggest comparing to what NetBSD did and also going
> through the consumers (swaplist_insert, swaplist_find, swaplist_trim)
> and check that they are properly locked when called, or that there is
> the KERNEL_LOCK() in place when swap data structures are manipulated.

I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce
the differences with NetBSD.

> In swapmount() I introduced locking since that's needed to be able to
> assert that the proper locks are held in swaplist_{insert,find,trim}.

Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down?
What about `uvmexp.nswget' and `uvmexp.swpgonly' in there?

> Index: uvm/uvm_swap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
> retrieving revision 1.152
> diff -u -p -r1.152 uvm_swap.c
> --- uvm/uvm_swap.c12 Dec 2021 09:14:59 -  1.152
> +++ uvm/uvm_swap.c30 Dec 2021 15:47:20 -
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -91,6 +92,9 @@
>   *  - swap_syscall_lock (sleep lock): this lock serializes the swapctl
>   *system call and prevents the swap priority list from changing
>   *while we are in the middle of a system call (e.g. SWAP_STATS).
> + *  - uvm_swap_data_lock (mutex): this lock protects all swap data
> + *structures including the priority list, the swapdev structures,
> + *and the swapmap arena.
>   *
>   * each swap device has the following info:
>   *  - swap device in use (could be disabled, preventing future use)
> @@ -212,6 +216,7 @@ LIST_HEAD(swap_priority, swappri);
>  struct swap_priority swap_priority;
>  
>  /* locks */
> +struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE);
>  struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk");
>  
>  /*
> @@ -442,7 +447,7 @@ uvm_swap_finicrypt_all(void)
>  /*
>   * swaplist_insert: insert swap device "sdp" into the global list
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   * => caller must provide a newly malloc'd swappri structure (we will
>   *   FREE it if we don't need it... this it to prevent malloc blocking
>   *   here while adding swap)
> @@ -452,6 +457,9 @@ swaplist_insert(struct swapdev *sdp, str
>  {
>   struct swappri *spp, *pspp;
>  
> + rw_assert_wrlock(_syscall_lock);
> + MUTEX_ASSERT_LOCKED(_swap_data_lock);
> +
>   /*
>* find entry at or after which to insert the new device.
>*/
> @@ -493,7 +501,7 @@ swaplist_insert(struct swapdev *sdp, str
>   * swaplist_find: find and optionally remove a swap device from the
>   *   global list.
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   * => we return the swapdev we found (and removed)
>   */
>  struct swapdev *
> @@ -502,6 +510,9 @@ swaplist_find(struct vnode *vp, boolean_
>   struct swapdev *sdp;
>   struct swappri *spp;
>  
> + rw_assert_wrlock(_syscall_lock);
> + MUTEX_ASSERT_LOCKED(_swap_data_lock);
> +
>   /*
>* search the lists for the requested vp
>*/
> @@ -524,13 +535,16 @@ swaplist_find(struct vnode *vp, boolean_
>   * swaplist_trim: scan priority list for empty priority entries and kill
>   *   them.
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   */
>  void
>  swaplist_trim(void)
>  {
>   struct swappri *spp, *nextspp;
>  
> + rw_assert_wrlock(_syscall_lock);
> + MUTEX_ASSERT_LOCKED(_swap_data_lock);
> +
>   LIST_FOREACH_SAFE(spp, _priority, spi_swappri, nextspp) {
>   if (!TAILQ_EMPTY(>spi_swapdev))
>   continue;
> @@ -543,7 +557,7 @@ swaplist_trim(void)
>   * swapdrum_add: add a "swapdev"'s blocks into /dev/drum's area.
>   *
>   * => caller must hold swap_syscall_lock
> - * => uvm.swap_data_lock should be unlocked (we may sleep)
> + * => uvm_swap_data_lock should be unlocked 

Re: unlock mmap(2) for anonymous mappings

2022-01-16 Thread Martin Pieuchot
On 14/01/22(Fri) 23:01, Mark Kettenis wrote:
> > Date: Tue, 11 Jan 2022 23:13:20 +
> > From: Klemens Nanni 
> > 
> > On Tue, Jan 11, 2022 at 09:54:44AM -0700, Theo de Raadt wrote:
> > > > Now this is clearly a "slow" path.  I don't think there is any reason
> > > > not to put all the code in that if (uvw_wxabort) block under the
> > > > kernel lock.  For now I think making access to ps_wxcounter atomic is
> > > > just too fine grained.
> > > 
> > > Right.  Lock the whole block.
> > 
> > Thanks everyone, here's the combined diff for that.
> 
> I think mpi@ should be involved in the actual unlocking of mmap(2),
> munmap(2) and mprotect(2).  But the changes to uvm_mmap.c are ok
> kettenis@ and can be committed now.

It isn't clear to me what changed since the last time this has been
tried.  Why is it safe now?  What are the locking assumptions?  

IMHO this approach of let's try if it works now and revert if it isn't
doesn't help us make progress.  I'd be more confident seeing diffs that
assert for the right lock in the functions called by uvm_mapanon() and
documentation about which lock is protecting which field of the data
structures.

NetBSD has done much of this and the code bases do not diverge much so
it can be useful to look there as well.



Re: patch: add a new ktrace facility for replacing some printf-debug

2022-01-07 Thread Martin Pieuchot
On 07/01/22(Fri) 10:54, Sebastien Marie wrote:
> Hi,
> 
> Debugging some code paths is complex: for example, unveil(2) code is
> involved inside VFS, and using DEBUG_UNVEIL means that the kernel is
> spamming printf() for all processes using unveil(2) (a lot of
> processes) instead of just the interested cases I want to follow.
> 
> So I cooked the following diff to add a KTRLOG() facility to be able
> to replace printf()-like debugging with a more process-limited method.

I wish you could debug such issue without having to change any kernel
code that's why I started btrace(8).

You should already be able to filter probes per thread/process, maybe you
want to replace some debug printf by static probes.  Alternatively you
could define DDBPROF and use the kprobe provider which allow you to
inspect prologue and epilogue of ELF functions.

Maybe btrace(8) is not yet fully functional to debug this particular
problem, but improving it should hopefully give us a tool to debug most
of the kernel issues without having to write a diff and boot a custom
kernel.  At least that's the goal.



Re: gprof: Profiling a multi-threaded application

2021-12-10 Thread Martin Pieuchot
On 10/12/21(Fri) 09:56, Yuichiro NAITO wrote:
> Any comments about this topic?

I'm ok with this approach.  I would appreciate if somebody
else could take it over, I'm too busy with other stuff.

> On 7/12/21 18:09, Yuichiro NAITO wrote:
> > Hi, Martin
> > 
> > n 2021/07/10 16:55, Martin Pieuchot wrote:
> > > Hello Yuichiro, thanks for your work !
> > 
> > Thanks for the response.
> > 
> > > > On 2021/06/16 16:34, Yuichiro NAITO wrote:
> > > > > When I compile a multi-threaded application with '-pg' option, I 
> > > > > always get no
> > > > > results in gmon.out. With bad luck, my application gets SIGSEGV while 
> > > > > running.
> > > > > Because the buffer that holds number of caller/callee count is the 
> > > > > only one
> > > > > in the process and will be broken by multi-threaded access.
> > > > > 
> > > > > I get the idea to solve this problem from NetBSD. NetBSD has 
> > > > > individual buffers
> > > > > for each thread and merges them at the end of profiling.
> > > 
> > > Note that the kernel use a similar approach but doesn't merge the buffer,
> > > instead it generates a file for each CPU.
> > 
> > Yes, so the number of output files are limited by the number of CPUs in 
> > case of
> > the kernel profiling. I think number of application threads can be 
> > increased more
> > casually. I don't want to see dozens of 'gmon.out' files.
> > 
> > > > > NetBSD stores the reference to the individual buffer by 
> > > > > pthread_setspecific(3).
> > > > > I think it causes infinite recursive call if whole libc library 
> > > > > (except
> > > > > mcount.c) is compiled with -pg.
> > > > > 
> > > > > The compiler generates '_mcount' function call at the beginning of 
> > > > > every
> > > > > functions. If '_mcount' calls pthread_getspecific(3) to get the 
> > > > > individual
> > > > > buffer, pthread_getspecific() calls '_mcount' again and causes 
> > > > > infinite
> > > > > recursion.
> > > > > 
> > > > > NetBSD prevents from infinite recursive call by checking a global 
> > > > > variable. But
> > > > > this approach also prevents simultaneously call of '_mcount' on a 
> > > > > multi-threaded
> > > > > application. It makes a little bit wrong results of profiling.
> > > > > 
> > > > > So I added a pointer to the buffer in `struct pthread` that can be 
> > > > > accessible
> > > > > via macro call as same as pthread_self(3). This approach can prevent 
> > > > > of
> > > > > infinite recursive call of '_mcount'.
> > > 
> > > Not calling a libc function for this makes sense.  However I'm not
> > > convinced that accessing `tib_thread' before _rthread_init() has been
> > > called is safe.
> > 
> > Before '_rthread_init’ is called, '__isthreaded' global variable is kept to 
> > be 0.
> > My patch doesn't access tib_thread in this case.
> > After calling `_rthread_init`, `pthread_create()` changes `__isthreaded` to 
> > 1.
> > Tib_thread will be accessed by all threads that are newly created and the 
> > initial one.
> > 
> > I believe tib of the initial thread has been initialized in `_libc_preinit' 
> > function
> > in 'lib/libc/dlfcn/init.c'.
> > 
> > > I'm not sure where is the cleanest way to place the per-thread buffer,
> > > I'd suggest you ask guenther@ about this.
> > 
> > I added guenther@ in CC of this mail.
> > I hope I can get an advise about per-thread buffer.
> > 
> > > Maybe the initialization can be done outside of _mcount()?
> > 
> > Yes, I think tib is initialized in `pthread_create()` and `_libc_preinit()`.
> > 
> > > > > I obtained merging function from NetBSD that is called in '_mcleanup' 
> > > > > function.
> > > > > Merging function needs to walk through all the individual buffers,
> > > > > I added SLIST_ENTRY member in 'struct gmonparam' to make a list of 
> > > > > the buffers.
> > > > > And also added '#ifndef _KERNEL' for the SLIST_ENTRY member not to be 
> > > > > used for
> > > > > the kernel.
> > > > > 
> > > > > But I still use pthread_getspecific(3) for that can call destructor 
&

Re: net write in pcb hash

2021-12-09 Thread Martin Pieuchot
On 08/12/21(Wed) 22:39, Alexander Bluhm wrote:
> On Wed, Dec 08, 2021 at 03:28:34PM -0300, Martin Pieuchot wrote:
> > On 04/12/21(Sat) 01:02, Alexander Bluhm wrote:
> > > Hi,
> > > 
> > > As I want a read-only net lock for forwarding, all paths should be
> > > checked for the correct net lock and asserts.  I found code in
> > > in_pcbhashlookup() where reading the PCB table results in a write
> > > to optimize the cache.
> > > 
> > > Porperly protecting PCB hashes is out of scope for parallel forwarding.
> > > Can we get away with this hack?  Only update the cache when we are
> > > in TCP oder UDP stack with the write lock.  The access from pf is
> > > read-only.
> > > 
> > > NET_WLOCKED() indicates whether we may change data structures.
> > 
> > I recall that we currently do not want to introduce such idiom: change
> > the behavior of the code depending on which lock is held by the caller.
> > 
> > Can we instead assert that a write-lock is held before modifying the
> > list?
> 
> We could also pass down the kind of lock that is used.  Goal is
> that pf uses shared net lock.  TCP and UDP will keep the exclusice
> net lock for a while.

Changing the logic of a function based on the type of a lock is not
different from the previous approach.

> Diff gets longer but perhaps a bit clearer what is going on.

I believe we want to split in_pcblookup_listen() into two functions.
One which is read-only and one which modifies the head of the hash
chain.
The read-only asserts for any lock and the one that modifies the hash
calls the former and assert for a write lock.

Alternatively, we could protect the PCB hash with a mutex.  This has the
advantage of not making the scope of the NET_LOCK() more complex.  In
the end we all know something like that will be done.  I don't know how
other BSD did this but I'm sure this will help getting the remaining
socket layer out of the NET_LOCK().

> Index: net/pf.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v
> retrieving revision 1.1122
> diff -u -p -r1.1122 pf.c
> --- net/pf.c  7 Jul 2021 18:38:25 -   1.1122
> +++ net/pf.c  8 Dec 2021 21:16:16 -
> @@ -3317,14 +3317,12 @@ pf_socket_lookup(struct pf_pdesc *pd)
>   sport = pd->hdr.tcp.th_sport;
>   dport = pd->hdr.tcp.th_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = 
>   break;
>   case IPPROTO_UDP:
>   sport = pd->hdr.udp.uh_sport;
>   dport = pd->hdr.udp.uh_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = 
>   break;
>   default:
> @@ -3348,22 +3346,24 @@ pf_socket_lookup(struct pf_pdesc *pd)
>* Fails when rtable is changed while evaluating the ruleset
>* The socket looked up will not match the one hit in the end.
>*/
> - inp = in_pcbhashlookup(tb, saddr->v4, sport, daddr->v4, dport,
> - pd->rdomain);
> + NET_ASSERT_LOCKED();
> + inp = in_pcbhashlookup_wlocked(tb, saddr->v4, sport, daddr->v4,
> + dport, pd->rdomain, 0);
>   if (inp == NULL) {
> - inp = in_pcblookup_listen(tb, daddr->v4, dport,
> - NULL, pd->rdomain);
> + inp = in_pcblookup_listen_wlocked(tb, daddr->v4, dport,
> + NULL, pd->rdomain, 0);
>   if (inp == NULL)
>   return (-1);
>   }
>   break;
>  #ifdef INET6
>   case AF_INET6:
> - inp = in6_pcbhashlookup(tb, >v6, sport, >v6,
> - dport, pd->rdomain);
> + NET_ASSERT_LOCKED();
> + inp = in6_pcbhashlookup_wlocked(tb, >v6, sport,
> + >v6, dport, pd->rdomain, 0);
>   if (inp == NULL) {
> - inp = in6_pcblookup_listen(tb, >v6, dport,
> - NULL, pd->rdomain);
> + inp = in6_pcblookup_listen_wlocked(tb, >v6,
> + dport, NULL, pd->rdomain, 0);
>   if (inp == NULL)
>   return (-1);
>   }
> Index: netinet/in_pcb.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/in_pcb.c,v
> retrieving revision 1.256
> diff -u -p -r1.256 in_pcb.c
> 

Re: net write in pcb hash

2021-12-08 Thread Martin Pieuchot
On 04/12/21(Sat) 01:02, Alexander Bluhm wrote:
> Hi,
> 
> As I want a read-only net lock for forwarding, all paths should be
> checked for the correct net lock and asserts.  I found code in
> in_pcbhashlookup() where reading the PCB table results in a write
> to optimize the cache.
> 
> Porperly protecting PCB hashes is out of scope for parallel forwarding.
> Can we get away with this hack?  Only update the cache when we are
> in TCP oder UDP stack with the write lock.  The access from pf is
> read-only.
> 
> NET_WLOCKED() indicates whether we may change data structures.

I recall that we currently do not want to introduce such idiom: change
the behavior of the code depending on which lock is held by the caller.

Can we instead assert that a write-lock is held before modifying the
list?

> Also move the assert from pf to in_pcb where the critical section
> is.
> 
> bluhm
> 
> Index: net/pf.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v
> retrieving revision 1.1122
> diff -u -p -r1.1122 pf.c
> --- net/pf.c  7 Jul 2021 18:38:25 -   1.1122
> +++ net/pf.c  3 Dec 2021 22:20:32 -
> @@ -3317,14 +3317,12 @@ pf_socket_lookup(struct pf_pdesc *pd)
>   sport = pd->hdr.tcp.th_sport;
>   dport = pd->hdr.tcp.th_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = 
>   break;
>   case IPPROTO_UDP:
>   sport = pd->hdr.udp.uh_sport;
>   dport = pd->hdr.udp.uh_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = 
>   break;
>   default:
> Index: netinet/in_pcb.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/in_pcb.c,v
> retrieving revision 1.256
> diff -u -p -r1.256 in_pcb.c
> --- netinet/in_pcb.c  25 Oct 2021 22:20:47 -  1.256
> +++ netinet/in_pcb.c  3 Dec 2021 22:20:32 -
> @@ -1069,6 +1069,8 @@ in_pcbhashlookup(struct inpcbtable *tabl
>   u_int16_t fport = fport_arg, lport = lport_arg;
>   u_int rdomain;
>  
> + NET_ASSERT_LOCKED();
> +
>   rdomain = rtable_l2(rtable);
>   head = in_pcbhash(table, rdomain, , fport, , lport);
>   LIST_FOREACH(inp, head, inp_hash) {
> @@ -1085,7 +1087,7 @@ in_pcbhashlookup(struct inpcbtable *tabl
>* repeated accesses are quicker.  This is analogous to
>* the historic single-entry PCB cache.
>*/
> - if (inp != LIST_FIRST(head)) {
> + if (NET_WLOCKED() && inp != LIST_FIRST(head)) {
>   LIST_REMOVE(inp, inp_hash);
>   LIST_INSERT_HEAD(head, inp, inp_hash);
>   }
> @@ -1119,6 +1121,8 @@ in_pcblookup_listen(struct inpcbtable *t
>   u_int16_t lport = lport_arg;
>   u_int rdomain;
>  
> + NET_ASSERT_LOCKED();
> +
>   key1 = 
>   key2 = _addr;
>  #if NPF > 0
> @@ -1185,7 +1189,7 @@ in_pcblookup_listen(struct inpcbtable *t
>* repeated accesses are quicker.  This is analogous to
>* the historic single-entry PCB cache.
>*/
> - if (inp != NULL && inp != LIST_FIRST(head)) {
> + if (NET_WLOCKED() && inp != NULL && inp != LIST_FIRST(head)) {
>   LIST_REMOVE(inp, inp_hash);
>   LIST_INSERT_HEAD(head, inp, inp_hash);
>   }
> Index: sys/systm.h
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/sys/systm.h,v
> retrieving revision 1.154
> diff -u -p -r1.154 systm.h
> --- sys/systm.h   2 Jun 2021 00:39:25 -   1.154
> +++ sys/systm.h   3 Dec 2021 22:20:32 -
> @@ -344,6 +344,8 @@ extern struct rwlock netlock;
>  #define  NET_RLOCK_IN_IOCTL()do { rw_enter_read(); } while 
> (0)
>  #define  NET_RUNLOCK_IN_IOCTL()  do { rw_exit_read(); } while (0)
>  
> +#define  NET_WLOCKED()   (rw_status() == RW_WRITE)
> +
>  #ifdef DIAGNOSTIC
>  
>  #define  NET_ASSERT_UNLOCKED()   
> \
> 



Re: kbind(2): push kernel lock down as needed

2021-12-08 Thread Martin Pieuchot
On 06/12/21(Mon) 14:58, Scott Cheloha wrote:
> On Mon, Dec 06, 2021 at 08:35:15PM +0100, Mark Kettenis wrote:
> > > Date: Sun, 5 Dec 2021 18:01:04 -0600
> > > From: Scott Cheloha 
> > > 
> > > Two things in sys_kbind() need an explicit kernel lock.  First,
> > > sigexit().  Second, uvm_map_extract(), because the following call
> > > chain panics without it:
> > > 
> > > [...]
> > > 
> > > With this committed we can unlock kbind(2).
> > > 
> > > Thoughts?  ok?
> > 
> > To be honest, I don't think this makes sense unless you can make the
> > "normal" code path lock free.  You're replacing a single
> > KERNEL_LOCK/UNLOCK pair with (potentially) a bunch of them.  That may
> > actually make things worse.  So I think we need to make
> > uvm_map_extract() mpsafe first.
> 
> Unlocking uvm_map_extract() would improve things, yes.

Yes, please.  What's missing?



Re: Rework UNIX sockets locking to be fine grained

2021-12-01 Thread Martin Pieuchot
On 24/11/21(Wed) 15:04, Vitaliy Makkoveev wrote:
> [...] 
> Really, this is the simplest way. The shared lock for the pair of
> sockets moves re-lock dances to the connect and disconnect stages which
> should be also protected by locks. And not for the pair. Many SOCK_DGRAM
> sockets could be connected to one socket. This could be done, but this
> is the hell. And there is absolutely NO profit.

It's not clear to me why sharing a lock isn't simpler.  Currently all
unix(4) sockets share a lock and the locking semantic is simpler.

If two sockets are linked couldn't they use the same rwlock?  Did you
consider this approach?  Is it complicated to know which lock to pick?

This is what is done in UVM and that's why the rwlock is allocated outside
of "struct uvm_object" with rw_obj_alloc().  Having a 'rwlock pointer' in
"struct socket" could also help by setting this pointer to  in the
case of UDP/TCP sockets. 

I'm not saying your approach isn't working.  I'm just not convinced it is
the simplest path forward and I wish we could do without refcounting.



Re: Please test: UVM fault unlocking (aka vmobjlock)

2021-11-29 Thread Martin Pieuchot
On 24/11/21(Wed) 11:16, Martin Pieuchot wrote:
> Diff below unlock the bottom part of the UVM fault handler.  I'm
> interested in squashing the remaining bugs.  Please test with your usual
> setup & report back.

Thanks to all the testers, here's a new version that includes a bug fix.

Tests on !x86 architectures are much appreciated!

Thanks a lot,
Martin

diff --git sys/arch/amd64/conf/GENERIC.MP sys/arch/amd64/conf/GENERIC.MP
index bb842f6d96e..e5334c19eac 100644
--- sys/arch/amd64/conf/GENERIC.MP
+++ sys/arch/amd64/conf/GENERIC.MP
@@ -4,6 +4,6 @@ include "arch/amd64/conf/GENERIC"
 
 option MULTIPROCESSOR
 #optionMP_LOCKDEBUG
-#optionWITNESS
+option WITNESS
 
 cpu*   at mainbus?
diff --git sys/arch/i386/conf/GENERIC.MP sys/arch/i386/conf/GENERIC.MP
index 980a572b8fd..ef7ded61501 100644
--- sys/arch/i386/conf/GENERIC.MP
+++ sys/arch/i386/conf/GENERIC.MP
@@ -7,6 +7,6 @@ include "arch/i386/conf/GENERIC"
 
 option MULTIPROCESSOR  # Multiple processor support
 #optionMP_LOCKDEBUG
-#optionWITNESS
+option WITNESS
 
 cpu*   at mainbus?
diff --git sys/dev/pci/drm/i915/gem/i915_gem_shmem.c 
sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
index ce8e2eca141..47b567087e7 100644
--- sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
+++ sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
@@ -268,8 +268,10 @@ shmem_truncate(struct drm_i915_gem_object *obj)
 #ifdef __linux__
shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1);
 #else
+   rw_enter(obj->base.uao->vmobjlock, RW_WRITE);
obj->base.uao->pgops->pgo_flush(obj->base.uao, 0, obj->base.size,
PGO_ALLPAGES | PGO_FREE);
+   rw_exit(obj->base.uao->vmobjlock);
 #endif
obj->mm.madv = __I915_MADV_PURGED;
obj->mm.pages = ERR_PTR(-EFAULT);
diff --git sys/dev/pci/drm/radeon/radeon_ttm.c 
sys/dev/pci/drm/radeon/radeon_ttm.c
index eb879b5c72c..837a9f94298 100644
--- sys/dev/pci/drm/radeon/radeon_ttm.c
+++ sys/dev/pci/drm/radeon/radeon_ttm.c
@@ -1006,6 +1006,8 @@ radeon_ttm_fault(struct uvm_faultinfo *ufi, vaddr_t 
vaddr, vm_page_t *pps,
struct radeon_device *rdev;
int r;
 
+   KASSERT(rw_write_held(ufi->entry->object.uvm_obj->vmobjlock));
+
bo = (struct drm_gem_object *)ufi->entry->object.uvm_obj;
rdev = bo->dev->dev_private;
down_read(>pm.mclk_lock);
diff --git sys/uvm/uvm_aobj.c sys/uvm/uvm_aobj.c
index 20051d95dc1..a5c403ab67d 100644
--- sys/uvm/uvm_aobj.c
+++ sys/uvm/uvm_aobj.c
@@ -184,7 +184,7 @@ const struct uvm_pagerops aobj_pager = {
  * deadlock.
  */
 static LIST_HEAD(aobjlist, uvm_aobj) uao_list = 
LIST_HEAD_INITIALIZER(uao_list);
-static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_NONE);
+static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_MPFLOOR);
 
 
 /*
@@ -277,6 +277,7 @@ uao_find_swslot(struct uvm_object *uobj, int pageidx)
  * uao_set_swslot: set the swap slot for a page in an aobj.
  *
  * => setting a slot to zero frees the slot
+ * => object must be locked by caller
  * => we return the old slot number, or -1 if we failed to allocate
  *memory to record the new slot number
  */
@@ -286,7 +287,7 @@ uao_set_swslot(struct uvm_object *uobj, int pageidx, int 
slot)
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
int oldslot;
 
-   KERNEL_ASSERT_LOCKED();
+   KASSERT(rw_write_held(uobj->vmobjlock) || uobj->uo_refs == 0);
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
 
/*
@@ -358,7 +359,9 @@ uao_free(struct uvm_aobj *aobj)
struct uvm_object *uobj = >u_obj;
 
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
+   KASSERT(rw_write_held(uobj->vmobjlock));
uao_dropswap_range(uobj, 0, 0);
+   rw_exit(uobj->vmobjlock);
 
if (UAO_USES_SWHASH(aobj)) {
/*
@@ -671,6 +674,7 @@ struct uvm_object *
 uao_create(vsize_t size, int flags)
 {
static struct uvm_aobj kernel_object_store;
+   static struct rwlock bootstrap_kernel_object_lock;
static int kobj_alloced = 0;
int pages = round_page(size) >> PAGE_SHIFT;
struct uvm_aobj *aobj;
@@ -742,6 +746,11 @@ uao_create(vsize_t size, int flags)
 * Initialise UVM object.
 */
uvm_obj_init(>u_obj, _pager, refs);
+   if (flags & UAO_FLAG_KERNOBJ) {
+   /* Use a temporary static lock for kernel_object. */
+   rw_init(_kernel_object_lock, "kobjlk");
+   uvm_obj_setlock(>u_obj, _kernel_object_lock);
+   }
 
/*
 * now that aobj is ready, add it to the global list
@@ -822,20 +831,20 @@ uao_detach(struct uvm_object *uobj)
 * involved in is complete), release any swap resources and free
 * the page itself.
 */
-   uvm_lock_pageq();
-   while((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
+   

Re: Rework UNIX sockets locking to be fine grained

2021-11-24 Thread Martin Pieuchot
On 22/11/21(Mon) 14:42, Vitaliy Makkoveev wrote:
> On Sat, Nov 20, 2021 at 03:12:31AM +0300, Vitaliy Makkoveev wrote:
> > Updated diff. Re-lock dances were simplified in the unix(4) sockets
> > layer.
> > 
> > Reference counters added to unix(4) sockets layer too. This makes 
> > pointer dereference of peer's control block always safe after re-lock.
> > 
> > The `unp_refs' list cleanup done in the unp_detach(). This removes the
> > case where the socket connected to our dying socket could be passed to
> > unp_disconnect() and the check of it's connection state became much
> > easier.
> >
> 
> Another re-lock simplification. We could enforce the lock order between
> the listening socket `head' and the socket `so' linked to it's `so_q0'
> or `so_q' to solock(head) -> solock(so).
> 
> This removes re-lock from accept(2) and the accepting socket couldn't be
> stolen by concurrent accept(2) thread. This removes re-lock from `so_q0'
> and `so_q' cleanup on dying listening socket.
> 
> The previous incarnation of this diff does re-lock in a half of
> doaccept(), soclose(), sofree() and soisconnected() calls. The current
> diff does not re-lock in doaccept() and soclose() and always so re-lock
> in sofree() and soisconnected().
> 
> I guess this is the latest simplification and this diff could be pushed
> forward.

This diff is really interesting.  It shows that the current locking
design needs to be reworked.

I don't think we should expose the locking strategy with a `persocket'
variable then use if/else dances to decide if one of two locks need to
be taken/released.  Instead could we fold the TCP/UDP locking into more
generic functions?  For example connect() could be:

int
soconnect2(struct socket *so1, struct socket *so2)
{
int s, error;

s = solock_pair(so1, so2);
error = (*so1->so_proto->pr_usrreq)(so1, PRU_CONNECT2, NULL,
(struct mbuf *)so2, NULL, curproc);
sounlock_pair(so1, so2, s);
return (error);
}

And solock_pair() would do the right thing(tm) based on the socket type.

Because in the end we want to prepare this layer to use per-socket locks
with TCP/UDP sockets as well.

Could something similar be done for doaccept()?

I'm afraid about introducing reference counting.  Once there is reference
counting it tends to be abused.  It's not clear to me for which reason it
is added.  It looks like to work around lock ordering issues, could you
talk a bit about this?  Is there any alternative?

I also don't understand the problem behind:

> + unp_ref(unp2);
> + sounlock(so, SL_LOCKED);
> + solock(so2);
> + solock(so);
> +
> + /* Datagram socket could be reconnected due to re-lock. */
> + if (unp->unp_conn != unp2) {
> + sounlock(so2, SL_LOCKED);
> + unp_rele(unp2);
> + goto again;
> + }
> +
> + unp_rele(unp2);


It seems that doing an unlock/relock dance requires a lot of added
complexity, why is it done this way?

Thanks for dealing with this!

> Index: sys/kern/uipc_socket.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> retrieving revision 1.269
> diff -u -p -r1.269 uipc_socket.c
> --- sys/kern/uipc_socket.c11 Nov 2021 16:35:09 -  1.269
> +++ sys/kern/uipc_socket.c22 Nov 2021 11:36:40 -
> @@ -52,6 +52,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef DDB
>  #include 
> @@ -156,7 +157,9 @@ soalloc(int prflags)
>   so = pool_get(_pool, prflags);
>   if (so == NULL)
>   return (NULL);
> - rw_init(>so_lock, "solock");
> + rw_init_flags(>so_lock, "solock", RWL_DUPOK);
> + refcnt_init(>so_refcnt);
> +
>   return (so);
>  }
>  
> @@ -257,6 +260,8 @@ solisten(struct socket *so, int backlog)
>  void
>  sofree(struct socket *so, int s)
>  {
> + int persocket = solock_persocket(so);
> +
>   soassertlocked(so);
>  
>   if (so->so_pcb || (so->so_state & SS_NOFDREF) == 0) {
> @@ -264,16 +269,53 @@ sofree(struct socket *so, int s)
>   return;
>   }
>   if (so->so_head) {
> + struct socket *head = so->so_head;
> +
>   /*
>* We must not decommission a socket that's on the accept(2)
>* queue.  If we do, then accept(2) may hang after select(2)
>* indicated that the listening socket was ready.
>*/
> - if (!soqremque(so, 0)) {
> + if (so->so_onq == >so_q) {
>   sounlock(so, s);
>   return;
>   }
> +
> + if (persocket) {
> + /*
> +  * Concurrent close of `head' could
> +  * abort `so' due to re-lock.
> +  */
> + soref(so);
> + soref(head);
> +  

Please test: UVM fault unlocking (aka vmobjlock)

2021-11-24 Thread Martin Pieuchot
Diff below unlock the bottom part of the UVM fault handler.  I'm
interested in squashing the remaining bugs.  Please test with your usual
setup & report back.

Thanks,
Martin

diff --git sys/arch/amd64/conf/GENERIC.MP sys/arch/amd64/conf/GENERIC.MP
index bb842f6d96e..e5334c19eac 100644
--- sys/arch/amd64/conf/GENERIC.MP
+++ sys/arch/amd64/conf/GENERIC.MP
@@ -4,6 +4,6 @@ include "arch/amd64/conf/GENERIC"
 
 option MULTIPROCESSOR
 #optionMP_LOCKDEBUG
-#optionWITNESS
+option WITNESS
 
 cpu*   at mainbus?
diff --git sys/arch/i386/conf/GENERIC.MP sys/arch/i386/conf/GENERIC.MP
index 980a572b8fd..ef7ded61501 100644
--- sys/arch/i386/conf/GENERIC.MP
+++ sys/arch/i386/conf/GENERIC.MP
@@ -7,6 +7,6 @@ include "arch/i386/conf/GENERIC"
 
 option MULTIPROCESSOR  # Multiple processor support
 #optionMP_LOCKDEBUG
-#optionWITNESS
+option WITNESS
 
 cpu*   at mainbus?
diff --git sys/dev/pci/drm/i915/gem/i915_gem_shmem.c 
sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
index ce8e2eca141..47b567087e7 100644
--- sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
+++ sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
@@ -268,8 +268,10 @@ shmem_truncate(struct drm_i915_gem_object *obj)
 #ifdef __linux__
shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1);
 #else
+   rw_enter(obj->base.uao->vmobjlock, RW_WRITE);
obj->base.uao->pgops->pgo_flush(obj->base.uao, 0, obj->base.size,
PGO_ALLPAGES | PGO_FREE);
+   rw_exit(obj->base.uao->vmobjlock);
 #endif
obj->mm.madv = __I915_MADV_PURGED;
obj->mm.pages = ERR_PTR(-EFAULT);
diff --git sys/dev/pci/drm/radeon/radeon_ttm.c 
sys/dev/pci/drm/radeon/radeon_ttm.c
index eb879b5c72c..837a9f94298 100644
--- sys/dev/pci/drm/radeon/radeon_ttm.c
+++ sys/dev/pci/drm/radeon/radeon_ttm.c
@@ -1006,6 +1006,8 @@ radeon_ttm_fault(struct uvm_faultinfo *ufi, vaddr_t 
vaddr, vm_page_t *pps,
struct radeon_device *rdev;
int r;
 
+   KASSERT(rw_write_held(ufi->entry->object.uvm_obj->vmobjlock));
+
bo = (struct drm_gem_object *)ufi->entry->object.uvm_obj;
rdev = bo->dev->dev_private;
down_read(>pm.mclk_lock);
diff --git sys/uvm/uvm_aobj.c sys/uvm/uvm_aobj.c
index 20051d95dc1..127218c4c40 100644
--- sys/uvm/uvm_aobj.c
+++ sys/uvm/uvm_aobj.c
@@ -31,7 +31,7 @@
 /*
  * uvm_aobj.c: anonymous memory uvm_object pager
  *
- * author: Chuck Silvers 
+* author: Chuck Silvers 
  * started: Jan-1998
  *
  * - design mostly from Chuck Cranor
@@ -184,7 +184,7 @@ const struct uvm_pagerops aobj_pager = {
  * deadlock.
  */
 static LIST_HEAD(aobjlist, uvm_aobj) uao_list = 
LIST_HEAD_INITIALIZER(uao_list);
-static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_NONE);
+static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_MPFLOOR);
 
 
 /*
@@ -277,6 +277,7 @@ uao_find_swslot(struct uvm_object *uobj, int pageidx)
  * uao_set_swslot: set the swap slot for a page in an aobj.
  *
  * => setting a slot to zero frees the slot
+ * => object must be locked by caller
  * => we return the old slot number, or -1 if we failed to allocate
  *memory to record the new slot number
  */
@@ -286,7 +287,7 @@ uao_set_swslot(struct uvm_object *uobj, int pageidx, int 
slot)
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
int oldslot;
 
-   KERNEL_ASSERT_LOCKED();
+   KASSERT(rw_write_held(uobj->vmobjlock) || uobj->uo_refs == 0);
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
 
/*
@@ -358,7 +359,9 @@ uao_free(struct uvm_aobj *aobj)
struct uvm_object *uobj = >u_obj;
 
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
+   KASSERT(rw_write_held(uobj->vmobjlock));
uao_dropswap_range(uobj, 0, 0);
+   rw_exit(uobj->vmobjlock);
 
if (UAO_USES_SWHASH(aobj)) {
/*
@@ -671,6 +674,7 @@ struct uvm_object *
 uao_create(vsize_t size, int flags)
 {
static struct uvm_aobj kernel_object_store;
+   static struct rwlock bootstrap_kernel_object_lock;
static int kobj_alloced = 0;
int pages = round_page(size) >> PAGE_SHIFT;
struct uvm_aobj *aobj;
@@ -742,6 +746,11 @@ uao_create(vsize_t size, int flags)
 * Initialise UVM object.
 */
uvm_obj_init(>u_obj, _pager, refs);
+   if (flags & UAO_FLAG_KERNOBJ) {
+   /* Use a temporary static lock for kernel_object. */
+   rw_init(_kernel_object_lock, "kobjlk");
+   uvm_obj_setlock(>u_obj, _kernel_object_lock);
+   }
 
/*
 * now that aobj is ready, add it to the global list
@@ -822,20 +831,20 @@ uao_detach(struct uvm_object *uobj)
 * involved in is complete), release any swap resources and free
 * the page itself.
 */
-   uvm_lock_pageq();
-   while((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
+   rw_enter(uobj->vmobjlock, RW_WRITE);
+   while ((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
+   pmap_page_protect(pg, PROT_NONE);

Re: Retry sleep in poll/select

2021-11-18 Thread Martin Pieuchot
On 17/11/21(Wed) 09:51, Scott Cheloha wrote:
> > On Nov 17, 2021, at 03:22, Martin Pieuchot  wrote:
> > 
> > On 16/11/21(Tue) 13:55, Visa Hankala wrote:
> >> Currently, dopselect() and doppoll() call tsleep_nsec() without retry.
> >> cheloha@ asked if the functions should handle spurious wakeups. I guess
> >> such wakeups are unlikely with the nowake wait channel, but I am not
> >> sure if that is a safe guess.
> > 
> > I'm not sure to understand, are we afraid a thread sleeping on `nowake'
> > can be awaken?  Is it the assumption here?
> 
> Yes, but I don't know how.

Then I'd suggest we start with understanding how this can happen otherwise
I fear we are adding more complexity for reasons we don't understands.

> kettenis@ said spurious wakeups were
> possible on a similar loop in sigsuspend(2)
> so I mentioned this to visa@ off-list.

I don't understand how this can happen.

> If we added an assert to panic in wakeup(9)
> if the channel is , would that be
> sufficient?

I guess so.

> Ideally if you sleep on  you should
> never get a zero status from the sleep
> functions.  It should be impossible… if that
> is possible to ensure.



Re: Retry sleep in poll/select

2021-11-17 Thread Martin Pieuchot
On 16/11/21(Tue) 13:55, Visa Hankala wrote:
> Currently, dopselect() and doppoll() call tsleep_nsec() without retry.
> cheloha@ asked if the functions should handle spurious wakeups. I guess
> such wakeups are unlikely with the nowake wait channel, but I am not
> sure if that is a safe guess.

I'm not sure to understand, are we afraid a thread sleeping on `nowake'
can be awaken?  Is it the assumption here?

> The following diff adds the retrying. The code is a bit arduous, so the
> retry loop is put in a separate function that both poll and select use.

Using a separate function makes sense anyway.

> Index: kern/sys_generic.c
> ===
> RCS file: src/sys/kern/sys_generic.c,v
> retrieving revision 1.141
> diff -u -p -r1.141 sys_generic.c
> --- kern/sys_generic.c16 Nov 2021 13:48:23 -  1.141
> +++ kern/sys_generic.c16 Nov 2021 13:50:08 -
> @@ -90,6 +90,7 @@ int dopselect(struct proc *, int, fd_set
>  int doppoll(struct proc *, struct pollfd *, u_int, struct timespec *,
>  const sigset_t *, register_t *);
>  void doselwakeup(struct selinfo *);
> +int selsleep(struct timespec *);
>  
>  int
>  iovec_copyin(const struct iovec *uiov, struct iovec **iovp, struct iovec 
> *aiov,
> @@ -664,19 +665,7 @@ dopselect(struct proc *p, int nd, fd_set
>* there's nothing to wait for.
>*/
>   if (nevents == 0 && ncollected == 0) {
> - uint64_t nsecs = INFSLP;
> -
> - if (timeout != NULL) {
> - if (!timespecisset(timeout))
> - goto done;
> - nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
> - }
> - error = tsleep_nsec(, PSOCK | PCATCH, "kqsel", nsecs);
> - /* select is not restarted after signals... */
> - if (error == ERESTART)
> - error = EINTR;
> - if (error == EWOULDBLOCK)
> - error = 0;
> + error = selsleep(timeout);
>   goto done;
>   }
>  
> @@ -849,6 +838,46 @@ selfalse(dev_t dev, int events, struct p
>  }
>  
>  /*
> + * Sleep until a signal arrives or the optional timeout expires.
> + */
> +int
> +selsleep(struct timespec *timeout)
> +{
> + uint64_t end, now, nsecs;
> + int error;
> +
> + if (timeout != NULL) {
> + if (!timespecisset(timeout))
> + return (0);
> + now = getnsecuptime();
> + end = MIN(now + TIMESPEC_TO_NSEC(timeout), MAXTSLP);
> + if (end < now)
> + end = MAXTSLP;
> + }
> +
> + do {
> + if (timeout != NULL)
> + nsecs = MAX(1, end - now);
> + else
> + nsecs = INFSLP;
> + error = tsleep_nsec(, PSOCK | PCATCH, "selslp", nsecs);
> + if (timeout != NULL) {
> + now = getnsecuptime();
> + if (now >= end)
> + break;
> + }
> + } while (error == 0);
> +
> + /* poll/select is not restarted after signals... */
> + if (error == ERESTART)
> + error = EINTR;
> + if (error == EWOULDBLOCK)
> + error = 0;
> +
> + return (error);
> +}
> +
> +/*
>   * Record a select request.
>   */
>  void
> @@ -1158,19 +1187,7 @@ doppoll(struct proc *p, struct pollfd *f
>* there's nothing to wait for.
>*/
>   if (nevents == 0 && ncollected == 0) {
> - uint64_t nsecs = INFSLP;
> -
> - if (timeout != NULL) {
> - if (!timespecisset(timeout))
> - goto done;
> - nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
> - }
> -
> - error = tsleep_nsec(, PSOCK | PCATCH, "kqpoll", nsecs);
> - if (error == ERESTART)
> - error = EINTR;
> - if (error == EWOULDBLOCK)
> - error = 0;
> + error = selsleep(timeout);
>   goto done;
>   }
>  
> 



Re: bt.5 document count()

2021-11-16 Thread Martin Pieuchot
On 16/11/21(Tue) 11:07, Claudio Jeker wrote:
> This documents count(). This function only works when used like this
>   @map[key] = count();
> But it is implemented and works. If used differently you get a syntax
> error which is not helpful. This is why I chose to document it like this.
> Another option would be to document the language (so it is clear where it
> is possible to use what). 

ok mpi@

> max(), min() and sum() are other functions that behave like this. Their
> documentation should also be adjusted IMO.
> 
> -- 
> :wq Claudio
> 
> Index: bt.5
> ===
> RCS file: /cvs/src/usr.sbin/btrace/bt.5,v
> retrieving revision 1.13
> diff -u -p -r1.13 bt.5
> --- bt.5  12 Nov 2021 16:57:24 -  1.13
> +++ bt.5  16 Nov 2021 09:50:52 -
> @@ -120,6 +120,11 @@ Functions:
>  .It Fn clear "@map"
>  Delete all (key, value) pairs from
>  .Va @map .
> +.It "@map[key]" = Fn count
> +Increment the value of
> +.Va key
> +from
> +.Va @map .
>  .It Fn delete "@map[key]"
>  Delete the pair indexed by
>  .Va key
> 



Re: poll/select: Lazy removal of knotes

2021-11-06 Thread Martin Pieuchot
On 06/11/21(Sat) 15:53, Visa Hankala wrote:
> On Fri, Nov 05, 2021 at 10:04:50AM +0100, Martin Pieuchot wrote:
> > New poll/select(2) implementation convert 'struct pollfd' and 'fdset' to
> > knotes (kqueue event descriptors) then pass them to the kqueue subsystem.
> > A knote is allocated, with kqueue_register(), for every read, write and
> > except condition watched on a given FD.  That means at most 3 allocations
> > might be necessary per FD.
> > 
> > The diff below reduce the overhead of per-syscall allocation/free of those
> > descriptors by leaving those which didn't trigger on the kqueue across
> > syscall.  Leaving knotes on the kqueue allows kqueue_register() to re-use
> > existing descriptor instead of re-allocating a new one.
> > 
> > With this knotes are now lazily removed.  The mechanism uses a serial
> > number which is incremented for every syscall that indicates if a knote
> > sitting in the kqueue is still valid or should be freed.
> > 
> > Note that performance improvements might not be visible with this diff
> > alone because kqueue_register() still pre-allocate a descriptor then drop
> > it.
> > 
> > visa@ already pointed out that the lazy removal logic could be integrated
> > in kqueue_scan() which would reduce the complexity of those two syscalls.
> > I'm arguing for doing this in a next step in-tree.
> 
> I think it would make more sense to add the removal logic to the scan
> function first as doing so would keep the code modifications more
> logical and simpler. This would also avoid the need to go through
> a temporary removal approach.

I totally support your effort and your design however I don't have the
time to do another round of test/debugging.  So please, can you take
care of doing these cleanups afterward?  If not, please send a full diff
and take over this feature, it's too much effort for me to work out of
tree.

> Index: kern/kern_event.c
> ===
> RCS file: src/sys/kern/kern_event.c,v
> retrieving revision 1.170
> diff -u -p -r1.170 kern_event.c
> --- kern/kern_event.c 6 Nov 2021 05:48:47 -   1.170
> +++ kern/kern_event.c 6 Nov 2021 15:31:04 -
> @@ -73,6 +73,7 @@ voidkqueue_terminate(struct proc *p, st
>  void KQREF(struct kqueue *);
>  void KQRELE(struct kqueue *);
>  
> +void kqueue_purge(struct proc *, struct kqueue *);
>  int  kqueue_sleep(struct kqueue *, struct timespec *);
>  
>  int  kqueue_read(struct file *, struct uio *, int);
> @@ -806,6 +807,22 @@ kqpoll_exit(void)
>  }
>  
>  void
> +kqpoll_done(unsigned int num)
> +{
> + struct proc *p = curproc;
> +
> + KASSERT(p->p_kq != NULL);
> +
> + if (p->p_kq_serial + num >= p->p_kq_serial) {
> + p->p_kq_serial += num;
> + } else {
> + /* Clear all knotes after serial wraparound. */
> + kqueue_purge(p, p->p_kq);
> + p->p_kq_serial = 1;
> + }
> +}
> +
> +void
>  kqpoll_dequeue(struct proc *p, int all)
>  {
>   struct knote marker;
> @@ -1383,6 +1400,15 @@ retry:
>  
>   mtx_leave(>kq_lock);
>  
> + /* Drop spurious events. */
> + if (p->p_kq == kq &&
> + p->p_kq_serial > (unsigned long)kn->kn_udata) {
> + filter_detach(kn);
> + knote_drop(kn, p);
> + mtx_enter(>kq_lock);
> + continue;
> + }
> +
>   memset(kevp, 0, sizeof(*kevp));
>   if (filter_process(kn, kevp) == 0) {
>   mtx_enter(>kq_lock);
> Index: kern/sys_generic.c
> ===
> RCS file: src/sys/kern/sys_generic.c,v
> retrieving revision 1.139
> diff -u -p -r1.139 sys_generic.c
> --- kern/sys_generic.c29 Oct 2021 15:52:44 -  1.139
> +++ kern/sys_generic.c6 Nov 2021 15:31:04 -
> @@ -730,8 +730,7 @@ done:
>   if (pibits[0] != (fd_set *)[0])
>   free(pibits[0], M_TEMP, 6 * ni);
>  
> - kqueue_purge(p, p->p_kq);
> - p->p_kq_serial += nd;
> + kqpoll_done(nd);
>  
>   return (error);
>  }
> @@ -1230,8 +1229,7 @@ bad:
>   if (pl != pfds)
>   free(pl, M_TEMP, sz);
>  
> - kqueue_purge(p, p->p_kq);
> - p->p_kq_serial += nfds;
> + kqpoll_done(nfds);
>  
>   return (error);
>  }
> @@ -1251,8 +1249,7 @@ ppollcollect(struct proc *p, struct keve
>   /*
>* Lazily delete spurious events.
>*
> -   

poll/select: Lazy removal of knotes

2021-11-05 Thread Martin Pieuchot
New poll/select(2) implementation convert 'struct pollfd' and 'fdset' to
knotes (kqueue event descriptors) then pass them to the kqueue subsystem.
A knote is allocated, with kqueue_register(), for every read, write and
except condition watched on a given FD.  That means at most 3 allocations
might be necessary per FD.

The diff below reduce the overhead of per-syscall allocation/free of those
descriptors by leaving those which didn't trigger on the kqueue across
syscall.  Leaving knotes on the kqueue allows kqueue_register() to re-use
existing descriptor instead of re-allocating a new one.

With this knotes are now lazily removed.  The mechanism uses a serial
number which is incremented for every syscall that indicates if a knote
sitting in the kqueue is still valid or should be freed.

Note that performance improvements might not be visible with this diff
alone because kqueue_register() still pre-allocate a descriptor then drop
it.

visa@ already pointed out that the lazy removal logic could be integrated
in kqueue_scan() which would reduce the complexity of those two syscalls.
I'm arguing for doing this in a next step in-tree.

Please test and review :)

Index: kern/sys_generic.c
===
RCS file: /cvs/src/sys/kern/sys_generic.c,v
retrieving revision 1.139
diff -u -p -r1.139 sys_generic.c
--- kern/sys_generic.c  29 Oct 2021 15:52:44 -  1.139
+++ kern/sys_generic.c  5 Nov 2021 08:11:05 -
@@ -598,7 +598,7 @@ sys_pselect(struct proc *p, void *v, reg
 
 int
 dopselect(struct proc *p, int nd, fd_set *in, fd_set *ou, fd_set *ex,
-struct timespec *timeout, const sigset_t *sigmask, register_t *retval)
+struct timespec *tsp, const sigset_t *sigmask, register_t *retval)
 {
struct kqueue_scan_state scan;
fd_mask bits[6];
@@ -666,10 +666,10 @@ dopselect(struct proc *p, int nd, fd_set
if (nevents == 0 && ncollected == 0) {
uint64_t nsecs = INFSLP;
 
-   if (timeout != NULL) {
-   if (!timespecisset(timeout))
+   if (tsp != NULL) {
+   if (!timespecisset(tsp))
goto done;
-   nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
+   nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(tsp), MAXTSLP));
}
error = tsleep_nsec(>p_kq, PSOCK | PCATCH, "kqsel", nsecs);
/* select is not restarted after signals... */
@@ -682,28 +682,37 @@ dopselect(struct proc *p, int nd, fd_set
 
/* Collect at most `nevents' possibly waiting in kqueue_scan() */
kqueue_scan_setup(, p->p_kq);
-   while (nevents > 0) {
+   while ((nevents - ncollected) > 0) {
struct kevent kev[KQ_NEVENTS];
int i, ready, count;
 
-   /* Maximum number of events per iteration */
-   count = MIN(nitems(kev), nevents);
-   ready = kqueue_scan(, count, kev, timeout, p, );
+   /*
+* Maximum number of events per iteration.  Use the whole
+* array to gather as many spurious events as possible.
+*/
+   count = nitems(kev);
+   ready = kqueue_scan(, count, kev, tsp, p, );
 #ifdef KTRACE
if (KTRPOINT(p, KTR_STRUCT))
ktrevent(p, kev, ready);
 #endif
-   /* Convert back events that are ready. */
+   /* Convert back events that are ready/delete spurious ones. */
for (i = 0; i < ready && error == 0; i++)
error = pselcollect(p, [i], pobits, );
+
/*
-* Stop if there was an error or if we had enough
-* space to collect all events that were ready.
+* Stop if there was an error or if we had enough space
+* to collect all non-spurious events that were ready.
 */
-   if (error || ready < count)
+   if (error || !ready || (ncollected > 0 && ready < count))
break;
 
-   nevents -= ready;
+   /*
+* If we only got spurious events try again repositioning
+* the marker.
+*/
+   if (ncollected == 0 && ((tsp == NULL) || timespecisset(tsp)))
+   scan.kqs_nevent = 0;
}
kqueue_scan_finish();
*retval = ncollected;
@@ -730,7 +739,7 @@ done:
if (pibits[0] != (fd_set *)[0])
free(pibits[0], M_TEMP, 6 * ni);
 
-   kqueue_purge(p, p->p_kq);
+   /* Needed to remove events lazily. */
p->p_kq_serial += nd;
 
return (error);
@@ -759,7 +768,7 @@ pselregister(struct proc *p, fd_set *pib
DPRINTFN(2, "select fd %d mask %d serial %lu\n",
fd, msk, p->p_kq_serial);

Re: UNIX sockets: use vnode(9) lock to protect `v_socket' dereference

2021-11-05 Thread Martin Pieuchot
On 26/10/21(Tue) 14:12, Vitaliy Makkoveev wrote:
> Another step to make UNIX sockets locking fine grained.
> 
> The listening socket has the references from file descriptors layer and
> from the vnode(9) layer. This means when we close(2)'ing such socket it
> still referenced by concurrent thread through connect(2) path.
> 
> When we bind(2) UNIX socket we link it to vnode(9) by assigning
> `v_socket'. When we connect(2)'ing socket to the socket we previously
> bind(2)'ed we finding it by namei(9) and obtain it's reference through
> `v_socket'. This socket has no extra reference in file descriptor
> layer and could be closed by concurrent thread.
> 
> This time we have `unp_lock' rwlock(9) which protects the whole layer
> and the dereference of `v_socket' is safe. But with the fine grained
> locking the `v_socket' will not be protected by global lock. When we
> obtain the vnode(9) by namei(9) in connect(9) or bind(9) paths it is
> already exclusively locked by vlode(9) lock. But in unp_detach() which
> is called on the close(2)'ing socket we assume `unp_lock' protects
> `v_socket'.
> 
> I propose to use exclusive vnode(9) lock to protect `v_socket'. With the
> fine grained locking, the `v_socket' dereference in unp_bind() or
> unp_connect() threads will be safe because unp_detach() thread will wait
> the vnode(9) lock release. The vnode referenced by `unp_vnod' has
> reference counter bumped so it's dereference is also safe without
> `unp_lock' held.

This makes sense to me.  Using the vnode lock here seems the simplest
approach.

> The `i_lock' should be take before `unp_lock' and unp_detach() should
> release solock(). To prevent connections on this socket the
> 'SO_ACCEPTCONN' bit cleared in soclose().

This is done to prevent races when solock() is released inside soabort(),
right?  Is it the only one or some more care is needed?
Will this stay with per-socket locks or is this only necessary because of
the global `unp_lock'?

> Index: sys/kern/uipc_socket.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> retrieving revision 1.265
> diff -u -p -r1.265 uipc_socket.c
> --- sys/kern/uipc_socket.c14 Oct 2021 23:05:10 -  1.265
> +++ sys/kern/uipc_socket.c26 Oct 2021 11:05:59 -
> @@ -315,6 +315,8 @@ soclose(struct socket *so, int flags)
>   /* Revoke async IO early. There is a final revocation in sofree(). */
>   sigio_free(>so_sigio);
>   if (so->so_options & SO_ACCEPTCONN) {
> + so->so_options &= ~SO_ACCEPTCONN;
> +
>   while ((so2 = TAILQ_FIRST(>so_q0)) != NULL) {
>   (void) soqremque(so2, 0);
>   (void) soabort(so2);
> Index: sys/kern/uipc_usrreq.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_usrreq.c,v
> retrieving revision 1.150
> diff -u -p -r1.150 uipc_usrreq.c
> --- sys/kern/uipc_usrreq.c21 Oct 2021 22:11:07 -  1.150
> +++ sys/kern/uipc_usrreq.c26 Oct 2021 11:05:59 -
> @@ -474,20 +474,30 @@ void
>  unp_detach(struct unpcb *unp)
>  {
>   struct socket *so = unp->unp_socket;
> - struct vnode *vp = NULL;
> + struct vnode *vp = unp->unp_vnode;
>  
>   rw_assert_wrlock(_lock);
>  
>   LIST_REMOVE(unp, unp_link);
> - if (unp->unp_vnode) {
> +
> + if (vp) {
> + unp->unp_vnode = NULL;
> +
>   /*
> -  * `v_socket' is only read in unp_connect and
> -  * unplock prevents concurrent access.
> +  * Enforce `i_lock' -> `unp_lock' because fifo
> +  * subsystem requires it.
>*/
>  
> - unp->unp_vnode->v_socket = NULL;
> - vp = unp->unp_vnode;
> - unp->unp_vnode = NULL;
> + sounlock(so, SL_LOCKED);
> +
> + VOP_LOCK(vp, LK_EXCLUSIVE);
> + vp->v_socket = NULL;
> +
> + KERNEL_LOCK();
> + vput(vp);
> + KERNEL_UNLOCK();
> +
> + solock(so);
>   }
>  
>   if (unp->unp_conn)
> @@ -500,21 +510,6 @@ unp_detach(struct unpcb *unp)
>   pool_put(_pool, unp);
>   if (unp_rights)
>   task_add(systqmp, _gc_task);
> -
> - if (vp != NULL) {
> - /*
> -  * Enforce `i_lock' -> `unplock' because fifo subsystem
> -  * requires it. The socket can't be closed concurrently
> -  * because the file descriptor reference is
> -  * still hold.
> -  */
> -
> - sounlock(so, SL_LOCKED);
> - KERNEL_LOCK();
> - vrele(vp);
> - KERNEL_UNLOCK();
> - solock(so);
> - }
>  }
>  
>  int
> 



Re: UNIX sockets: make `unp_rights', `unp_msgcount' and `unp_file' atomic

2021-11-05 Thread Martin Pieuchot
On 30/10/21(Sat) 21:22, Vitaliy Makkoveev wrote:
> This completely removes global rwlock(9) from the unp_internalize() and
> unp_externalize() normal paths but only leaves it in unp_externalize()
> error path. Also we don't need to simultaneously hold both fdplock()
> and `unp_lock' in unp_internalize(). As non obvious profit this
> simplifies the future lock dances in the UNIX sockets layer.
> 
> It's safe to call fptounp() without `unp_lock' held. We always got this
> file descriptor by fd_getfile(9) so we always have the extra reference
> and this descriptor can't be closed by concurrent thread. Some sockets
> could be destroyed through 'PRU_ABORT' path but they don't have
> associated file descriptor and they are not accessible in the
> unp_internalize() path.
> 
> The `unp_file' access without `unp_lock' held is also safe. Each socket
> could have the only associated file descriptor and each file descriptor
> could have the only associated socket. We only assign `unp_file' in the
> unp_internalize() path where we got the socket by fd_getfile(9). This
> descriptor has the extra reference and couldn't be closed concurrently.
> We could override `unp_file' but with the same address because the
> associated file descriptor can't be changed so the address will be also
> the same. So while unp_gc() concurrently runs the dereference of
> non-NULL `unp_file' is always safe.

Using an atomic operation for `unp_msgcount' is ok with me, one comment
about `unp_rights' below.

> Index: sys/kern/uipc_usrreq.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_usrreq.c,v
> retrieving revision 1.153
> diff -u -p -r1.153 uipc_usrreq.c
> --- sys/kern/uipc_usrreq.c30 Oct 2021 16:35:31 -  1.153
> +++ sys/kern/uipc_usrreq.c30 Oct 2021 18:41:25 -
> @@ -58,6 +58,7 @@
>   * Locks used to protect global data and struct members:
>   *  I   immutable after creation
>   *  U   unp_lock
> + *  a   atomic
>   */
>  struct rwlock unp_lock = RWLOCK_INITIALIZER("unplock");
>  
> @@ -99,7 +100,7 @@ SLIST_HEAD(,unp_deferral)  unp_deferred =
>   SLIST_HEAD_INITIALIZER(unp_deferred);
>  
>  ino_tunp_ino;/* [U] prototype for fake inode numbers */
> -int  unp_rights; /* [U] file descriptors in flight */
> +int  unp_rights; /* [a] file descriptors in flight */
>  int  unp_defer;  /* [U] number of deferred fp to close by the GC task */
>  int  unp_gcing;  /* [U] GC task currently running */
>  
> @@ -927,17 +928,16 @@ restart:
>*/
>   rp = (struct fdpass *)CMSG_DATA(cm);
>  
> - rw_enter_write(_lock);
>   for (i = 0; i < nfds; i++) {
>   struct unpcb *unp;
>  
>   fp = rp->fp;
>   rp++;
>   if ((unp = fptounp(fp)) != NULL)
> - unp->unp_msgcount--;
> - unp_rights--;
> + atomic_dec_long(>unp_msgcount);
>   }
> - rw_exit_write(_lock);
> +
> + atomic_sub_int(_rights, nfds);
>  
>   /*
>* Copy temporary array to message and adjust length, in case of
> @@ -985,13 +985,10 @@ unp_internalize(struct mbuf *control, st
>   return (EINVAL);
>   nfds = (cm->cmsg_len - CMSG_ALIGN(sizeof(*cm))) / sizeof (int);
>  
> - rw_enter_write(_lock);
> - if (unp_rights + nfds > maxfiles / 10) {
> - rw_exit_write(_lock);
> + if (atomic_add_int_nv(_rights, nfds) > maxfiles / 10) {
> + atomic_sub_int(_rights, nfds);

I can't believe this is race free. If two threads, T1 and T2, call
atomic_add at the same time both might end up returning EMFILE even
if only the first one currently does.  This could happen if T1 exceeds
the limit and T2 does atomic_add on an already-exceeded `unp_rights'
before T1 could do atomic_sub.

I suggest using a mutex to protect `unp_rights' instead to solve this
issue.

>   return (EMFILE);
>   }
> - unp_rights += nfds;
> - rw_exit_write(_lock);
>  
>   /* Make sure we have room for the struct file pointers */
>  morespace:
> @@ -1031,7 +1028,6 @@ morespace:
>   ip = ((int *)CMSG_DATA(cm)) + nfds - 1;
>   rp = ((struct fdpass *)CMSG_DATA(cm)) + nfds - 1;
>   fdplock(fdp);
> - rw_enter_write(_lock);
>   for (i = 0; i < nfds; i++) {
>   memcpy(, ip, sizeof fd);
>   ip--;
> @@ -1056,15 +1052,13 @@ morespace:
>   rp->flags = fdp->fd_ofileflags[fd] & UF_PLEDGED;
>   rp--;
>   if ((unp = fptounp(fp)) != NULL) {
> + atomic_inc_long(>unp_msgcount);
>   unp->unp_file = fp;
> - unp->unp_msgcount++;
>   }
>   }
> - rw_exit_write(_lock);
>   fdpunlock(fdp);
>   return (0);
>  fail:
> - rw_exit_write(_lock);
>   fdpunlock(fdp);
>   if (fp != NULL)
>   FRELE(fp, p);
> @@ -1072,17 +1066,13 @@ fail:
>   for ( ; i > 

Re: Please test: full poll/select(2) switch

2021-10-29 Thread Martin Pieuchot
On 29/10/21(Fri) 14:48, Alexandre Ratchov wrote:
> On Fri, Oct 29, 2021 at 01:12:06PM +0100, Martin Pieuchot wrote:
> > On 29/10/21(Fri) 13:12, Alexandre Ratchov wrote:
> > > On Sat, Oct 23, 2021 at 10:40:56AM +0100, Martin Pieuchot wrote:
> > > > Diff below switches both poll(2) and select(2) to the kqueue-based
> > > > implementation.
> > > > 
> > > > In addition it switches libevent(3) to use poll(2) by default for
> > > > testing purposes.
> > > > 
> > > > I don't have any open bug left with this diff and I'm happily running
> > > > GNOME with it.  So I'd be happy if you could try to break it and report
> > > > back.
> > > > 
> > > 
> > > Without the below diff (copied from audio(4) driver), kernel panics
> > > upon the first MIDI input byte.
> > 
> > What is the panic?  The mutex is taken recursively, right?
> >  
> 
> Exactly, this is the "locking against myself", panic.
> 
> AFAIU, the interrupt handler grabs the audio_lock and calls
> midi_iintr(). It calls selwakeup(), which in turn calls
> filt_midiread(), which attempts to grab the audio_lock a second time.
> 
> > > ok? suggestion for a better fix?
> > 
> > Without seeing the panic, I'm guessing this is correct.
> > 
> > That suggest kevent(2) wasn't safe to use with midi(4).
> > 
> 
> Yes, this is the very first time midi(4) is used with kevent(2).

Then this is correct, thanks a lot.  Please go ahead, ok mpi@



Re: Please test: full poll/select(2) switch

2021-10-29 Thread Martin Pieuchot
On 29/10/21(Fri) 13:12, Alexandre Ratchov wrote:
> On Sat, Oct 23, 2021 at 10:40:56AM +0100, Martin Pieuchot wrote:
> > Diff below switches both poll(2) and select(2) to the kqueue-based
> > implementation.
> > 
> > In addition it switches libevent(3) to use poll(2) by default for
> > testing purposes.
> > 
> > I don't have any open bug left with this diff and I'm happily running
> > GNOME with it.  So I'd be happy if you could try to break it and report
> > back.
> > 
> 
> Without the below diff (copied from audio(4) driver), kernel panics
> upon the first MIDI input byte.

What is the panic?  The mutex is taken recursively, right?
 
> ok? suggestion for a better fix?

Without seeing the panic, I'm guessing this is correct.

That suggest kevent(2) wasn't safe to use with midi(4).

> Index: midi.c
> ===
> RCS file: /cvs/src/sys/dev/midi.c,v
> retrieving revision 1.48
> diff -u -p -r1.48 midi.c
> --- midi.c25 Dec 2020 12:59:52 -  1.48
> +++ midi.c29 Oct 2021 11:09:47 -
> @@ -386,9 +386,11 @@ filt_midiread(struct knote *kn, long hin
>   struct midi_softc *sc = (struct midi_softc *)kn->kn_hook;
>   int retval;
>  
> - mtx_enter(_lock);
> + if ((hint & NOTE_SUBMIT) == 0)
> + mtx_enter(_lock);
>   retval = !MIDIBUF_ISEMPTY(>inbuf);
> - mtx_leave(_lock);
> + if ((hint & NOTE_SUBMIT) == 0)
> + mtx_leave(_lock);
>  
>   return (retval);
>  }
> @@ -409,9 +411,11 @@ filt_midiwrite(struct knote *kn, long hi
>   struct midi_softc *sc = (struct midi_softc *)kn->kn_hook;
>   intretval;
>  
> - mtx_enter(_lock);
> + if ((hint & NOTE_SUBMIT) == 0)
> + mtx_enter(_lock);
>   retval = !MIDIBUF_ISFULL(>outbuf);
> - mtx_leave(_lock);
> + if ((hint & NOTE_SUBMIT) == 0)
> + mtx_leave(_lock);
>  
>   return (retval);
>  }
> 
> 



Re: uvm_km_pgremove() tweak

2021-10-24 Thread Martin Pieuchot
On 24/10/21(Sun) 14:49, Martin Pieuchot wrote:
> Here's another small tweak I could extract from the UVM unlocking diff.
> This doesn't introduce any functional change. uvm_km_pgremove() is used
> in only one place.

Updated diff that also moves pmap_kremove() into the intrsafe variant to
be coherent, pointed out by kettenis@.  This also reduce differences with
NetBSD.

ok?

Index: uvm/uvm_km.c
===
RCS file: /cvs/src/sys/uvm/uvm_km.c,v
retrieving revision 1.145
diff -u -p -r1.145 uvm_km.c
--- uvm/uvm_km.c15 Jun 2021 16:38:09 -  1.145
+++ uvm/uvm_km.c24 Oct 2021 14:08:42 -
@@ -239,8 +239,10 @@ uvm_km_suballoc(struct vm_map *map, vadd
  *the pages right away.(this gets called from uvm_unmap_...).
  */
 void
-uvm_km_pgremove(struct uvm_object *uobj, vaddr_t start, vaddr_t end)
+uvm_km_pgremove(struct uvm_object *uobj, vaddr_t startva, vaddr_t endva)
 {
+   const voff_t start = startva - vm_map_min(kernel_map);
+   const voff_t end = endva - vm_map_min(kernel_map);
struct vm_page *pp;
voff_t curoff;
int slot;
@@ -248,6 +250,7 @@ uvm_km_pgremove(struct uvm_object *uobj,
 
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
 
+   pmap_remove(pmap_kernel(), startva, endva);
for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
pp = uvm_pagelookup(uobj, curoff);
if (pp && pp->pg_flags & PG_BUSY) {
@@ -301,6 +304,7 @@ uvm_km_pgremove_intrsafe(vaddr_t start, 
panic("uvm_km_pgremove_intrsafe: no page");
uvm_pagefree(pg);
}
+   pmap_kremove(start, end - start);
 }
 
 /*
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.278
diff -u -p -r1.278 uvm_map.c
--- uvm/uvm_map.c   5 Oct 2021 15:37:21 -   1.278
+++ uvm/uvm_map.c   24 Oct 2021 14:09:13 -
@@ -2116,8 +2116,8 @@ uvm_unmap_kill_entry(struct vm_map *map,
/* Nothing to be done for holes. */
} else if (map->flags & VM_MAP_INTRSAFE) {
KASSERT(vm_map_pmap(map) == pmap_kernel());
+
uvm_km_pgremove_intrsafe(entry->start, entry->end);
-   pmap_kremove(entry->start, entry->end - entry->start);
} else if (UVM_ET_ISOBJ(entry) &&
UVM_OBJ_IS_KERN_OBJECT(entry->object.uvm_obj)) {
KASSERT(vm_map_pmap(map) == pmap_kernel());
@@ -2155,10 +2155,8 @@ uvm_unmap_kill_entry(struct vm_map *map,
 * from the object.  offsets are always relative
 * to vm_map_min(kernel_map).
 */
-   pmap_remove(pmap_kernel(), entry->start, entry->end);
-   uvm_km_pgremove(entry->object.uvm_obj,
-   entry->start - vm_map_min(kernel_map),
-   entry->end - vm_map_min(kernel_map));
+   uvm_km_pgremove(entry->object.uvm_obj, entry->start,
+   entry->end);
 
/*
 * null out kernel_object reference, we've just



uvm_km_pgremove() tweak

2021-10-24 Thread Martin Pieuchot
Here's another small tweak I could extract from the UVM unlocking diff.
This doesn't introduce any functional change. uvm_km_pgremove() is used
in only one place.

Ok?

Index: uvm/uvm_km.c
===
RCS file: /cvs/src/sys/uvm/uvm_km.c,v
retrieving revision 1.145
diff -u -p -r1.145 uvm_km.c
--- uvm/uvm_km.c15 Jun 2021 16:38:09 -  1.145
+++ uvm/uvm_km.c24 Oct 2021 13:23:22 -
@@ -239,8 +239,10 @@ uvm_km_suballoc(struct vm_map *map, vadd
  *the pages right away.(this gets called from uvm_unmap_...).
  */
 void
-uvm_km_pgremove(struct uvm_object *uobj, vaddr_t start, vaddr_t end)
+uvm_km_pgremove(struct uvm_object *uobj, vaddr_t startva, vaddr_t endva)
 {
+   const voff_t start = startva - vm_map_min(kernel_map);
+   const voff_t end = endva - vm_map_min(kernel_map);
struct vm_page *pp;
voff_t curoff;
int slot;
@@ -248,6 +250,7 @@ uvm_km_pgremove(struct uvm_object *uobj,
 
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
 
+   pmap_remove(pmap_kernel(), startva, endva);
for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
pp = uvm_pagelookup(uobj, curoff);
if (pp && pp->pg_flags & PG_BUSY) {
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.278
diff -u -p -r1.278 uvm_map.c
--- uvm/uvm_map.c   5 Oct 2021 15:37:21 -   1.278
+++ uvm/uvm_map.c   24 Oct 2021 13:24:21 -
@@ -2155,10 +2155,8 @@ uvm_unmap_kill_entry(struct vm_map *map,
 * from the object.  offsets are always relative
 * to vm_map_min(kernel_map).
 */
-   pmap_remove(pmap_kernel(), entry->start, entry->end);
-   uvm_km_pgremove(entry->object.uvm_obj,
-   entry->start - vm_map_min(kernel_map),
-   entry->end - vm_map_min(kernel_map));
+   uvm_km_pgremove(entry->object.uvm_obj, entry->start,
+   entry->end);
 
/*
 * null out kernel_object reference, we've just



More uvm_obj_destroy()

2021-10-23 Thread Martin Pieuchot
Diff below is extracted from the current UVM unlocking diff.  It adds a
couple of uvm_obj_destroy() and move some uvm_obj_init() around.

uvm_obj_destroy() will be used to release the memory of the, possibly
shared, lock allocated in uvm_obj_init().  When it is call the object
should no longer have any paged attached to it, that's why I added the
corresponding KASSERT().

uvm_obj_init() have been moved to satisfy lock assertions and reduce
differences with NetBSD.  The tricky one is for vnode which are never
freed. 

Comments?  Oks?

Index: kern/vfs_subr.c
===
RCS file: /cvs/src/sys/kern/vfs_subr.c,v
retrieving revision 1.309
diff -u -p -r1.309 vfs_subr.c
--- kern/vfs_subr.c 21 Oct 2021 09:59:14 -  1.309
+++ kern/vfs_subr.c 23 Oct 2021 09:53:48 -
@@ -410,6 +410,7 @@ getnewvnode(enum vtagtype tag, struct mo
vp = pool_get(_pool, PR_WAITOK | PR_ZERO);
vp->v_uvm = pool_get(_vnode_pool, PR_WAITOK | PR_ZERO);
vp->v_uvm->u_vnode = vp;
+   uvm_obj_init(>v_uvm->u_obj, _vnodeops, 0);
RBT_INIT(buf_rb_bufs, >v_bufs_tree);
cache_tree_init(>v_nc_tree);
TAILQ_INIT(>v_cache_dst);
Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_aobj.c
--- uvm/uvm_aobj.c  28 Jun 2021 11:19:01 -  1.99
+++ uvm/uvm_aobj.c  23 Oct 2021 09:52:02 -
@@ -372,6 +372,7 @@ uao_free(struct uvm_aobj *aobj)
/*
 * finally free the aobj itself
 */
+   uvm_obj_destroy(uobj);
pool_put(_aobj_pool, aobj);
 }
 
Index: uvm/uvm_device.c
===
RCS file: /cvs/src/sys/uvm/uvm_device.c,v
retrieving revision 1.64
diff -u -p -r1.64 uvm_device.c
--- uvm/uvm_device.c29 Jun 2021 01:46:35 -  1.64
+++ uvm/uvm_device.c23 Oct 2021 09:49:16 -
@@ -182,6 +182,7 @@ udv_attach(dev_t device, vm_prot_t acces
mtx_leave(_lock);
/* NOTE: we could sleep in the following malloc() */
udv = malloc(sizeof(*udv), M_TEMP, M_WAITOK);
+   uvm_obj_init(>u_obj, _deviceops, 1);
mtx_enter(_lock);
 
/*
@@ -199,6 +200,7 @@ udv_attach(dev_t device, vm_prot_t acces
 */
if (lcv) {
mtx_leave(_lock);
+   uvm_obj_destroy(>u_obj);
free(udv, M_TEMP, sizeof(*udv));
continue;
}
@@ -207,7 +209,6 @@ udv_attach(dev_t device, vm_prot_t acces
 * we have it!   init the data structures, add to list
 * and return.
 */
-   uvm_obj_init(>u_obj, _deviceops, 1);
udv->u_flags = 0;
udv->u_device = device;
LIST_INSERT_HEAD(_list, udv, u_list);
@@ -275,6 +276,8 @@ again:
if (udv->u_flags & UVM_DEVICE_WANTED)
wakeup(udv);
mtx_leave(_lock);
+
+   uvm_obj_destroy(uobj);
free(udv, M_TEMP, sizeof(*udv));
 }
 
Index: uvm/uvm_object.c
===
RCS file: /cvs/src/sys/uvm/uvm_object.c,v
retrieving revision 1.21
diff -u -p -r1.21 uvm_object.c
--- uvm/uvm_object.c12 Oct 2021 18:16:51 -  1.21
+++ uvm/uvm_object.c23 Oct 2021 09:49:57 -
@@ -66,9 +66,13 @@ uvm_obj_init(struct uvm_object *uobj, co
uobj->uo_refs = refs;
 }
 
+/*
+ * uvm_obj_destroy: destroy UVM memory object.
+ */
 void
 uvm_obj_destroy(struct uvm_object *uo)
 {
+   KASSERT(RBT_EMPTY(uvm_objtree, >memt));
 }
 
 #ifndef SMALL_KERNEL
Index: uvm/uvm_vnode.c
===
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.118
diff -u -p -r1.118 uvm_vnode.c
--- uvm/uvm_vnode.c 20 Oct 2021 06:35:40 -  1.118
+++ uvm/uvm_vnode.c 23 Oct 2021 09:56:32 -
@@ -229,7 +229,8 @@ uvn_attach(struct vnode *vp, vm_prot_t a
 #endif
 
/* now set up the uvn. */
-   uvm_obj_init(>u_obj, _vnodeops, 1);
+   KASSERT(uvn->u_obj.uo_refs == 0);
+   uvn->u_obj.uo_refs++;
oldflags = uvn->u_flags;
uvn->u_flags = UVM_VNODE_VALID|UVM_VNODE_CANPERSIST;
uvn->u_nio = 0;



Please test: full poll/select(2) switch

2021-10-23 Thread Martin Pieuchot
Diff below switches both poll(2) and select(2) to the kqueue-based
implementation.

In addition it switches libevent(3) to use poll(2) by default for
testing purposes.

I don't have any open bug left with this diff and I'm happily running
GNOME with it.  So I'd be happy if you could try to break it and report
back.

Index: lib/libevent/event.c
===
RCS file: /cvs/src/lib/libevent/event.c,v
retrieving revision 1.41
diff -u -p -r1.41 event.c
--- lib/libevent/event.c1 May 2019 19:14:25 -   1.41
+++ lib/libevent/event.c23 Oct 2021 09:36:10 -
@@ -53,9 +53,9 @@ extern const struct eventop kqops;
 
 /* In order of preference */
 static const struct eventop *eventops[] = {
-   ,
,
,
+   ,
NULL
 };
 
Index: sys/kern/sys_generic.c
===
RCS file: /cvs/src/sys/kern/sys_generic.c,v
retrieving revision 1.137
diff -u -p -r1.137 sys_generic.c
--- sys/kern/sys_generic.c  15 Oct 2021 06:59:57 -  1.137
+++ sys/kern/sys_generic.c  23 Oct 2021 09:14:59 -
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef KTRACE
 #include 
 #endif
@@ -66,8 +67,23 @@
 
 #include 
 
-int selscan(struct proc *, fd_set *, fd_set *, int, int, register_t *);
-void pollscan(struct proc *, struct pollfd *, u_int, register_t *);
+/*
+ * Debug values:
+ *  1 - print implementation errors, things that should not happen.
+ *  2 - print ppoll(2) information, somewhat verbose
+ *  3 - print pselect(2) and ppoll(2) information, very verbose
+ */
+int kqpoll_debug = 0;
+#define DPRINTFN(v, x...) if (kqpoll_debug > v) {  \
+   printf("%s(%d): ", curproc->p_p->ps_comm, curproc->p_tid);  \
+   printf(x);  \
+}
+
+int pselregister(struct proc *, fd_set *[], fd_set *[], int, int *, int *);
+int pselcollect(struct proc *, struct kevent *, fd_set *[], int *);
+int ppollregister(struct proc *, struct pollfd *, int, int *);
+int ppollcollect(struct proc *, struct kevent *, struct pollfd *, u_int);
+
 int pollout(struct pollfd *, struct pollfd *, u_int);
 int dopselect(struct proc *, int, fd_set *, fd_set *, fd_set *,
 struct timespec *, const sigset_t *, register_t *);
@@ -584,11 +600,10 @@ int
 dopselect(struct proc *p, int nd, fd_set *in, fd_set *ou, fd_set *ex,
 struct timespec *timeout, const sigset_t *sigmask, register_t *retval)
 {
+   struct kqueue_scan_state scan;
fd_mask bits[6];
fd_set *pibits[3], *pobits[3];
-   struct timespec elapsed, start, stop;
-   uint64_t nsecs;
-   int s, ncoll, error = 0;
+   int error, ncollected = 0, nevents = 0;
u_int ni;
 
if (nd < 0)
@@ -618,6 +633,8 @@ dopselect(struct proc *p, int nd, fd_set
pobits[2] = (fd_set *)[5];
}
 
+   kqpoll_init();
+
 #definegetbits(name, x) \
if (name && (error = copyin(name, pibits[x], ni))) \
goto done;
@@ -636,43 +653,61 @@ dopselect(struct proc *p, int nd, fd_set
if (sigmask)
dosigsuspend(p, *sigmask &~ sigcantmask);
 
-retry:
-   ncoll = nselcoll;
-   atomic_setbits_int(>p_flag, P_SELECT);
-   error = selscan(p, pibits[0], pobits[0], nd, ni, retval);
-   if (error || *retval)
+   /* Register kqueue events */
+   error = pselregister(p, pibits, pobits, nd, , );
+   if (error != 0)
goto done;
-   if (timeout == NULL || timespecisset(timeout)) {
-   if (timeout != NULL) {
-   getnanouptime();
-   nsecs = MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP);
-   } else
-   nsecs = INFSLP;
-   s = splhigh();
-   if ((p->p_flag & P_SELECT) == 0 || nselcoll != ncoll) {
-   splx(s);
-   goto retry;
-   }
-   atomic_clearbits_int(>p_flag, P_SELECT);
-   error = tsleep_nsec(, PSOCK | PCATCH, "select", nsecs);
-   splx(s);
+
+   /*
+* The poll/select family of syscalls has been designed to
+* block when file descriptors are not available, even if
+* there's nothing to wait for.
+*/
+   if (nevents == 0 && ncollected == 0) {
+   uint64_t nsecs = INFSLP;
+
if (timeout != NULL) {
-   getnanouptime();
-   timespecsub(, , );
-   timespecsub(timeout, , timeout);
-   if (timeout->tv_sec < 0)
-   timespecclear(timeout);
+   if (!timespecisset(timeout))
+   goto done;
+   nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
}
-   if (error == 0 || error == EWOULDBLOCK)
-  

Re: xhci uhub on arm64: handle device in SS_INACTIVE state

2021-10-22 Thread Martin Pieuchot
On 17/10/21(Sun) 09:06, Christopher Zimmermann wrote:
> Hi,
> 
> on my RK3399, a usb device connected to the USB 3 port is not detected
> during boot because it is in SS_INACTIVE (0x00c0) state:
> 
> uhub3 at usb3 configuration 1 interface 0 "Generic xHCI root hub" rev
> 3.00/1.00 addr 1
> uhub3: uhub_attach
> uhub3: 2 ports with 2 removable, self powered
> uhub3: intr status=0
> usb_needs_explore: usb3: not exploring before first explore
> uhub3: uhub_explore
> uhub3: port 1 status=0x02a0 change=0x
> uhub3: port 2 status=0x02c0 change=0x0040
> usb_explore: usb3: first explore done
> xhci1: port=2 change=0x04
> uhub3: intr status=0
> uhub3: uhub_explore
> uhub3: port 2 status=0x02c0 change=0x0040
> xhci1: port=2 change=0x04
> uhub3: intr status=0
> uhub3: uhub_explore
> uhub3: port 2 status=0x02c0 change=0x0040
> 
> [...]
> 
> [turn the usb device off and on again]
> 
> uhub3: intr status=0
> uhub3: uhub_explore
> uhub3: port 2 status=0x0203 change=0x0001
> usbd_reset_port: port 2 reset done
> xhci1: port=2 change=0x04
> uhub3: intr status=0
> uhub3: port 2 status=0x0203 change=0x
> umass0 at uhub3 port 2 configuration 1 interface 0 "ATEC Dual Disk Drive" rev 
> 3.00/1.08 addr 2
> 
> This might be because u-boot-aarch64-2021.10 from packages left it in that
> state.
> I added this code to reset a device locked in such a state:

It's not clear to me if this a warm reset or not?  If the port is in
SS.Inactive it needs a warm reset, no?

If so could you add a comment on top of the block.  I'd also suggest
moving the block after the "warm reset change" (BH_PORT_RESET), this
might matter and at least match Linux's logic.

I wish you could add the logic to properly check if a warm reset is
required by checking the proper bits against the port number, but we
can rely on UPS_C_PORT_LINK_STATE for now and do that in a second step.

Comments below:

> Index: uhub.c
> ===
> RCS file: /cvs/src/sys/dev/usb/uhub.c,v
> retrieving revision 1.95
> diff -u -p -r1.95 uhub.c
> --- uhub.c  31 Jul 2020 10:49:33 -  1.95
> +++ uhub.c  17 Oct 2021 06:44:14 -
> @@ -414,6 +414,24 @@ uhub_explore(struct usbd_device *dev)
> change |= UPS_C_CONNECT_STATUS;
> }
> 
> +   if (change & UPS_C_PORT_LINK_STATE &&
> +   UPS_PORT_LS_GET(status) == UPS_PORT_LS_SS_INACTIVE &&

This should check for the speed of the HUB: 

sc->sc_hub->speed == USB_SPEED_SUPER

Should we also check if the link state is UPS_PORT_LS_COMP_MOD?

> +   ! (status & UPS_CURRENT_CONNECT_STATUS)) {
   ^
 Please drop the space here

> +   DPRINTF("%s: port %d is in in SS_INACTIVE.Quiet 
> state. "
> + "Reset port.\n",
> + sc->sc_dev.dv_xname, port);
> +   usbd_clear_port_feature(sc->sc_hub, port,
> +   UHF_C_PORT_RESET);
> +
> +   if (usbd_reset_port(sc->sc_hub, port)) {
> +   printf("%s: port %d reset failed\n",
> + DEVNAME(sc), port);
> +   return (-1);
> +   }
> +
> +   change |= UPS_C_CONNECT_STATUS;
> +   }
> +
> if (change & UPS_C_BH_PORT_RESET &&
> sc->sc_hub->speed == USB_SPEED_SUPER) {
> usbd_clear_port_feature(sc->sc_hub, port,
> 
> 
> Now the device attaches during boot. A redundant second reset of the device
> is performed during uhub_port_connect():
> 
> uhub3 at usb3 configuration 1 interface 0 "Generic xHCI root hub" rev
> 3.00/1.00 addr 1
> uhub3: uhub_attach
> uhub3: 2 ports with 2 removable, self powered
> xhci1: port=2 change=0x04
> uhub3: intr status=0
> usb_needs_explore: usb3: not exploring before first explore
> uhub3: uhub_explore
> uhub3: port 1 status=0x02a0 change=0x
> uhub3: port 2 status=0x02c0 change=0x0040
> uhub3: port 2 is in in SS_INACTIVE.Quiet state. Reset port.
> usbd_reset_port: port 2 reset done
> usb_explore: usb3: first explore done
> xhci1: port=2 change=0x04
> uhub3: intr status=0
> uhub3: uhub_explore
> uhub3: port 2 status=0x0203 change=0x0031
> uhub3: uhub_port_connect
> usbd_reset_port: port 2 reset done
> xhci1: port=2 change=0x04
> uhub3: intr status=0
> uhub3: port 2 status=0x0203 change=0x
> umass0 at uhub3 port 2 configuration 1 interface 0 "ATEC Dual Disk Drive" rev 
> 3.00/1.08 addr 2
> 
> 
> OK to commit this diff? Or should this be done some other way?
> 
> 
> Christopher
> 



Re: Make pipe event filters MP-safe

2021-10-22 Thread Martin Pieuchot
On 22/10/21(Fri) 13:15, Visa Hankala wrote:
> This diff makes pipe event filters ready to run without the kernel lock.
> The code pattern in the callbacks is the same as in sockets. Pipes
> have a klist lock already.
> 
> So far, pipe event filters have used read-locking. The patch changes
> that to write-locking for clarity. This should not be a real loss,
> though, because the lock is fine-grained and there is little multiple-
> readers parallelism to be utilized.

The removal of the KERNEL_LOCK() in pipeselwakeup() makes me very happy.
As found with patrick@ this was a non negligible spinning time in:
  https://undeadly.org/features/2021/09/2ytHD+googlemap_arm64.svg

ok mpi@

> Index: kern/sys_pipe.c
> ===
> RCS file: src/sys/kern/sys_pipe.c,v
> retrieving revision 1.127
> diff -u -p -r1.127 sys_pipe.c
> --- kern/sys_pipe.c   22 Oct 2021 05:00:26 -  1.127
> +++ kern/sys_pipe.c   22 Oct 2021 12:17:57 -
> @@ -78,20 +78,30 @@ static const struct fileops pipeops = {
>  
>  void filt_pipedetach(struct knote *kn);
>  int  filt_piperead(struct knote *kn, long hint);
> +int  filt_pipereadmodify(struct kevent *kev, struct knote *kn);
> +int  filt_pipereadprocess(struct knote *kn, struct kevent *kev);
> +int  filt_piperead_common(struct knote *kn, struct pipe *rpipe);
>  int  filt_pipewrite(struct knote *kn, long hint);
> +int  filt_pipewritemodify(struct kevent *kev, struct knote *kn);
> +int  filt_pipewriteprocess(struct knote *kn, struct kevent *kev);
> +int  filt_pipewrite_common(struct knote *kn, struct pipe *rpipe);
>  
>  const struct filterops pipe_rfiltops = {
> - .f_flags= FILTEROP_ISFD,
> + .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
>   .f_attach   = NULL,
>   .f_detach   = filt_pipedetach,
>   .f_event= filt_piperead,
> + .f_modify   = filt_pipereadmodify,
> + .f_process  = filt_pipereadprocess,
>  };
>  
>  const struct filterops pipe_wfiltops = {
> - .f_flags= FILTEROP_ISFD,
> + .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
>   .f_attach   = NULL,
>   .f_detach   = filt_pipedetach,
>   .f_event= filt_pipewrite,
> + .f_modify   = filt_pipewritemodify,
> + .f_process  = filt_pipewriteprocess,
>  };
>  
>  /*
> @@ -362,9 +372,7 @@ pipeselwakeup(struct pipe *cpipe)
>   cpipe->pipe_state &= ~PIPE_SEL;
>   selwakeup(>pipe_sel);
>   } else {
> - KERNEL_LOCK();
> - KNOTE(>pipe_sel.si_note, NOTE_SUBMIT);
> - KERNEL_UNLOCK();
> + KNOTE(>pipe_sel.si_note, 0);
>   }
>  
>   if (cpipe->pipe_state & PIPE_ASYNC)
> @@ -929,45 +937,76 @@ filt_pipedetach(struct knote *kn)
>  }
>  
>  int
> -filt_piperead(struct knote *kn, long hint)
> +filt_piperead_common(struct knote *kn, struct pipe *rpipe)
>  {
> - struct pipe *rpipe = kn->kn_fp->f_data, *wpipe;
> - struct rwlock *lock = rpipe->pipe_lock;
> + struct pipe *wpipe;
> +
> + rw_assert_wrlock(rpipe->pipe_lock);
>  
> - if ((hint & NOTE_SUBMIT) == 0)
> - rw_enter_read(lock);
>   wpipe = pipe_peer(rpipe);
>  
>   kn->kn_data = rpipe->pipe_buffer.cnt;
>  
>   if ((rpipe->pipe_state & PIPE_EOF) || wpipe == NULL) {
> - if ((hint & NOTE_SUBMIT) == 0)
> - rw_exit_read(lock);
>   kn->kn_flags |= EV_EOF; 
>   if (kn->kn_flags & __EV_POLL)
>   kn->kn_flags |= __EV_HUP;
>   return (1);
>   }
>  
> - if ((hint & NOTE_SUBMIT) == 0)
> - rw_exit_read(lock);
> -
>   return (kn->kn_data > 0);
>  }
>  
>  int
> -filt_pipewrite(struct knote *kn, long hint)
> +filt_piperead(struct knote *kn, long hint)
>  {
> - struct pipe *rpipe = kn->kn_fp->f_data, *wpipe;
> - struct rwlock *lock = rpipe->pipe_lock;
> + struct pipe *rpipe = kn->kn_fp->f_data;
> +
> + return (filt_piperead_common(kn, rpipe));
> +}
> +
> +int
> +filt_pipereadmodify(struct kevent *kev, struct knote *kn)
> +{
> + struct pipe *rpipe = kn->kn_fp->f_data;
> + int active;
> +
> + rw_enter_write(rpipe->pipe_lock);
> + knote_modify(kev, kn);
> + active = filt_piperead_common(kn, rpipe);
> + rw_exit_write(rpipe->pipe_lock);
> +
> + return (active);
> +}
> +
> +int
> +filt_pipereadprocess(struct knote *kn, struct kevent *kev)
> +{
> + struct pipe *rpipe = kn->kn_fp->f_data;
> + int active;
> +
> + rw_enter_write(rpipe->pipe_lock);
> + if (kev != NULL && (kn->kn_flags & EV_ONESHOT))
> + active = 1;
> + else
> + active = filt_piperead_common(kn, rpipe);
> + if (active)
> + knote_submit(kn, kev);
> + rw_exit_write(rpipe->pipe_lock);
> +
> + return (active);
> +}
> +
> +int
> +filt_pipewrite_common(struct knote *kn, struct pipe *rpipe)
> +{
> + struct pipe *wpipe;
> +
> + 

Re: Set klist lock for sockets, v2

2021-10-22 Thread Martin Pieuchot
On 22/10/21(Fri) 13:11, Visa Hankala wrote:
> Here is another attempt to set klist lock for sockets. This is a revised
> version of a patch that I posted in January [1].
> 
> Using solock() for the klists is probably the easiest way at the time
> being. However, the lock is a potential point of contention because of
> the underlying big-lock design. The increase of overhead is related to
> adding and removing event registrations. With persistent registrations
> the overhead is unchanged.
> 
> As a result, socket and named FIFO event filters should be ready to run
> without the kernel lock. The f_event, f_modify and f_process callbacks
> should be MP-safe already.
> 
> [1] https://marc.info/?l=openbsd-tech=160986578724696
> 
> OK?

I've been running with this and unlocked sowakeup() for quite some time
now.

ok mpi@

> Index: kern/uipc_socket.c
> ===
> RCS file: src/sys/kern/uipc_socket.c,v
> retrieving revision 1.265
> diff -u -p -r1.265 uipc_socket.c
> --- kern/uipc_socket.c14 Oct 2021 23:05:10 -  1.265
> +++ kern/uipc_socket.c22 Oct 2021 12:17:57 -
> @@ -84,7 +84,7 @@ int filt_solistenprocess(struct knote *k
>  int  filt_solisten_common(struct knote *kn, struct socket *so);
>  
>  const struct filterops solisten_filtops = {
> - .f_flags= FILTEROP_ISFD,
> + .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
>   .f_attach   = NULL,
>   .f_detach   = filt_sordetach,
>   .f_event= filt_solisten,
> @@ -93,7 +93,7 @@ const struct filterops solisten_filtops 
>  };
>  
>  const struct filterops soread_filtops = {
> - .f_flags= FILTEROP_ISFD,
> + .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
>   .f_attach   = NULL,
>   .f_detach   = filt_sordetach,
>   .f_event= filt_soread,
> @@ -102,7 +102,7 @@ const struct filterops soread_filtops = 
>  };
>  
>  const struct filterops sowrite_filtops = {
> - .f_flags= FILTEROP_ISFD,
> + .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
>   .f_attach   = NULL,
>   .f_detach   = filt_sowdetach,
>   .f_event= filt_sowrite,
> @@ -111,7 +111,7 @@ const struct filterops sowrite_filtops =
>  };
>  
>  const struct filterops soexcept_filtops = {
> - .f_flags= FILTEROP_ISFD,
> + .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
>   .f_attach   = NULL,
>   .f_detach   = filt_sordetach,
>   .f_event= filt_soread,
> @@ -169,6 +169,8 @@ socreate(int dom, struct socket **aso, i
>   return (EPROTOTYPE);
>   so = pool_get(_pool, PR_WAITOK | PR_ZERO);
>   rw_init(>so_lock, "solock");
> + klist_init(>so_rcv.sb_sel.si_note, _klistops, so);
> + klist_init(>so_snd.sb_sel.si_note, _klistops, so);
>   sigio_init(>so_sigio);
>   TAILQ_INIT(>so_q0);
>   TAILQ_INIT(>so_q);
> @@ -258,6 +260,8 @@ sofree(struct socket *so, int s)
>   }
>   }
>   sigio_free(>so_sigio);
> + klist_free(>so_rcv.sb_sel.si_note);
> + klist_free(>so_snd.sb_sel.si_note);
>  #ifdef SOCKET_SPLICE
>   if (so->so_sp) {
>   if (issplicedback(so)) {
> @@ -2038,9 +2042,9 @@ soo_kqfilter(struct file *fp, struct kno
>  {
>   struct socket *so = kn->kn_fp->f_data;
>   struct sockbuf *sb;
> + int s;
>  
> - KERNEL_ASSERT_LOCKED();
> -
> + s = solock(so);
>   switch (kn->kn_filter) {
>   case EVFILT_READ:
>   if (so->so_options & SO_ACCEPTCONN)
> @@ -2058,10 +2062,12 @@ soo_kqfilter(struct file *fp, struct kno
>   sb = >so_rcv;
>   break;
>   default:
> + sounlock(so, s);
>   return (EINVAL);
>   }
>  
>   klist_insert_locked(>sb_sel.si_note, kn);
> + sounlock(so, s);
>  
>   return (0);
>  }
> @@ -2071,9 +2077,7 @@ filt_sordetach(struct knote *kn)
>  {
>   struct socket *so = kn->kn_fp->f_data;
>  
> - KERNEL_ASSERT_LOCKED();
> -
> - klist_remove_locked(>so_rcv.sb_sel.si_note, kn);
> + klist_remove(>so_rcv.sb_sel.si_note, kn);
>  }
>  
>  int
> @@ -2159,9 +2163,7 @@ filt_sowdetach(struct knote *kn)
>  {
>   struct socket *so = kn->kn_fp->f_data;
>  
> - KERNEL_ASSERT_LOCKED();
> -
> - klist_remove_locked(>so_snd.sb_sel.si_note, kn);
> + klist_remove(>so_snd.sb_sel.si_note, kn);
>  }
>  
>  int
> @@ -2284,6 +2286,36 @@ filt_solistenprocess(struct knote *kn, s
>   return (rv);
>  }
>  
> +void
> +klist_soassertlk(void *arg)
> +{
> + struct socket *so = arg;
> +
> + soassertlocked(so);
> +}
> +
> +int
> +klist_solock(void *arg)
> +{
> + struct socket *so = arg;
> +
> + return (solock(so));
> +}
> +
> +void
> +klist_sounlock(void *arg, int ls)
> +{
> + struct socket *so = arg;
> +
> + sounlock(so, ls);
> +}
> +
> +const struct klistops socket_klistops = {
> + .klo_assertlk   = klist_soassertlk,
> + .klo_lock   = 

POLLHUP vs EVFILT_EXCEPT semantic

2021-10-22 Thread Martin Pieuchot
Last year we added the new EVFILT_EXCEPT filter type to kqueue in
order to report conditions currently available via POLLPRI/POLLRDBAND
in poll(2) and select(2).

This new filter has been implemented in tty and socket by re-using the
existing kqueue's "read" filter.  This has a downside which is the filter
will also trigger if any data is available for reading.

This "feature" makes it impossible to correctly implement poll(2)'s
"empty" condition mode.  If no bit are set in the `events' pollfd
structure we still need to return POLLHUP.  But if the filter triggers
when there's data to read, it means POLLIN not POLLHUP.

So I'd like to change the existing EVFILT_EXCEPT filters to no longer
fire if there is something to read.  Diff below does that and adds a
new filter for FIFOs necessary for poll(2) support.

Ok?

Index: kern/tty_pty.c
===
RCS file: /cvs/src/sys/kern/tty_pty.c,v
retrieving revision 1.108
diff -u -p -r1.108 tty_pty.c
--- kern/tty_pty.c  8 Feb 2021 09:18:30 -   1.108
+++ kern/tty_pty.c  22 Oct 2021 12:49:12 -
@@ -107,6 +107,7 @@ voidfilt_ptcrdetach(struct knote *);
 intfilt_ptcread(struct knote *, long);
 void   filt_ptcwdetach(struct knote *);
 intfilt_ptcwrite(struct knote *, long);
+intfilt_ptcexcept(struct knote *, long);
 
 static struct pt_softc **ptyarralloc(int);
 static int check_pty(int);
@@ -670,16 +671,6 @@ filt_ptcread(struct knote *kn, long hint
tp = pti->pt_tty;
kn->kn_data = 0;
 
-   if (kn->kn_sfflags & NOTE_OOB) {
-   /* If in packet or user control mode, check for data. */
-   if (((pti->pt_flags & PF_PKT) && pti->pt_send) ||
-   ((pti->pt_flags & PF_UCNTL) && pti->pt_ucntl)) {
-   kn->kn_fflags |= NOTE_OOB;
-   kn->kn_data = 1;
-   return (1);
-   }
-   return (0);
-   }
if (ISSET(tp->t_state, TS_ISOPEN)) {
if (!ISSET(tp->t_state, TS_TTSTOP))
kn->kn_data = tp->t_outq.c_cc;
@@ -731,6 +722,34 @@ filt_ptcwrite(struct knote *kn, long hin
return (kn->kn_data > 0);
 }
 
+int
+filt_ptcexcept(struct knote *kn, long hint)
+{
+   struct pt_softc *pti = (struct pt_softc *)kn->kn_hook;
+   struct tty *tp;
+
+   tp = pti->pt_tty;
+
+   if (kn->kn_sfflags & NOTE_OOB) {
+   /* If in packet or user control mode, check for data. */
+   if (((pti->pt_flags & PF_PKT) && pti->pt_send) ||
+   ((pti->pt_flags & PF_UCNTL) && pti->pt_ucntl)) {
+   kn->kn_fflags |= NOTE_OOB;
+   kn->kn_data = 1;
+   return (1);
+   }
+   return (0);
+   }
+   if (!ISSET(tp->t_state, TS_CARR_ON)) {
+   kn->kn_flags |= EV_EOF;
+   if (kn->kn_flags & __EV_POLL)
+   kn->kn_flags |= __EV_HUP;
+   return (1);
+   }
+
+   return (0);
+}
+
 const struct filterops ptcread_filtops = {
.f_flags= FILTEROP_ISFD,
.f_attach   = NULL,
@@ -749,7 +768,7 @@ const struct filterops ptcexcept_filtops
.f_flags= FILTEROP_ISFD,
.f_attach   = NULL,
.f_detach   = filt_ptcrdetach,
-   .f_event= filt_ptcread,
+   .f_event= filt_ptcexcept,
 };
 
 int
Index: kern/uipc_socket.c
===
RCS file: /cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.265
diff -u -p -r1.265 uipc_socket.c
--- kern/uipc_socket.c  14 Oct 2021 23:05:10 -  1.265
+++ kern/uipc_socket.c  22 Oct 2021 12:49:12 -
@@ -78,6 +78,10 @@ int  filt_sowrite(struct knote *kn, long 
 intfilt_sowritemodify(struct kevent *kev, struct knote *kn);
 intfilt_sowriteprocess(struct knote *kn, struct kevent *kev);
 intfilt_sowrite_common(struct knote *kn, struct socket *so);
+intfilt_soexcept(struct knote *kn, long hint);
+intfilt_soexceptmodify(struct kevent *kev, struct knote *kn);
+intfilt_soexceptprocess(struct knote *kn, struct kevent *kev);
+intfilt_soexcept_common(struct knote *kn, struct socket *so);
 intfilt_solisten(struct knote *kn, long hint);
 intfilt_solistenmodify(struct kevent *kev, struct knote *kn);
 intfilt_solistenprocess(struct knote *kn, struct kevent *kev);
@@ -114,9 +118,9 @@ const struct filterops soexcept_filtops 
.f_flags= FILTEROP_ISFD,
.f_attach   = NULL,
.f_detach   = filt_sordetach,
-   .f_event= filt_soread,
-   .f_modify   = filt_soreadmodify,
-   .f_process  = filt_soreadprocess,
+   .f_event= filt_soexcept,
+   .f_modify   = filt_soexceptmodify,
+   .f_process  = filt_soexceptprocess,
 };
 
 #ifndef SOMINCONN
@@ -2089,13 +2093,7 @@ 

Re: vnode lock: remove VLOCKSWORK flag

2021-10-15 Thread Martin Pieuchot
On 15/10/21(Fri) 09:27, Sebastien Marie wrote:
> Hi,
> 
> The following diff removes VLOCKSWORK flag.

Nice.

> This flag is currently used to mark or unmark a vnode to actively
> check vnode locking semantic (when compiled with VFSLCKDEBUG).
>  
> Currently, VLOCKSWORK flag isn't properly set for several FS
> implementation which have full locking support, specially:
>  - cd9660
>  - udf
>  - fuse
>  - msdosfs
>  - tmpfs
> 
> Instead of using a particular flag, I propose to directly check if
> v_op->vop_islocked is nullop or not to activate or not the vnode
> locking checks.

I wonder if we shouldn't get rid of those checks and instead make
VOP_ISLOCKED() deal with that.

VOP_ISLOCKED() is inconsistent.  It returns the value of rrw_status(9)
or EOPNOTSUPP if `vop_islocked' is NULL.

But this is a change in behavior that has a broader scope, so it should
be done separately.

> Some alternate methods might be possible, like having a specific
> member inside struct vops. But it will only duplicate the fact that
> nullop is used as lock mecanism.
> 
> I also slightly changed ASSERT_VP_ISLOCKED(vp) macro:
> - evaluate vp argument only once
> - explicitly check if VOP_ISLOCKED() != LK_EXCLUSIVE (it might returns
>   error or 'locked by some else', and it doesn't mean "locked by me")
> - show the VOP_ISLOCKED returned code in panic message
> 
> Some code are using ASSERT_VP_ISLOCKED() like code. I kept them simple.
> 
> The direct impact on snapshots should be low as VFSLCKDEBUG isn't set
> by default.
> 
> Comments or OK ?

ok mpi@

> diff e44725a8dd99f82f94f37ecff5c0e710c4dba97e 
> /home/semarie/repos/openbsd/sys-clean
> blob - c752dd99e9ef62b05162cfeda67913ab5bccf06e
> file + kern/vfs_subr.c
> --- kern/vfs_subr.c
> +++ kern/vfs_subr.c
> @@ -1075,9 +1075,6 @@ vclean(struct vnode *vp, int flags, struct proc *p)
>   vp->v_op = _vops;
>   VN_KNOTE(vp, NOTE_REVOKE);
>   vp->v_tag = VT_NON;
> -#ifdef VFSLCKDEBUG
> - vp->v_flag &= ~VLOCKSWORK;
> -#endif
>   mtx_enter(_mtx);
>   vp->v_lflag &= ~VXLOCK;
>   if (vp->v_lflag & VXWANT) {
> @@ -1930,7 +1927,7 @@ vinvalbuf(struct vnode *vp, int flags, struct ucred *c
>   int s, error;
>  
>  #ifdef VFSLCKDEBUG
> - if ((vp->v_flag & VLOCKSWORK) && !VOP_ISLOCKED(vp))
> + if ((vp->v_op->vop_islocked != nullop) && !VOP_ISLOCKED(vp))
>   panic("%s: vp isn't locked, vp %p", __func__, vp);
>  #endif
>  
> blob - caf2dc327bfc2f5a001bcee80edd90938497ef99
> file + kern/vfs_vops.c
> --- kern/vfs_vops.c
> +++ kern/vfs_vops.c
> @@ -48,11 +48,15 @@
>  #include 
>  
>  #ifdef VFSLCKDEBUG
> -#define ASSERT_VP_ISLOCKED(vp) do {  \
> - if (((vp)->v_flag & VLOCKSWORK) && !VOP_ISLOCKED(vp)) { \
> - VOP_PRINT(vp);  \
> - panic("vp not locked"); \
> - }   \
> +#define ASSERT_VP_ISLOCKED(vp) do {  \
> + struct vnode *_vp = (vp);   \
> + int r;  \
> + if (_vp->v_op->vop_islocked == nullop)  \
> + break;  \
> + if ((r = VOP_ISLOCKED(_vp)) != LK_EXCLUSIVE) {  \
> + VOP_PRINT(_vp); \
> + panic("%s: vp not locked, vp %p, %d", __func__, _vp, r);\
> + }   \
>  } while (0)
>  #else
>  #define ASSERT_VP_ISLOCKED(vp)  /* nothing */
> blob - 81b900e83d2071d8450f35cfae42c6cb91f1a414
> file + nfs/nfs_node.c
> --- nfs/nfs_node.c
> +++ nfs/nfs_node.c
> @@ -133,9 +133,6 @@ loop:
>   }
>  
>   vp = nvp;
> -#ifdef VFSLCKDEBUG
> - vp->v_flag |= VLOCKSWORK;
> -#endif
>   rrw_init_flags(>n_lock, "nfsnode", RWL_DUPOK | RWL_IS_VNODE);
>   vp->v_data = np;
>   /* we now have an nfsnode on this vnode */
> blob - 3668f954a9aab3fd49ed5e41e7d4ab51b4bf0a90
> file + sys/vnode.h
> --- sys/vnode.h
> +++ sys/vnode.h
> @@ -146,8 +146,7 @@ struct vnode {
>  #define  VCLONED 0x0400  /* vnode was cloned */
>  #define  VALIASED0x0800  /* vnode has an alias */
>  #define  VLARVAL 0x1000  /* vnode data not yet set up by higher 
> level */
> -#define  VLOCKSWORK  0x4000  /* FS supports locking discipline */
> -#define  VCLONE  0x8000  /* vnode is a clone */
> +#define  VCLONE  0x4000  /* vnode is a clone */
>  
>  /*
>   * (v_bioflag) Flags that may be manipulated by interrupt handlers
> blob - d859d216b40ebb2f5cce1eb5cf0becbfff21a638
> file + ufs/ext2fs/ext2fs_subr.c
> --- ufs/ext2fs/ext2fs_subr.c
> +++ ufs/ext2fs/ext2fs_subr.c
> @@ -170,9 +170,6 @@ ext2fs_vinit(struct mount *mp, struct vnode **vpp)
>   nvp->v_data = vp->v_data;
>  

poll(2) on top of kqueue

2021-10-14 Thread Martin Pieuchot
Diff below is the counterpart of the select(2) one I just committed to
make poll(2) and ppoll(2) use kqueue internally.

They use the same logic as select(2): convert pollfd into kqueue events
with EV_SET(2) then wait in kqueue_scan().

To make this implementation compatible with the existing poll(2) semantic  
I added a new specific kqueue-filter to FIFOs to handle the case where
POLLOUT is specified on a read-only event.  Thanks to millert@ for the
idea.  The regress sys/fifofs is passing with that.

As for the select(2) diff I'm currently interested in knowing if you
find any incompatibility with the current behavior. 

Thanks for testing,
Martin

Index: kern/sys_generic.c
===
RCS file: /cvs/src/sys/kern/sys_generic.c,v
retrieving revision 1.136
diff -u -p -r1.136 sys_generic.c
--- kern/sys_generic.c  14 Oct 2021 08:46:01 -  1.136
+++ kern/sys_generic.c  14 Oct 2021 09:00:22 -
@@ -81,6 +81,8 @@ int kqpoll_debug = 0;
 
 int pselregister(struct proc *, fd_set *[], int, int *);
 int pselcollect(struct proc *, struct kevent *, fd_set *[], int *);
+int ppollregister(struct proc *, struct pollfd *, int, int *);
+int ppollcollect(struct proc *, struct kevent *, struct pollfd *, u_int);
 
 int pollout(struct pollfd *, struct pollfd *, u_int);
 int dopselect(struct proc *, int, fd_set *, fd_set *, fd_set *,
@@ -769,6 +771,7 @@ pselregister(struct proc *p, fd_set *pib
/* FALLTHROUGH */
case EOPNOTSUPP:/* No underlying kqfilter */
case EINVAL:/* Unimplemented filter */
+   case EPERM: /* Specific to FIFO */
error = 0;
break;
case ENXIO: /* Device has been detached */
@@ -899,31 +902,132 @@ doselwakeup(struct selinfo *sip)
}
 }
 
-void
-pollscan(struct proc *p, struct pollfd *pl, u_int nfd, register_t *retval)
+int
+ppollregister_evts(struct proc *p, struct kevent *kevp, int nkev,
+struct pollfd *pl)
 {
-   struct filedesc *fdp = p->p_fd;
-   struct file *fp;
-   u_int i;
-   int n = 0;
+   int i, error, nevents = 0;
 
-   for (i = 0; i < nfd; i++, pl++) {
-   /* Check the file descriptor. */
-   if (pl->fd < 0) {
-   pl->revents = 0;
-   continue;
+   KASSERT(pl->revents == 0);
+
+#ifdef KTRACE
+   if (KTRPOINT(p, KTR_STRUCT))
+   ktrevent(p, kevp, nkev);
+#endif
+   for (i = 0; i < nkev; i++, kevp++) {
+again:
+   error = kqueue_register(p->p_kq, kevp, p);
+   switch (error) {
+   case 0:
+   nevents++;
+   break;
+   case EOPNOTSUPP:/* No underlying kqfilter */
+   case EINVAL:/* Unimplemented filter */
+   break;
+   case EBADF: /* Bad file descriptor */
+   pl->revents |= POLLNVAL;
+   break;
+   case EPERM: /* Specific to FIFO */
+   KASSERT(kevp->filter == EVFILT_WRITE);
+   if (nkev == 1) {
+   /*
+* If this is the only filter make sure
+* POLLHUP is passed to userland.
+*/
+   kevp->filter = EVFILT_EXCEPT;
+   goto again;
+   }
+   break;
+   case EPIPE: /* Specific to pipes */
+   KASSERT(kevp->filter == EVFILT_WRITE);
+   pl->revents |= POLLHUP;
+   break;
+   default:
+#ifdef DIAGNOSTIC
+   DPRINTFN(0, "poll err %lu fd %d revents %02x serial"
+   " %lu filt %d ERROR=%d\n",
+   ((unsigned long)kevp->udata - p->p_kq_serial),
+   pl->fd, pl->revents, p->p_kq_serial, kevp->filter,
+   error);
+#endif
+   /* FALLTHROUGH */
+   case ENXIO: /* Device has been detached */
+   pl->revents |= POLLERR;
+   break;
}
-   if ((fp = fd_getfile(fdp, pl->fd)) == NULL) {
-   pl->revents = POLLNVAL;
-   n++;
+   }
+
+   return (nevents);
+}
+
+/*
+ * Convert pollfd into kqueue events and register them on the
+ * per-thread queue.
+ *
+ * Return the number of pollfd that triggered at least one error and aren't
+ * completly monitored.  These pollfd should have the correponding error bit
+ * set in `revents'.
+ *
+ * At most 3 events can correspond to a single pollfd.
+ */
+int

Re: Switch to kqueue based select(2)

2021-10-13 Thread Martin Pieuchot
On 13/10/21(Wed) 11:41, Alexander Bluhm wrote:
> On Sat, Oct 02, 2021 at 09:10:13AM +0200, Martin Pieuchot wrote:
> > ok?
> 
> OK bluhm@
> 
> > +   /* Maxium number of events per iteration */
> 
> Maximum
> 
> > +int
> > +pselcollect(struct proc *p, struct kevent *kevp, fd_set *pobits[3],
> > +int *ncollected)
> > +{
> > +#ifdef DIAGNOSTIC
> > +   /* Filter out and lazily delete spurious events */
> > +   if ((unsigned long)kevp->udata != p->p_kq_serial) {
> > +   DPRINTFN(0, "select fd %u mismatched serial %lu\n",
> > +   (int)kevp->ident, p->p_kq_serial);
> > +   kevp->flags = EV_DISABLE|EV_DELETE;
> > +   kqueue_register(p->p_kq, kevp, p);
> > +   return (0);
> > +   }
> > +#endif
> 
> Why is it DIAGNOSTIC?  Either it should not happen, then call panic().
> Or it is a valid corner case, then remove #ifdef DIAGNOSTIC.
> 
> Different behavior with and without DIAGNOSTIC seems bad.

Indeed.  It should not be in DIAGNOSTIC, that's a leftover from previous
iteration of the diff, I'll fix both points before committing.

Thanks for the review.



Re: mi_switch() & setting `p_stat'

2021-10-03 Thread Martin Pieuchot
On 02/10/21(Sat) 21:09, Mark Kettenis wrote:
> > Date: Sat, 2 Oct 2021 20:35:41 +0200
> > From: Martin Pieuchot 
> > [...] 
> > There's no sleeping point but a call to wakeup().  This wakeup() is
> > supposed to wake a btrace(8) process.  But if the curproc, which just
> > added itself to the global sleep queue, ends up in the same bucket as
> > the btrace process, the KASSERT() line 565 of kern/kern_synch.c will
> > trigger:
> > 
> > /*
> >  * If the rwlock passed to rwsleep() is contended, the
> >  * CPU will end up calling wakeup() between sleep_setup()
> >  * and sleep_finish().
> >  */
> > if (p == curproc) {
> > KASSERT(p->p_stat == SONPROC);
> > continue;
> > }
> 
> Ah, right.  But that means the comment isn't accurate.  At least there
> are other cases that make us hit that codepath.
> 
> How useful is that KASSERT in catching actual bugs?

I added the KASSERT() to limit the scope of the check.  If the test is
true `curproc' is obviously on the CPU.  Its usefulness is questionable.

So a simpler fix would be to remove the assert, diff below does that and
update the comment, ok?

Index: kern/kern_synch.c
===
RCS file: /cvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.179
diff -u -p -r1.179 kern_synch.c
--- kern/kern_synch.c   9 Sep 2021 18:41:39 -   1.179
+++ kern/kern_synch.c   3 Oct 2021 08:48:28 -
@@ -558,14 +558,11 @@ wakeup_n(const volatile void *ident, int
for (p = TAILQ_FIRST(qp); p != NULL && n != 0; p = pnext) {
pnext = TAILQ_NEXT(p, p_runq);
/*
-* If the rwlock passed to rwsleep() is contended, the
-* CPU will end up calling wakeup() between sleep_setup()
-* and sleep_finish().
+* This happens if wakeup(9) is called after enqueuing
+* itself on the sleep queue and both `ident' collide.
 */
-   if (p == curproc) {
-   KASSERT(p->p_stat == SONPROC);
+   if (p == curproc)
continue;
-   }
 #ifdef DIAGNOSTIC
if (p->p_stat != SSLEEP && p->p_stat != SSTOP)
panic("wakeup: p_stat is %d", (int)p->p_stat);



  1   2   3   4   5   6   7   8   9   10   >