Re: uvm_swap: introduce uvm_swap_data_lock

2022-08-17 Thread Martin Pieuchot
On 16/01/22(Sun) 15:35, Martin Pieuchot wrote:
> On 30/12/21(Thu) 23:38, Theo Buehler wrote:
> > The diff below does two things: it adds a uvm_swap_data_lock mutex and
> > trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad()
> 
> Why is it enough?  Which fields is the lock protecting in these
> function?  Is it `uvmexp.swpages', could that be documented?  

It is documented in the diff below.

> 
> What about `nswapdev'?  Why is the rwlock grabbed before reading it in
> sys_swapctl()?i

Because it is always modified with the lock, I added some documentation.

> What about `swpginuse'?

This is still under KERNEL_LOCK(), documented below.

> If the mutex/rwlock are used to protect the global `swap_priority' could
> that be also documented?  Once this is documented it should be trivial to
> see that some places are missing some locking.  Is it intentional?
> 
> > The uvm_swap_data_lock protects all swap data structures, so needs to be
> > grabbed a few times, many of them already documented in the comments.
> > 
> > For review, I suggest comparing to what NetBSD did and also going
> > through the consumers (swaplist_insert, swaplist_find, swaplist_trim)
> > and check that they are properly locked when called, or that there is
> > the KERNEL_LOCK() in place when swap data structures are manipulated.
> 
> I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce
> the differences with NetBSD.

Done.

> > In swapmount() I introduced locking since that's needed to be able to
> > assert that the proper locks are held in swaplist_{insert,find,trim}.
> 
> Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down?
> What about `uvmexp.nswget' and `uvmexp.swpgonly' in there?

This has been done as part of another change.  This diff uses an atomic
operation to increase `nswget' in case multiple threads fault on a page
in swap at the same time.

Updated diff below, ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.163
diff -u -p -r1.163 uvm_swap.c
--- uvm/uvm_swap.c  6 Aug 2022 13:44:04 -   1.163
+++ uvm/uvm_swap.c  17 Aug 2022 11:46:20 -
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -84,13 +85,16 @@
  * the system maintains a global data structure describing all swap
  * partitions/files.   there is a sorted LIST of "swappri" structures
  * which describe "swapdev"'s at that priority.   this LIST is headed
- * by the "swap_priority" global var.each "swappri" contains a 
+ * by the "swap_priority" global var.each "swappri" contains a
  * TAILQ of "swapdev" structures at that priority.
  *
  * locking:
  *  - swap_syscall_lock (sleep lock): this lock serializes the swapctl
  *system call and prevents the swap priority list from changing
  *while we are in the middle of a system call (e.g. SWAP_STATS).
+ *  - uvm_swap_data_lock (mutex): this lock protects all swap data
+ *structures including the priority list, the swapdev structures,
+ *and the swapmap arena.
  *
  * each swap device has the following info:
  *  - swap device in use (could be disabled, preventing future use)
@@ -106,7 +110,7 @@
  * userland controls and configures swap with the swapctl(2) system call.
  * the sys_swapctl performs the following operations:
  *  [1] SWAP_NSWAP: returns the number of swap devices currently configured
- *  [2] SWAP_STATS: given a pointer to an array of swapent structures 
+ *  [2] SWAP_STATS: given a pointer to an array of swapent structures
  * (passed in via "arg") of a size passed in via "misc" ... we load
  * the current swap config into the array.
  *  [3] SWAP_ON: given a pathname in arg (could be device or file) and a
@@ -208,9 +212,10 @@ struct extent *swapmap;/* controls the
 
 /* list of all active swap devices [by priority] */
 LIST_HEAD(swap_priority, swappri);
-struct swap_priority swap_priority;
+struct swap_priority swap_priority;/* [S] */
 
 /* locks */
+struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE);
 struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk");
 
 struct mutex oommtx = MUTEX_INITIALIZER(IPL_VM);
@@ -224,7 +229,7 @@ void swapdrum_add(struct swapdev *, in
 struct swapdev *swapdrum_getsdp(int);
 
 struct swapdev *swaplist_find(struct vnode *, int);
-voidswaplist_insert(struct swapdev *, 
+voidswaplist_insert(struct swapdev *,
 struct swappri *, int);
 voidswaplist_trim(void);
 
@@ -472,16 +477,19 @@ uvm_swap_finicrypt_all(void)
 /*
  * swaplist_insert: insert swap device "sdp" into the 

Re: patch: change swblk_t type and use it in blist

2022-08-05 Thread Martin Pieuchot
On 05/08/22(Fri) 18:10, Sebastien Marie wrote:
> Hi,
> 
> When initially ported blist from DragonFlyBSD, we used custom type bsblk_t 
> and 
> bsbmp_t instead of the one used by DragonFlyBSD (swblk_t and u_swblk_t).
> 
> The reason was swblk_t is already defined on OpenBSD, and was incompatible 
> with 
> blist (int32_t). It is defined, but not used (outside some regress file which 
> seems to be not affected by type change).
> 
> This diff changes the __swblk_t definition in sys/_types.h to be 'unsigned 
> long', and switch back blist to use swblk_t (and u_swblk_t, even if it isn't 
> 'unsigned swblk_t').
> 
> It makes the diff with DragonFlyBSD more thin. I added a comment with the git 
> id 
> used for the initial port.
> 
> I tested it on i386 and amd64 (kernel and userland).
> 
> By changing bitmap type from 'u_long' to 'u_swblk_t' ('u_int64_t'), it makes 
> the 
> regress the same on 64 and 32bits archs (and it success on both).
> 
> Comments or OK ?

Makes sense to me.  I'm not a standard/type lawyer so I don't know if
this is fine for userland.  So I'm ok with it.

> diff /home/semarie/repos/openbsd/src
> commit - 73f52ef7130cefbe5a8fe028eedaad0e54be7303
> path + /home/semarie/repos/openbsd/src
> blob - e05867429cdd81c434f9ca589c1fb8c6d25957f8
> file + sys/sys/_types.h
> --- sys/sys/_types.h
> +++ sys/sys/_types.h
> @@ -60,7 +60,7 @@ typedef __uint8_t   __sa_family_t;  /* sockaddr 
> address f
>  typedef  __int32_t   __segsz_t;  /* segment size */
>  typedef  __uint32_t  __socklen_t;/* length type for network 
> syscalls */
>  typedef  long__suseconds_t;  /* microseconds (signed) */
> -typedef  __int32_t   __swblk_t;  /* swap offset */
> +typedef  unsigned long   __swblk_t;  /* swap offset */
>  typedef  __int64_t   __time_t;   /* epoch time */
>  typedef  __int32_t   __timer_t;  /* POSIX timer identifiers */
>  typedef  __uint32_t  __uid_t;/* user id */
> blob - 102ca95dd45ba6d9cab0f3fcbb033d6043ec1606
> file + sys/sys/blist.h
> --- sys/sys/blist.h
> +++ sys/sys/blist.h
> @@ -1,4 +1,5 @@
>  /* $OpenBSD: blist.h,v 1.1 2022/07/29 17:47:12 semarie Exp $ */
> +/* DragonFlyBSD:7b80531f545c7d3c51c1660130c71d01f6bccbe0:/sys/sys/blist.h */
>  /*
>   * Copyright (c) 2003,2004 The DragonFly Project.  All rights reserved.
>   * 
> @@ -65,15 +66,13 @@
>  #include 
>  #endif
>  
> -#define  SWBLK_BITS 64
> -typedef u_long bsbmp_t;
> -typedef u_long bsblk_t;
> +typedef u_int64_tu_swblk_t;
>  
>  /*
>   * note: currently use SWAPBLK_NONE as an absolute value rather then
>   * a flag bit.
>   */
> -#define SWAPBLK_NONE ((bsblk_t)-1)
> +#define SWAPBLK_NONE ((swblk_t)-1)
>  
>  /*
>   * blmeta and bl_bitmap_t MUST be a power of 2 in size.
> @@ -81,39 +80,39 @@ typedef u_long bsblk_t;
>  
>  typedef struct blmeta {
>   union {
> - bsblk_t bmu_avail;  /* space available under us */
> - bsbmp_t bmu_bitmap; /* bitmap if we are a leaf  */
> + swblk_t bmu_avail;  /* space available under us */
> + u_swblk_t   bmu_bitmap; /* bitmap if we are a leaf  */
>   } u;
> - bsblk_t bm_bighint; /* biggest contiguous block hint*/
> + swblk_t bm_bighint; /* biggest contiguous block hint*/
>  } blmeta_t;
>  
>  typedef struct blist {
> - bsblk_t bl_blocks;  /* area of coverage */
> + swblk_t bl_blocks;  /* area of coverage */
>   /* XXX int64_t bl_radix */
> - bsblk_t bl_radix;   /* coverage radix   */
> - bsblk_t bl_skip;/* starting skip*/
> - bsblk_t bl_free;/* number of free blocks*/
> + swblk_t bl_radix;   /* coverage radix   */
> + swblk_t bl_skip;/* starting skip*/
> + swblk_t bl_free;/* number of free blocks*/
>   blmeta_t*bl_root;   /* root of radix tree   */
> - bsblk_t bl_rootblks;/* bsblk_t blks allocated for tree */
> + swblk_t bl_rootblks;/* swblk_t blks allocated for tree */
>  } *blist_t;
>  
> -#define BLIST_META_RADIX (sizeof(bsbmp_t)*8/2)   /* 2 bits per */
> -#define BLIST_BMAP_RADIX (sizeof(bsbmp_t)*8) /* 1 bit per */
> +#define BLIST_META_RADIX (sizeof(u_swblk_t)*8/2) /* 2 bits per */
> +#define BLIST_BMAP_RADIX (sizeof(u_swblk_t)*8)   /* 1 bit per */
>  
>  /*
>   * The radix may exceed the size of a 64 bit signed (or unsigned) int
> - * when the maximal number of blocks is allocated.  With a 32-bit bsblk_t
> + * when the maximal number of blocks is allocated.  With a 32-bit swblk_t
>   * this corresponds to ~1G x PAGE_SIZE = 4096GB.  The swap code usually
>   * divides this by 4, leaving us with a capability of up to four 1TB swap
>   * devices.
>   *
> - * With a 

pf.conf(5): document new anchors limit

2022-07-21 Thread Martin Vahlensieck
Hi

This is a diff to document the new anchors limit in pf.conf(5).  I
inserted it as second-to-last item, as the following paragraph talks
about NMBCLUSTERS.  While here: Is the double entry for table-entries
intentional?

Best,

Martin

Index: pf.conf.5
===
RCS file: /cvs/src/share/man/man5/pf.conf.5,v
retrieving revision 1.596
diff -u -p -r1.596 pf.conf.5
--- pf.conf.5   27 May 2022 15:45:02 -  1.596
+++ pf.conf.5   21 Jul 2022 17:00:53 -
@@ -1287,6 +1287,7 @@ has the following defaults:
 .It tables Ta Dv PFR_KTABLE_HIWAT Ta Pq 1000
 .It table-entries Ta Dv PFR_KENTRY_HIWAT Ta Pq 20
 .It table-entries Ta Dv PFR_KENTRY_HIWAT_SMALL Ta Pq 10
+.It anchors Ta Dv PF_ANCHOR_HIWAT Ta Pq 512
 .It frags Ta Dv NMBCLUSTERS Ns /32 Ta Pq platform dependent
 .El
 .Pp



ypconnect(2): mention correct return value

2022-07-21 Thread Martin Vahlensieck
Hi

While looking at the recent YP changes I noticed that the RETURN
VALUES section of the man page is incorrect.  Here is an update (I
just copied the text from socket(2) and adjusted the function name).

Best,

Martin

Index: ypconnect.2
===
RCS file: /cvs/src/lib/libc/sys/ypconnect.2,v
retrieving revision 1.2
diff -u -p -r1.2 ypconnect.2
--- ypconnect.2 17 Jul 2022 05:48:26 -  1.2
+++ ypconnect.2 21 Jul 2022 17:08:57 -
@@ -45,7 +45,12 @@ general purpose.
 .Nm
 is only intended for use by internal libc YP functions.
 .Sh RETURN VALUES
-.Rv -std
+If successful,
+.Fn ypconnect
+returns a non-negative integer, the socket file descriptor.
+Otherwise, a value of \-1 is returned and
+.Va errno
+is set to indicate the error.
 .Sh ERRORS
 .Fn ypconnect
 will fail if:



Re: Introduce uvm_pagewait()

2022-07-11 Thread Martin Pieuchot
On 28/06/22(Tue) 14:13, Martin Pieuchot wrote:
> I'd like to abstract the use of PG_WANTED to start unifying & cleaning
> the various cases where a code path is waiting for a busy page.  Here's
> the first step.
> 
> ok?

Anyone?

> Index: uvm/uvm_amap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
> retrieving revision 1.90
> diff -u -p -r1.90 uvm_amap.c
> --- uvm/uvm_amap.c30 Aug 2021 16:59:17 -  1.90
> +++ uvm/uvm_amap.c28 Jun 2022 11:53:08 -
> @@ -781,9 +781,7 @@ ReStart:
>* it and then restart.
>*/
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK,
> - "cownow", INFSLP);
> + uvm_pagewait(pg, amap->am_lock, "cownow");
>   goto ReStart;
>   }
>  
> Index: uvm/uvm_aobj.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
> retrieving revision 1.103
> diff -u -p -r1.103 uvm_aobj.c
> --- uvm/uvm_aobj.c29 Dec 2021 20:22:06 -  1.103
> +++ uvm/uvm_aobj.c28 Jun 2022 11:53:08 -
> @@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj)
>   while ((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
>   pmap_page_protect(pg, PROT_NONE);
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det",
> - INFSLP);
> + uvm_pagewait(pg, uobj->vmobjlock, "uao_det");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   continue;
>   }
>   uao_dropswap(>u_obj, pg->offset >> PAGE_SHIFT);
> @@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
>  
>   /* Make sure page is unbusy, else wait for it. */
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh",
> - INFSLP);
> + uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   curoff -= PAGE_SIZE;
>   continue;
>   }
> @@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t 
>  
>   /* page is there, see if we need to wait on it */
>   if ((ptmp->pg_flags & PG_BUSY) != 0) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(ptmp, uobj->vmobjlock, PVM,
> - "uao_get", INFSLP);
> + uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   continue;   /* goto top of pps while loop */
>   }
>  
> Index: uvm/uvm_km.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_km.c,v
> retrieving revision 1.150
> diff -u -p -r1.150 uvm_km.c
> --- uvm/uvm_km.c  7 Jun 2022 12:07:45 -   1.150
> +++ uvm/uvm_km.c  28 Jun 2022 11:53:08 -
> @@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj,
>   for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
>   pp = uvm_pagelookup(uobj, curoff);
>   if (pp && pp->pg_flags & PG_BUSY) {
> - atomic_setbits_int(>pg_flags, PG_WANTED);
> - rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm",
> - INFSLP);
> + uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   curoff -= PAGE_SIZE; /* loop back to us */
>   continue;
>   }
> Index: uvm/uvm_page.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> retrieving revision 1.166
> diff -u -p -r1.166 uvm_page.c
> --- uvm/uvm_page.c12 May 2022 12:48:36 -  1.166
> +++ uvm/uv

PATCH: better prime testing for libressl

2022-07-08 Thread Martin Grenouilloux
Hello,

I'm suggesting you a diff against master of the implementation of the 
Baillie-PSW algorithm for primality testing.
The code in itself is commented and has information about what is being done 
and why.

The reason for this change is that in this paper from 2018 
(https://eprint.iacr.org/2018/749.pdf),
researchers pointed out the weaknesses of main algorithms for primality testing 
where one could generate
pseudoprimes to those tests resulting in fooling them and having composite 
numbers being taken as primes.

There is one that has been studied and that has no known pseudoprime and it is 
called the Baillie-PSW algorithm.
It is this one I and Theo Buehler have implemented as it is safer.

Me and Theo Buehler have been actively working on it for the past month to 
finally present it to you.

In hope this finds you well,

Regards,

Martin Grenouilloux.
EPITA - LSE
diff --git a/lib/libcrypto/Makefile b/lib/libcrypto/Makefile
index d6432cdc518..7e3a07c5fba 100644
--- a/lib/libcrypto/Makefile
+++ b/lib/libcrypto/Makefile
@@ -89,6 +89,7 @@ SRCS+= bn_print.c bn_rand.c bn_shift.c bn_word.c bn_blind.c
 SRCS+= bn_kron.c bn_sqrt.c bn_gcd.c bn_prime.c bn_err.c bn_sqr.c
 SRCS+= bn_recp.c bn_mont.c bn_mpi.c bn_exp2.c bn_gf2m.c bn_nist.c
 SRCS+= bn_depr.c bn_const.c bn_x931p.c
+SRCS+= bn_bpsw.c bn_isqrt.c
 
 # buffer/
 SRCS+= buffer.c buf_err.c buf_str.c
diff --git a/lib/libcrypto/bn/bn_bpsw.c b/lib/libcrypto/bn/bn_bpsw.c
new file mode 100644
index 000..d198899b2a4
--- /dev/null
+++ b/lib/libcrypto/bn/bn_bpsw.c
@@ -0,0 +1,401 @@
+/*	$OpenBSD$ */
+/*
+ * Copyright (c) 2022 Martin Grenouilloux 
+ * Copyright (c) 2022 Theo Buehler 
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include 
+
+#include "bn_lcl.h"
+#include "bn_prime.h"
+
+/*
+ * For an odd n compute a / 2 (mod n). If a is even, we can do a plain
+ * division, otherwise calculate (a + n) / 2. Then reduce (mod n).
+ */
+static int
+bn_division_by_two_mod_n(BIGNUM *r, BIGNUM *a, const BIGNUM *n, BN_CTX *ctx)
+{
+	if (!BN_is_odd(n))
+		return 0;
+
+	if (!BN_mod_ct(r, a, n, ctx))
+		return 0;
+
+	if (BN_is_odd(r)) {
+		if (!BN_add(r, r, n))
+			return 0;
+	}
+
+	if (!BN_rshift1(r, r))
+		return 0;
+
+	return 1;
+}
+
+/*
+ * Given the next binary digit of k and the current Lucas terms U and V, this
+ * helper computes the next terms in the Lucas sequence defined as follows:
+ *
+ *   U' = U * V  (mod n)
+ *   V' = (V^2 + D * U^2) / 2(mod n)
+ *
+ * If digit == 0, bn_lucas_step() returns U' and V'. If digit == 1, it returns
+ *
+ *   U'' = (U' + V') / 2 (mod n)
+ *   V'' = (V' + D * U') / 2 (mod n)
+ *
+ * Compare with FIPS 186-4, Appendix C.3.3, step 6.
+ */
+static int
+bn_lucas_step(BIGNUM *U, BIGNUM *V, int digit, const BIGNUM *D,
+const BIGNUM *n, BN_CTX *ctx)
+{
+	BIGNUM *tmp;
+	int ret = 0;
+
+	BN_CTX_start(ctx);
+
+	if ((tmp = BN_CTX_get(ctx)) == NULL)
+		goto done;
+
+	/* Store D * U^2 before computing U'. */
+	if (!BN_sqr(tmp, U, ctx))
+		goto done;
+	if (!BN_mul(tmp, D, tmp, ctx))
+		goto done;
+
+	/* U' = U * V (mod n). */
+	if (!BN_mod_mul(U, U, V, n, ctx))
+		goto done;
+
+	/* V' = (V^2 + D * U^2) / 2 (mod n). */
+	if (!BN_sqr(V, V, ctx))
+		goto done;
+	if (!BN_add(V, V, tmp))
+		goto done;
+	if (!bn_division_by_two_mod_n(V, V, n, ctx))
+		goto done;
+
+	if (digit == 1) {
+		/* Store D * U' before computing U''. */
+		if (!BN_mul(tmp, D, U, ctx))
+			goto done;
+
+		/* U'' = (U' + V') / 2 (mod n). */
+		if (!BN_add(U, U, V))
+			goto done;
+		if (!bn_division_by_two_mod_n(U, U, n, ctx))
+			goto done;
+
+		/* V'' = (V' + D * U') / 2 (mod n). */
+		if (!BN_add(V, V, tmp))
+			goto done;
+		if (!bn_division_by_two_mod_n(V, V, n, ctx))
+			goto done;
+	}
+
+	ret = 1;
+
+ done:
+	BN_CTX_end(ctx);
+
+	return ret;
+}
+
+/*
+ * Compute the Lucas terms U_k, V_k, see FIPS 186-4, Appendix C.3.3, steps 4-6.
+ */
+static int
+bn_lucas(BIGNUM *U, BIGNUM *V, const BIGNUM *k, const BIGNUM *D,
+const BIGNUM *n, BN_CTX *ctx)
+{
+	int digit, i;
+	int ret = 0;
+
+	if (!BN_one(U))
+		goto done;
+	if (!BN_one(V))
+		goto done;
+
+	/*
+	 * Iterate over the digits of k from MSB to LSB. Start at digit 2
+	 * since the first digit is dealt with by setting U = 1 and V = 1.
+	 */
+	for (i =

Faster M operation for the swapper to be great again

2022-06-30 Thread Martin Pieuchot
Diff below uses two tricks to make uvm_pagermapin/out() faster and less
likely to fail in OOM situations.

These functions are used to map buffers when swapping pages in/out and
when faulting on mmaped files.  robert@ even measured a 75% improvement
when populating pages related to files that aren't yet in the buffer
cache.

The first trick is to use the direct map when available.  I'm doing this
for single pages but km_alloc(9) also does that for single segment...
uvm_io() only maps one page at a time for the moment so this should be
enough.

The second trick is to use pmap_kenter_pa() which doesn't fail and is
faster.

With this changes the "freeze" happening on my server when entering many
pages to swap in OOM situation is much shorter and the machine becomes
quickly responsive.

ok?

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.81
diff -u -p -r1.81 uvm_pager.c
--- uvm/uvm_pager.c 28 Jun 2022 19:07:40 -  1.81
+++ uvm/uvm_pager.c 30 Jun 2022 13:34:46 -
@@ -258,6 +258,16 @@ uvm_pagermapin(struct vm_page **pps, int
vsize_t size;
struct vm_page *pp;
 
+#ifdef __HAVE_PMAP_DIRECT
+   /* use direct mappings for single page */
+   if (npages == 1) {
+   KASSERT(pps[0]);
+   KASSERT(pps[0]->pg_flags & PG_BUSY);
+   kva = pmap_map_direct(pps[0]);
+   return kva;
+   }
+#endif
+
prot = PROT_READ;
if (flags & UVMPAGER_MAPIN_READ)
prot |= PROT_WRITE;
@@ -273,14 +283,7 @@ uvm_pagermapin(struct vm_page **pps, int
pp = *pps++;
KASSERT(pp);
KASSERT(pp->pg_flags & PG_BUSY);
-   /* Allow pmap_enter to fail. */
-   if (pmap_enter(pmap_kernel(), cva, VM_PAGE_TO_PHYS(pp),
-   prot, PMAP_WIRED | PMAP_CANFAIL | prot) != 0) {
-   pmap_remove(pmap_kernel(), kva, cva);
-   pmap_update(pmap_kernel());
-   uvm_pseg_release(kva);
-   return 0;
-   }
+   pmap_kenter_pa(cva, VM_PAGE_TO_PHYS(pp), prot);
}
pmap_update(pmap_kernel());
return kva;
@@ -294,8 +297,15 @@ uvm_pagermapin(struct vm_page **pps, int
 void
 uvm_pagermapout(vaddr_t kva, int npages)
 {
+#ifdef __HAVE_PMAP_DIRECT
+   /* use direct mappings for single page */
+   if (npages == 1) {
+   pmap_unmap_direct(kva);
+   return;
+   }
+#endif
 
-   pmap_remove(pmap_kernel(), kva, kva + ((vsize_t)npages << PAGE_SHIFT));
+   pmap_kremove(kva, (vsize_t)npages << PAGE_SHIFT);
pmap_update(pmap_kernel());
uvm_pseg_release(kva);
 



Re: Use SMR instead of SRP list in rtsock.c

2022-06-30 Thread Martin Pieuchot
On 30/06/22(Thu) 11:56, Claudio Jeker wrote:
> On Thu, Jun 30, 2022 at 12:34:33PM +0300, Vitaliy Makkoveev wrote:
> > On Thu, Jun 30, 2022 at 11:08:48AM +0200, Claudio Jeker wrote:
> > > This diff converts the SRP list to a SMR list in rtsock.c
> > > SRP is a bit strange with how it works and the SMR code is a bit easier to
> > > understand. Since we can sleep in the SMR_TAILQ_FOREACH() we need to grab
> > > a refcount on the route pcb so that we can leave the SMR critical section
> > > and then enter the SMR critical section at the end of the loop before
> > > dropping the refcount again.
> > > 
> > > The diff does not immeditaly explode but I doubt we can exploit
> > > parallelism in route_input() so this may fail at some later stage if it is
> > > wrong.
> > > 
> > > Comments from the lock critics welcome
> > 
> > We use `so_lock' rwlock(9) to protect route domain sockets. We can't
> > convert this SRP list to SMR list because we call solock() within
> > foreach loop.

We shouldn't use SRP list either, no?  Or are we allowed to sleep
holding a SRP reference?  That's the question that triggered this diff.

> because of the so_lock the code uses a refcnt on the route pcb to make
> sure that the object is not freed while we sleep. So that is handled by
> this diff.
>  
> > We can easily crash kernel by running in parallel some "route monitor"
> > commands and "while true; ifconfig vether0 create ; ifconfig vether0
> > destroy; done".
> 
> That does not cause problem on my system.
>  
> > > -- 
> > > :wq Claudio
> > > 
> > > Index: sys/net/rtsock.c
> > > ===
> > > RCS file: /cvs/src/sys/net/rtsock.c,v
> > > retrieving revision 1.334
> > > diff -u -p -r1.334 rtsock.c
> > > --- sys/net/rtsock.c  28 Jun 2022 10:01:13 -  1.334
> > > +++ sys/net/rtsock.c  30 Jun 2022 08:02:09 -
> > > @@ -71,7 +71,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > -#include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -107,8 +107,6 @@ struct walkarg {
> > >  };
> > >  
> > >  void route_prinit(void);
> > > -void rcb_ref(void *, void *);
> > > -void rcb_unref(void *, void *);
> > >  int  route_output(struct mbuf *, struct socket *, struct sockaddr *,
> > >   struct mbuf *);
> > >  int  route_ctloutput(int, struct socket *, int, int, struct mbuf *);
> > > @@ -149,7 +147,7 @@ intrt_setsource(unsigned int, struct 
> > >  struct rtpcb {
> > >   struct socket   *rop_socket;/* [I] */
> > >  
> > > - SRPL_ENTRY(rtpcb)   rop_list;
> > > + SMR_TAILQ_ENTRY(rtpcb)  rop_list;
> > >   struct refcnt   rop_refcnt;
> > >   struct timeout  rop_timeout;
> > >   unsigned introp_msgfilter;  /* [s] */
> > > @@ -162,8 +160,7 @@ struct rtpcb {
> > >  #define  sotortpcb(so)   ((struct rtpcb *)(so)->so_pcb)
> > >  
> > >  struct rtptable {
> > > - SRPL_HEAD(, rtpcb)  rtp_list;
> > > - struct srpl_rc  rtp_rc;
> > > + SMR_TAILQ_HEAD(, rtpcb) rtp_list;
> > >   struct rwlock   rtp_lk;
> > >   unsigned intrtp_count;
> > >  };
> > > @@ -185,29 +182,12 @@ struct rtptable rtptable;
> > >  void
> > >  route_prinit(void)
> > >  {
> > > - srpl_rc_init(_rc, rcb_ref, rcb_unref, NULL);
> > >   rw_init(_lk, "rtsock");
> > > - SRPL_INIT(_list);
> > > + SMR_TAILQ_INIT(_list);
> > >   pool_init(_pool, sizeof(struct rtpcb), 0,
> > >   IPL_SOFTNET, PR_WAITOK, "rtpcb", NULL);
> > >  }
> > >  
> > > -void
> > > -rcb_ref(void *null, void *v)
> > > -{
> > > - struct rtpcb *rop = v;
> > > -
> > > - refcnt_take(>rop_refcnt);
> > > -}
> > > -
> > > -void
> > > -rcb_unref(void *null, void *v)
> > > -{
> > > - struct rtpcb *rop = v;
> > > -
> > > - refcnt_rele_wake(>rop_refcnt);
> > > -}
> > > -
> > >  int
> > >  route_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf 
> > > *nam,
> > >  struct mbuf *control, struct proc *p)
> > > @@ -325,8 +305,7 @@ route_attach(struct socket *so, int prot
> > >   so->so_options |= SO_USELOOPBACK;
> > >  
> > >   rw_enter(_lk, RW_WRITE);
> > > - SRPL_INSERT_HEAD_LOCKED(_rc, _list, rop,
> > > - rop_list);
> > > + SMR_TAILQ_INSERT_HEAD_LOCKED(_list, rop, rop_list);
> > >   rtptable.rtp_count++;
> > >   rw_exit(_lk);
> > >  
> > > @@ -347,8 +326,7 @@ route_detach(struct socket *so)
> > >   rw_enter(_lk, RW_WRITE);
> > >  
> > >   rtptable.rtp_count--;
> > > - SRPL_REMOVE_LOCKED(_rc, _list, rop, rtpcb,
> > > - rop_list);
> > > + SMR_TAILQ_REMOVE_LOCKED(_list, rop, rop_list);
> > >   rw_exit(_lk);
> > >  
> > >   sounlock(so);
> > > @@ -356,6 +334,7 @@ route_detach(struct socket *so)
> > >   /* wait for all references to drop */
> > >   refcnt_finalize(>rop_refcnt, "rtsockrefs");
> > >   timeout_del_barrier(>rop_timeout);
> > > + smr_barrier();
> > >  
> > >   solock(so);
> > >  
> > > @@ -501,7 +480,6 @@ route_input(struct mbuf *m0, struct sock
> > >   struct rtpcb *rop;
> > >   

Re: arp llinfo mutex

2022-06-29 Thread Martin Pieuchot
On 29/06/22(Wed) 19:40, Alexander Bluhm wrote:
> Hi,
> 
> To fix the KASSERT(la != NULL) we have to protect the rt_llinfo
> with a mutex.  The idea is to keep rt_llinfo and RTF_LLINFO consistent.
> Also do not put the mutex in the fast path.

Losing the RTM_ADD/DELETE race is not a bug.  I would not add a printf
in these cases.  I understand you might want one for debugging purposes
but I don't see any value in committing it.  Do you agree?

Note that some times the code checks for the RTF_LLINFO flags and some
time for rt_llinfo != NULL.  This is inconsistent and a bit confusing
now that we use a mutex to protect those states.

Could you document that rt_llinfo is now protected by the mutex (or
KERNEL_LOCK())?

Anyway this is an improvement ok mpi@

PS: What about ND6?

> Index: netinet/if_ether.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/if_ether.c,v
> retrieving revision 1.250
> diff -u -p -r1.250 if_ether.c
> --- netinet/if_ether.c27 Jun 2022 20:47:10 -  1.250
> +++ netinet/if_ether.c28 Jun 2022 14:00:12 -
> @@ -101,6 +101,8 @@ void arpreply(struct ifnet *, struct mbu
>  unsigned int);
>  
>  struct niqueue arpinq = NIQUEUE_INITIALIZER(50, NETISR_ARP);
> +
> +/* llinfo_arp live time, rt_llinfo and RTF_LLINFO are protected by arp_mtx */
>  struct mutex arp_mtx = MUTEX_INITIALIZER(IPL_SOFTNET);
>  
>  LIST_HEAD(, llinfo_arp) arp_list; /* [mN] list of all llinfo_arp structures 
> */
> @@ -155,7 +157,7 @@ void
>  arp_rtrequest(struct ifnet *ifp, int req, struct rtentry *rt)
>  {
>   struct sockaddr *gate = rt->rt_gateway;
> - struct llinfo_arp *la = (struct llinfo_arp *)rt->rt_llinfo;
> + struct llinfo_arp *la;
>   time_t uptime;
>  
>   NET_ASSERT_LOCKED();
> @@ -171,7 +173,7 @@ arp_rtrequest(struct ifnet *ifp, int req
>   rt->rt_expire = 0;
>   break;
>   }
> - if ((rt->rt_flags & RTF_LOCAL) && !la)
> + if ((rt->rt_flags & RTF_LOCAL) && rt->rt_llinfo == NULL)
>   rt->rt_expire = 0;
>   /*
>* Announce a new entry if requested or warn the user
> @@ -192,44 +194,54 @@ arp_rtrequest(struct ifnet *ifp, int req
>   }
>   satosdl(gate)->sdl_type = ifp->if_type;
>   satosdl(gate)->sdl_index = ifp->if_index;
> - if (la != NULL)
> - break; /* This happens on a route change */
>   /*
>* Case 2:  This route may come from cloning, or a manual route
>* add with a LL address.
>*/
>   la = pool_get(_pool, PR_NOWAIT | PR_ZERO);
> - rt->rt_llinfo = (caddr_t)la;
>   if (la == NULL) {
>   log(LOG_DEBUG, "%s: pool get failed\n", __func__);
>   break;
>   }
>  
> + mtx_enter(_mtx);
> + if (rt->rt_llinfo != NULL) {
> + /* we lost the race, another thread has entered it */
> + mtx_leave(_mtx);
> + printf("%s: llinfo exists\n", __func__);
> + pool_put(_pool, la);
> + break;
> + }
>   mq_init(>la_mq, LA_HOLD_QUEUE, IPL_SOFTNET);
> + rt->rt_llinfo = (caddr_t)la;
>   la->la_rt = rt;
>   rt->rt_flags |= RTF_LLINFO;
> + LIST_INSERT_HEAD(_list, la, la_list);
>   if ((rt->rt_flags & RTF_LOCAL) == 0)
>   rt->rt_expire = uptime;
> - mtx_enter(_mtx);
> - LIST_INSERT_HEAD(_list, la, la_list);
>   mtx_leave(_mtx);
> +
>   break;
>  
>   case RTM_DELETE:
> - if (la == NULL)
> - break;
>   mtx_enter(_mtx);
> + la = (struct llinfo_arp *)rt->rt_llinfo;
> + if (la == NULL) {
> + /* we lost the race, another thread has removed it */
> + mtx_leave(_mtx);
> + printf("%s: llinfo missing\n", __func__);
> + break;
> + }
>   LIST_REMOVE(la, la_list);
> - mtx_leave(_mtx);
>   rt->rt_llinfo = NULL;
>   rt->rt_flags &= ~RTF_LLINFO;
>   atomic_sub_int(_hold_total, mq_purge(>la_mq));
> + mtx_leave(_mtx);
> +
>   pool_put(_pool, la);
>   break;
>  
>   case RTM_INVALIDATE:
> - if (la == NULL)
> - break;
>   if (!ISSET(rt->rt_flags, RTF_LOCAL))
>   arpinvalidate(rt);
>   break;
> @@ -363,8 +375,6 @@ arpresolve(struct ifnet *ifp, struct rte
>   goto bad;
>   }
>  
> - la = (struct llinfo_arp *)rt->rt_llinfo;
> - KASSERT(la != NULL);
>  
>   /*
>* Check the 

Simplify aiodone daemon

2022-06-29 Thread Martin Pieuchot
The aiodone daemon accounts for and frees/releases pages they were
written to swap.  It is only used for asynchronous write.  The diff
below uses this knowledge to:

- Stop suggesting that uvm_swap_get() can be asynchronous.  There's an
  assert for PGO_SYNCIO 3 lines above.

- Remove unused support for asynchronous read, including error
  conditions, from uvm_aio_aiodone_pages().

- Grab the proper lock for each page that has been written to swap.
  This allows to enable an assert in uvm_page_unbusy().

- Move the uvm_anon_release() call outside of uvm_page_unbusy() and
  assert for the different anon cases.  This will allows us to unify
  code paths waiting for busy pages.

This is adapted/simplified from what is in NetBSD.

ok?

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  29 Jun 2022 11:16:35 -
@@ -143,7 +143,6 @@ struct pool uvm_aobj_pool;
 
 static struct uao_swhash_elt   *uao_find_swhash_elt(struct uvm_aobj *, int,
 boolean_t);
-static int  uao_find_swslot(struct uvm_object *, int);
 static boolean_tuao_flush(struct uvm_object *, voff_t,
 voff_t, int);
 static void uao_free(struct uvm_aobj *);
@@ -241,7 +240,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-inline static int
+int
 uao_find_swslot(struct uvm_object *uobj, int pageidx)
 {
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_aobj.h
--- uvm/uvm_aobj.h  21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_aobj.h  29 Jun 2022 11:16:35 -
@@ -60,6 +60,7 @@
 
 void uao_init(void);
 int uao_set_swslot(struct uvm_object *, int, int);
+int uao_find_swslot (struct uvm_object *, int);
 int uao_dropswap(struct uvm_object *, int);
 int uao_swap_off(int, int);
 int uao_shrink(struct uvm_object *, int);
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  29 Jun 2022 11:47:55 -
@@ -1036,13 +1036,14 @@ uvm_pagefree(struct vm_page *pg)
  * uvm_page_unbusy: unbusy an array of pages.
  *
  * => pages must either all belong to the same object, or all belong to anons.
+ * => if pages are object-owned, object must be locked.
  * => if pages are anon-owned, anons must have 0 refcount.
+ * => caller must make sure that anon-owned pages are not PG_RELEASED.
  */
 void
 uvm_page_unbusy(struct vm_page **pgs, int npgs)
 {
struct vm_page *pg;
-   struct uvm_object *uobj;
int i;
 
for (i = 0; i < npgs; i++) {
@@ -1052,35 +1053,19 @@ uvm_page_unbusy(struct vm_page **pgs, in
continue;
}
 
-#if notyet
-   /*
- * XXX swap case in uvm_aio_aiodone() is not holding the lock.
-*
-* This isn't compatible with the PG_RELEASED anon case below.
-*/
KASSERT(uvm_page_owner_locked_p(pg));
-#endif
KASSERT(pg->pg_flags & PG_BUSY);
 
if (pg->pg_flags & PG_WANTED) {
wakeup(pg);
}
if (pg->pg_flags & PG_RELEASED) {
-   uobj = pg->uobject;
-   if (uobj != NULL) {
-   uvm_lock_pageq();
-   pmap_page_protect(pg, PROT_NONE);
-   /* XXX won't happen right now */
-   if (pg->pg_flags & PQ_AOBJ)
-   uao_dropswap(uobj,
-   pg->offset >> PAGE_SHIFT);
-   uvm_pagefree(pg);
-   uvm_unlock_pageq();
-   } else {
-   rw_enter(pg->uanon->an_lock, RW_WRITE);
-   uvm_anon_release(pg->uanon);
-   }
+   KASSERT(pg->uobject != NULL ||
+   (pg->uanon != NULL && pg->uanon->an_ref > 0));
+   atomic_clearbits_int(>pg_flags, PG_RELEASED);
+   uvm_pagefree(pg);
} else {
+   KASSERT((pg->pg_flags & PG_FAKE) == 0);
atomic_clearbits_int(>pg_flags, PG_WANTED|PG_BUSY);
UVM_PAGE_OWN(pg, NULL);
   

Re: Unlocking pledge(2)

2022-06-28 Thread Martin Pieuchot
On 28/06/22(Tue) 18:17, Jeremie Courreges-Anglas wrote:
> 
> Initially I just wandered in syscall_mi.h and found the locking scheme
> super weird, even if technically correct.  pledge_syscall() better be
> safe to call without the kernel lock so I don't understand why we're
> sometimes calling it with the kernel lock and sometimes not.
> 
> ps_pledge is 64 bits so it's not possible to unset bits in an atomic
> manner on all architectures.  Even if we're only removing bits and there
> is probably no way to see a completely garbage value, it makes sense to
> just protect ps_pledge (and ps_execpledge) in the usual manner so that
> we can unlock the syscall.  The diff below protects the fields using
> ps_mtx even though I initially used a dedicated ps_pledge_mtx.
> unveil_destroy() needs to be moved after the critical section.
> regress/sys/kern/pledge looks happy with this.  The sys/syscall_mi.h
> change can be committed in a separate step.
> 
> Input and oks welcome.

This looks nice.  I doubt there's any existing program where you can
really test this.  Even firefox and chromium should do things
correctly.

Maybe you should write a regress test that tries to break the kernel.

> Index: arch/amd64/amd64/vmm.c
> ===
> RCS file: /home/cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.315
> diff -u -p -r1.315 vmm.c
> --- arch/amd64/amd64/vmm.c27 Jun 2022 15:12:14 -  1.315
> +++ arch/amd64/amd64/vmm.c28 Jun 2022 13:54:25 -
> @@ -713,7 +713,7 @@ pledge_ioctl_vmm(struct proc *p, long co
>   case VMM_IOC_CREATE:
>   case VMM_IOC_INFO:
>   /* The "parent" process in vmd forks and manages VMs */
> - if (p->p_p->ps_pledge & PLEDGE_PROC)
> + if (pledge_get(p->p_p) & PLEDGE_PROC)
>   return (0);
>   break;
>   case VMM_IOC_TERM:
> @@ -1312,7 +1312,7 @@ vm_find(uint32_t id, struct vm **res)
>* The managing vmm parent process can lookup all
>* all VMs and is indicated by PLEDGE_PROC.
>*/
> - if (((p->p_p->ps_pledge &
> + if (((pledge_get(p->p_p) &
>   (PLEDGE_VMM | PLEDGE_PROC)) == PLEDGE_VMM) &&
>   (vm->vm_creator_pid != p->p_p->ps_pid))
>   return (pledge_fail(p, EPERM, PLEDGE_VMM));
> Index: kern/init_sysent.c
> ===
> RCS file: /home/cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.238
> diff -u -p -r1.238 init_sysent.c
> --- kern/init_sysent.c27 Jun 2022 14:26:05 -  1.238
> +++ kern/init_sysent.c28 Jun 2022 15:18:25 -
> @@ -1,10 +1,10 @@
> -/*   $OpenBSD: init_sysent.c,v 1.238 2022/06/27 14:26:05 cheloha Exp $   
> */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
>   *
>   * DO NOT EDIT-- this file is automatically generated.
> - * created from; OpenBSD: syscalls.master,v 1.224 2022/05/16 07:36:04 
> mvs Exp 
> + * created from; OpenBSD: syscalls.master,v 1.225 2022/06/27 14:26:05 
> cheloha Exp 
>   */
>  
>  #include 
> @@ -248,7 +248,7 @@ const struct sysent sysent[] = {
>   sys_listen },   /* 106 = listen */
>   { 4, s(struct sys_chflagsat_args), 0,
>   sys_chflagsat },/* 107 = chflagsat */
> - { 2, s(struct sys_pledge_args), 0,
> + { 2, s(struct sys_pledge_args), SY_NOLOCK | 0,
>   sys_pledge },   /* 108 = pledge */
>   { 4, s(struct sys_ppoll_args), 0,
>   sys_ppoll },/* 109 = ppoll */
> Index: kern/kern_event.c
> ===
> RCS file: /home/cvs/src/sys/kern/kern_event.c,v
> retrieving revision 1.191
> diff -u -p -r1.191 kern_event.c
> --- kern/kern_event.c 27 Jun 2022 13:35:21 -  1.191
> +++ kern/kern_event.c 28 Jun 2022 13:55:18 -
> @@ -331,7 +331,7 @@ filt_procattach(struct knote *kn)
>   int s;
>  
>   if ((curproc->p_p->ps_flags & PS_PLEDGE) &&
> - (curproc->p_p->ps_pledge & PLEDGE_PROC) == 0)
> + (pledge_get(curproc->p_p) & PLEDGE_PROC) == 0)
>   return pledge_fail(curproc, EPERM, PLEDGE_PROC);
>  
>   if (kn->kn_id > PID_MAX)
> Index: kern/kern_pledge.c
> ===
> RCS file: /home/cvs/src/sys/kern/kern_pledge.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 kern_pledge.c
> --- kern/kern_pledge.c26 Jun 2022 06:11:49 -  1.282
> +++ kern/kern_pledge.c28 Jun 2022 15:21:46 -
> @@ -21,6 +21,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -465,13 +466,26 @@ sys_pledge(struct proc *p, void *v, regi
>   struct process *pr = p->p_p;
>   uint64_t 

Re: Fix the swapper

2022-06-28 Thread Martin Pieuchot
On 27/06/22(Mon) 15:44, Martin Pieuchot wrote:
> Diff below contain 3 parts that can be committed independently.  The 3
> of them are necessary to allow the pagedaemon to make progress in OOM
> situation and to satisfy all the allocations waiting for pages in
> specific ranges.
> 
> * uvm/uvm_pager.c part reserves a second segment for the page daemon.
>   This is necessary to ensure the two uvm_pagermapin() calls needed by
>   uvm_swap_io() succeed in emergency OOM situation.  (the 2nd segment is
>   necessary when encryption or bouncing is required)
> 
> * uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region
>   for the same reason.  Note that a sleeping point is introduced because
>   the pagedaemon is faster than the asynchronous I/O and in OOM
>   situation it tends to stay busy building cluster that it then discard
>   because no memory is available.
> 
> * uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive 
>   list of pages to account for a given memory range.  Without this the
>   daemon could spin infinitely doing nothing because the global limits
>   are reached.

Here's an updated diff with a fix on top:

 * in uvm/uvm_swap.c make sure uvm_swap_allocpages() is allowed to sleep
   when coming from uvm_fault().  This makes the faulting process wait
   instead of dying when there isn't any free pages to do the bouncing.

I'd appreciate more reviews and tests !

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.80
diff -u -p -r1.80 uvm_pager.c
--- uvm/uvm_pager.c 28 Jun 2022 12:10:37 -  1.80
+++ uvm/uvm_pager.c 28 Jun 2022 15:25:30 -
@@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[]
  * The number of uvm_pseg instances is dynamic using an array segs.
  * At most UVM_PSEG_COUNT instances can exist.
  *
- * psegs[0] always exists (so that the pager can always map in pages).
- * psegs[0] element 0 is always reserved for the pagedaemon.
+ * psegs[0/1] always exist (so that the pager can always map in pages).
+ * psegs[0/1] element 0 are always reserved for the pagedaemon.
  *
  * Any other pseg is automatically created when no space is available
  * and automatically destroyed when it is no longer in use.
@@ -93,6 +93,7 @@ uvm_pager_init(void)
 
/* init pager map */
uvm_pseg_init([0]);
+   uvm_pseg_init([1]);
mtx_init(_pseg_lck, IPL_VM);
 
/* init ASYNC I/O queue */
@@ -168,9 +169,10 @@ pager_seg_restart:
goto pager_seg_fail;
}
 
-   /* Keep index 0 reserved for pagedaemon. */
-   if (pseg == [0] && curproc != uvm.pagedaemon_proc)
-   i = 1;
+   /* Keep indexes 0,1 reserved for pagedaemon. */
+   if ((pseg == [0] || pseg == [1]) &&
+   (curproc != uvm.pagedaemon_proc))
+   i = 2;
else
i = 0;
 
@@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr)
pseg->use &= ~(1 << id);
wakeup();
 
-   if (pseg != [0] && UVM_PSEG_EMPTY(pseg)) {
+   if ((pseg != [0] && pseg != [1]) && UVM_PSEG_EMPTY(pseg)) {
va = pseg->start;
pseg->start = 0;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   12 May 2022 12:49:31 -  1.99
+++ uvm/uvm_pdaemon.c   28 Jun 2022 13:59:49 -
@@ -101,8 +101,8 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
-void   uvmpd_scan(void);
-boolean_t  uvmpd_scan_inactive(struct pglist *);
+void   uvmpd_scan(struct uvm_pmalloc *);
+boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -281,7 +281,7 @@ uvm_pageout(void *arg)
if (pma != NULL ||
((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) ||
((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) {
-   uvmpd_scan();
+   uvmpd_scan(pma);
}
 
/*
@@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg)
  */
 
 boolean_t
-uvmpd_scan_inactive(struct pglist *pglst)
+uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
-   struct vm_page *pps[MAXBSIZE >> PAGE_SHIFT], **ppsp;
+   struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
-

Introduce uvm_pagewait()

2022-06-28 Thread Martin Pieuchot
I'd like to abstract the use of PG_WANTED to start unifying & cleaning
the various cases where a code path is waiting for a busy page.  Here's
the first step.

ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.90
diff -u -p -r1.90 uvm_amap.c
--- uvm/uvm_amap.c  30 Aug 2021 16:59:17 -  1.90
+++ uvm/uvm_amap.c  28 Jun 2022 11:53:08 -
@@ -781,9 +781,7 @@ ReStart:
 * it and then restart.
 */
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK,
-   "cownow", INFSLP);
+   uvm_pagewait(pg, amap->am_lock, "cownow");
goto ReStart;
}
 
Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  28 Jun 2022 11:53:08 -
@@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj)
while ((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
pmap_page_protect(pg, PROT_NONE);
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det",
-   INFSLP);
+   uvm_pagewait(pg, uobj->vmobjlock, "uao_det");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
continue;
}
uao_dropswap(>u_obj, pg->offset >> PAGE_SHIFT);
@@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
 
/* Make sure page is unbusy, else wait for it. */
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh",
-   INFSLP);
+   uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
curoff -= PAGE_SIZE;
continue;
}
@@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t 
 
/* page is there, see if we need to wait on it */
if ((ptmp->pg_flags & PG_BUSY) != 0) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(ptmp, uobj->vmobjlock, PVM,
-   "uao_get", INFSLP);
+   uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
continue;   /* goto top of pps while loop */
}
 
Index: uvm/uvm_km.c
===
RCS file: /cvs/src/sys/uvm/uvm_km.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_km.c
--- uvm/uvm_km.c7 Jun 2022 12:07:45 -   1.150
+++ uvm/uvm_km.c28 Jun 2022 11:53:08 -
@@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj,
for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
pp = uvm_pagelookup(uobj, curoff);
if (pp && pp->pg_flags & PG_BUSY) {
-   atomic_setbits_int(>pg_flags, PG_WANTED);
-   rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm",
-   INFSLP);
+   uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
curoff -= PAGE_SIZE; /* loop back to us */
continue;
}
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  28 Jun 2022 11:57:42 -
@@ -1087,6 +1087,23 @@ uvm_page_unbusy(struct vm_page **pgs, in
}
 }
 
+/*
+ * uvm_pagewait: wait for a busy page
+ *
+ * => page must be known PG_BUSY
+ * => object must be locked
+ * => object will be unlocked on return
+ */
+void
+uvm_pagewait(struct vm_page *pg, struct rwlock *lock, const char *wmesg)
+{
+   KASSERT(rw_lock_held(lock));
+   KASSERT((pg->pg_flags & PG_BUSY) != 0);
+
+   atomic_setbits_int(>pg_flags, PG_WANTED);
+   rwsleep_nsec(pg, lock, PVM | PNORELOCK, wmesg, INFSLP);
+}
+
 #if defined(UVM_PAGE_TRKOWN)
 /*
  * uvm_page_own: set or 

Re: kernel lock in arp

2022-06-27 Thread Martin Pieuchot
On 27/06/22(Mon) 19:11, Alexander Bluhm wrote:
> On Mon, Jun 27, 2022 at 11:49:23AM +0200, Alexander Bluhm wrote:
> > On Sat, May 21, 2022 at 10:50:28PM +0300, Vitaliy Makkoveev wrote:
> > > This diff looks good, except the re-check after kernel lock. It???s
> > > supposed `rt??? could became inconsistent, right? But what stops to
> > > make it inconsistent after first unlocked RTF_LLINFO flag check?
> > >
> > > I think this re-check should gone.
> >
> > I have copied the re-check from intenal genua code.  I am not sure
> > if it is really needed.  We know from Hrvoje that the diff with
> > re-check is stable.  And we know that it crashes without kernel
> > lock at all.
> >
> > I have talked with mpi@ about it.  The main problem is that we have
> > no write lock when we change RTF_LLINFO.  Then rt_llinfo can get
> > NULL or inconsistent.
> >
> > Plan is that I put some lock asserts into route add and delete.
> > This helps to find the parts that modify RTF_LLINFO and rt_llinfo
> > without exclusive lock.
> >
> > Maybe we need some kernel lock somewhere else.  Or we want to use
> > some ARP mutex.  We could also add some comment and commit the diff
> > that I have.  We know that it is faster and stable.  Pushing the
> > kernel lock down or replacing it with something clever can always
> > be done later.
> 
> We need the re-check.  I have tested it with a printf.  It is
> triggered by running arp -d in a loop while forwarding.
> 
> The concurrent threads are these:
> 
> rtrequest_delete(8000246b7428,3,80775048,8000246b7510,0) at 
> rtrequest_delete+0x67
> rtdeletemsg(fd8834a23550,80775048,0) at rtdeletemsg+0x1ad
> rtrequest(b,8000246b7678,3,8000246b7718,0) at rtrequest+0x55c
> rt_clone(8000246b7780,8000246b78f8,0) at rt_clone+0x73
> rtalloc_mpath(8000246b78f8,fd8003169ad8,0) at rtalloc_mpath+0x4c
> ip_forward(fd80b8cc7e00,8077d048,fd8834a230f0,0) at 
> ip_forward+0x137
> ip_input_if(8000246b7a28,8000246b7a34,4,0,8077d048) at 
> ip_input_if+0x353
> ipv4_input(8077d048,fd80b8cc7e00) at ipv4_input+0x39
> ether_input(8077d048,fd80b8cc7e00) at ether_input+0x3ad
> if_input_process(8077d048,8000246b7b18) at if_input_process+0x6f
> ifiq_process(8077d458) at ifiq_process+0x69
> taskq_thread(80036080) at taskq_thread+0x100
> 
> rtrequest_delete(8000246c8d08,3,80775048,8000246c8df0,0) at 
> rtrequest_delete+0x67
> rtdeletemsg(fd8834a230f0,80775048,0) at rtdeletemsg+0x1ad
> rtrequest(b,8000246c8f58,3,8000246c8ff8,0) at rtrequest+0x55c
> rt_clone(8000246c9060,8000246c90b8,0) at rt_clone+0x73
> rtalloc_mpath(8000246c90b8,fd8002c754d8,0) at rtalloc_mpath+0x4c
> in_ouraddr(fd8094771b00,8077d048,8000246c9138) at 
> in_ouraddr+0x84
> ip_input_if(8000246c91d8,8000246c91e4,4,0,8077d048) at 
> ip_input_if+0x1cd
> ipv4_input(8077d048,fd8094771b00) at ipv4_input+0x39
> ether_input(8077d048,fd8094771b00) at ether_input+0x3ad
> if_input_process(8077d048,8000246c92c8) at if_input_process+0x6f
> ifiq_process(80781400) at ifiq_process+0x69
> taskq_thread(80036200) at taskq_thread+0x100
> 
> I have added a comment why kernel lock protects us.  I would like
> to get this in.  It has been tested, reduces the kernel lock and
> is faster.  A more clever lock can be done later.
> 
> ok?

I don't understand how the KERNEL_LOCK() there prevents rtdeletemsg()
from running.  rtrequest_delete() seems completely broken it assumes it
holds an exclusive lock.

To "fix" arp the KERNEL_LOCK() should also be taken in RTM_DELETE and
RTM_RESOLVE inside arp_rtrequest().  Or maybe around ifp->if_rtrequest()

But it doesn't mean there isn't another problem in rtdeletemsg()...

> Index: net/if_ethersubr.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v
> retrieving revision 1.281
> diff -u -p -r1.281 if_ethersubr.c
> --- net/if_ethersubr.c26 Jun 2022 21:19:53 -  1.281
> +++ net/if_ethersubr.c27 Jun 2022 16:55:15 -
> @@ -221,10 +221,7 @@ ether_resolve(struct ifnet *ifp, struct 
>  
>   switch (af) {
>   case AF_INET:
> - KERNEL_LOCK();
> - /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> - KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   eh->ether_type = htons(ETHERTYPE_IP);
> @@ -285,10 +282,7 @@ ether_resolve(struct ifnet *ifp, struct 
>   break;
>  #endif
>   case AF_INET:
> - KERNEL_LOCK();
> - /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> -

CoW & neighbor pages

2022-06-27 Thread Martin Pieuchot
When faulting a page after COW neighborhood pages are likely to already
be entered.   So speed up the fault by doing a narrow fault (do not try
to map in adjacent pages).

This is stolen from NetBSD.

ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.129
diff -u -p -r1.129 uvm_fault.c
--- uvm/uvm_fault.c 4 Apr 2022 09:27:05 -   1.129
+++ uvm/uvm_fault.c 27 Jun 2022 17:05:26 -
@@ -737,6 +737,16 @@ uvm_fault_check(struct uvm_faultinfo *uf
}
 
/*
+* for a case 2B fault waste no time on adjacent pages because
+* they are likely already entered.
+*/
+   if (uobj != NULL && amap != NULL &&
+   (flt->access_type & PROT_WRITE) != 0) {
+   /* wide fault (!narrow) */
+   flt->narrow = TRUE;
+   }
+
+   /*
 * establish range of interest based on advice from mapper
 * and then clip to fit map entry.   note that we only want
 * to do this the first time through the fault.   if we



Fix the swapper

2022-06-27 Thread Martin Pieuchot
Diff below contain 3 parts that can be committed independently.  The 3
of them are necessary to allow the pagedaemon to make progress in OOM
situation and to satisfy all the allocations waiting for pages in
specific ranges.

* uvm/uvm_pager.c part reserves a second segment for the page daemon.
  This is necessary to ensure the two uvm_pagermapin() calls needed by
  uvm_swap_io() succeed in emergency OOM situation.  (the 2nd segment is
  necessary when encryption or bouncing is required)

* uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region
  for the same reason.  Note that a sleeping point is introduced because
  the pagedaemon is faster than the asynchronous I/O and in OOM
  situation it tends to stay busy building cluster that it then discard
  because no memory is available.

* uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive 
  list of pages to account for a given memory range.  Without this the
  daemon could spin infinitely doing nothing because the global limits
  are reached.

At lot could be improved, but this at least makes swapping work in OOM
situations.

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.78
diff -u -p -r1.78 uvm_pager.c
--- uvm/uvm_pager.c 18 Feb 2022 09:04:38 -  1.78
+++ uvm/uvm_pager.c 27 Jun 2022 08:44:41 -
@@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[]
  * The number of uvm_pseg instances is dynamic using an array segs.
  * At most UVM_PSEG_COUNT instances can exist.
  *
- * psegs[0] always exists (so that the pager can always map in pages).
- * psegs[0] element 0 is always reserved for the pagedaemon.
+ * psegs[0/1] always exist (so that the pager can always map in pages).
+ * psegs[0/1] element 0 are always reserved for the pagedaemon.
  *
  * Any other pseg is automatically created when no space is available
  * and automatically destroyed when it is no longer in use.
@@ -93,6 +93,7 @@ uvm_pager_init(void)
 
/* init pager map */
uvm_pseg_init([0]);
+   uvm_pseg_init([1]);
mtx_init(_pseg_lck, IPL_VM);
 
/* init ASYNC I/O queue */
@@ -168,9 +169,10 @@ pager_seg_restart:
goto pager_seg_fail;
}
 
-   /* Keep index 0 reserved for pagedaemon. */
-   if (pseg == [0] && curproc != uvm.pagedaemon_proc)
-   i = 1;
+   /* Keep indexes 0,1 reserved for pagedaemon. */
+   if ((pseg == [0] || pseg == [1]) &&
+   (curproc != uvm.pagedaemon_proc))
+   i = 2;
else
i = 0;
 
@@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr)
pseg->use &= ~(1 << id);
wakeup();
 
-   if (pseg != [0] && UVM_PSEG_EMPTY(pseg)) {
+   if ((pseg != [0] && pseg != [1]) && UVM_PSEG_EMPTY(pseg)) {
va = pseg->start;
pseg->start = 0;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   12 May 2022 12:49:31 -  1.99
+++ uvm/uvm_pdaemon.c   27 Jun 2022 13:24:54 -
@@ -101,8 +101,8 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
-void   uvmpd_scan(void);
-boolean_t  uvmpd_scan_inactive(struct pglist *);
+void   uvmpd_scan(struct uvm_pmalloc *);
+boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -281,7 +281,7 @@ uvm_pageout(void *arg)
if (pma != NULL ||
((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) ||
((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) {
-   uvmpd_scan();
+   uvmpd_scan(pma);
}
 
/*
@@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg)
  */
 
 boolean_t
-uvmpd_scan_inactive(struct pglist *pglst)
+uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
-   struct vm_page *pps[MAXBSIZE >> PAGE_SHIFT], **ppsp;
+   struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
-   struct vm_page *swpps[MAXBSIZE >> PAGE_SHIFT];  /* XXX: see below */
+   struct vm_page *swpps[SWCLUSTPAGES];/* XXX: see below */
int swnpages, swcpages; /* XXX: see below */
int swslot;
struct vm_anon *anon;
@@ -404,8 +404,27 @@ uvmpd_scan_inactive(struct pglist *pglst
swnpages = swcpages = 0;
free = 0;
dirtyreacts = 0;
+   p = NULL;
 
-   for (p 

pdaemon: reserve memory for swapping

2022-06-26 Thread Martin Pieuchot
uvm_swap_io() needs to perform up to 4 allocations to write pages to
disk.  In OOM situation uvm_swap_allocpages() always fail because the
kernel doesn't reserve enough pages.

Diff below set `uvmexp.reserve_pagedaemon' to the number of pages needed
to write a cluster of pages to disk.  With this my machine do not
deadlock and can push pages to swap in OOM case.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  26 Jun 2022 08:17:34 -
@@ -280,10 +280,13 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 
/*
 * init reserve thresholds
-* XXXCDC - values may need adjusting
+*
+* The pagedaemon needs to always be able to write pages to disk,
+* Reserve the minimum amount of pages, a cluster, required by
+* uvm_swap_allocpages()
 */
-   uvmexp.reserve_pagedaemon = 4;
-   uvmexp.reserve_kernel = 8;
+   uvmexp.reserve_pagedaemon = (MAXBSIZE >> PAGE_SHIFT);
+   uvmexp.reserve_kernel = uvmexp.reserve_pagedaemon + 4;
uvmexp.anonminpct = 10;
uvmexp.vnodeminpct = 10;
uvmexp.vtextminpct = 5;



Re: ssh-add(1): fix NULL in fprintf

2022-06-16 Thread Martin Vahlensieck
ping, diff attached

On Mon, May 16, 2022 at 09:21:42PM +0200, Martin Vahlensieck wrote:
> Hi
> 
> What's the status on this?  Anthing required from my side?  I have
> reattached the patch (with the changes Theo suggested).
> 
> Best,
> 
> Martin
> 
> On Mon, May 09, 2022 at 08:39:38PM +0200, Martin Vahlensieck wrote:
> > On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote:
> > > Martin Vahlensieck  wrote:
> > > 
> > > > if (!qflag) {
> > > > -   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > > > -   sshkey_type(key), comment);
> > > > +   fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path,
> > > > +   sshkey_type(key), comment ? " (" : "",
> > > > +   comment ? comment : "", comment ? ")" : "");
> > > 
> > > this is probably better as something like
> > > 
> > > > -   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > > > -   sshkey_type(key), comment ? comment : "no comment");
> > > 
> > > Which has a minor ambiguity, but probably harms noone.
> > > 
> > 
> > Index: ssh-add.c
> > ===
> > RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
> > retrieving revision 1.165
> > diff -u -p -r1.165 ssh-add.c
> > --- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
> > +++ ssh-add.c   9 May 2022 18:36:54 -
> > @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
> > }
> > if (!qflag) {
> > fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > -   sshkey_type(key), comment);
> > +   sshkey_type(key), comment ? comment : "no comment");
> > }
> > return 0;
> >  }
> > @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
> > certpath, filename);
> > sshkey_free(cert);
> > goto out;
> > -   } 
> > +   }
> >  
> > /* Graft with private bits */
> > if ((r = sshkey_to_certified(private)) != 0) {
> 
> Index: ssh-add.c
> ===
> RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
> retrieving revision 1.165
> diff -u -p -r1.165 ssh-add.c
> --- ssh-add.c 4 Feb 2022 02:49:17 -   1.165
> +++ ssh-add.c 9 May 2022 18:36:54 -
> @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
>   }
>   if (!qflag) {
>   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> - sshkey_type(key), comment);
> + sshkey_type(key), comment ? comment : "no comment");
>   }
>   return 0;
>  }
> @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
>   certpath, filename);
>   sshkey_free(cert);
>   goto out;
> - } 
> + }
>  
>   /* Graft with private bits */
>   if ((r = sshkey_to_certified(private)) != 0) {
> 

Index: ssh-add.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
retrieving revision 1.165
diff -u -p -r1.165 ssh-add.c
--- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
+++ ssh-add.c   9 May 2022 18:36:54 -
@@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
}
if (!qflag) {
fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
-   sshkey_type(key), comment);
+   sshkey_type(key), comment ? comment : "no comment");
}
return 0;
 }
@@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
certpath, filename);
sshkey_free(cert);
goto out;
-   } 
+   }
 
/* Graft with private bits */
if ((r = sshkey_to_certified(private)) != 0) {



Re: set RTF_DONE in sysctl_dumpentry for the routing table

2022-06-08 Thread Martin Pieuchot
On 08/06/22(Wed) 16:13, Claudio Jeker wrote:
> Notice while hacking in OpenBGPD. Unlike routing socket messages the
> messages from the sysctl interface have RTF_DONE not set.
> I think it would make sense to set RTF_DONE also in this case since it
> makes reusing code easier.
> 
> All messages sent out via sysctl_dumpentry() have been processed by the
> kernel so setting RTF_DONE kind of makes sense.

I agree, ok mpi@

> -- 
> :wq Claudio
> 
> Index: rtsock.c
> ===
> RCS file: /cvs/src/sys/net/rtsock.c,v
> retrieving revision 1.328
> diff -u -p -r1.328 rtsock.c
> --- rtsock.c  6 Jun 2022 14:45:41 -   1.328
> +++ rtsock.c  8 Jun 2022 14:10:20 -
> @@ -1987,7 +1987,7 @@ sysctl_dumpentry(struct rtentry *rt, voi
>   struct rt_msghdr *rtm = (struct rt_msghdr *)w->w_tmem;
>  
>   rtm->rtm_pid = curproc->p_p->ps_pid;
> - rtm->rtm_flags = rt->rt_flags;
> + rtm->rtm_flags = RTF_DONE | rt->rt_flags;
>   rtm->rtm_priority = rt->rt_priority & RTP_MASK;
>   rtm_getmetrics(>rt_rmx, >rtm_rmx);
>   /* Do not account the routing table's reference. */
> 



Re: Fix clearing of sleep timeouts

2022-06-06 Thread Martin Pieuchot
On 06/06/22(Mon) 06:47, David Gwynne wrote:
> On Sun, Jun 05, 2022 at 03:57:39PM +, Visa Hankala wrote:
> > On Sun, Jun 05, 2022 at 12:27:32PM +0200, Martin Pieuchot wrote:
> > > On 05/06/22(Sun) 05:20, Visa Hankala wrote:
> > > > Encountered the following panic:
> > > > 
> > > > panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" 
> > > > failed: file "/usr/src/sys/kern/kern_synch.c", line 373
> > > > Stopped at  db_enter+0x10:  popq%rbp
> > > > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > > >  423109  57118 55 0x3  02  link
> > > >  330695  30276 55 0x3  03  link
> > > > * 46366  85501 55  0x1003  0x40804001  link
> > > >  188803  85501 55  0x1003  0x40820000K link
> > > > db_enter() at db_enter+0x10
> > > > panic(81f25d2b) at panic+0xbf
> > > > __assert(81f9a186,81f372c8,175,81f87c6c) at 
> > > > __assert+0x25
> > > > sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at 
> > > > sleep_setup+0x1d8
> > > > cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46
> > > > timeout_barrier(8000228a28b0) at timeout_barrier+0x109
> > > > timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2
> > > > sleep_finish(800022d64d90,1) at sleep_finish+0x16d
> > > > tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2
> > > > sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at 
> > > > sys_nanosleep+0x12d
> > > > syscall(800022d64f60) at syscall+0x374
> > > > 
> > > > The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously,
> > > > soft-interrupt-driven timeouts could be deleted synchronously without
> > > > blocking. Now, timeout_del_barrier() can sleep regardless of the type
> > > > of the timeout.
> > > > 
> > > > It looks that with small adjustments timeout_del_barrier() can sleep
> > > > in sleep_finish(). The management of run queues is not affected because
> > > > the timeout clearing happens after it. As timeout_del_barrier() does not
> > > > rely on a timeout or signal catching, there should be no risk of
> > > > unbounded recursion or unwanted signal side effects within the sleep
> > > > machinery. In a way, a sleep with a timeout is higher-level than
> > > > one without.
> > > 
> > > I trust you on the analysis.  However this looks very fragile to me.
> > > 
> > > The use of timeout_del_barrier() which can sleep using the global sleep
> > > queue is worrying me.  
> > 
> > I think the queue handling ends in sleep_finish() when SCHED_LOCK()
> > is released. The timeout clearing is done outside of it.
> 
> That's ok.
> 
> > The extra sleeping point inside sleep_finish() is subtle. It should not
> > matter in typical use. But is it permissible with the API? Also, if
> > timeout_del_barrier() sleeps, the thread's priority can change.
> 
> What other options do we have at this point? Spin? Allocate the timeout
> dynamically so sleep_finish doesn't have to wait for it and let the
> handler clean up? How would you stop the timeout handler waking up the
> thread if it's gone back to sleep again for some other reason?
> 
> Sleeping here is the least worst option.

I agree.  I don't think sleeping is bad here.  My concern is about how
sleeping is implemented.  There's a single API built on top of a single
global data structure which now calls itself recursively.  

I'm not sure how much work it would be to make cond_wait(9) use its own
sleep queue...  This is something independent from this fix though.

> As for timeout_del_barrier, if prio is a worry we can provide an
> advanced version of it that lets you pass the prio in. I'd also
> like to change timeout_barrier so it queues the barrier task at the
> head of the pending lists rather than at the tail.

I doubt prio matter.



Re: Fix clearing of sleep timeouts

2022-06-05 Thread Martin Pieuchot
On 05/06/22(Sun) 05:20, Visa Hankala wrote:
> Encountered the following panic:
> 
> panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: 
> file "/usr/src/sys/kern/kern_synch.c", line 373
> Stopped at  db_enter+0x10:  popq%rbp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>  423109  57118 55 0x3  02  link
>  330695  30276 55 0x3  03  link
> * 46366  85501 55  0x1003  0x40804001  link
>  188803  85501 55  0x1003  0x40820000K link
> db_enter() at db_enter+0x10
> panic(81f25d2b) at panic+0xbf
> __assert(81f9a186,81f372c8,175,81f87c6c) at 
> __assert+0x25
> sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at 
> sleep_setup+0x1d8
> cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46
> timeout_barrier(8000228a28b0) at timeout_barrier+0x109
> timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2
> sleep_finish(800022d64d90,1) at sleep_finish+0x16d
> tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2
> sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at 
> sys_nanosleep+0x12d
> syscall(800022d64f60) at syscall+0x374
> 
> The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously,
> soft-interrupt-driven timeouts could be deleted synchronously without
> blocking. Now, timeout_del_barrier() can sleep regardless of the type
> of the timeout.
> 
> It looks that with small adjustments timeout_del_barrier() can sleep
> in sleep_finish(). The management of run queues is not affected because
> the timeout clearing happens after it. As timeout_del_barrier() does not
> rely on a timeout or signal catching, there should be no risk of
> unbounded recursion or unwanted signal side effects within the sleep
> machinery. In a way, a sleep with a timeout is higher-level than
> one without.

I trust you on the analysis.  However this looks very fragile to me.

The use of timeout_del_barrier() which can sleep using the global sleep
queue is worrying me.  

> Note that endtsleep() can run and set P_TIMEOUT during
> timeout_del_barrier() when the thread is blocked in cond_wait().
> To avoid unnecessary atomic read-modify-write operations, the clearing
> of P_TIMEOUT could be conditional, but maybe that is an unnecessary
> optimization at this point.

I agree this optimization seems unnecessary at the moment.

> While it should be possible to make the code use timeout_del() instead
> of timeout_del_barrier(), the outcome might not be outright better. For
> example, sleep_setup() and endtsleep() would have to coordinate so that
> a late-running timeout from previous sleep cycle would not disturb the
> new cycle.

So that's the price for not having to sleep in sleep_finish(), right?

> To test the barrier path reliably, I made the code call
> timeout_del_barrier() twice in a row. The second call is guaranteed
> to sleep. Of course, this is not part of the patch.

ok mpi@

> Index: kern/kern_synch.c
> ===
> RCS file: src/sys/kern/kern_synch.c,v
> retrieving revision 1.187
> diff -u -p -r1.187 kern_synch.c
> --- kern/kern_synch.c 13 May 2022 15:32:00 -  1.187
> +++ kern/kern_synch.c 5 Jun 2022 05:04:45 -
> @@ -370,8 +370,8 @@ sleep_setup(struct sleep_state *sls, con
>   p->p_slppri = prio & PRIMASK;
>   TAILQ_INSERT_TAIL([LOOKUP(ident)], p, p_runq);
>  
> - KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   if (timo) {
> + KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   sls->sls_timeout = 1;
>   timeout_add(>p_sleep_to, timo);
>   }
> @@ -432,13 +432,12 @@ sleep_finish(struct sleep_state *sls, in
>  
>   if (sls->sls_timeout) {
>   if (p->p_flag & P_TIMEOUT) {
> - atomic_clearbits_int(>p_flag, P_TIMEOUT);
>   error1 = EWOULDBLOCK;
>   } else {
> - /* This must not sleep. */
> + /* This can sleep. It must not use timeouts. */
>   timeout_del_barrier(>p_sleep_to);
> - KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   }
> + atomic_clearbits_int(>p_flag, P_TIMEOUT);
>   }
>  
>   /* Check if thread was woken up because of a unwind or signal */
> 



Re: start unlocking kbind(2)

2022-05-31 Thread Martin Pieuchot
On 18/05/22(Wed) 15:53, Alexander Bluhm wrote:
> On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote:
> > +   cookie = SCARG(uap, proc_cookie);
> > +   if (pr->ps_kbind_addr == pc) {
> > +   membar_datadep_consumer();
> > +   if (pr->ps_kbind_cookie != cookie)
> > +   goto sigill;
> > +   } else {
> 
> You must use membar_consumer() here.  membar_datadep_consumer() is
> a barrier between reading pointer and pointed data.  Only alpha
> requires membar_datadep_consumer() for that, everywhere else it is
> a NOP.
> 
> > +   mtx_enter(>ps_mtx);
> > +   kpc = pr->ps_kbind_addr;
> 
> Do we need kpc variable?  I would prefer to read explicit
> pr->ps_kbind_addr in the two places where we use it.
> 
> I think the logic of barriers and mutexes is correct.
> 
> with the suggestions above OK bluhm@

I believe you should go ahead with the current diff.  ok with me.  Moving
the field under the scope of another lock can be easily done afterward.



Re: allow 240/4 in various network daemons

2022-05-28 Thread Martin Schröder
Am Sa., 28. Mai 2022 um 22:46 Uhr schrieb Seth David Schoen
:
> We're also interested in talking about whether there's an appropriate
> path for supporting non-broadcast use of addresses within 127/8, our
> most controversial change.  In Linux and FreeBSD, we're experimenting

IPv6 is now older than IPv4 was when v6 was introduced.
You are beating a very dead horse.

Best
    Martin



Re: ffs_truncate: Missing uvm_vnp_uncache() w/ softdep

2022-05-24 Thread Martin Pieuchot
On 24/05/22(Tue) 15:24, Mark Kettenis wrote:
> > Date: Tue, 24 May 2022 14:28:39 +0200
> > From: Martin Pieuchot 
> > 
> > The softdep code path is missing a UVM cache invalidation compared to
> > the !softdep one.  This is necessary to flush pages of a persisting
> > vnode.
> > 
> > Since uvm_vnp_setsize() is also called later in this function for the
> > !softdep case move it to not call it twice.
> > 
> > ok?
> 
> I'm not sure this is correct.  I'm trying to understand why you're
> moving the uvm_uvn_setsize() call.  Are you just trying to call it
> twice?  Or are you trying to avoid calling it at all when we end up in
> an error path?
>
> The way you moved it means we'll still call it twice for "partially
> truncated" files with softdeps.  At least the way I understand the
> code is that the code will fsync the vnode and dropping down in the
> "normal" non-softdep code that will call uvm_vnp_setsize() (and
> uvn_vnp_uncache()) again.  So maybe you should move the
> uvm_uvn_setsize() call into the else case?

We might want to do that indeed.  I'm not sure what are the implications
of calling uvm_vnp_setsize/uncache() after VOP_FSYNC(), which might fail.
So I'd rather play safe and go with that diff.

> > Index: ufs/ffs/ffs_inode.c
> > ===
> > RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v
> > retrieving revision 1.81
> > diff -u -p -r1.81 ffs_inode.c
> > --- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 -  1.81
> > +++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 -
> > @@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le
> > if (length > fs->fs_maxfilesize)
> > return (EFBIG);
> >  
> > -   uvm_vnp_setsize(ovp, length);
> > oip->i_ci.ci_lasta = oip->i_ci.ci_clen 
> > = oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0;
> >  
> > if (DOINGSOFTDEP(ovp)) {
> > +   uvm_vnp_setsize(ovp, length);
> > +   (void) uvm_vnp_uncache(ovp);
> > if (length > 0 || softdep_slowdown(ovp)) {
> > /*
> >  * If a file is only partially truncated, then
> > 
> > 



Please test: rewrite of pdaemon

2022-05-24 Thread Martin Pieuchot
Diff below brings in & adapt most of the changes from NetBSD's r1.37 of
uvm_pdaemon.c.  My motivation for doing this is to untangle the inner
loop of uvmpd_scan_inactive() which will allow us to split the global
`pageqlock' mutex in a next step.

The idea behind this change is to get rid of the too-complex uvm_pager*
abstraction by checking early if a page is going to be flushed or
swapped to disk.  The loop is then clearly divided into two cases which
makes it more readable. 

This also opens the door to a better integration between UVM's vnode
layer and the buffer cache.

The main loop of uvmpd_scan_inactive() can be understood as below:

. If a page can be flushed we can call "uvn_flush()" directly and pass the
  PGO_ALLPAGES flag instead of building a cluster beforehand.  Note that,
  in its current form uvn_flush() is synchronous.

. If the page needs to be swapped, mark it as PG_PAGEOUT, build a cluster
  and once it is full call uvm_swap_put(). 

Please test this diff, do not hesitate to play with the `vm.swapencrypt.enable'
sysctl(2).

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  24 May 2022 12:31:34 -
@@ -143,7 +143,7 @@ struct pool uvm_aobj_pool;
 
 static struct uao_swhash_elt   *uao_find_swhash_elt(struct uvm_aobj *, int,
 boolean_t);
-static int  uao_find_swslot(struct uvm_object *, int);
+int uao_find_swslot(struct uvm_object *, int);
 static boolean_tuao_flush(struct uvm_object *, voff_t,
 voff_t, int);
 static void uao_free(struct uvm_aobj *);
@@ -241,7 +241,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-inline static int
+int
 uao_find_swslot(struct uvm_object *uobj, int pageidx)
 {
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_aobj.h
--- uvm/uvm_aobj.h  21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_aobj.h  24 May 2022 12:31:34 -
@@ -60,6 +60,7 @@
 
 void uao_init(void);
 int uao_set_swslot(struct uvm_object *, int, int);
+int uao_find_swslot (struct uvm_object *, int);
 int uao_dropswap(struct uvm_object *, int);
 int uao_swap_off(int, int);
 int uao_shrink(struct uvm_object *, int);
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.291
diff -u -p -r1.291 uvm_map.c
--- uvm/uvm_map.c   4 May 2022 14:58:26 -   1.291
+++ uvm/uvm_map.c   24 May 2022 12:31:34 -
@@ -3215,8 +3215,9 @@ uvm_object_printit(struct uvm_object *uo
  * uvm_page_printit: actually print the page
  */
 static const char page_flagbits[] =
-   "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5CLEANCHK\6RELEASED\7FAKE\10RDONLY"
-   "\11ZERO\12DEV\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ"
+   "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5PAGEOUT\6RELEASED\7FAKE\10RDONLY"
+   "\11ZERO\12DEV\13CLEANCHK"
+   "\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ"
"\27ENCRYPT\31PMAP0\32PMAP1\33PMAP2\34PMAP3\35PMAP4\36PMAP5";
 
 void
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  24 May 2022 12:32:54 -
@@ -960,6 +960,7 @@ uvm_pageclean(struct vm_page *pg)
 {
u_int flags_to_clear = 0;
 
+   KASSERT((pg->pg_flags & PG_PAGEOUT) == 0);
if ((pg->pg_flags & (PG_TABLED|PQ_ACTIVE|PQ_INACTIVE)) &&
(pg->uobject == NULL || !UVM_OBJ_IS_PMAP(pg->uobject)))
MUTEX_ASSERT_LOCKED();
@@ -978,11 +979,14 @@ uvm_pageclean(struct vm_page *pg)
rw_write_held(pg->uanon->an_lock));
 
/*
-* if the page was an object page (and thus "TABLED"), remove it
-* from the object.
+* remove page from its object or anon.
 */
-   if (pg->pg_flags & PG_TABLED)
+   if (pg->pg_flags & PG_TABLED) {
uvm_pageremove(pg);
+   } else if (pg->uanon != NULL) {
+   pg->uanon->an_page = NULL;
+   pg->uanon = NULL;
+   }
 
/*
 * now remove the page from the queues
@@ -996,10 +1000,6 @@ uvm_pageclean(struct vm_page *pg)
pg->wire_count = 0;
uvmexp.wired--;
}
-   if (pg->uanon) {
-   pg->uanon->an_page = NULL;
-   pg->uanon = NULL;
-   }
 
   

ffs_truncate: Missing uvm_vnp_uncache() w/ softdep

2022-05-24 Thread Martin Pieuchot
The softdep code path is missing a UVM cache invalidation compared to
the !softdep one.  This is necessary to flush pages of a persisting
vnode.

Since uvm_vnp_setsize() is also called later in this function for the
!softdep case move it to not call it twice.

ok?

Index: ufs/ffs/ffs_inode.c
===
RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v
retrieving revision 1.81
diff -u -p -r1.81 ffs_inode.c
--- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 -  1.81
+++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 -
@@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le
if (length > fs->fs_maxfilesize)
return (EFBIG);
 
-   uvm_vnp_setsize(ovp, length);
oip->i_ci.ci_lasta = oip->i_ci.ci_clen 
= oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0;
 
if (DOINGSOFTDEP(ovp)) {
+   uvm_vnp_setsize(ovp, length);
+   (void) uvm_vnp_uncache(ovp);
if (length > 0 || softdep_slowdown(ovp)) {
/*
 * If a file is only partially truncated, then



Re: Call uvm_vnp_uncache() before VOP_RENAME()

2022-05-24 Thread Martin Pieuchot
On 17/05/22(Tue) 16:55, Martin Pieuchot wrote:
> nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a
> possibly mmap'ed file before calling VOP_RENAME().
> 
> ok?

Anyone?

> Index: nfs/nfs_serv.c
> ===
> RCS file: /cvs/src/sys/nfs/nfs_serv.c,v
> retrieving revision 1.120
> diff -u -p -r1.120 nfs_serv.c
> --- nfs/nfs_serv.c11 Mar 2021 13:31:35 -  1.120
> +++ nfs/nfs_serv.c4 May 2022 15:29:06 -
> @@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd
>   error = -1;
>  out:
>   if (!error) {
> + if (tvp) {
> + (void)uvm_vnp_uncache(tvp);
> + }
>   error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, _cnd,
>  tond.ni_dvp, tond.ni_vp, _cnd);
>   } else {
> 



Re: start unlocking kbind(2)

2022-05-17 Thread Martin Pieuchot
On 17/05/22(Tue) 10:44, David Gwynne wrote:
> this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall
> argument checks can be done without the kernel lock.
> 
> care is taken to allow the pc/cookie checks to run without any lock by
> being careful with the order of the checks. all modifications to the
> pc/cookie state are serialised by the per process mutex.

I don't understand why it is safe to do the following check without
holding a mutex:

if (pr->ps_kbind_addr == pc)
...

Is there much differences when always grabbing the per-process mutex?

> i dont know enough about uvm to say whether it is safe to unlock the
> actual memory updates too, but even if i was confident i would still
> prefer to change it as a separate step.

I agree.

> Index: kern/init_sysent.c
> ===
> RCS file: /cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.236
> diff -u -p -r1.236 init_sysent.c
> --- kern/init_sysent.c1 May 2022 23:00:04 -   1.236
> +++ kern/init_sysent.c17 May 2022 00:36:03 -
> @@ -1,4 +1,4 @@
> -/*   $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $  */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
> @@ -204,7 +204,7 @@ const struct sysent sysent[] = {
>   sys_utimensat },/* 84 = utimensat */
>   { 2, s(struct sys_futimens_args), 0,
>   sys_futimens }, /* 85 = futimens */
> - { 3, s(struct sys_kbind_args), 0,
> + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0,
>   sys_kbind },/* 86 = kbind */
>   { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0,
>   sys_clock_gettime },/* 87 = clock_gettime */
> Index: kern/syscalls.master
> ===
> RCS file: /cvs/src/sys/kern/syscalls.master,v
> retrieving revision 1.223
> diff -u -p -r1.223 syscalls.master
> --- kern/syscalls.master  24 Feb 2022 07:41:51 -  1.223
> +++ kern/syscalls.master  17 May 2022 00:36:03 -
> @@ -194,7 +194,7 @@
>   const struct timespec *times, int flag); }
>  85   STD { int sys_futimens(int fd, \
>   const struct timespec *times); }
> -86   STD { int sys_kbind(const struct __kbind *param, \
> +86   STD NOLOCK  { int sys_kbind(const struct __kbind *param, \
>   size_t psize, int64_t proc_cookie); }
>  87   STD NOLOCK  { int sys_clock_gettime(clockid_t clock_id, \
>   struct timespec *tp); }
> Index: uvm/uvm_mmap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
> retrieving revision 1.169
> diff -u -p -r1.169 uvm_mmap.c
> --- uvm/uvm_mmap.c19 Jan 2022 10:43:48 -  1.169
> +++ uvm/uvm_mmap.c17 May 2022 00:36:03 -
> @@ -70,6 +70,7 @@
>  #include 
>  #include   /* for KBIND* */
>  #include 
> +#include 
>  
>  #include /* for __LDPGSZ */
>  
> @@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis
>   const char *data;
>   vaddr_t baseva, last_baseva, endva, pageoffset, kva;
>   size_t psize, s;
> - u_long pc;
> + u_long pc, kpc;
>   int count, i, extra;
> + uint64_t cookie;
>   int error;
>  
>   /*
>* extract syscall args from uap
>*/
>   paramp = SCARG(uap, param);
> - psize = SCARG(uap, psize);
>  
>   /* a NULL paramp disables the syscall for the process */
>   if (paramp == NULL) {
> + mtx_enter(>ps_mtx);
>   if (pr->ps_kbind_addr != 0)
> - sigexit(p, SIGILL);
> + goto leave_sigill;
>   pr->ps_kbind_addr = BOGO_PC;
> + mtx_leave(>ps_mtx);
>   return 0;
>   }
>  
>   /* security checks */
> +
> + /*
> +  * ps_kbind_addr can only be set to 0 or BOGO_PC by the
> +  * kernel, not by a call from userland.
> +  */
>   pc = PROC_PC(p);
> - if (pr->ps_kbind_addr == 0) {
> - pr->ps_kbind_addr = pc;
> - pr->ps_kbind_cookie = SCARG(uap, proc_cookie);
> - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC)
> - sigexit(p, SIGILL);
> - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie))
> - sigexit(p, SIGILL);
> + if (pc == 0 || pc == BOGO_PC)
> + goto sigill;
> +
> + cookie = SCARG(uap, proc_cookie);
> + if (pr->ps_kbind_addr == pc) {
> + membar_datadep_consumer();
> + if (pr->ps_kbind_cookie != cookie)
> + goto sigill;
> + } else {
> + mtx_enter(>ps_mtx);
> + kpc = pr->ps_kbind_addr;
> +
> + /*
> +  * If we're the first thread in (kpc is 0), then
> +  * 

Call uvm_vnp_uncache() before VOP_RENAME()

2022-05-17 Thread Martin Pieuchot
nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a
possibly mmap'ed file before calling VOP_RENAME().

ok?

Index: nfs/nfs_serv.c
===
RCS file: /cvs/src/sys/nfs/nfs_serv.c,v
retrieving revision 1.120
diff -u -p -r1.120 nfs_serv.c
--- nfs/nfs_serv.c  11 Mar 2021 13:31:35 -  1.120
+++ nfs/nfs_serv.c  4 May 2022 15:29:06 -
@@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd
error = -1;
 out:
if (!error) {
+   if (tvp) {
+   (void)uvm_vnp_uncache(tvp);
+   }
error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, _cnd,
   tond.ni_dvp, tond.ni_vp, _cnd);
} else {



Re: ssh-add(1): fix NULL in fprintf

2022-05-16 Thread Martin Vahlensieck
Hi

What's the status on this?  Anthing required from my side?  I have
reattached the patch (with the changes Theo suggested).

Best,

Martin

On Mon, May 09, 2022 at 08:39:38PM +0200, Martin Vahlensieck wrote:
> On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote:
> > Martin Vahlensieck  wrote:
> > 
> > >   if (!qflag) {
> > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > > - sshkey_type(key), comment);
> > > + fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path,
> > > + sshkey_type(key), comment ? " (" : "",
> > > + comment ? comment : "", comment ? ")" : "");
> > 
> > this is probably better as something like
> > 
> > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > > - sshkey_type(key), comment ? comment : "no comment");
> > 
> > Which has a minor ambiguity, but probably harms noone.
> > 
> 
> Index: ssh-add.c
> ===
> RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
> retrieving revision 1.165
> diff -u -p -r1.165 ssh-add.c
> --- ssh-add.c 4 Feb 2022 02:49:17 -   1.165
> +++ ssh-add.c 9 May 2022 18:36:54 -
> @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
>   }
>   if (!qflag) {
>   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> - sshkey_type(key), comment);
> + sshkey_type(key), comment ? comment : "no comment");
>   }
>   return 0;
>  }
> @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
>   certpath, filename);
>   sshkey_free(cert);
>   goto out;
> - } 
> + }
>  
>   /* Graft with private bits */
>   if ((r = sshkey_to_certified(private)) != 0) {

Index: ssh-add.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
retrieving revision 1.165
diff -u -p -r1.165 ssh-add.c
--- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
+++ ssh-add.c   9 May 2022 18:36:54 -
@@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
}
if (!qflag) {
fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
-   sshkey_type(key), comment);
+   sshkey_type(key), comment ? comment : "no comment");
}
return 0;
 }
@@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
certpath, filename);
sshkey_free(cert);
goto out;
-   } 
+   }
 
/* Graft with private bits */
if ((r = sshkey_to_certified(private)) != 0) {



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Matthew Martin
int
main() {
int results[3] = { 0, 0, 0 };
for (int i = 0; i < 10; i++) {
results[arc4random_uniform_fast_simple(3)]++;
}
for (int i = 0; i < 3; i++)
printf("%d: %d\n", i, results[i]);

return 0;
}

% ./a.out
0: 24809
1: 50011
2: 25180

You can't reuse bits because they'll be biased.



libcrypto/err_prn.c: skip BIO*

2022-05-12 Thread Martin Vahlensieck
Hi

As far as I can tell, this ends up calling vprintf eventually, so
skip the steps inbetween.

Best,

Martin

Index: err_prn.c
===
RCS file: /home/reposync/cvs/src/lib/libcrypto/err/err_prn.c,v
retrieving revision 1.19
diff -u -p -r1.19 err_prn.c
--- err_prn.c   7 Jan 2022 09:02:18 -   1.19
+++ err_prn.c   7 Jan 2022 16:13:48 -
@@ -92,12 +92,7 @@ ERR_print_errors_cb(int (*cb)(const char
 static int
 print_fp(const char *str, size_t len, void *fp)
 {
-   BIO bio;
-
-   BIO_set(, BIO_s_file());
-   BIO_set_fp(, fp, BIO_NOCLOSE);
-
-   return BIO_printf(, "%s", str);
+   return fprintf(fp, "%s", str);
 }
 
 void



apply(1): constify two arguments

2022-05-12 Thread Martin Vahlensieck
Index: apply.c
===
RCS file: /cvs/src/usr.bin/apply/apply.c,v
retrieving revision 1.29
diff -u -p -r1.29 apply.c
--- apply.c 1 Apr 2018 17:45:05 -   1.29
+++ apply.c 12 May 2022 21:14:04 -
@@ -54,7 +54,7 @@ char  *str;
 size_t  sz;
 
 void
-stradd(char *p)
+stradd(const char *p)
 {
size_t n;
 
@@ -73,7 +73,7 @@ stradd(char *p)
 }
 
 void
-strset(char *p)
+strset(const char *p)
 {
if (str != NULL)
str[0] = '\0';



Re: uvm_pagedequeue()

2022-05-12 Thread Martin Pieuchot
On 10/05/22(Tue) 20:23, Mark Kettenis wrote:
> > Date: Tue, 10 May 2022 18:45:21 +0200
> > From: Martin Pieuchot 
> > 
> > On 05/05/22(Thu) 14:54, Martin Pieuchot wrote:
> > > Diff below introduces a new wrapper to manipulate active/inactive page
> > > queues. 
> > > 
> > > ok?
> > 
> > Anyone?
> 
> Sorry I started looking at this and got distracted.
> 
> I'm not sure about the changes to uvm_pageactivate().  It doesn't
> quite match what NetBSD does, but I guess NetBSD assumes that
> uvm_pageactiave() isn't called for a page that is already active?  And
> that's something we can't guarantee?

It does look at what NetBSD did 15 years ago.  We're not ready to synchronize
with NetBSD -current yet. 

We're getting there!

> The diff is correct though in the sense that it is equivalent to the
> code we already have.  So if this definitely is the direction you want
> to go:
> 
> ok kettenis@
> 
> > > Index: uvm/uvm_page.c
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> > > retrieving revision 1.165
> > > diff -u -p -r1.165 uvm_page.c
> > > --- uvm/uvm_page.c4 May 2022 14:58:26 -   1.165
> > > +++ uvm/uvm_page.c5 May 2022 12:49:13 -
> > > @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
> > >   /*
> > >* now remove the page from the queues
> > >*/
> > > - if (pg->pg_flags & PQ_ACTIVE) {
> > > - TAILQ_REMOVE(_active, pg, pageq);
> > > - flags_to_clear |= PQ_ACTIVE;
> > > - uvmexp.active--;
> > > - }
> > > - if (pg->pg_flags & PQ_INACTIVE) {
> > > - TAILQ_REMOVE(_inactive, pg, pageq);
> > > - flags_to_clear |= PQ_INACTIVE;
> > > - uvmexp.inactive--;
> > > - }
> > > + uvm_pagedequeue(pg);
> > >  
> > >   /*
> > >* if the page was wired, unwire it now.
> > > @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
> > >   MUTEX_ASSERT_LOCKED();
> > >  
> > >   if (pg->wire_count == 0) {
> > > - if (pg->pg_flags & PQ_ACTIVE) {
> > > - TAILQ_REMOVE(_active, pg, pageq);
> > > - atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
> > > - uvmexp.active--;
> > > - }
> > > - if (pg->pg_flags & PQ_INACTIVE) {
> > > - TAILQ_REMOVE(_inactive, pg, pageq);
> > > - atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
> > > - uvmexp.inactive--;
> > > - }
> > > + uvm_pagedequeue(pg);
> > >   uvmexp.wired++;
> > >   }
> > >   pg->wire_count++;
> > > @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
> > >   KASSERT(uvm_page_owner_locked_p(pg));
> > >   MUTEX_ASSERT_LOCKED();
> > >  
> > > + uvm_pagedequeue(pg);
> > > + if (pg->wire_count == 0) {
> > > + TAILQ_INSERT_TAIL(_active, pg, pageq);
> > > + atomic_setbits_int(>pg_flags, PQ_ACTIVE);
> > > + uvmexp.active++;
> > > +
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * uvm_pagedequeue: remove a page from any paging queue
> > > + */
> > > +void
> > > +uvm_pagedequeue(struct vm_page *pg)
> > > +{
> > > + if (pg->pg_flags & PQ_ACTIVE) {
> > > + TAILQ_REMOVE(_active, pg, pageq);
> > > + atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
> > > + uvmexp.active--;
> > > + }
> > >   if (pg->pg_flags & PQ_INACTIVE) {
> > >   TAILQ_REMOVE(_inactive, pg, pageq);
> > >   atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
> > >   uvmexp.inactive--;
> > >   }
> > > - if (pg->wire_count == 0) {
> > > - /*
> > > -  * if page is already active, remove it from list so we
> > > -  * can put it at tail.  if it wasn't active, then mark
> > > -  * it active and bump active count
> > > -  */
> > > - if (pg->pg_flags & PQ_ACTIVE)
> > > - TAILQ_REMOVE(_active, pg, pageq);
> > > - else {
> > > - atomic_setbits_int(>pg_flags, PQ_ACTIVE);
> > > - uvmexp.active++;
> > > - }
> > > -
> > > - TAILQ_INSERT_TAIL(_active, pg, pageq);
> > > - }
> > >  }
> > > -
> > >  /*
> > >   * uvm_pagezero: zero fill a page
> > >   */
> > > Index: uvm/uvm_page.h
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_page.h,v
> > > retrieving revision 1.67
> > > diff -u -p -r1.67 uvm_page.h
> > > --- uvm/uvm_page.h29 Jan 2022 06:25:33 -  1.67
> > > +++ uvm/uvm_page.h5 May 2022 12:49:13 -
> > > @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *);
> > >  #endif
> > >  
> > >  void uvm_pageactivate(struct vm_page *);
> > > +void uvm_pagedequeue(struct vm_page *);
> > >  vaddr_t  uvm_pageboot_alloc(vsize_t);
> > >  void uvm_pagecopy(struct vm_page *, struct vm_page *);
> > >  void uvm_pagedeactivate(struct vm_page *);
> > > 
> > 
> > 



Re: Mark pw_error __dead in util.h

2022-05-10 Thread Matthew Martin
On Tue, May 03, 2022 at 10:37:36PM -0500, Matthew Martin wrote:
> The function is already marked __dead in passwd.c, so appears to just be
> an oversight.

ping

diff --git util.h util.h
index dd64f478e23..752f8bb9fc5 100644
--- util.h
+++ util.h
@@ -97,7 +97,7 @@ void  pw_edit(int, const char *);
 void   pw_prompt(void);
 void   pw_copy(int, int, const struct passwd *, const struct passwd *);
 intpw_scan(char *, struct passwd *, int *);
-void   pw_error(const char *, int, int);
+__dead voidpw_error(const char *, int, int);
 intgetptmfd(void);
 intopenpty(int *, int *, char *, const struct termios *,
const struct winsize *);



Re: uvm_pagedequeue()

2022-05-10 Thread Martin Pieuchot
On 05/05/22(Thu) 14:54, Martin Pieuchot wrote:
> Diff below introduces a new wrapper to manipulate active/inactive page
> queues. 
> 
> ok?

Anyone?

> Index: uvm/uvm_page.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> retrieving revision 1.165
> diff -u -p -r1.165 uvm_page.c
> --- uvm/uvm_page.c4 May 2022 14:58:26 -   1.165
> +++ uvm/uvm_page.c5 May 2022 12:49:13 -
> @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
>   /*
>* now remove the page from the queues
>*/
> - if (pg->pg_flags & PQ_ACTIVE) {
> - TAILQ_REMOVE(_active, pg, pageq);
> - flags_to_clear |= PQ_ACTIVE;
> - uvmexp.active--;
> - }
> - if (pg->pg_flags & PQ_INACTIVE) {
> - TAILQ_REMOVE(_inactive, pg, pageq);
> - flags_to_clear |= PQ_INACTIVE;
> - uvmexp.inactive--;
> - }
> + uvm_pagedequeue(pg);
>  
>   /*
>* if the page was wired, unwire it now.
> @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
>   MUTEX_ASSERT_LOCKED();
>  
>   if (pg->wire_count == 0) {
> - if (pg->pg_flags & PQ_ACTIVE) {
> - TAILQ_REMOVE(_active, pg, pageq);
> - atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
> - uvmexp.active--;
> - }
> - if (pg->pg_flags & PQ_INACTIVE) {
> - TAILQ_REMOVE(_inactive, pg, pageq);
> - atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
> - uvmexp.inactive--;
> - }
> + uvm_pagedequeue(pg);
>   uvmexp.wired++;
>   }
>   pg->wire_count++;
> @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
>   KASSERT(uvm_page_owner_locked_p(pg));
>   MUTEX_ASSERT_LOCKED();
>  
> + uvm_pagedequeue(pg);
> + if (pg->wire_count == 0) {
> + TAILQ_INSERT_TAIL(_active, pg, pageq);
> + atomic_setbits_int(>pg_flags, PQ_ACTIVE);
> + uvmexp.active++;
> +
> + }
> +}
> +
> +/*
> + * uvm_pagedequeue: remove a page from any paging queue
> + */
> +void
> +uvm_pagedequeue(struct vm_page *pg)
> +{
> + if (pg->pg_flags & PQ_ACTIVE) {
> + TAILQ_REMOVE(_active, pg, pageq);
> + atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
> + uvmexp.active--;
> + }
>   if (pg->pg_flags & PQ_INACTIVE) {
>   TAILQ_REMOVE(_inactive, pg, pageq);
>   atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
>   uvmexp.inactive--;
>   }
> - if (pg->wire_count == 0) {
> - /*
> -  * if page is already active, remove it from list so we
> -  * can put it at tail.  if it wasn't active, then mark
> -  * it active and bump active count
> -  */
> - if (pg->pg_flags & PQ_ACTIVE)
> - TAILQ_REMOVE(_active, pg, pageq);
> - else {
> - atomic_setbits_int(>pg_flags, PQ_ACTIVE);
> - uvmexp.active++;
> - }
> -
> - TAILQ_INSERT_TAIL(_active, pg, pageq);
> - }
>  }
> -
>  /*
>   * uvm_pagezero: zero fill a page
>   */
> Index: uvm/uvm_page.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.h,v
> retrieving revision 1.67
> diff -u -p -r1.67 uvm_page.h
> --- uvm/uvm_page.h29 Jan 2022 06:25:33 -  1.67
> +++ uvm/uvm_page.h5 May 2022 12:49:13 -
> @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *);
>  #endif
>  
>  void uvm_pageactivate(struct vm_page *);
> +void uvm_pagedequeue(struct vm_page *);
>  vaddr_t  uvm_pageboot_alloc(vsize_t);
>  void uvm_pagecopy(struct vm_page *, struct vm_page *);
>  void uvm_pagedeactivate(struct vm_page *);
> 



Re: ssh-add(1): fix NULL in fprintf

2022-05-09 Thread Martin Vahlensieck
On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote:
> Martin Vahlensieck  wrote:
> 
> > if (!qflag) {
> > -   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > -   sshkey_type(key), comment);
> > +   fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path,
> > +   sshkey_type(key), comment ? " (" : "",
> > +   comment ? comment : "", comment ? ")" : "");
> 
> this is probably better as something like
> 
> > -   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > -   sshkey_type(key), comment ? comment : "no comment");
> 
> Which has a minor ambiguity, but probably harms noone.
> 

Index: ssh-add.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
retrieving revision 1.165
diff -u -p -r1.165 ssh-add.c
--- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
+++ ssh-add.c   9 May 2022 18:36:54 -
@@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
}
if (!qflag) {
fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
-   sshkey_type(key), comment);
+   sshkey_type(key), comment ? comment : "no comment");
}
return 0;
 }
@@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
certpath, filename);
sshkey_free(cert);
goto out;
-   } 
+   }
 
/* Graft with private bits */
if ((r = sshkey_to_certified(private)) != 0) {



ssh-add(1): fix NULL in fprintf

2022-05-09 Thread Martin Vahlensieck
Hi

When removing an identity from the agent using the private key file,
ssh-add first tries to find the public key file.  If that fails,
it loads the public key from the private key file, but no comment
is loaded.  This means comment is NULL when it is used inside
delete_one to print `Identity removed: ...'

Below is a diff which only prints the braces and the comment if it
is not NULL.  Something similar is done in ssh-keygen.c line
2423-2425.

So with the following setup:
$ ssh-keygen -t ed25519 -f demo -C demo -N ''
$ mv demo.pub demo_pub
$ ssh-add demo
Identity added: demo (demo)
Before:
$ ssh-add -d demo
Identity removed: demo ED25519 ((null))
$ tail -n 1 /var/log/messages
May  9 18:15:53 demo ssh-add: vfprintf %s NULL in "Identity removed: %s %s 
(%s) "
After:
$ ssh-add -d demo
Identity removed: demo ED25519

Best,

Martin

P.S.: While here remove a trailing space as well.

Index: ssh-add.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
retrieving revision 1.165
diff -u -p -r1.165 ssh-add.c
--- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
+++ ssh-add.c   9 May 2022 16:04:14 -
@@ -117,8 +117,9 @@ delete_one(int agent_fd, const struct ss
return r;
}
if (!qflag) {
-   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
-   sshkey_type(key), comment);
+   fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path,
+   sshkey_type(key), comment ? " (" : "",
+   comment ? comment : "", comment ? ")" : "");
}
return 0;
 }
@@ -392,7 +393,7 @@ add_file(int agent_fd, const char *filen
certpath, filename);
sshkey_free(cert);
goto out;
-   } 
+   }
 
/* Graft with private bits */
if ((r = sshkey_to_certified(private)) != 0) {



Re: uvm: Consider BUFPAGES_DEFICIT in swap_shortage

2022-05-09 Thread Martin Pieuchot
On 05/05/22(Thu) 10:56, Bob Beck wrote:
> On Thu, May 05, 2022 at 10:16:23AM -0600, Bob Beck wrote:
> > Ugh. You???re digging in the most perilous parts of the pile. 
> > 
> > I will go look with you??? sigh. (This is not yet an ok for that.)
> > 
> > > On May 5, 2022, at 7:53 AM, Martin Pieuchot  wrote:
> > > 
> > > When considering the amount of free pages in the page daemon a small
> > > amount is always kept for the buffer cache... except in one place.
> > > 
> > > The diff below gets rid of this exception.  This is important because
> > > uvmpd_scan() is called conditionally using the following check:
> > > 
> > >  if (uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) {
> > >...
> > >  }
> > > 
> > > So in case of swap shortage we might end up freeing fewer pages than
> > > wanted.
> 
> So a bit of backgroud. 
> 
> I am pretty much of the belief that this "low water mark" for pages is 
> nonsense now.  I was in the midst of trying to prove that
> to myself and therefore rip down some of the crazy accounting and
> very arbitrary limits in the buffer cache and got distracted.
> 
> Maybe something like this to start? (buf failing that I think
> your current diff is probably ok).

Thanks.  I'll commit my diff then to make the current code coherent and
let me progress in my refactoring.  Then we can consider changing this
magic.

> Index: sys/sys/mount.h
> ===
> RCS file: /cvs/src/sys/sys/mount.h,v
> retrieving revision 1.148
> diff -u -p -u -p -r1.148 mount.h
> --- sys/sys/mount.h   6 Apr 2021 14:17:35 -   1.148
> +++ sys/sys/mount.h   5 May 2022 16:50:50 -
> @@ -488,10 +488,8 @@ struct bcachestats {
>  #ifdef _KERNEL
>  extern struct bcachestats bcstats;
>  extern long buflowpages, bufhighpages, bufbackpages;
> -#define BUFPAGES_DEFICIT (((buflowpages - bcstats.numbufpages) < 0) ? 0 \
> -: buflowpages - bcstats.numbufpages)
> -#define BUFPAGES_INACT (((bcstats.numcleanpages - buflowpages) < 0) ? 0 \
> -: bcstats.numcleanpages - buflowpages)
> +#define BUFPAGES_DEFICIT 0
> +#define BUFPAGES_INACT bcstats.numcleanpages
>  extern int bufcachepercent;
>  extern void bufadjust(int);
>  struct uvm_constraint_range;
> 
> 
> > > 
> > > ok?
> > > 
> > > Index: uvm/uvm_pdaemon.c
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
> > > retrieving revision 1.98
> > > diff -u -p -r1.98 uvm_pdaemon.c
> > > --- uvm/uvm_pdaemon.c 4 May 2022 14:58:26 -   1.98
> > > +++ uvm/uvm_pdaemon.c 5 May 2022 13:40:28 -
> > > @@ -923,12 +923,13 @@ uvmpd_scan(void)
> > >* detect if we're not going to be able to page anything out
> > >* until we free some swap resources from active pages.
> > >*/
> > > + free = uvmexp.free - BUFPAGES_DEFICIT;
> > >   swap_shortage = 0;
> > > - if (uvmexp.free < uvmexp.freetarg &&
> > > + if (free < uvmexp.freetarg &&
> > >   uvmexp.swpginuse == uvmexp.swpages &&
> > >   !uvm_swapisfull() &&
> > >   pages_freed == 0) {
> > > - swap_shortage = uvmexp.freetarg - uvmexp.free;
> > > + swap_shortage = uvmexp.freetarg - free;
> > >   }
> > > 
> > >   for (p = TAILQ_FIRST(_active);
> > > 
> > 



uvm: Consider BUFPAGES_DEFICIT in swap_shortage

2022-05-05 Thread Martin Pieuchot
When considering the amount of free pages in the page daemon a small
amount is always kept for the buffer cache... except in one place.

The diff below gets rid of this exception.  This is important because
uvmpd_scan() is called conditionally using the following check:
  
  if (uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) {
...
  }

So in case of swap shortage we might end up freeing fewer pages than
wanted.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.98
diff -u -p -r1.98 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   4 May 2022 14:58:26 -   1.98
+++ uvm/uvm_pdaemon.c   5 May 2022 13:40:28 -
@@ -923,12 +923,13 @@ uvmpd_scan(void)
 * detect if we're not going to be able to page anything out
 * until we free some swap resources from active pages.
 */
+   free = uvmexp.free - BUFPAGES_DEFICIT;
swap_shortage = 0;
-   if (uvmexp.free < uvmexp.freetarg &&
+   if (free < uvmexp.freetarg &&
uvmexp.swpginuse == uvmexp.swpages &&
!uvm_swapisfull() &&
pages_freed == 0) {
-   swap_shortage = uvmexp.freetarg - uvmexp.free;
+   swap_shortage = uvmexp.freetarg - free;
}
 
for (p = TAILQ_FIRST(_active);



uvm_pagedequeue()

2022-05-05 Thread Martin Pieuchot
Diff below introduces a new wrapper to manipulate active/inactive page
queues. 

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.165
diff -u -p -r1.165 uvm_page.c
--- uvm/uvm_page.c  4 May 2022 14:58:26 -   1.165
+++ uvm/uvm_page.c  5 May 2022 12:49:13 -
@@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
/*
 * now remove the page from the queues
 */
-   if (pg->pg_flags & PQ_ACTIVE) {
-   TAILQ_REMOVE(_active, pg, pageq);
-   flags_to_clear |= PQ_ACTIVE;
-   uvmexp.active--;
-   }
-   if (pg->pg_flags & PQ_INACTIVE) {
-   TAILQ_REMOVE(_inactive, pg, pageq);
-   flags_to_clear |= PQ_INACTIVE;
-   uvmexp.inactive--;
-   }
+   uvm_pagedequeue(pg);
 
/*
 * if the page was wired, unwire it now.
@@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
MUTEX_ASSERT_LOCKED();
 
if (pg->wire_count == 0) {
-   if (pg->pg_flags & PQ_ACTIVE) {
-   TAILQ_REMOVE(_active, pg, pageq);
-   atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
-   uvmexp.active--;
-   }
-   if (pg->pg_flags & PQ_INACTIVE) {
-   TAILQ_REMOVE(_inactive, pg, pageq);
-   atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
-   uvmexp.inactive--;
-   }
+   uvm_pagedequeue(pg);
uvmexp.wired++;
}
pg->wire_count++;
@@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
KASSERT(uvm_page_owner_locked_p(pg));
MUTEX_ASSERT_LOCKED();
 
+   uvm_pagedequeue(pg);
+   if (pg->wire_count == 0) {
+   TAILQ_INSERT_TAIL(_active, pg, pageq);
+   atomic_setbits_int(>pg_flags, PQ_ACTIVE);
+   uvmexp.active++;
+
+   }
+}
+
+/*
+ * uvm_pagedequeue: remove a page from any paging queue
+ */
+void
+uvm_pagedequeue(struct vm_page *pg)
+{
+   if (pg->pg_flags & PQ_ACTIVE) {
+   TAILQ_REMOVE(_active, pg, pageq);
+   atomic_clearbits_int(>pg_flags, PQ_ACTIVE);
+   uvmexp.active--;
+   }
if (pg->pg_flags & PQ_INACTIVE) {
TAILQ_REMOVE(_inactive, pg, pageq);
atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
-   if (pg->wire_count == 0) {
-   /*
-* if page is already active, remove it from list so we
-* can put it at tail.  if it wasn't active, then mark
-* it active and bump active count
-*/
-   if (pg->pg_flags & PQ_ACTIVE)
-   TAILQ_REMOVE(_active, pg, pageq);
-   else {
-   atomic_setbits_int(>pg_flags, PQ_ACTIVE);
-   uvmexp.active++;
-   }
-
-   TAILQ_INSERT_TAIL(_active, pg, pageq);
-   }
 }
-
 /*
  * uvm_pagezero: zero fill a page
  */
Index: uvm/uvm_page.h
===
RCS file: /cvs/src/sys/uvm/uvm_page.h,v
retrieving revision 1.67
diff -u -p -r1.67 uvm_page.h
--- uvm/uvm_page.h  29 Jan 2022 06:25:33 -  1.67
+++ uvm/uvm_page.h  5 May 2022 12:49:13 -
@@ -224,6 +224,7 @@ boolean_t   uvm_page_physget(paddr_t *);
 #endif
 
 void   uvm_pageactivate(struct vm_page *);
+void   uvm_pagedequeue(struct vm_page *);
 vaddr_tuvm_pageboot_alloc(vsize_t);
 void   uvm_pagecopy(struct vm_page *, struct vm_page *);
 void   uvm_pagedeactivate(struct vm_page *);



ssh: sshkey.c: reduce code duplication

2022-05-04 Thread Martin Vahlensieck
Hi

I noticed that sshkey_unshield_private contains a exact duplicate
of the code in private2_check_padding.  So by pulling
private2_check_padding up, the code can be reused.  Or is there
a reason for this split?

Best,

Martin

P.S.: This diff also removes two trailing spaces while here.

Index: sshkey.c
===
RCS file: /home/reposync/cvs/src/usr.bin/ssh/sshkey.c,v
retrieving revision 1.120
diff -u -p -r1.120 sshkey.c
--- sshkey.c6 Jan 2022 22:05:42 -   1.120
+++ sshkey.c4 May 2022 19:12:16 -
@@ -2079,14 +2079,38 @@ sshkey_shield_private(struct sshkey *k)
return r;
 }
 
+/* Check deterministic padding after private key */
+static int
+private2_check_padding(struct sshbuf *decrypted)
+{
+   u_char pad;
+   size_t i;
+   int r;
+
+   i = 0;
+   while (sshbuf_len(decrypted)) {
+   if ((r = sshbuf_get_u8(decrypted, )) != 0)
+   goto out;
+   if (pad != (++i & 0xff)) {
+   r = SSH_ERR_INVALID_FORMAT;
+   goto out;
+   }
+   }
+   /* success */
+   r = 0;
+ out:
+   explicit_bzero(, sizeof(pad));
+   explicit_bzero(, sizeof(i));
+   return r;
+}
+
 int
 sshkey_unshield_private(struct sshkey *k)
 {
struct sshbuf *prvbuf = NULL;
-   u_char pad, *cp, keyiv[SSH_DIGEST_MAX_LENGTH];
+   u_char *cp, keyiv[SSH_DIGEST_MAX_LENGTH];
struct sshcipher_ctx *cctx = NULL;
const struct sshcipher *cipher;
-   size_t i;
struct sshkey *kswap = NULL, tmp;
int r = SSH_ERR_INTERNAL_ERROR;
 
@@ -2148,16 +2172,9 @@ sshkey_unshield_private(struct sshkey *k
/* Parse private key */
if ((r = sshkey_private_deserialize(prvbuf, )) != 0)
goto out;
-   /* Check deterministic padding */
-   i = 0;
-   while (sshbuf_len(prvbuf)) {
-   if ((r = sshbuf_get_u8(prvbuf, )) != 0)
-   goto out;
-   if (pad != (++i & 0xff)) {
-   r = SSH_ERR_INVALID_FORMAT;
-   goto out;
-   }
-   }
+
+   if ((r = private2_check_padding(prvbuf)) != 0)
+   goto out;
 
/* Swap the parsed key back into place */
tmp = *kswap;
@@ -3966,9 +3983,9 @@ sshkey_private_to_blob2(struct sshkey *p
explicit_bzero(salt, sizeof(salt));
if (key != NULL)
freezero(key, keylen + ivlen);
-   if (pubkeyblob != NULL) 
+   if (pubkeyblob != NULL)
freezero(pubkeyblob, pubkeylen);
-   if (b64 != NULL) 
+   if (b64 != NULL)
freezero(b64, strlen(b64));
return r;
 }
@@ -4192,31 +4209,6 @@ private2_decrypt(struct sshbuf *decoded,
}
sshbuf_free(kdf);
sshbuf_free(decrypted);
-   return r;
-}
-
-/* Check deterministic padding after private key */
-static int
-private2_check_padding(struct sshbuf *decrypted)
-{
-   u_char pad;
-   size_t i;
-   int r = SSH_ERR_INTERNAL_ERROR;
-
-   i = 0;
-   while (sshbuf_len(decrypted)) {
-   if ((r = sshbuf_get_u8(decrypted, )) != 0)
-   goto out;
-   if (pad != (++i & 0xff)) {
-   r = SSH_ERR_INVALID_FORMAT;
-   goto out;
-   }
-   }
-   /* success */
-   r = 0;
- out:
-   explicit_bzero(, sizeof(pad));
-   explicit_bzero(, sizeof(i));
return r;
 }
 



ssh: mux.c: mark argument as const

2022-05-04 Thread Martin Vahlensieck
Index: mux.c
===
RCS file: /home/reposync/cvs/src/usr.bin/ssh/mux.c,v
retrieving revision 1.92
diff -u -p -r1.92 mux.c
--- mux.c   11 Jan 2022 01:26:47 -  1.92
+++ mux.c   13 Jan 2022 16:27:14 -
@@ -227,7 +227,7 @@ mux_master_control_cleanup_cb(struct ssh
 
 /* Check mux client environment variables before passing them to mux master. */
 static int
-env_permitted(char *env)
+env_permitted(const char *env)
 {
int i, ret;
char name[1024], *cp;



ssh: channels.c: Fix comment and add a const

2022-05-04 Thread Martin Vahlensieck
Hi

channel_new no longer frees remote_name.  So update the comment
accordingly.  As remote_name is not modified, it can be const
as well.

Best,

Martin

Index: channels.c
===
RCS file: /home/reposync/cvs/src/usr.bin/ssh/channels.c,v
retrieving revision 1.418
diff -u -p -r1.418 channels.c
--- channels.c  4 May 2022 07:31:22 -   1.418
+++ channels.c  4 May 2022 19:02:14 -
@@ -349,12 +349,11 @@ channel_register_fds(struct ssh *ssh, Ch
 }
 
 /*
- * Allocate a new channel object and set its type and socket. This will cause
- * remote_name to be freed.
+ * Allocate a new channel object and set its type and socket.
  */
 Channel *
 channel_new(struct ssh *ssh, char *ctype, int type, int rfd, int wfd, int efd,
-u_int window, u_int maxpack, int extusage, char *remote_name, int nonblock)
+u_int window, u_int maxpack, int extusage, const char *remote_name, int 
nonblock)
 {
struct ssh_channels *sc = ssh->chanctxt;
u_int i, found;
Index: channels.h
===
RCS file: /home/reposync/cvs/src/usr.bin/ssh/channels.h,v
retrieving revision 1.142
diff -u -p -r1.142 channels.h
--- channels.h  30 Mar 2022 21:10:25 -  1.142
+++ channels.h  6 Apr 2022 20:26:56 -
@@ -272,7 +272,7 @@ Channel *channel_by_id(struct ssh *, int
 Channel*channel_by_remote_id(struct ssh *, u_int);
 Channel*channel_lookup(struct ssh *, int);
 Channel *channel_new(struct ssh *, char *, int, int, int, int,
-   u_int, u_int, int, char *, int);
+   u_int, u_int, int, const char *, int);
 voidchannel_set_fds(struct ssh *, int, int, int, int, int,
int, int, u_int);
 voidchannel_free(struct ssh *, Channel *);



Mark pw_error __dead in util.h

2022-05-03 Thread Matthew Martin
The function is already marked __dead in passwd.c, so appears to just be
an oversight.

diff --git util.h util.h
index dd64f478e23..752f8bb9fc5 100644
--- util.h
+++ util.h
@@ -97,7 +97,7 @@ void  pw_edit(int, const char *);
 void   pw_prompt(void);
 void   pw_copy(int, int, const struct passwd *, const struct passwd *);
 intpw_scan(char *, struct passwd *, int *);
-void   pw_error(const char *, int, int);
+__dead voidpw_error(const char *, int, int);
 intgetptmfd(void);
 intopenpty(int *, int *, char *, const struct termios *,
const struct winsize *);



Merge swap-backed and object-backed inactive lists

2022-05-02 Thread Martin Pieuchot
Let's simplify the existing logic and use a single list for inactive
pages.  uvmpd_scan_inactive() already does a lot of check if it finds
a page which is swap-backed.  This will be improved in a next change.

ok?

Index: uvm/uvm.h
===
RCS file: /cvs/src/sys/uvm/uvm.h,v
retrieving revision 1.68
diff -u -p -r1.68 uvm.h
--- uvm/uvm.h   24 Nov 2020 13:49:09 -  1.68
+++ uvm/uvm.h   2 May 2022 16:32:16 -
@@ -53,8 +53,7 @@ struct uvm {
 
/* vm_page queues */
struct pglist page_active;  /* [Q] allocated pages, in use */
-   struct pglist page_inactive_swp;/* [Q] pages inactive (reclaim/free) */
-   struct pglist page_inactive_obj;/* [Q] pages inactive (reclaim/free) */
+   struct pglist page_inactive;/* [Q] pages inactive (reclaim/free) */
/* Lock order: pageqlock, then fpageqlock. */
struct mutex pageqlock; /* [] lock for active/inactive page q */
struct mutex fpageqlock;/* [] lock for free page q  + pdaemon */
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.290
diff -u -p -r1.290 uvm_map.c
--- uvm/uvm_map.c   12 Mar 2022 08:11:07 -  1.290
+++ uvm/uvm_map.c   2 May 2022 16:32:16 -
@@ -3281,8 +3281,7 @@ uvm_page_printit(struct vm_page *pg, boo
(*pr)("  >>> page not found in uvm_pmemrange <<<\n");
pgl = NULL;
} else if (pg->pg_flags & PQ_INACTIVE) {
-   pgl = (pg->pg_flags & PQ_SWAPBACKED) ?
-   _inactive_swp : _inactive_obj;
+   pgl = _inactive;
} else if (pg->pg_flags & PQ_ACTIVE) {
pgl = _active;
} else {
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.164
diff -u -p -r1.164 uvm_page.c
--- uvm/uvm_page.c  28 Apr 2022 09:59:28 -  1.164
+++ uvm/uvm_page.c  2 May 2022 16:32:16 -
@@ -185,8 +185,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 */
 
TAILQ_INIT(_active);
-   TAILQ_INIT(_inactive_swp);
-   TAILQ_INIT(_inactive_obj);
+   TAILQ_INIT(_inactive);
mtx_init(, IPL_VM);
mtx_init(, IPL_VM);
uvm_pmr_init();
@@ -994,10 +993,7 @@ uvm_pageclean(struct vm_page *pg)
uvmexp.active--;
}
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(_inactive, pg, pageq);
flags_to_clear |= PQ_INACTIVE;
uvmexp.inactive--;
}
@@ -1253,10 +1249,7 @@ uvm_pagewire(struct vm_page *pg)
uvmexp.active--;
}
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(_inactive, pg, pageq);
atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
@@ -1304,10 +1297,7 @@ uvm_pagedeactivate(struct vm_page *pg)
}
if ((pg->pg_flags & PQ_INACTIVE) == 0) {
KASSERT(pg->wire_count == 0);
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_INSERT_TAIL(_inactive_swp, pg, pageq);
-   else
-   TAILQ_INSERT_TAIL(_inactive_obj, pg, pageq);
+   TAILQ_INSERT_TAIL(_inactive, pg, pageq);
atomic_setbits_int(>pg_flags, PQ_INACTIVE);
uvmexp.inactive++;
pmap_clear_reference(pg);
@@ -1335,10 +1325,7 @@ uvm_pageactivate(struct vm_page *pg)
MUTEX_ASSERT_LOCKED();
 
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(_inactive, pg, pageq);
atomic_clearbits_int(>pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.97
diff -u -p -r1.97 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   30 Apr 2022 17:58:43 -  1.97
+++ uvm/uvm_pdaemon.c   2 May 2022 16:32:16 -
@@ -396,13 +396,6 @@ uvmpd_scan_inactive(struct pglist *pglst
int dirtyreacts;
 
/*
-* note: we currently keep swap-backed pages on a 

uvmpd_scan(): Recheck PG_BUSY after locking the page

2022-04-28 Thread Martin Pieuchot
rw_enter(9) can sleep.  When the lock is finally acquired by the
pagedaemon the previous check might no longer be true and the page
could be busy.  In this case we shouldn't touch it.

Diff below recheck for PG_BUSY after acquiring the lock and also
use a variable for the lock to reduce the differences with NetBSD.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.96
diff -u -p -r1.96 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   11 Apr 2022 16:43:49 -  1.96
+++ uvm/uvm_pdaemon.c   28 Apr 2022 10:22:52 -
@@ -879,6 +879,8 @@ uvmpd_scan(void)
int free, inactive_shortage, swap_shortage, pages_freed;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
+   struct vm_anon *anon;
+   struct rwlock *slock;
boolean_t got_it;
 
MUTEX_ASSERT_LOCKED();
@@ -947,20 +949,34 @@ uvmpd_scan(void)
 p != NULL && (inactive_shortage > 0 || swap_shortage > 0);
 p = nextpg) {
nextpg = TAILQ_NEXT(p, pageq);
-
-   /* skip this page if it's busy. */
-   if (p->pg_flags & PG_BUSY)
+   if (p->pg_flags & PG_BUSY) {
continue;
+   }
 
-   if (p->pg_flags & PQ_ANON) {
-   KASSERT(p->uanon != NULL);
-   if (rw_enter(p->uanon->an_lock, RW_WRITE|RW_NOSLEEP))
+   /*
+* lock the page's owner.
+*/
+   if (p->uobject != NULL) {
+   uobj = p->uobject;
+   slock = uobj->vmobjlock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
continue;
+   }
} else {
-   KASSERT(p->uobject != NULL);
-   if (rw_enter(p->uobject->vmobjlock,
-   RW_WRITE|RW_NOSLEEP))
+   anon = p->uanon;
+   KASSERT(p->uanon != NULL);
+   slock = anon->an_lock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
continue;
+   }
+   }
+
+   /*
+* skip this page if it's busy.
+*/
+   if ((p->pg_flags & PG_BUSY) != 0) {
+   rw_exit(slock);
+   continue;
}
 
/*
@@ -997,10 +1013,11 @@ uvmpd_scan(void)
uvmexp.pddeact++;
inactive_shortage--;
}
-   if (p->pg_flags & PQ_ANON)
-   rw_exit(p->uanon->an_lock);
-   else
-   rw_exit(p->uobject->vmobjlock);
+
+   /*
+* we're done with this page.
+*/
+   rw_exit(slock);
}
 }
 



Call uvm_pageactivate() from uvm_pageunwire()

2022-04-26 Thread Martin Pieuchot
I'd like to use a proper interface to add/remove pages on the
active/inactive queues.  This will help for lock assertions and help
improving the existing LRU limitations.

Diff below makes uvm_pageunwire() call uvm_pageactivate() instead of
inserting the page itself.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.163
diff -u -p -r1.163 uvm_page.c
--- uvm/uvm_page.c  12 Mar 2022 12:34:22 -  1.163
+++ uvm/uvm_page.c  26 Apr 2022 12:13:59 -
@@ -1279,9 +1279,7 @@ uvm_pageunwire(struct vm_page *pg)
 
pg->wire_count--;
if (pg->wire_count == 0) {
-   TAILQ_INSERT_TAIL(_active, pg, pageq);
-   uvmexp.active++;
-   atomic_setbits_int(>pg_flags, PQ_ACTIVE);
+   uvm_pageactivate(pg);
uvmexp.wired--;
}
 }



Decrement uvmexp.swpgonly

2022-04-26 Thread Martin Pieuchot
Small diff to decrement the counter only if the I/O succeed.  This
prevent a false positive if a check is performed before an error is
returned.

ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.154
diff -u -p -r1.154 uvm_swap.c
--- uvm/uvm_swap.c  17 Mar 2022 10:15:13 -  1.154
+++ uvm/uvm_swap.c  26 Apr 2022 12:04:52 -
@@ -1572,17 +1572,16 @@ uvm_swap_get(struct vm_page *page, int s
}
 
KERNEL_LOCK();
-   /* this page is (about to be) no longer only in swap. */
-   atomic_dec_int();
-
result = uvm_swap_io(, swslot, 1, B_READ |
((flags & PGO_SYNCIO) ? 0 : B_ASYNC));
+   KERNEL_UNLOCK();
 
-   if (result != VM_PAGER_OK && result != VM_PAGER_PEND) {
-   /* oops, the read failed so it really is still only in swap. */
-   atomic_inc_int();
+   if (result == VM_PAGER_OK || result == VM_PAGER_PEND) {
+   /*
+* this page is no longer only in swap.
+*/
+   atomic_dec_int();
}
-   KERNEL_UNLOCK();
return (result);
 }
 



Re: const openpty et al.

2022-04-19 Thread Matthew Martin
On Thu, Apr 07, 2022 at 06:11:45PM -0500, Matthew Martin wrote:
> const the termp and winp arguments for openpty and related. This matches
> the prototypes for openpty and forkpty in glibc and musl libc.

ping; has an ok from tb@ [1]

1: https://marc.info/?l=openbsd-tech=164986161215132=2

diff --git openpty.3 openpty.3
index e0c82e00c53..080d92d8ced 100644
--- openpty.3
+++ openpty.3
@@ -47,15 +47,15 @@
 .Ft int
 .Fn getptmfd "void"
 .Ft int
-.Fn openpty "int *amaster" "int *aslave" "char *name" "struct termios *termp" 
"struct winsize *winp"
+.Fn openpty "int *amaster" "int *aslave" "char *name" "const struct termios 
*termp" "const struct winsize *winp"
 .Ft int
-.Fn fdopenpty "int ptmfd" "int *amaster" "int *aslave" "char *name" "struct 
termios *termp" "struct winsize *winp"
+.Fn fdopenpty "int ptmfd" "int *amaster" "int *aslave" "char *name" "const 
struct termios *termp" "const struct winsize *winp"
 .Ft int
 .Fn login_tty "int fd"
 .Ft pid_t
-.Fn forkpty "int *amaster" "char *name" "struct termios *termp" "struct 
winsize *winp"
+.Fn forkpty "int *amaster" "char *name" "const struct termios *termp" "const 
struct winsize *winp"
 .Ft pid_t
-.Fn fdforkpty "int ptmfd" "int *amaster" "char *name" "struct termios *termp" 
"struct winsize *winp"
+.Fn fdforkpty "int ptmfd" "int *amaster" "char *name" "const struct termios 
*termp" "const struct winsize *winp"
 .Sh DESCRIPTION
 The
 .Fn openpty ,
diff --git pty.c pty.c
index c796acb5041..d59f863ff2d 100644
--- pty.c
+++ pty.c
@@ -50,8 +50,8 @@ getptmfd(void)
 }
 
 int
-openpty(int *amaster, int *aslave, char *name, struct termios *termp,
-struct winsize *winp)
+openpty(int *amaster, int *aslave, char *name, const struct termios *termp,
+const struct winsize *winp)
 {
int ptmfd;
 
@@ -67,7 +67,7 @@ openpty(int *amaster, int *aslave, char *name, struct termios 
*termp,
 
 int
 fdopenpty(int ptmfd, int *amaster, int *aslave, char *name,
-struct termios *termp, struct winsize *winp)
+const struct termios *termp, const struct winsize *winp)
 {
int master, slave;
struct ptmget ptm;
@@ -97,7 +97,8 @@ fdopenpty(int ptmfd, int *amaster, int *aslave, char *name,
 }
 
 pid_t
-forkpty(int *amaster, char *name, struct termios *termp, struct winsize *winp)
+forkpty(int *amaster, char *name, const struct termios *termp,
+const struct winsize *winp)
 {
int ptmfd;
pid_t pid;
@@ -113,8 +114,8 @@ forkpty(int *amaster, char *name, struct termios *termp, 
struct winsize *winp)
 }
 
 pid_t
-fdforkpty(int ptmfd, int *amaster, char *name, struct termios *termp,
-struct winsize *winp)
+fdforkpty(int ptmfd, int *amaster, char *name, const struct termios *termp,
+const struct winsize *winp)
 {
int master, slave;
pid_t pid;
diff --git util.h util.h
index d7aa8faed4b..ecae5819e53 100644
--- util.h
+++ util.h
@@ -99,12 +99,14 @@ voidpw_copy(int, int, const struct passwd *, const 
struct passwd *);
 intpw_scan(char *, struct passwd *, int *);
 void   pw_error(const char *, int, int);
 intgetptmfd(void);
-intopenpty(int *, int *, char *, struct termios *, struct winsize *);
-intfdopenpty(int, int *, int *, char *, struct termios *,
-   struct winsize *);
+intopenpty(int *, int *, char *, const struct termios *,
+   const struct winsize *);
+intfdopenpty(int, int *, int *, char *, const struct termios *,
+   const struct winsize *);
 intopendisk(const char *, int, char *, size_t, int);
-pid_t  forkpty(int *, char *, struct termios *, struct winsize *);
-pid_t  fdforkpty(int, int *, char *, struct termios *, struct winsize *);
+pid_t  forkpty(int *, char *, const struct termios *, const struct winsize *);
+pid_t  fdforkpty(int, int *, char *, const struct termios *,
+   const struct winsize *);
 intgetmaxpartitions(void);
 intgetrawpartition(void);
 void   login_fbtab(const char *, uid_t, gid_t);



xmss_hash.c: remove superfluous includes

2022-04-19 Thread Martin Vahlensieck
Hi

Neither openssl/evp.h nor openssl/hmac.h are required.

Best,

Martin

Index: xmss_hash.c
===
RCS file: /cvs/src/usr.bin/ssh/xmss_hash.c,v
retrieving revision 1.2
diff -u -p -r1.2 xmss_hash.c
--- xmss_hash.c 26 Feb 2018 03:56:44 -  1.2
+++ xmss_hash.c 7 Apr 2022 07:04:19 -
@@ -15,8 +15,6 @@ Public domain.
 #include 
 #include 
 #include 
-#include 
-#include 
 
 int core_hash_SHA2(unsigned char *, const unsigned int, const unsigned char *,
 unsigned int, const unsigned char *, unsigned long long, unsigned int);



ssh-xmss.c: Add missing includes

2022-04-19 Thread Martin Vahlensieck
Hi

malloc(3) and friends require stdlib.h, SIZE_MAX requires stdint.h.

Best,

Martin

Index: ssh-xmss.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-xmss.c,v
retrieving revision 1.4
diff -u -p -r1.4 ssh-xmss.c
--- ssh-xmss.c  19 Oct 2020 22:49:23 -  1.4
+++ ssh-xmss.c  7 Apr 2022 07:14:09 -
@@ -19,8 +19,10 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
+#include 
 #include 
 
 #include "log.h"



readconf.c: Avoid a xstrdup

2022-04-19 Thread Martin Vahlensieck
Hi

There is no need to duplicate options->send_env[i] only free it
in all cases.  Just use options->send_env[i] directly.

Best,

Martin

Index: readconf.c
===
RCS file: /cvs/src/usr.bin/ssh/readconf.c,v
retrieving revision 1.366
diff -u -p -r1.366 readconf.c
--- readconf.c  8 Feb 2022 08:59:12 -   1.366
+++ readconf.c  19 Apr 2022 09:38:48 -
@@ -740,19 +740,15 @@ static void
 rm_env(Options *options, const char *arg, const char *filename, int linenum)
 {
int i, j, onum_send_env = options->num_send_env;
-   char *cp;
 
/* Remove an environment variable */
for (i = 0; i < options->num_send_env; ) {
-   cp = xstrdup(options->send_env[i]);
-   if (!match_pattern(cp, arg + 1)) {
-   free(cp);
+   if (!match_pattern(options->send_env[i], arg + 1)) {
i++;
continue;
}
debug3("%s line %d: removing environment %s",
-   filename, linenum, cp);
-   free(cp);
+   filename, linenum, options->send_env[i]);
free(options->send_env[i]);
options->send_env[i] = NULL;
for (j = i; j < options->num_send_env - 1; j++) {



Re: refcount btrace

2022-04-11 Thread Martin Pieuchot
On 08/04/22(Fri) 12:16, Alexander Bluhm wrote:
> On Fri, Apr 08, 2022 at 02:39:34AM +, Visa Hankala wrote:
> > On Thu, Apr 07, 2022 at 07:55:11PM +0200, Alexander Bluhm wrote:
> > > On Wed, Mar 23, 2022 at 06:13:27PM +0100, Alexander Bluhm wrote:
> > > > In my opinion tracepoints give insight at minimal cost.  It is worth
> > > > it to have it in GENERIC to make it easy to use.
> > > 
> > > After release I want to revive the btrace of refcounts discussion.
> > > 
> > > As mpi@ mentioned the idea of dt(4) is to have these trace points
> > > in GENERIC kernel.  If you want to hunt a bug, just turn it on.
> > > Refounting is a common place for bugs, leaks can be detected easily.
> > > 
> > > The alternative are some defines that you can compile in and access
> > > from ddb.  This is more work and you would have to implement it for
> > > every recount.
> > > https://marc.info/?l=openbsd-tech=163786435916039=2
> > > 
> > > There is no measuarable performance difference.  dt(4) is written
> > > in a way that is is only one additional branch.  At least my goal
> > > is to add trace points to useful places when we identify them.
> > 
> > DT_INDEX_ENTER() still checks the index first, so it has two branches
> > in practice.
> > 
> > I think dt_tracing should be checked first so that it serves as
> > a gateway to the trace code. Under normal operation, the check's
> > outcome is always the same, which is easy even for simple branch
> > predictors.
> 
> Reordering the check is easy.  Now dt_tracing is first.
> 
> > I have a slight suspicion that dt(4) is now becoming a way to add code
> > that would be otherwise unacceptable. Also, how "durable" are trace
> > points perceived? Is an added trace point an achieved advantage that
> > is difficult to remove even when its utility has diminished? There is
> > a similarity to (ad hoc) debug printfs too.
> 
> As I understand dt(4) it is a replacement for debug printfs.  But
> it has advantages.  I can be turnd on selectively from userland.
> It does not spam the console, but can be processed in userland.  It
> is always there, you don't have to recompile.
> 
> Of course you always have the printf or tracepoint at the worng
> place.  I think people debugging the code should move them to
> the useful places.  Then we may end with generally useful tool.
> A least that is my hope.
> 
> There are obvious places to debug.  We have syscall entry and return.
> And I think reference counting is also generally interesting.

I'm happy if this can help debugging real reference counting issues.  Do
you have a script that could be committed to /usr/share/btrace to show
how to track reference counting using these probes?

> Index: dev/dt/dt_prov_static.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/dt/dt_prov_static.c,v
> retrieving revision 1.13
> diff -u -p -r1.13 dt_prov_static.c
> --- dev/dt/dt_prov_static.c   17 Mar 2022 14:53:59 -  1.13
> +++ dev/dt/dt_prov_static.c   8 Apr 2022 09:40:29 -
> @@ -87,6 +87,12 @@ DT_STATIC_PROBE1(smr, barrier_exit, "int
>  DT_STATIC_PROBE0(smr, wakeup);
>  DT_STATIC_PROBE2(smr, thread, "uint64_t", "uint64_t");
>  
> +/*
> + * reference counting
> + */
> +DT_STATIC_PROBE0(refcnt, none);
> +DT_STATIC_PROBE3(refcnt, inpcb, "void *", "int", "int");
> +DT_STATIC_PROBE3(refcnt, tdb, "void *", "int", "int");
>  
>  /*
>   * List of all static probes
> @@ -127,15 +133,24 @@ struct dt_probe *const dtps_static[] = {
>   &_DT_STATIC_P(smr, barrier_exit),
>   &_DT_STATIC_P(smr, wakeup),
>   &_DT_STATIC_P(smr, thread),
> + /* refcnt */
> + &_DT_STATIC_P(refcnt, none),
> + &_DT_STATIC_P(refcnt, inpcb),
> + &_DT_STATIC_P(refcnt, tdb),
>  };
>  
> +struct dt_probe *const *dtps_index_refcnt;
> +
>  int
>  dt_prov_static_init(void)
>  {
>   int i;
>  
> - for (i = 0; i < nitems(dtps_static); i++)
> + for (i = 0; i < nitems(dtps_static); i++) {
> + if (dtps_static[i] == &_DT_STATIC_P(refcnt, none))
> + dtps_index_refcnt = _static[i];
>   dt_dev_register_probe(dtps_static[i]);
> + }
>  
>   return i;
>  }
> Index: dev/dt/dtvar.h
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/dt/dtvar.h,v
> retrieving revision 1.13
> diff -u -p -r1.13 dtvar.h
> --- dev/dt/dtvar.h27 Feb 2022 10:14:01 -  1.13
> +++ dev/dt/dtvar.h8 Apr 2022 09:42:19 -
> @@ -313,11 +313,30 @@ extern volatile uint32_tdt_tracing; /* 
>  #define  DT_STATIC_ENTER(func, name, args...) do {   
> \
>   extern struct dt_probe _DT_STATIC_P(func, name);\
>   struct dt_probe *dtp = &_DT_STATIC_P(func, name);   \
> - struct dt_provider *dtpv = dtp->dtp_prov;   \
>   \
>   if 

Kill selrecord()

2022-04-11 Thread Martin Pieuchot
Now that poll(2) & select(2) use the kqueue backend under the hood we
can start retiring the old machinery. 

The diff below does not touch driver definitions, however it :

- kills selrecord() & doselwakeup()

- make it obvious that `kern.nselcoll' is now always 0 

- Change all poll/select hooks to return 0

- Kill a seltid check in wsdisplaystart() which is now always true

In a later step we could remove the *_poll() requirement from device
drivers and kill seltrue() & selfalse().

ok?

Index: arch/sparc64/dev/vldcp.c
===
RCS file: /cvs/src/sys/arch/sparc64/dev/vldcp.c,v
retrieving revision 1.22
diff -u -p -r1.22 vldcp.c
--- arch/sparc64/dev/vldcp.c24 Oct 2021 17:05:04 -  1.22
+++ arch/sparc64/dev/vldcp.c11 Apr 2022 16:38:32 -
@@ -581,44 +581,7 @@ vldcpioctl(dev_t dev, u_long cmd, caddr_
 int
 vldcppoll(dev_t dev, int events, struct proc *p)
 {
-   struct vldcp_softc *sc;
-   struct ldc_conn *lc;
-   uint64_t head, tail, state;
-   int revents = 0;
-   int s, err;
-
-   sc = vldcp_lookup(dev);
-   if (sc == NULL)
-   return (POLLERR);
-   lc = >sc_lc;
-
-   s = spltty();
-   if (events & (POLLIN | POLLRDNORM)) {
-   err = hv_ldc_rx_get_state(lc->lc_id, , , );
-
-   if (err == 0 && state == LDC_CHANNEL_UP && head != tail)
-   revents |= events & (POLLIN | POLLRDNORM);
-   }
-   if (events & (POLLOUT | POLLWRNORM)) {
-   err = hv_ldc_tx_get_state(lc->lc_id, , , );
-
-   if (err == 0 && state == LDC_CHANNEL_UP && head != tail)
-   revents |= events & (POLLOUT | POLLWRNORM);
-   }
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM)) {
-   cbus_intr_setenabled(sc->sc_bustag, sc->sc_rx_ino,
-   INTR_ENABLED);
-   selrecord(p, >sc_rsel);
-   }
-   if (events & (POLLOUT | POLLWRNORM)) {
-   cbus_intr_setenabled(sc->sc_bustag, sc->sc_tx_ino,
-   INTR_ENABLED);
-   selrecord(p, >sc_wsel);
-   }
-   }
-   splx(s);
-   return revents;
+   return 0;
 }
 
 void
Index: dev/audio.c
===
RCS file: /cvs/src/sys/dev/audio.c,v
retrieving revision 1.198
diff -u -p -r1.198 audio.c
--- dev/audio.c 21 Mar 2022 19:22:39 -  1.198
+++ dev/audio.c 11 Apr 2022 16:38:52 -
@@ -2053,17 +2053,7 @@ audio_mixer_read(struct audio_softc *sc,
 int
 audio_mixer_poll(struct audio_softc *sc, int events, struct proc *p)
 {
-   int revents = 0;
-
-   mtx_enter(_lock);
-   if (sc->mix_isopen && sc->mix_pending)
-   revents |= events & (POLLIN | POLLRDNORM);
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM))
-   selrecord(p, >mix_sel);
-   }
-   mtx_leave(_lock);
-   return revents;
+   return 0;
 }
 
 int
@@ -2101,21 +2091,7 @@ audio_mixer_close(struct audio_softc *sc
 int
 audio_poll(struct audio_softc *sc, int events, struct proc *p)
 {
-   int revents = 0;
-
-   mtx_enter(_lock);
-   if ((sc->mode & AUMODE_RECORD) && sc->rec.used > 0)
-   revents |= events & (POLLIN | POLLRDNORM);
-   if ((sc->mode & AUMODE_PLAY) && sc->play.used < sc->play.len)
-   revents |= events & (POLLOUT | POLLWRNORM);
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM))
-   selrecord(p, >rec.sel);
-   if (events & (POLLOUT | POLLWRNORM))
-   selrecord(p, >play.sel);
-   }
-   mtx_leave(_lock);
-   return revents;
+   return 0;
 }
 
 int
Index: dev/hotplug.c
===
RCS file: /cvs/src/sys/dev/hotplug.c,v
retrieving revision 1.21
diff -u -p -r1.21 hotplug.c
--- dev/hotplug.c   25 Dec 2020 12:59:52 -  1.21
+++ dev/hotplug.c   11 Apr 2022 16:39:24 -
@@ -183,16 +183,7 @@ hotplugioctl(dev_t dev, u_long cmd, cadd
 int
 hotplugpoll(dev_t dev, int events, struct proc *p)
 {
-   int revents = 0;
-
-   if (events & (POLLIN | POLLRDNORM)) {
-   if (evqueue_count > 0)
-   revents |= events & (POLLIN | POLLRDNORM);
-   else
-   selrecord(p, _sel);
-   }
-
-   return (revents);
+   return (0);
 }
 
 int
Index: dev/midi.c
===
RCS file: /cvs/src/sys/dev/midi.c,v
retrieving revision 1.54
diff -u -p -r1.54 midi.c
--- dev/midi.c  6 Apr 2022 18:59:27 -   1.54
+++ dev/midi.c  11 Apr 2022 16:39:31 -
@@ -334,31 +334,7 @@ done:
 int
 midipoll(dev_t dev, int events, struct proc *p)
 {
-   struct midi_softc *sc;
-   int 

sshd_config(5): Use correct path for system-wide known_hosts

2022-04-11 Thread Martin Vahlensieck
Hi

The path to the system-wide known_hosts file is /etc/ssh/ssh_known_hosts
and not /etc/ssh/known_hosts.  See auth2-hostbased.c line 221-223.

Best,

Martin

Index: sshd_config.5
===
RCS file: /cvs/src/usr.bin/ssh/sshd_config.5,v
retrieving revision 1.340
diff -u -p -r1.340 sshd_config.5
--- sshd_config.5   31 Mar 2022 17:58:44 -  1.340
+++ sshd_config.5   10 Apr 2022 20:35:39 -
@@ -818,7 +818,7 @@ should ignore the user's
 during
 .Cm HostbasedAuthentication
 and use only the system-wide known hosts file
-.Pa /etc/ssh/known_hosts .
+.Pa /etc/ssh/ssh_known_hosts .
 The default is
 .Dq no .
 .It Cm Include



const openpty et al.

2022-04-07 Thread Matthew Martin
const the termp and winp arguments for openpty and related. This matches
the prototypes for openpty and forkpty in glibc and musl libc.

diff --git openpty.3 openpty.3
index e0c82e00c53..080d92d8ced 100644
--- openpty.3
+++ openpty.3
@@ -47,15 +47,15 @@
 .Ft int
 .Fn getptmfd "void"
 .Ft int
-.Fn openpty "int *amaster" "int *aslave" "char *name" "struct termios *termp" 
"struct winsize *winp"
+.Fn openpty "int *amaster" "int *aslave" "char *name" "const struct termios 
*termp" "const struct winsize *winp"
 .Ft int
-.Fn fdopenpty "int ptmfd" "int *amaster" "int *aslave" "char *name" "struct 
termios *termp" "struct winsize *winp"
+.Fn fdopenpty "int ptmfd" "int *amaster" "int *aslave" "char *name" "const 
struct termios *termp" "const struct winsize *winp"
 .Ft int
 .Fn login_tty "int fd"
 .Ft pid_t
-.Fn forkpty "int *amaster" "char *name" "struct termios *termp" "struct 
winsize *winp"
+.Fn forkpty "int *amaster" "char *name" "const struct termios *termp" "const 
struct winsize *winp"
 .Ft pid_t
-.Fn fdforkpty "int ptmfd" "int *amaster" "char *name" "struct termios *termp" 
"struct winsize *winp"
+.Fn fdforkpty "int ptmfd" "int *amaster" "char *name" "const struct termios 
*termp" "const struct winsize *winp"
 .Sh DESCRIPTION
 The
 .Fn openpty ,
diff --git pty.c pty.c
index c796acb5041..d59f863ff2d 100644
--- pty.c
+++ pty.c
@@ -50,8 +50,8 @@ getptmfd(void)
 }
 
 int
-openpty(int *amaster, int *aslave, char *name, struct termios *termp,
-struct winsize *winp)
+openpty(int *amaster, int *aslave, char *name, const struct termios *termp,
+const struct winsize *winp)
 {
int ptmfd;
 
@@ -67,7 +67,7 @@ openpty(int *amaster, int *aslave, char *name, struct termios 
*termp,
 
 int
 fdopenpty(int ptmfd, int *amaster, int *aslave, char *name,
-struct termios *termp, struct winsize *winp)
+const struct termios *termp, const struct winsize *winp)
 {
int master, slave;
struct ptmget ptm;
@@ -97,7 +97,8 @@ fdopenpty(int ptmfd, int *amaster, int *aslave, char *name,
 }
 
 pid_t
-forkpty(int *amaster, char *name, struct termios *termp, struct winsize *winp)
+forkpty(int *amaster, char *name, const struct termios *termp,
+const struct winsize *winp)
 {
int ptmfd;
pid_t pid;
@@ -113,8 +114,8 @@ forkpty(int *amaster, char *name, struct termios *termp, 
struct winsize *winp)
 }
 
 pid_t
-fdforkpty(int ptmfd, int *amaster, char *name, struct termios *termp,
-struct winsize *winp)
+fdforkpty(int ptmfd, int *amaster, char *name, const struct termios *termp,
+const struct winsize *winp)
 {
int master, slave;
pid_t pid;
diff --git util.h util.h
index d7aa8faed4b..ecae5819e53 100644
--- util.h
+++ util.h
@@ -99,12 +99,14 @@ voidpw_copy(int, int, const struct passwd *, const 
struct passwd *);
 intpw_scan(char *, struct passwd *, int *);
 void   pw_error(const char *, int, int);
 intgetptmfd(void);
-intopenpty(int *, int *, char *, struct termios *, struct winsize *);
-intfdopenpty(int, int *, int *, char *, struct termios *,
-   struct winsize *);
+intopenpty(int *, int *, char *, const struct termios *,
+   const struct winsize *);
+intfdopenpty(int, int *, int *, char *, const struct termios *,
+   const struct winsize *);
 intopendisk(const char *, int, char *, size_t, int);
-pid_t  forkpty(int *, char *, struct termios *, struct winsize *);
-pid_t  fdforkpty(int, int *, char *, struct termios *, struct winsize *);
+pid_t  forkpty(int *, char *, const struct termios *, const struct winsize *);
+pid_t  fdforkpty(int, int *, char *, const struct termios *,
+   const struct winsize *);
 intgetmaxpartitions(void);
 intgetrawpartition(void);
 void   login_fbtab(const char *, uid_t, gid_t);



rcctl.8: Add missing variable

2022-03-31 Thread Matthew Martin
diff --git rcctl.8 rcctl.8
index 6c3048b834c..93a76a937c1 100644
--- rcctl.8
+++ rcctl.8
@@ -64,6 +64,7 @@ can be one of
 .Cm class ,
 .Cm flags ,
 .Cm logger ,
+.Cm rtable ,
 .Cm status ,
 .Cm timeout
 or



Re: refcount btrace

2022-03-21 Thread Martin Pieuchot
On 20/03/22(Sun) 05:39, Visa Hankala wrote:
> On Sat, Mar 19, 2022 at 12:10:11AM +0100, Alexander Bluhm wrote:
> > On Thu, Mar 17, 2022 at 07:25:27AM +, Visa Hankala wrote:
> > > On Thu, Mar 17, 2022 at 12:42:13AM +0100, Alexander Bluhm wrote:
> > > > I would like to use btrace to debug refernce counting.  The idea
> > > > is to a a tracepoint for every type of refcnt we have.  When it
> > > > changes, print the actual object, the current counter and the change
> > > > value.
> > > 
> > > > Do we want that feature?
> > > 
> > > I am against this in its current form. The code would become more
> > > complex, and the trace points can affect timing. There is a risk that
> > > the kernel behaves slightly differently when dt has been compiled in.
> > 
> > On our main architectures dt(4) is in GENERIC.  I see your timing
> > point for uvm structures.
> 
> In my opinion, having dt(4) enabled by default is another reason why
> there should be no carte blanche for adding trace points. Each trace
> point adds a tiny amount of bloat. Few users will use the tracing
> facility.
> 
> Maybe high-rate trace points could be behind a build option...

The whole point of dt(4) is to be able to debug GENERIC kernel.  I doubt
the cost of an additional if () block matters.



rpki-client(8): properly zero terminate pretty printed key

2022-03-17 Thread Martin Vahlensieck
Hi

It seems the pretty printed key is zero terminated only if the size
of hex stays the same or increases between calls.  This diff fixes
it, so it is always properly terminated.  While here, also drop
*hex != '\0' from the if inside the loop, as it is checked directly
above in the loop condition and constify the argument, as it is not
modified.

Best,

Martin

Index: print.c
===
RCS file: /cvs/src/usr.sbin/rpki-client/print.c,v
retrieving revision 1.5
diff -u -p -r1.5 print.c
--- print.c 10 Feb 2022 17:33:28 -  1.5
+++ print.c 17 Mar 2022 17:46:01 -
@@ -28,19 +28,21 @@
 #include "extern.h"
 
 static const char *
-pretty_key_id(char *hex)
+pretty_key_id(const char *hex)
 {
static char buf[128];   /* bigger than SHA_DIGEST_LENGTH * 3 */
size_t i;
 
for (i = 0; i < sizeof(buf) && *hex != '\0'; i++) {
-   if  (i % 3 == 2 && *hex != '\0')
+   if  (i % 3 == 2)
buf[i] = ':';
else
buf[i] = *hex++;
}
if (i == sizeof(buf))
memcpy(buf + sizeof(buf) - 4, "...", 4);
+   else
+   buf[i] = '\0';
return buf;
 }
 



wg: remove argument names from prototypes

2022-03-17 Thread Martin Vahlensieck
None of the other prototypes have argument names.

Index: if_wg.c
===
RCS file: /home/reposync/cvs/src/sys/net/if_wg.c,v
retrieving revision 1.22
diff -u -p -r1.22 if_wg.c
--- if_wg.c 22 Feb 2022 01:15:02 -  1.22
+++ if_wg.c 15 Mar 2022 21:10:37 -
@@ -325,7 +325,7 @@ voidwg_peer_send_buf(struct wg_peer *, 
 void   wg_send_initiation(void *);
 void   wg_send_response(struct wg_peer *);
 void   wg_send_cookie(struct wg_softc *, struct cookie_macs *, uint32_t,
-   struct wg_endpoint *e);
+   struct wg_endpoint *);
 void   wg_send_keepalive(void *);
 void   wg_peer_clear_secrets(void *);
 void   wg_handshake(struct wg_softc *, struct mbuf *);
Index: wg_cookie.c
===
RCS file: /home/reposync/cvs/src/sys/net/wg_cookie.c,v
retrieving revision 1.3
diff -u -p -r1.3 wg_cookie.c
--- wg_cookie.c 10 Mar 2021 10:21:48 -  1.3
+++ wg_cookie.c 15 Mar 2022 21:09:29 -
@@ -37,7 +37,7 @@ static void   cookie_macs_mac2(struct cook
 static int cookie_timer_expired(struct timespec *, time_t, long);
 static voidcookie_checker_make_cookie(struct cookie_checker *,
uint8_t[COOKIE_COOKIE_SIZE], struct sockaddr *);
-static int ratelimit_init(struct ratelimit *, struct pool *pool);
+static int ratelimit_init(struct ratelimit *, struct pool *);
 static voidratelimit_deinit(struct ratelimit *);
 static voidratelimit_gc(struct ratelimit *, int);
 static int ratelimit_allow(struct ratelimit *, struct sockaddr *);



two small typos

2022-03-17 Thread Martin Vahlensieck
Index: if_wg.c
===
RCS file: /home/reposync/cvs/src/sys/net/if_wg.c,v
retrieving revision 1.22
diff -u -p -r1.22 if_wg.c
--- if_wg.c 22 Feb 2022 01:15:02 -  1.22
+++ if_wg.c 15 Mar 2022 21:10:37 -
@@ -2023,7 +2023,7 @@ wg_input(void *_sc, struct mbuf *m, stru
 
/*
 * Ensure mbuf is contiguous over full length of packet. This is done
-* os we can directly read the handshake values in wg_handshake, and so
+* so we can directly read the handshake values in wg_handshake, and so
 * we can decrypt a transport packet by passing a single buffer to
 * noise_remote_decrypt in wg_decap.
 */
Index: pf.c
===
RCS file: /home/reposync/cvs/src/sys/net/pf.c,v
retrieving revision 1.1125
diff -u -p -r1.1125 pf.c
--- pf.c5 Mar 2022 10:43:32 -   1.1125
+++ pf.c10 Mar 2022 15:53:51 -
@@ -1340,7 +1340,7 @@ pf_state_expires(const struct pf_state *
 * this function may be called by the state purge task while
 * the state is being modified. avoid inconsistent reads of
 * state->timeout by having the caller do the read (and any
-* chacks it needs to do on the same variable) and then pass
+* checks it needs to do on the same variable) and then pass
 * their view of the timeout in here for this function to use.
 * the only consequence of using a stale timeout value is
 * that the state won't be a candidate for purging until the



Re: issue with login.conf(5) rtable and su -l user

2022-03-13 Thread Matthew Martin
On Sun, Mar 13, 2022 at 02:30:23PM +0100, Solene Rapenne wrote:
> Hi, I'm playing with the new rtable feature in login.conf(5) but it
> seems one use case doesn't trigger the rtable change.
> 
> I have an user called alice, if I ssh locally from my user to alice
> with ssh alice@localhost, alice has the correct routing table, if I use
> as root "su -l alice", then alice seems using rtable 0.

Ignoring -L which already honors rtable, su has three cases:
  -l (asme=0 asthem=1)
  -m (asme=1 asthem=0)
   (asme=0 asthem=0)

-l should honor rtable; I am not sure about the other two. I think the
least suprising would be for the neither case to honor rtable and for -m
to not, but I don't have a strong opinion here. Patch as suggested below.

> if it works, I'm using rtable 1 (openvpn), if not, it's using rtable 0.

id -R will show the rtable directly.


diff --git su.c su.c
index f87e6690835..c2fbbe2724d 100644
--- su.c
+++ su.c
@@ -355,6 +355,8 @@ main(int argc, char **argv)
flags &= ~LOGIN_SETLOGIN;
} else {
flags = LOGIN_SETRESOURCES|LOGIN_SETGROUP|LOGIN_SETUSER;
+   if (!asme)
+   flags |= LOGIN_SETRTABLE;
if (asthem)
flags |= LOGIN_SETENV|LOGIN_SETPRIORITY|LOGIN_SETUMASK;
}



Swap encrypt under memory pressure

2022-03-12 Thread Martin Pieuchot
Try to allocate the buffer before doing the encryption, if it fails we
do not spend time doing the encryption.  This reduce the pressure when
swapping with low memory.

ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.153
diff -u -p -r1.153 uvm_swap.c
--- uvm/uvm_swap.c  22 Feb 2022 01:15:02 -  1.153
+++ uvm/uvm_swap.c  12 Mar 2022 10:30:26 -
@@ -1690,6 +1690,26 @@ uvm_swap_io(struct vm_page **pps, int st
}
}
 
+   /*
+* now allocate a buf for the i/o.
+* [make sure we don't put the pagedaemon to sleep...]
+*/
+   pflag = (async || curproc == uvm.pagedaemon_proc) ? PR_NOWAIT :
+   PR_WAITOK;
+   bp = pool_get(, pflag | PR_ZERO);
+
+   /*
+* if we failed to get a swapbuf, return "try again"
+*/
+   if (bp == NULL) {
+   if (bounce) {
+   uvm_pagermapout(bouncekva, npages);
+   uvm_swap_freepages(tpps, npages);
+   }
+   uvm_pagermapout(kva, npages);
+   return (VM_PAGER_AGAIN);
+   }
+
/* encrypt to swap */
if (write && bounce) {
int i, opages;
@@ -1729,35 +1749,6 @@ uvm_swap_io(struct vm_page **pps, int st
  PGO_PDFREECLUST);
 
kva = bouncekva;
-   }
-
-   /*
-* now allocate a buf for the i/o.
-* [make sure we don't put the pagedaemon to sleep...]
-*/
-   pflag = (async || curproc == uvm.pagedaemon_proc) ? PR_NOWAIT :
-   PR_WAITOK;
-   bp = pool_get(, pflag | PR_ZERO);
-
-   /*
-* if we failed to get a swapbuf, return "try again"
-*/
-   if (bp == NULL) {
-   if (write && bounce) {
-#ifdef UVM_SWAP_ENCRYPT
-   int i;
-
-   /* swap encrypt needs cleanup */
-   if (encrypt)
-   for (i = 0; i < npages; i++)
-   SWAP_KEY_PUT(sdp, SWD_KEY(sdp,
-   startslot + i));
-#endif
-
-   uvm_pagermapout(kva, npages);
-   uvm_swap_freepages(tpps, npages);
-   }
-   return (VM_PAGER_AGAIN);
}
 
/*



Re: rpki-client: fix wrong conditional

2022-03-10 Thread Martin Vahlensieck
On Thu, Mar 10, 2022 at 06:15:48PM +0100, Theo Buehler wrote:
> On Thu, Mar 10, 2022 at 06:03:14PM +0100, Claudio Jeker wrote:
> > On Thu, Mar 10, 2022 at 05:54:21PM +0100, Theo Buehler wrote:
> > > On Thu, Mar 10, 2022 at 05:51:46PM +0100, Claudio Jeker wrote:
> > > > On Thu, Mar 10, 2022 at 05:33:28PM +0100, Martin Vahlensieck wrote:
> > > > > Hi
> > > > > 
> > > > > This pulls up and adjusts the check if i exceeds the bounds of pfds.
> > > > > Before it was technically wrong, as i > NPFDS means that the last
> > > > > write (i == NPFDS) was already out of bounds.
> > > >  
> > > > I see no reason to pull up the check but the if condition should indeed 
> > > > be
> > > > greater or equal. One could consider to change this into an assert() 
> > > > but I
> > > > think I stick with the errx().
> > > 
> > > Agreed. ok for the diff that just changes the checks to >=
> > 
> > Actually I was wrong, the check needs to happen at the start of the loop
> > not at the end else it does not work if the list is exactly the number of
> > elements to fill NPFDS. 
> 
> Ah right, we need to bail out when we would actually go past the limit.
> However, I see no reason to fiddle with the timeout before checking.
> So let's move the check to the beginning.

Here you go:

Index: http.c
===
RCS file: /cvs/src/usr.sbin/rpki-client/http.c,v
retrieving revision 1.53
diff -u -p -r1.53 http.c
--- http.c  10 Feb 2022 11:10:40 -  1.53
+++ http.c  10 Mar 2022 17:28:21 -
@@ -1809,6 +1809,9 @@ proc_http(char *bind_addr, int fd)
timeout = INFTIM;
now = getmonotime();
LIST_FOREACH(conn, , entry) {
+   if (i >= NPFDS)
+   errx(1, "too many connections");
+
if (conn->io_time == 0)
conn->io_time = now + HTTP_IO_TIMEOUT;
 
@@ -1828,10 +1831,11 @@ proc_http(char *bind_addr, int fd)
pfds[i].events = conn->events;
conn->pfd = [i];
i++;
-   if (i > NPFDS)
-   errx(1, "too many connections");
}
LIST_FOREACH(conn, , entry) {
+   if (i >= NPFDS)
+   errx(1, "too many connections");
+
if (conn->idle_time <= now)
timeout = 0;
else {
@@ -1844,8 +1848,6 @@ proc_http(char *bind_addr, int fd)
pfds[i].events = POLLIN;
conn->pfd = [i];
i++;
-   if (i > NPFDS)
-   errx(1, "too many connections");
}
 
if (poll(pfds, i, timeout) == -1) {



rpki-client: fix wrong conditional

2022-03-10 Thread Martin Vahlensieck
Hi

This pulls up and adjusts the check if i exceeds the bounds of pfds.
Before it was technically wrong, as i > NPFDS means that the last
write (i == NPFDS) was already out of bounds.

Best,

Martin


Index: http.c
===
RCS file: /cvs/src/usr.sbin/rpki-client/http.c,v
retrieving revision 1.53
diff -u -p -r1.53 http.c
--- http.c  10 Feb 2022 11:10:40 -  1.53
+++ http.c  10 Mar 2022 16:28:48 -
@@ -1820,6 +1820,10 @@ proc_http(char *bind_addr, int fd)
if (timeout == INFTIM || diff < timeout)
timeout = diff;
}
+
+   if (i >= NPFDS)
+   errx(1, "too many connections");
+
if (conn->state == STATE_WRITE_DATA)
pfds[i].fd = conn->req->outfd;
else
@@ -1828,8 +1832,6 @@ proc_http(char *bind_addr, int fd)
pfds[i].events = conn->events;
conn->pfd = [i];
i++;
-   if (i > NPFDS)
-   errx(1, "too many connections");
}
LIST_FOREACH(conn, , entry) {
if (conn->idle_time <= now)
@@ -1840,12 +1842,14 @@ proc_http(char *bind_addr, int fd)
if (timeout == INFTIM || diff < timeout)
timeout = diff;
}
+
+   if (i >= NPFDS)
+   errx(1, "too many connections");
+
pfds[i].fd = conn->fd;
pfds[i].events = POLLIN;
conn->pfd = [i];
i++;
-   if (i > NPFDS)
-   errx(1, "too many connections");
}
 
if (poll(pfds, i, timeout) == -1) {



[patch] LOGIN_SETENV fails when login.conf is missing

2022-02-26 Thread Matthew Martin
Anton spotted a doas regression failure in t-run-keepenv-path after the
change to doas for LOGIN_SETALL. Since that test runs doas in a chroot
and the setup does not create a login.conf, login_getclass in
login_cap.c will return a login_cap_t with a NULL lc_cap (and errno set
to ENOENT) on L133. setuserenv returns -1 if lc->lc_cap is NULL causing
the "could not set user environment" failure.

As all the login_getcap* functions as well as gsetrl and setuserpath
succeed on a missing login.conf, the below path changes setuserenv to
return 0 as well. This matches what happens if the class has no setenv
capability right below. The comment is taken from gsetrl.



diff --git gen/login_cap.c gen/login_cap.c
index 67933653349..c142c0a7503 100644
--- gen/login_cap.c
+++ gen/login_cap.c
@@ -700,8 +700,11 @@ setuserpath(login_cap_t *lc, const struct passwd *pwd)
char *path = NULL, *opath = NULL, *op, *np;
int len, error;
 
+   /*
+* If we have no capabilities then set _PATH_DEFPATH.
+*/
if (lc->lc_cap == NULL)
-   goto setit; /* impossible */
+   goto setit;
 
if ((len = cgetustr(lc->lc_cap, "path", )) <= 0)
goto setit;
@@ -753,8 +756,12 @@ setuserenv(login_cap_t *lc, const struct passwd *pwd)
char *beg, *end, *ep, *list, *value;
int len, error;
 
+   /*
+* If we have no capabilities then there is nothing to do and
+* we can just return success.
+*/
if (lc->lc_cap == NULL)
-   return (-1);/* impossible */
+   return (0);
 
if ((len = cgetustr(lc->lc_cap, "setenv", )) <= 0)
return (0);



Re: Add rtable capability to login.conf

2022-02-24 Thread Matthew Martin
On Fri, Feb 18, 2022 at 03:25:51PM -0500, Ted Unangst wrote:
> On 2022-02-06, Ted Unangst wrote:
> > On 2022-02-05, Matthew Martin wrote:
> > > On Sat, Jan 29, 2022 at 06:25:32PM -0600, Matthew Martin wrote:
> > > > On Sat, Jan 29, 2022 at 07:10:00PM -0500, Ted Unangst wrote:
> > > > > I believe it would be better to add setrtable to id pledge.
> > > 
> > > ping
> > > 
> > > Also are there any opinions on adding LOGIN_SETRTABLE to doas?
> > 
> > I think this diff looks fine.
> > 
> > For doas, we can use setall with an extra note in the man page.
> 
> Final auction for oks. I think all the login.conf.d changes are in now.
> 
> Plan is add setrtable to pledge first so people don't get caught, then libc.

ping?

> > Index: doas.1
> > ===
> > RCS file: /home/cvs/src/usr.bin/doas/doas.1,v
> > retrieving revision 1.25
> > diff -u -p -r1.25 doas.1
> > --- doas.1  16 Jan 2021 09:18:41 -  1.25
> > +++ doas.1  6 Feb 2022 18:41:53 -
> > @@ -54,6 +54,8 @@ and
> >  and the
> >  .Xr umask 2
> >  are set to values appropriate for the target user.
> > +Other values may also be set as specified in
> > +.Pa /etc/login.conf .
> >  .Ev DOAS_USER
> >  is set to the name of the user executing
> >  .Nm .
> > Index: doas.c
> > ===
> > RCS file: /home/cvs/src/usr.bin/doas/doas.c,v
> > retrieving revision 1.93
> > diff -u -p -r1.93 doas.c
> > --- doas.c  30 Nov 2021 20:08:15 -  1.93
> > +++ doas.c  6 Feb 2022 18:39:38 -
> > @@ -450,10 +450,7 @@ main(int argc, char **argv)
> > if (targpw == NULL)
> > errx(1, "no passwd entry for target");
> >  
> > -   if (setusercontext(NULL, targpw, target, LOGIN_SETGROUP |
> > -   LOGIN_SETPATH |
> > -   LOGIN_SETPRIORITY | LOGIN_SETRESOURCES | LOGIN_SETUMASK |
> > -   LOGIN_SETUSER) != 0)
> > +   if (setusercontext(NULL, targpw, target, LOGIN_SETALL) == -1)
> > errx(1, "failed to set user context for target");
> >  
> > if (pledge("stdio rpath exec", NULL) == -1)
> > 
> > 



Re: yacc(1): skeleton.c: remove outdated comment

2022-02-23 Thread Martin Vahlensieck
ping, diff reattached
On Thu, Feb 10, 2022 at 04:29:53PM +0100, Martin Vahlensieck wrote:
> Hi
> 
> yysccsid was removed in 1.30 back in 2009.
> 
> Best,
> 
> Martin
> 
> Index: skeleton.c
> ===
> RCS file: /cvs/src/usr.bin/yacc/skeleton.c,v
> retrieving revision 1.40
> diff -u -p -r1.40 skeleton.c
> --- skeleton.c3 Feb 2021 01:10:10 -   1.40
> +++ skeleton.c10 Feb 2022 15:29:08 -
> @@ -35,10 +35,6 @@
>  
>  #include "defs.h"
>  
> -/*  The definition of yysccsid in the banner should be replaced with */
> -/*  a #pragma ident directive if the target C compiler supports  
> */
> -/*  #pragma ident directives.
> */
> -/*   */
>  /*  If the skeleton is changed, the banner should be changed so that */
>  /*  the altered version can be easily distinguished from the original.   
> */
>  /*   */
> 

Index: skeleton.c
===
RCS file: /cvs/src/usr.bin/yacc/skeleton.c,v
retrieving revision 1.40
diff -u -p -r1.40 skeleton.c
--- skeleton.c  3 Feb 2021 01:10:10 -   1.40
+++ skeleton.c  10 Feb 2022 15:29:08 -
@@ -35,10 +35,6 @@
 
 #include "defs.h"
 
-/*  The definition of yysccsid in the banner should be replaced with   */
-/*  a #pragma ident directive if the target C compiler supports
*/
-/*  #pragma ident directives.  */
-/* */
 /*  If the skeleton is changed, the banner should be changed so that   */
 /*  the altered version can be easily distinguished from the original. */
 /* */



ssh(1): monitor.c: save a xstrdup

2022-02-23 Thread Martin Vahlensieck
Hi

Tiny diff to save an allocation.

Best,

Martin

Index: monitor.c
===
RCS file: /cvs/src/usr.bin/ssh/monitor.c,v
retrieving revision 1.231
diff -u -p -r1.231 monitor.c
--- monitor.c   28 Jan 2022 06:18:42 -  1.231
+++ monitor.c   23 Feb 2022 16:49:27 -
@@ -658,9 +658,8 @@ mm_answer_pwnamallow(struct ssh *ssh, in
 
pwent = getpwnamallow(ssh, username);
 
-   authctxt->user = xstrdup(username);
+   authctxt->user = username;
setproctitle("%s [priv]", pwent ? username : "unknown");
-   free(username);
 
sshbuf_reset(m);
 



Re: if_wg: Missing DPRINTF newline

2022-02-20 Thread Matthew Martin
On Thu, Feb 03, 2022 at 07:53:43AM +, Jason McIntyre wrote:
> On Wed, Feb 02, 2022 at 07:46:39PM -0600, Matthew Martin wrote:
> > Two DPRINTFs in sys/net/if_wg.c are missing a newline.
> > 
> 
> if this is committed, the committer may also want to question the dodgy
> capitals in "Delay Unsupported".
> 
> jmc

ping


diff --git if_wg.c if_wg.c
index 13c48f42c54..a3efd577dbc 100644
--- if_wg.c
+++ if_wg.c
@@ -2156,7 +2156,7 @@ wg_output(struct ifnet *ifp, struct mbuf *m, struct 
sockaddr *sa,
}
 
if (m->m_pkthdr.ph_loopcnt++ > M_MAXLOOP) {
-   DPRINTF(sc, "Packet looped");
+   DPRINTF(sc, "Packet looped\n");
ret = ELOOP;
goto error;
}
@@ -2169,7 +2169,7 @@ wg_output(struct ifnet *ifp, struct mbuf *m, struct 
sockaddr *sa,
 * another aip_lookup in wg_qstart, or refcnting as mentioned before.
 */
if (m->m_pkthdr.pf.delay > 0) {
-   DPRINTF(sc, "PF Delay Unsupported");
+   DPRINTF(sc, "PF Delay Unsupported\n");
ret = EOPNOTSUPP;
goto error;
}




Re: more MAKEDEV cleanup

2022-02-10 Thread Martin Pieuchot
On 05/04/21(Mon) 09:25, Miod Vallat wrote:
> The following diff attempts to clean up a few loose ends in the current
> MAKEDEV files:
> 
> - remove no-longer applicable device definitions (MSCP and SMD disks,
>   this kind of thing).
> - makes sure all platforms use the same `ramdisk' target for
>   installation media devices, rather than a mix of `ramd' and `ramdisk'.
> - moves as many `ramdisk' devices to MI land (bio, diskmap, random,
>   etc).
> - reduces the number of block devices in `ramdisk' targets to only one
>   per device, since the installer script will invoke MAKEDEV by itself
>   for the devices it needs to use.
> - sort device names in `all' and `ramdisk' MI lists to make maintainence
>   easier. This causes some ordering change in the `all' target in the
>   generated MAKEDEVs.

What happened to this?

> Index: MAKEDEV.common
> ===
> RCS file: /OpenBSD/src/etc/MAKEDEV.common,v
> retrieving revision 1.113
> diff -u -p -r1.113 MAKEDEV.common
> --- MAKEDEV.common12 Feb 2021 10:26:33 -  1.113
> +++ MAKEDEV.common5 Apr 2021 09:18:49 -
> @@ -114,7 +114,7 @@ dnl make a 'disktgt' macro that automati
>  dnl disktgt(rd, {-rd-})
>  dnl
>  dnl  target(all,rd,0)
> -dnl  target(ramd,rd,0)
> +dnl  target(ramdisk,rd,0)
>  dnl  disk_q(rd)
>  dnl  __devitem(rd, {-rd*-}, {-rd-})dnl
>  dnl
> @@ -122,62 +122,60 @@ dnl  Note: not all devices are generated
>  dnlits own extra list.
>  dnl
>  divert(1)dnl
> +target(all, acpi)dnl
> +target(all, apm)dnl
> +target(all, bio)dnl
> +target(all, bpf)dnl
> +twrget(all, com, tty0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b)dnl
> +twrget(all, czs, cua, a, b, c, d)dnl
> +target(all, diskmap)dnl
> +target(all, dt)dnl
>  twrget(all, fdesc, fd)dnl
> -target(all, st, 0, 1)dnl
> -target(all, std)dnl
> -target(all, ra, 0, 1, 2, 3)dnl
> -target(all, rx, 0, 1)dnl
> -target(all, wd, 0, 1, 2, 3)dnl
> -target(all, xd, 0, 1, 2, 3)dnl
> +target(all, fuse)dnl
> +target(all, hotplug)dnl
> +target(all, joy, 0, 1)dnl
> +target(all, kcov)dnl
> +target(all, kstat)dnl
> +target(all, local)dnl
> +target(all, lpt, 0, 1, 2)dnl
> +twrget(all, lpt, lpa, 0, 1, 2)dnl
> +target(all, par, 0)dnl
> +target(all, pci, 0, 1, 2, 3)dnl
>  target(all, pctr)dnl
>  target(all, pctr0)dnl
>  target(all, pf)dnl
> -target(all, apm)dnl
> -target(all, acpi)dnl
> +target(all, pppac)dnl
> +target(all, pppx)dnl
> +target(all, ptm)dnl
> +target(all, pty, 0)dnl
> +target(all, pvbus, 0, 1)dnl
> +target(all, radio, 0)dnl
> +target(all, rmidi, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> +twrget(all, rnd, random)dnl
> +twrget(all, speak, speaker)dnl
> +target(all, st, 0, 1)dnl
> +target(all, std)dnl
> +target(all, switch, 0, 1, 2, 3)dnl
> +target(all, tap, 0, 1, 2, 3)dnl
>  twrget(all, tth, ttyh, 0, 1)dnl
>  target(all, ttyA, 0, 1)dnl
> -twrget(all, mac_tty0, tty0, 0, 1)dnl
> -twrget(all, tzs, tty, a, b, c, d)dnl
> -twrget(all, czs, cua, a, b, c, d)dnl
>  target(all, ttyc, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> -twrget(all, com, tty0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b)dnl
> -twrget(all, mmcl, mmclock)dnl
> -target(all, lpt, 0, 1, 2)dnl
> -twrget(all, lpt, lpa, 0, 1, 2)dnl
> -target(all, joy, 0, 1)dnl
> -twrget(all, rnd, random)dnl
> -target(all, uk, 0)dnl
> -twrget(all, vi, video, 0, 1)dnl
> -twrget(all, speak, speaker)dnl
> -target(all, asc, 0)dnl
> -target(all, radio, 0)dnl
> +target(all, tun, 0, 1, 2, 3)dnl
>  target(all, tuner, 0)dnl
> -target(all, rmidi, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> +twrget(all, tzs, tty, a, b, c, d)dnl
>  target(all, uall)dnl
> -target(all, pci, 0, 1, 2, 3)dnl
> -twrget(all, wsmouse, wscons)dnl
> -target(all, par, 0)dnl
> -target(all, apci, 0)dnl
> -target(all, local)dnl
> -target(all, ptm)dnl
> -target(all, hotplug)dnl
> -target(all, pppx)dnl
> -target(all, pppac)dnl
> -target(all, fuse)dnl
> +target(all, uk, 0)dnl
> +twrget(all, vi, video, 0, 1)dnl
>  target(all, vmm)dnl
> -target(all, pvbus, 0, 1)dnl
> -target(all, bpf)dnl
> -target(all, kcov)dnl
> -target(all, dt)dnl
> -target(all, kstat)dnl
> +target(all, vnd, 0, 1, 2, 3)dnl
> +target(all, vscsi, 0)dnl
> +target(all, wd, 0, 1, 2, 3)dnl
> +twrget(all, wsmouse, wscons)dnl
>  dnl
>  _mkdev(all, {-all-}, {-dnl
>  show_target(all)dnl
>  -})dnl
>  dnl
> -dnl XXX some arches use ramd, others ramdisk - needs to be fixed eventually
> -__devitem(ramdisk, ramdisk, Ramdisk kernel devices,nothing)dnl
> -dnl
>  target(usb, usb, 0, 1, 2, 3, 4, 5, 6, 7)dnl
>  target(usb, uhid, 0, 1, 2, 3, 4, 5, 6, 7)dnl
>  twrget(usb, fido, fido)dnl
> @@ -208,26 +206,26 @@ __devitem(ch, {-ch*-}, SCSI media change
>  _mcdev(ch, ch*, ch, {-major_ch_c-}, 660, operator)dnl
>  __devitem(uk, uk*, Unknown SCSI devices)dnl
>  _mcdev(uk, uk*, uk, {-major_uk_c-}, 640, operator)dnl
> -dnl XXX see ramdisk above
> -__devitem(ramd, ramdisk, Ramdisk kernel devices,nothing)dnl
>  dnl
> -_mkdev(ramd, ramdisk, {-dnl
> -show_target(ramd)dnl
> +__devitem(ramdisk, ramdisk, Ramdisk kernel devices,nothing)dnl
> +_mkdev(ramdisk, ramdisk, {-dnl
> 

yacc(1): skeleton.c: remove outdated comment

2022-02-10 Thread Martin Vahlensieck
Hi

yysccsid was removed in 1.30 back in 2009.

Best,

Martin

Index: skeleton.c
===
RCS file: /cvs/src/usr.bin/yacc/skeleton.c,v
retrieving revision 1.40
diff -u -p -r1.40 skeleton.c
--- skeleton.c  3 Feb 2021 01:10:10 -   1.40
+++ skeleton.c  10 Feb 2022 15:29:08 -
@@ -35,10 +35,6 @@
 
 #include "defs.h"
 
-/*  The definition of yysccsid in the banner should be replaced with   */
-/*  a #pragma ident directive if the target C compiler supports
*/
-/*  #pragma ident directives.  */
-/* */
 /*  If the skeleton is changed, the banner should be changed so that   */
 /*  the altered version can be easily distinguished from the original. */
 /* */



Re: Add rtable capability to login.conf

2022-02-05 Thread Matthew Martin
On Sat, Jan 29, 2022 at 06:25:32PM -0600, Matthew Martin wrote:
> On Sat, Jan 29, 2022 at 07:10:00PM -0500, Ted Unangst wrote:
> > I believe it would be better to add setrtable to id pledge.

ping

Also are there any opinions on adding LOGIN_SETRTABLE to doas?


diff --git include/login_cap.h include/login_cap.h
index d9a4c2c349c..1e831b6471a 100644
--- include/login_cap.h
+++ include/login_cap.h
@@ -53,7 +53,8 @@
 #defineLOGIN_SETUMASK  0x0020  /* Set umask */
 #defineLOGIN_SETUSER   0x0040  /* Set user */
 #defineLOGIN_SETENV0x0080  /* Set environment */
-#defineLOGIN_SETALL0x00ff  /* Set all. */
+#defineLOGIN_SETRTABLE 0x0100  /* Set rtable */
+#defineLOGIN_SETALL0x01ff  /* Set all. */
 
 #defineBI_AUTH "authorize" /* Accepted 
authentication */
 #defineBI_REJECT   "reject"/* Rejected 
authentication */
diff --git lib/libc/gen/login_cap.c lib/libc/gen/login_cap.c
index 862f33b2065..65848f09ef6 100644
--- lib/libc/gen/login_cap.c
+++ lib/libc/gen/login_cap.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -574,7 +575,7 @@ int
 setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags)
 {
login_cap_t *flc;
-   quad_t p;
+   quad_t p, rtable;
int i;
 
flc = NULL;
@@ -625,6 +626,14 @@ setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t 
uid, u_int flags)
umask((mode_t)p);
}
 
+   if (flags & LOGIN_SETRTABLE) {
+   rtable = login_getcapnum(lc, "rtable", 0, 0);
+
+   if (setrtable((int)rtable) == -1) {
+   syslog(LOG_ERR, "%s: setrtable: %m", lc->lc_class);
+   }
+   }
+
if (flags & LOGIN_SETGROUP) {
if (setresgid(pwd->pw_gid, pwd->pw_gid, pwd->pw_gid) == -1) {
syslog(LOG_ERR, "setresgid(%u,%u,%u): %m",
diff --git lib/libc/sys/pledge.2 lib/libc/sys/pledge.2
index 581d274822c..ff3aa59f9e3 100644
--- lib/libc/sys/pledge.2
+++ lib/libc/sys/pledge.2
@@ -514,7 +514,8 @@ process:
 .Xr setlogin 2 ,
 .Xr setrlimit 2 ,
 .Xr getpriority 2 ,
-.Xr setpriority 2
+.Xr setpriority 2 ,
+.Xr setrtable 2
 .It Cm pf
 Allows a subset of
 .Xr ioctl 2
diff --git share/man/man5/login.conf.5 share/man/man5/login.conf.5
index da935fa223e..ffeafc3531c 100644
--- share/man/man5/login.conf.5
+++ share/man/man5/login.conf.5
@@ -276,6 +276,10 @@ Initial priority (nice) level.
 Require home directory to login.
 .\"
 .Pp
+.It rtable Ta number Ta Dv 0 Ta
+Rtable to be set for the class.
+.\"
+.Pp
 .It setenv Ta envlist Ta "" Ta
 A list of environment variables and associated values to be set for the class.
 .\"
diff --git sys/kern/kern_pledge.c sys/kern/kern_pledge.c
index b876b91a18a..26b5e1fab96 100644
--- sys/kern/kern_pledge.c
+++ sys/kern/kern_pledge.c
@@ -300,6 +300,7 @@ const uint64_t pledge_syscalls[SYS_MAXSYSCALL] = {
[SYS_setresgid] = PLEDGE_ID,
[SYS_setgroups] = PLEDGE_ID,
[SYS_setlogin] = PLEDGE_ID,
+   [SYS_setrtable] = PLEDGE_ID,
 
[SYS_unveil] = PLEDGE_UNVEIL,
 



Re: uvm_unmap_kill_entry(): unwire with map lock held

2022-02-04 Thread Martin Pieuchot
On 04/02/22(Fri) 03:39, Klemens Nanni wrote:
> [...] 
> ... with the lock grabbed in uvm_map_teardown() that is, otherwise
> the first call path can lock against itself (regress/misc/posixtestsuite
> is a reproduce for this):
> 
>   vm_map_lock_read_ln+0x38
>   uvm_fault_unwire+0x58
>   uvm_unmap_kill_entry_withlock+0x68
>   uvm_unmap_remove+0x2d4
>   sys_munmap+0x11c
> 
> which is obvious in hindsight.
> 
> So grabbing the lock in uvm_map_teardown() instead avoids that while
> still ensuring a locked map in the path missing a lock.

This should be fine since this function should only be called when the
last reference of a map is dropped.  In other words the locking here is
necessary to satisfy the assertions.

I wonder if lock shouldn't be taken & released around uvm_map_teardown()
which makes it easier to see that this is called after the last refcount
decrement. 

> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 4 Feb 2022 02:51:00 -
> @@ -2734,6 +2751,7 @@ uvm_map_teardown(struct vm_map *map)
>   KERNEL_ASSERT_UNLOCKED();
>  
>   KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
> + vm_map_lock(map);
>  
>   /* Remove address selectors. */
>   uvm_addr_destroy(map->uaddr_exe);
> 



if_wg: Missing DPRINTF newline

2022-02-02 Thread Matthew Martin
Two DPRINTFs in sys/net/if_wg.c are missing a newline.


diff --git if_wg.c if_wg.c
index 13c48f42c54..a3efd577dbc 100644
--- if_wg.c
+++ if_wg.c
@@ -2156,7 +2156,7 @@ wg_output(struct ifnet *ifp, struct mbuf *m, struct 
sockaddr *sa,
}
 
if (m->m_pkthdr.ph_loopcnt++ > M_MAXLOOP) {
-   DPRINTF(sc, "Packet looped");
+   DPRINTF(sc, "Packet looped\n");
ret = ELOOP;
goto error;
}
@@ -2169,7 +2169,7 @@ wg_output(struct ifnet *ifp, struct mbuf *m, struct 
sockaddr *sa,
 * another aip_lookup in wg_qstart, or refcnting as mentioned before.
 */
if (m->m_pkthdr.pf.delay > 0) {
-   DPRINTF(sc, "PF Delay Unsupported");
+   DPRINTF(sc, "PF Delay Unsupported\n");
ret = EOPNOTSUPP;
goto error;
}



Re: uvm_unmap_kill_entry(): unwire with map lock held

2022-01-31 Thread Martin Pieuchot
On 31/01/22(Mon) 10:24, Klemens Nanni wrote:
> Running with my uvm assert diff showed that uvm_fault_unwire_locked()
> was called without any locks held.
> 
> This happened when I rebooted my machine:
> 
>   uvm_fault_unwire_locked()
>   uvm_unmap_kill_entry_withlock()
>   uvm_unmap_kill_entry()
>   uvm_map_teardown()
>   uvmspace_free()
> 
> This code does not corrupt anything because
> uvm_unmap_kill_entry_withlock() is grabbing the kernel lock aorund its
> uvm_fault_unwire_locked() call.
> 
> But regardless of the kernel lock dances in this code path, the uvm map
> ought to be protected by its own lock.  uvm_fault_unwire() does that.
> 
> uvm_fault_unwire_locked()'s comment says the map must at least be read
> locked, which is what all other code paths to that function do.
> 
> This makes my latest assert diff happy in the reboot case (it did not
> always hit that assert).

I'm happy your asserts found a first bug.

I"m afraid calling this function below could result in a grabbing the
lock twice.  Can we be sure this doesn't happen?

uvm_unmap_kill_entry() is called in many different contexts and this is
currently a mess.  I don't know what NetBSD did in this area but it is
worth looking at and see if there isn't a good idea to untangle this.

> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 31 Jan 2022 09:28:04 -
> @@ -2132,7 +2143,7 @@ uvm_unmap_kill_entry_withlock(struct vm_
>   if (VM_MAPENT_ISWIRED(entry)) {
>   KERNEL_LOCK();
>   entry->wired_count = 0;
> - uvm_fault_unwire_locked(map, entry->start, entry->end);
> + uvm_fault_unwire(map, entry->start, entry->end);
>   KERNEL_UNLOCK();
>   }
>  
> 



Re: Add rtable capability to login.conf

2022-01-29 Thread Matthew Martin
On Sat, Jan 29, 2022 at 07:10:00PM -0500, Ted Unangst wrote:
> I believe it would be better to add setrtable to id pledge.

Makes the diff quite a bit shorter too.


diff --git include/login_cap.h include/login_cap.h
index d9a4c2c349c..1e831b6471a 100644
--- include/login_cap.h
+++ include/login_cap.h
@@ -53,7 +53,8 @@
 #defineLOGIN_SETUMASK  0x0020  /* Set umask */
 #defineLOGIN_SETUSER   0x0040  /* Set user */
 #defineLOGIN_SETENV0x0080  /* Set environment */
-#defineLOGIN_SETALL0x00ff  /* Set all. */
+#defineLOGIN_SETRTABLE 0x0100  /* Set rtable */
+#defineLOGIN_SETALL0x01ff  /* Set all. */
 
 #defineBI_AUTH "authorize" /* Accepted 
authentication */
 #defineBI_REJECT   "reject"/* Rejected 
authentication */
diff --git lib/libc/gen/login_cap.c lib/libc/gen/login_cap.c
index 862f33b2065..65848f09ef6 100644
--- lib/libc/gen/login_cap.c
+++ lib/libc/gen/login_cap.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -574,7 +575,7 @@ int
 setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags)
 {
login_cap_t *flc;
-   quad_t p;
+   quad_t p, rtable;
int i;
 
flc = NULL;
@@ -625,6 +626,14 @@ setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t 
uid, u_int flags)
umask((mode_t)p);
}
 
+   if (flags & LOGIN_SETRTABLE) {
+   rtable = login_getcapnum(lc, "rtable", 0, 0);
+
+   if (setrtable((int)rtable) == -1) {
+   syslog(LOG_ERR, "%s: setrtable: %m", lc->lc_class);
+   }
+   }
+
if (flags & LOGIN_SETGROUP) {
if (setresgid(pwd->pw_gid, pwd->pw_gid, pwd->pw_gid) == -1) {
syslog(LOG_ERR, "setresgid(%u,%u,%u): %m",
diff --git lib/libc/sys/pledge.2 lib/libc/sys/pledge.2
index 581d274822c..ff3aa59f9e3 100644
--- lib/libc/sys/pledge.2
+++ lib/libc/sys/pledge.2
@@ -514,7 +514,8 @@ process:
 .Xr setlogin 2 ,
 .Xr setrlimit 2 ,
 .Xr getpriority 2 ,
-.Xr setpriority 2
+.Xr setpriority 2 ,
+.Xr setrtable 2
 .It Cm pf
 Allows a subset of
 .Xr ioctl 2
diff --git share/man/man5/login.conf.5 share/man/man5/login.conf.5
index da935fa223e..ffeafc3531c 100644
--- share/man/man5/login.conf.5
+++ share/man/man5/login.conf.5
@@ -276,6 +276,10 @@ Initial priority (nice) level.
 Require home directory to login.
 .\"
 .Pp
+.It rtable Ta number Ta Dv 0 Ta
+Rtable to be set for the class.
+.\"
+.Pp
 .It setenv Ta envlist Ta "" Ta
 A list of environment variables and associated values to be set for the class.
 .\"
diff --git sys/kern/kern_pledge.c sys/kern/kern_pledge.c
index 6687bf91f09..ead40b2df6f 100644
--- sys/kern/kern_pledge.c
+++ sys/kern/kern_pledge.c
@@ -300,6 +300,7 @@ const uint64_t pledge_syscalls[SYS_MAXSYSCALL] = {
[SYS_setresgid] = PLEDGE_ID,
[SYS_setgroups] = PLEDGE_ID,
[SYS_setlogin] = PLEDGE_ID,
+   [SYS_setrtable] = PLEDGE_ID,
 
[SYS_unveil] = PLEDGE_UNVEIL,
 



Add rtable capability to login.conf

2022-01-29 Thread Matthew Martin
It would be nice to have the ability to set a user's rtable upon login.
This would be useful both for road warrior VPN setups (put both the VPN
interface and user in an rdomain other than 0) and to differentiate
users in firewall rules on the gateway or unbound views on a resolver.
The below patch adds an rtable capability to login.conf which sets the
user's default rtable on login. 

Since setusercontext will now call setrtable, that syscall needs to be
covered under some pledge promise. Since the wroute promise already
allows programs to set a different rtable on a per socket basis, I don't
think adding the setrtable syscall materially changes the scope or
attack surface. This impacts dhcpleased, iked, rad, and slaacd which
already use the wroute promise. The wroute promise is added to
calendar, skeyaudit, su, cron, and inetd.

chroot, login, ssh, nsd, and unbound are impacted, but don't show up in
the diff because they don't use pledge. ssh continues to work as
expected: the user's session has the login specified default rtable;
however, the socket connecting the user to the session remains in the
original rtable so the connection doesn't break. nsd and unbound open
their listening sockets prior to privdrop, so those sockets are in the
original rtable (which should be set via rcctl). Notifications and XFRs
for nsd and lookups for unbound happen in the login specified rtable.
While this is somewhat surprising, I think most people would set the
rtable for a daemon via rcctl.

doas is a bit of a special case: it explicitly lists the LOGIN_ flags
rather than using LOGIN_SETALL and removing specific flags. I am
inclined to say doas should also set the default rtable, but it's not
a strong stance. I've left it out of this version of the patch pending
feedback on how others feel.

In addition xenodm would need to add wroute to the pledge in dm.c
StartDisplay and to the first pledge in session.c StartClient.

>From a Debian code search the ports which use LOGIN_SETALL appear to be
mail/alpine, security/sudo, and x11/gnome/gdm. None of these ports have
pledge markers in their Makefile.

Finally it may be nicer to drop -a from calendar and skeyaudit. Each
user that desires the service could then create their own cron job.
While it would allow a fair bit of code to be deleted and a tighter
pledge, it would break existing workflows and could be considered as
in a separate thread if desired.

- Matthew Martin



diff --git include/login_cap.h include/login_cap.h
index d9a4c2c349c..1e831b6471a 100644
--- include/login_cap.h
+++ include/login_cap.h
@@ -53,7 +53,8 @@
 #defineLOGIN_SETUMASK  0x0020  /* Set umask */
 #defineLOGIN_SETUSER   0x0040  /* Set user */
 #defineLOGIN_SETENV0x0080  /* Set environment */
-#defineLOGIN_SETALL0x00ff  /* Set all. */
+#defineLOGIN_SETRTABLE 0x0100  /* Set rtable */
+#defineLOGIN_SETALL0x01ff  /* Set all. */
 
 #defineBI_AUTH "authorize" /* Accepted 
authentication */
 #defineBI_REJECT   "reject"/* Rejected 
authentication */
diff --git lib/libc/gen/login_cap.c lib/libc/gen/login_cap.c
index 862f33b2065..65848f09ef6 100644
--- lib/libc/gen/login_cap.c
+++ lib/libc/gen/login_cap.c
@@ -52,6 +52,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -574,7 +575,7 @@ int
 setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags)
 {
login_cap_t *flc;
-   quad_t p;
+   quad_t p, rtable;
int i;
 
flc = NULL;
@@ -625,6 +626,14 @@ setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t 
uid, u_int flags)
umask((mode_t)p);
}
 
+   if (flags & LOGIN_SETRTABLE) {
+   rtable = login_getcapnum(lc, "rtable", 0, 0);
+
+   if (setrtable((int)rtable) == -1) {
+   syslog(LOG_ERR, "%s: setrtable: %m", lc->lc_class);
+   }
+   }
+
if (flags & LOGIN_SETGROUP) {
if (setresgid(pwd->pw_gid, pwd->pw_gid, pwd->pw_gid) == -1) {
syslog(LOG_ERR, "setresgid(%u,%u,%u): %m",
diff --git share/man/man5/login.conf.5 share/man/man5/login.conf.5
index da935fa223e..ffeafc3531c 100644
--- share/man/man5/login.conf.5
+++ share/man/man5/login.conf.5
@@ -276,6 +276,10 @@ Initial priority (nice) level.
 Require home directory to login.
 .\"
 .Pp
+.It rtable Ta number Ta Dv 0 Ta
+Rtable to be set for the class.
+.\"
+.Pp
 .It setenv Ta envlist Ta "" Ta
 A list of environment variables and associated values to be set for the class.
 .\"
diff --git sys/kern/kern_pledge.c sys/kern/kern_pledge.c
index 6687bf91f09..901d41cefb6 100644
--- sys/kern/kern_pledge.c
+++ sys/kern/kern_pledge.c
@@ -373,6 +373,8 @@ const uint64_t pledge_syscalls[SYS_MAXSYSCALL] = {
[SY

Re: unlock mmap(2) for anonymous mappings

2022-01-24 Thread Martin Pieuchot
On 24/01/22(Mon) 12:06, Klemens Nanni wrote:
> On Sun, Jan 16, 2022 at 09:22:50AM -0300, Martin Pieuchot wrote:
> > IMHO this approach of let's try if it works now and revert if it isn't
> > doesn't help us make progress.  I'd be more confident seeing diffs that
> > assert for the right lock in the functions called by uvm_mapanon() and
> > documentation about which lock is protecting which field of the data
> > structures.
> 
> I picked `vm_map's `size' as first underdocumented member.
> All accesses to it are protected by vm_map_lock(), either through the
> function itself or implicitly by already requiring the calling function
> to lock the map.

Could we use a vm_map_assert_locked() or something similar that reflect
the exclusive or shared (read) lock comments?  I don't trust comments.
It's too easy to miss a lock in a code path.

> So annotate functions using `size' wrt. the required lock.
> 
> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 21 Jan 2022 13:03:05 -
> @@ -805,6 +805,8 @@ uvm_map_sel_limits(vaddr_t *min, vaddr_t
>   * Fills in *start_ptr and *end_ptr to be the first and last entry describing
>   * the space.
>   * If called with prefilled *start_ptr and *end_ptr, they are to be correct.
> + *
> + * map must be at least read-locked.
>   */
>  int
>  uvm_map_isavail(struct vm_map *map, struct uvm_addr_state *uaddr,
> @@ -2206,6 +2208,8 @@ uvm_unmap_kill_entry(struct vm_map *map,
>   * If remove_holes, then remove ET_HOLE entries as well.
>   * If markfree, entry will be properly marked free, otherwise, no replacement
>   * entry will be put in the tree (corrupting the tree).
> + *
> + * map must be locked.
>   */
>  void
>  uvm_unmap_remove(struct vm_map *map, vaddr_t start, vaddr_t end,
> @@ -2976,6 +2980,9 @@ uvm_tree_sanity(struct vm_map *map, char
>   UVM_ASSERT(map, addr == vm_map_max(map), file, line);
>  }
>  
> +/*
> + * map must be at least read-locked.
> + */
>  void
>  uvm_tree_size_chk(struct vm_map *map, char *file, int line)
>  {
> Index: uvm_map.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.h,v
> retrieving revision 1.71
> diff -u -p -r1.71 uvm_map.h
> --- uvm_map.h 15 Dec 2021 12:53:53 -  1.71
> +++ uvm_map.h 21 Jan 2022 12:51:26 -
> @@ -272,7 +272,7 @@ struct vm_map {
>  
>   struct uvm_map_addr addr;   /* [v] Entry tree, by addr */
>  
> - vsize_t size;   /* virtual size */
> + vsize_t size;   /* [v] virtual size */
>   int ref_count;  /* [a] Reference count */
>   int flags;  /* flags */
>   struct mutexflags_lock; /* flags lock */
> 



Re: DDBPROF: move option to amd64,i386 GENERIC

2022-01-18 Thread Martin Pieuchot
On 18/01/22(Tue) 04:38, Klemens Nanni wrote:
> While intended for more architectures, DDBPROF is strictly amd64 and
> i386 only, so the machine-independent sys/conf/GENERIC does not seem fit
> (until all architectures are supported).

This define should die.  There's no need to polish this turd.  Somebody
has to stand up and turn this into a "#if NDT > 0" with the audit that
goes with it.

> Index: conf/GENERIC
> ===
> RCS file: /cvs/src/sys/conf/GENERIC,v
> retrieving revision 1.281
> diff -u -p -r1.281 GENERIC
> --- conf/GENERIC  23 Dec 2021 10:04:14 -  1.281
> +++ conf/GENERIC  18 Jan 2022 04:28:29 -
> @@ -4,7 +4,6 @@
>  #GENERIC kernel
>  
>  option   DDB # in-kernel debugger
> -#option  DDBPROF # ddb(4) based profiling
>  #option  DDB_SAFE_CONSOLE # allow break into ddb during boot
>  #makeoptions DEBUG=""# do not compile full symbol table
>  #makeoptions PROF="-pg"  # build profiled kernel
> Index: arch/amd64/conf/GENERIC
> ===
> RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v
> retrieving revision 1.510
> diff -u -p -r1.510 GENERIC
> --- arch/amd64/conf/GENERIC   4 Jan 2022 05:50:43 -   1.510
> +++ arch/amd64/conf/GENERIC   18 Jan 2022 04:28:04 -
> @@ -13,6 +13,8 @@ machine amd64
>  include  "../../../conf/GENERIC"
>  maxusers 80  # estimated number of users
>  
> +#option  DDBPROF # ddb(4) based profiling
> +
>  option   USER_PCICONF# user-space PCI configuration
>  
>  option   APERTURE# in-kernel aperture driver for XFree86
> Index: arch/i386/conf/GENERIC
> ===
> RCS file: /cvs/src/sys/arch/i386/conf/GENERIC,v
> retrieving revision 1.860
> diff -u -p -r1.860 GENERIC
> --- arch/i386/conf/GENERIC2 Jan 2022 23:14:27 -   1.860
> +++ arch/i386/conf/GENERIC18 Jan 2022 04:28:26 -
> @@ -13,6 +13,8 @@ machine i386
>  include  "../../../conf/GENERIC"
>  maxusers 80  # estimated number of users
>  
> +#option  DDBPROF # ddb(4) based profiling
> +
>  option   USER_PCICONF# user-space PCI configuration
>  
>  option   APERTURE# in-kernel aperture driver for XFree86
> 



Re: uvm_swap: introduce uvm_swap_data_lock

2022-01-16 Thread Martin Pieuchot
Nice!

On 30/12/21(Thu) 23:38, Theo Buehler wrote:
> The diff below does two things: it adds a uvm_swap_data_lock mutex and
> trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad()

Why is it enough?  Which fields is the lock protecting in these
function?  Is it `uvmexp.swpages', could that be documented?  

What about `nswapdev'?  Why is the rwlock grabbed before reading it in
sys_swapctl()?i

What about `swpginuse'?

If the mutex/rwlock are used to protect the global `swap_priority' could
that be also documented?  Once this is documented it should be trivial to
see that some places are missing some locking.  Is it intentional?

> The uvm_swap_data_lock protects all swap data structures, so needs to be
> grabbed a few times, many of them already documented in the comments.
> 
> For review, I suggest comparing to what NetBSD did and also going
> through the consumers (swaplist_insert, swaplist_find, swaplist_trim)
> and check that they are properly locked when called, or that there is
> the KERNEL_LOCK() in place when swap data structures are manipulated.

I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce
the differences with NetBSD.

> In swapmount() I introduced locking since that's needed to be able to
> assert that the proper locks are held in swaplist_{insert,find,trim}.

Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down?
What about `uvmexp.nswget' and `uvmexp.swpgonly' in there?

> Index: uvm/uvm_swap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
> retrieving revision 1.152
> diff -u -p -r1.152 uvm_swap.c
> --- uvm/uvm_swap.c12 Dec 2021 09:14:59 -  1.152
> +++ uvm/uvm_swap.c30 Dec 2021 15:47:20 -
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -91,6 +92,9 @@
>   *  - swap_syscall_lock (sleep lock): this lock serializes the swapctl
>   *system call and prevents the swap priority list from changing
>   *while we are in the middle of a system call (e.g. SWAP_STATS).
> + *  - uvm_swap_data_lock (mutex): this lock protects all swap data
> + *structures including the priority list, the swapdev structures,
> + *and the swapmap arena.
>   *
>   * each swap device has the following info:
>   *  - swap device in use (could be disabled, preventing future use)
> @@ -212,6 +216,7 @@ LIST_HEAD(swap_priority, swappri);
>  struct swap_priority swap_priority;
>  
>  /* locks */
> +struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE);
>  struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk");
>  
>  /*
> @@ -442,7 +447,7 @@ uvm_swap_finicrypt_all(void)
>  /*
>   * swaplist_insert: insert swap device "sdp" into the global list
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   * => caller must provide a newly malloc'd swappri structure (we will
>   *   FREE it if we don't need it... this it to prevent malloc blocking
>   *   here while adding swap)
> @@ -452,6 +457,9 @@ swaplist_insert(struct swapdev *sdp, str
>  {
>   struct swappri *spp, *pspp;
>  
> + rw_assert_wrlock(_syscall_lock);
> + MUTEX_ASSERT_LOCKED(_swap_data_lock);
> +
>   /*
>* find entry at or after which to insert the new device.
>*/
> @@ -493,7 +501,7 @@ swaplist_insert(struct swapdev *sdp, str
>   * swaplist_find: find and optionally remove a swap device from the
>   *   global list.
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   * => we return the swapdev we found (and removed)
>   */
>  struct swapdev *
> @@ -502,6 +510,9 @@ swaplist_find(struct vnode *vp, boolean_
>   struct swapdev *sdp;
>   struct swappri *spp;
>  
> + rw_assert_wrlock(_syscall_lock);
> + MUTEX_ASSERT_LOCKED(_swap_data_lock);
> +
>   /*
>* search the lists for the requested vp
>*/
> @@ -524,13 +535,16 @@ swaplist_find(struct vnode *vp, boolean_
>   * swaplist_trim: scan priority list for empty priority entries and kill
>   *   them.
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   */
>  void
>  swaplist_trim(void)
>  {
>   struct swappri *spp, *nextspp;
>  
> + rw_assert_wrlock(_syscall_lock);
> + MUTEX_ASSERT_LOCKED(_swap_data_lock);
> +
>   LIST_FOREACH_SAFE(spp, _priority, spi_swappri, nextspp) {
>   if (!TAILQ_EMPTY(>spi_swapdev))
>   continue;
> @@ -543,7 +557,7 @@ swaplist_trim(void)
>   * swapdrum_add: add a "swapdev"'s blocks into /dev/drum's area.
>   *
>   * => caller must hold swap_syscall_lock
> - * => uvm.swap_data_lock should be unlocked (we may sleep)
> + * => uvm_swap_data_lock should be unlocked 

Re: unlock mmap(2) for anonymous mappings

2022-01-16 Thread Martin Pieuchot
On 14/01/22(Fri) 23:01, Mark Kettenis wrote:
> > Date: Tue, 11 Jan 2022 23:13:20 +
> > From: Klemens Nanni 
> > 
> > On Tue, Jan 11, 2022 at 09:54:44AM -0700, Theo de Raadt wrote:
> > > > Now this is clearly a "slow" path.  I don't think there is any reason
> > > > not to put all the code in that if (uvw_wxabort) block under the
> > > > kernel lock.  For now I think making access to ps_wxcounter atomic is
> > > > just too fine grained.
> > > 
> > > Right.  Lock the whole block.
> > 
> > Thanks everyone, here's the combined diff for that.
> 
> I think mpi@ should be involved in the actual unlocking of mmap(2),
> munmap(2) and mprotect(2).  But the changes to uvm_mmap.c are ok
> kettenis@ and can be committed now.

It isn't clear to me what changed since the last time this has been
tried.  Why is it safe now?  What are the locking assumptions?  

IMHO this approach of let's try if it works now and revert if it isn't
doesn't help us make progress.  I'd be more confident seeing diffs that
assert for the right lock in the functions called by uvm_mapanon() and
documentation about which lock is protecting which field of the data
structures.

NetBSD has done much of this and the code bases do not diverge much so
it can be useful to look there as well.



Re: patch: add a new ktrace facility for replacing some printf-debug

2022-01-07 Thread Martin Pieuchot
On 07/01/22(Fri) 10:54, Sebastien Marie wrote:
> Hi,
> 
> Debugging some code paths is complex: for example, unveil(2) code is
> involved inside VFS, and using DEBUG_UNVEIL means that the kernel is
> spamming printf() for all processes using unveil(2) (a lot of
> processes) instead of just the interested cases I want to follow.
> 
> So I cooked the following diff to add a KTRLOG() facility to be able
> to replace printf()-like debugging with a more process-limited method.

I wish you could debug such issue without having to change any kernel
code that's why I started btrace(8).

You should already be able to filter probes per thread/process, maybe you
want to replace some debug printf by static probes.  Alternatively you
could define DDBPROF and use the kprobe provider which allow you to
inspect prologue and epilogue of ELF functions.

Maybe btrace(8) is not yet fully functional to debug this particular
problem, but improving it should hopefully give us a tool to debug most
of the kernel issues without having to write a diff and boot a custom
kernel.  At least that's the goal.



Re: gprof: Profiling a multi-threaded application

2021-12-10 Thread Martin Pieuchot
On 10/12/21(Fri) 09:56, Yuichiro NAITO wrote:
> Any comments about this topic?

I'm ok with this approach.  I would appreciate if somebody
else could take it over, I'm too busy with other stuff.

> On 7/12/21 18:09, Yuichiro NAITO wrote:
> > Hi, Martin
> > 
> > n 2021/07/10 16:55, Martin Pieuchot wrote:
> > > Hello Yuichiro, thanks for your work !
> > 
> > Thanks for the response.
> > 
> > > > On 2021/06/16 16:34, Yuichiro NAITO wrote:
> > > > > When I compile a multi-threaded application with '-pg' option, I 
> > > > > always get no
> > > > > results in gmon.out. With bad luck, my application gets SIGSEGV while 
> > > > > running.
> > > > > Because the buffer that holds number of caller/callee count is the 
> > > > > only one
> > > > > in the process and will be broken by multi-threaded access.
> > > > > 
> > > > > I get the idea to solve this problem from NetBSD. NetBSD has 
> > > > > individual buffers
> > > > > for each thread and merges them at the end of profiling.
> > > 
> > > Note that the kernel use a similar approach but doesn't merge the buffer,
> > > instead it generates a file for each CPU.
> > 
> > Yes, so the number of output files are limited by the number of CPUs in 
> > case of
> > the kernel profiling. I think number of application threads can be 
> > increased more
> > casually. I don't want to see dozens of 'gmon.out' files.
> > 
> > > > > NetBSD stores the reference to the individual buffer by 
> > > > > pthread_setspecific(3).
> > > > > I think it causes infinite recursive call if whole libc library 
> > > > > (except
> > > > > mcount.c) is compiled with -pg.
> > > > > 
> > > > > The compiler generates '_mcount' function call at the beginning of 
> > > > > every
> > > > > functions. If '_mcount' calls pthread_getspecific(3) to get the 
> > > > > individual
> > > > > buffer, pthread_getspecific() calls '_mcount' again and causes 
> > > > > infinite
> > > > > recursion.
> > > > > 
> > > > > NetBSD prevents from infinite recursive call by checking a global 
> > > > > variable. But
> > > > > this approach also prevents simultaneously call of '_mcount' on a 
> > > > > multi-threaded
> > > > > application. It makes a little bit wrong results of profiling.
> > > > > 
> > > > > So I added a pointer to the buffer in `struct pthread` that can be 
> > > > > accessible
> > > > > via macro call as same as pthread_self(3). This approach can prevent 
> > > > > of
> > > > > infinite recursive call of '_mcount'.
> > > 
> > > Not calling a libc function for this makes sense.  However I'm not
> > > convinced that accessing `tib_thread' before _rthread_init() has been
> > > called is safe.
> > 
> > Before '_rthread_init’ is called, '__isthreaded' global variable is kept to 
> > be 0.
> > My patch doesn't access tib_thread in this case.
> > After calling `_rthread_init`, `pthread_create()` changes `__isthreaded` to 
> > 1.
> > Tib_thread will be accessed by all threads that are newly created and the 
> > initial one.
> > 
> > I believe tib of the initial thread has been initialized in `_libc_preinit' 
> > function
> > in 'lib/libc/dlfcn/init.c'.
> > 
> > > I'm not sure where is the cleanest way to place the per-thread buffer,
> > > I'd suggest you ask guenther@ about this.
> > 
> > I added guenther@ in CC of this mail.
> > I hope I can get an advise about per-thread buffer.
> > 
> > > Maybe the initialization can be done outside of _mcount()?
> > 
> > Yes, I think tib is initialized in `pthread_create()` and `_libc_preinit()`.
> > 
> > > > > I obtained merging function from NetBSD that is called in '_mcleanup' 
> > > > > function.
> > > > > Merging function needs to walk through all the individual buffers,
> > > > > I added SLIST_ENTRY member in 'struct gmonparam' to make a list of 
> > > > > the buffers.
> > > > > And also added '#ifndef _KERNEL' for the SLIST_ENTRY member not to be 
> > > > > used for
> > > > > the kernel.
> > > > > 
> > > > > But I still use pthread_getspecific(3) for that can call destructor 
&

Re: net write in pcb hash

2021-12-09 Thread Martin Pieuchot
On 08/12/21(Wed) 22:39, Alexander Bluhm wrote:
> On Wed, Dec 08, 2021 at 03:28:34PM -0300, Martin Pieuchot wrote:
> > On 04/12/21(Sat) 01:02, Alexander Bluhm wrote:
> > > Hi,
> > > 
> > > As I want a read-only net lock for forwarding, all paths should be
> > > checked for the correct net lock and asserts.  I found code in
> > > in_pcbhashlookup() where reading the PCB table results in a write
> > > to optimize the cache.
> > > 
> > > Porperly protecting PCB hashes is out of scope for parallel forwarding.
> > > Can we get away with this hack?  Only update the cache when we are
> > > in TCP oder UDP stack with the write lock.  The access from pf is
> > > read-only.
> > > 
> > > NET_WLOCKED() indicates whether we may change data structures.
> > 
> > I recall that we currently do not want to introduce such idiom: change
> > the behavior of the code depending on which lock is held by the caller.
> > 
> > Can we instead assert that a write-lock is held before modifying the
> > list?
> 
> We could also pass down the kind of lock that is used.  Goal is
> that pf uses shared net lock.  TCP and UDP will keep the exclusice
> net lock for a while.

Changing the logic of a function based on the type of a lock is not
different from the previous approach.

> Diff gets longer but perhaps a bit clearer what is going on.

I believe we want to split in_pcblookup_listen() into two functions.
One which is read-only and one which modifies the head of the hash
chain.
The read-only asserts for any lock and the one that modifies the hash
calls the former and assert for a write lock.

Alternatively, we could protect the PCB hash with a mutex.  This has the
advantage of not making the scope of the NET_LOCK() more complex.  In
the end we all know something like that will be done.  I don't know how
other BSD did this but I'm sure this will help getting the remaining
socket layer out of the NET_LOCK().

> Index: net/pf.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v
> retrieving revision 1.1122
> diff -u -p -r1.1122 pf.c
> --- net/pf.c  7 Jul 2021 18:38:25 -   1.1122
> +++ net/pf.c  8 Dec 2021 21:16:16 -
> @@ -3317,14 +3317,12 @@ pf_socket_lookup(struct pf_pdesc *pd)
>   sport = pd->hdr.tcp.th_sport;
>   dport = pd->hdr.tcp.th_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = 
>   break;
>   case IPPROTO_UDP:
>   sport = pd->hdr.udp.uh_sport;
>   dport = pd->hdr.udp.uh_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = 
>   break;
>   default:
> @@ -3348,22 +3346,24 @@ pf_socket_lookup(struct pf_pdesc *pd)
>* Fails when rtable is changed while evaluating the ruleset
>* The socket looked up will not match the one hit in the end.
>*/
> - inp = in_pcbhashlookup(tb, saddr->v4, sport, daddr->v4, dport,
> - pd->rdomain);
> + NET_ASSERT_LOCKED();
> + inp = in_pcbhashlookup_wlocked(tb, saddr->v4, sport, daddr->v4,
> + dport, pd->rdomain, 0);
>   if (inp == NULL) {
> - inp = in_pcblookup_listen(tb, daddr->v4, dport,
> - NULL, pd->rdomain);
> + inp = in_pcblookup_listen_wlocked(tb, daddr->v4, dport,
> + NULL, pd->rdomain, 0);
>   if (inp == NULL)
>   return (-1);
>   }
>   break;
>  #ifdef INET6
>   case AF_INET6:
> - inp = in6_pcbhashlookup(tb, >v6, sport, >v6,
> - dport, pd->rdomain);
> + NET_ASSERT_LOCKED();
> + inp = in6_pcbhashlookup_wlocked(tb, >v6, sport,
> + >v6, dport, pd->rdomain, 0);
>   if (inp == NULL) {
> - inp = in6_pcblookup_listen(tb, >v6, dport,
> - NULL, pd->rdomain);
> + inp = in6_pcblookup_listen_wlocked(tb, >v6,
> + dport, NULL, pd->rdomain, 0);
>   if (inp == NULL)
>   return (-1);
>   }
> Index: netinet/in_pcb.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/in_pcb.c,v
> retrieving revision 1.256
> diff -u -p -r1.256 in_pcb.c
> 

Re: net write in pcb hash

2021-12-08 Thread Martin Pieuchot
On 04/12/21(Sat) 01:02, Alexander Bluhm wrote:
> Hi,
> 
> As I want a read-only net lock for forwarding, all paths should be
> checked for the correct net lock and asserts.  I found code in
> in_pcbhashlookup() where reading the PCB table results in a write
> to optimize the cache.
> 
> Porperly protecting PCB hashes is out of scope for parallel forwarding.
> Can we get away with this hack?  Only update the cache when we are
> in TCP oder UDP stack with the write lock.  The access from pf is
> read-only.
> 
> NET_WLOCKED() indicates whether we may change data structures.

I recall that we currently do not want to introduce such idiom: change
the behavior of the code depending on which lock is held by the caller.

Can we instead assert that a write-lock is held before modifying the
list?

> Also move the assert from pf to in_pcb where the critical section
> is.
> 
> bluhm
> 
> Index: net/pf.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v
> retrieving revision 1.1122
> diff -u -p -r1.1122 pf.c
> --- net/pf.c  7 Jul 2021 18:38:25 -   1.1122
> +++ net/pf.c  3 Dec 2021 22:20:32 -
> @@ -3317,14 +3317,12 @@ pf_socket_lookup(struct pf_pdesc *pd)
>   sport = pd->hdr.tcp.th_sport;
>   dport = pd->hdr.tcp.th_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = 
>   break;
>   case IPPROTO_UDP:
>   sport = pd->hdr.udp.uh_sport;
>   dport = pd->hdr.udp.uh_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = 
>   break;
>   default:
> Index: netinet/in_pcb.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/in_pcb.c,v
> retrieving revision 1.256
> diff -u -p -r1.256 in_pcb.c
> --- netinet/in_pcb.c  25 Oct 2021 22:20:47 -  1.256
> +++ netinet/in_pcb.c  3 Dec 2021 22:20:32 -
> @@ -1069,6 +1069,8 @@ in_pcbhashlookup(struct inpcbtable *tabl
>   u_int16_t fport = fport_arg, lport = lport_arg;
>   u_int rdomain;
>  
> + NET_ASSERT_LOCKED();
> +
>   rdomain = rtable_l2(rtable);
>   head = in_pcbhash(table, rdomain, , fport, , lport);
>   LIST_FOREACH(inp, head, inp_hash) {
> @@ -1085,7 +1087,7 @@ in_pcbhashlookup(struct inpcbtable *tabl
>* repeated accesses are quicker.  This is analogous to
>* the historic single-entry PCB cache.
>*/
> - if (inp != LIST_FIRST(head)) {
> + if (NET_WLOCKED() && inp != LIST_FIRST(head)) {
>   LIST_REMOVE(inp, inp_hash);
>   LIST_INSERT_HEAD(head, inp, inp_hash);
>   }
> @@ -1119,6 +1121,8 @@ in_pcblookup_listen(struct inpcbtable *t
>   u_int16_t lport = lport_arg;
>   u_int rdomain;
>  
> + NET_ASSERT_LOCKED();
> +
>   key1 = 
>   key2 = _addr;
>  #if NPF > 0
> @@ -1185,7 +1189,7 @@ in_pcblookup_listen(struct inpcbtable *t
>* repeated accesses are quicker.  This is analogous to
>* the historic single-entry PCB cache.
>*/
> - if (inp != NULL && inp != LIST_FIRST(head)) {
> + if (NET_WLOCKED() && inp != NULL && inp != LIST_FIRST(head)) {
>   LIST_REMOVE(inp, inp_hash);
>   LIST_INSERT_HEAD(head, inp, inp_hash);
>   }
> Index: sys/systm.h
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/sys/systm.h,v
> retrieving revision 1.154
> diff -u -p -r1.154 systm.h
> --- sys/systm.h   2 Jun 2021 00:39:25 -   1.154
> +++ sys/systm.h   3 Dec 2021 22:20:32 -
> @@ -344,6 +344,8 @@ extern struct rwlock netlock;
>  #define  NET_RLOCK_IN_IOCTL()do { rw_enter_read(); } while 
> (0)
>  #define  NET_RUNLOCK_IN_IOCTL()  do { rw_exit_read(); } while (0)
>  
> +#define  NET_WLOCKED()   (rw_status() == RW_WRITE)
> +
>  #ifdef DIAGNOSTIC
>  
>  #define  NET_ASSERT_UNLOCKED()   
> \
> 



Re: kbind(2): push kernel lock down as needed

2021-12-08 Thread Martin Pieuchot
On 06/12/21(Mon) 14:58, Scott Cheloha wrote:
> On Mon, Dec 06, 2021 at 08:35:15PM +0100, Mark Kettenis wrote:
> > > Date: Sun, 5 Dec 2021 18:01:04 -0600
> > > From: Scott Cheloha 
> > > 
> > > Two things in sys_kbind() need an explicit kernel lock.  First,
> > > sigexit().  Second, uvm_map_extract(), because the following call
> > > chain panics without it:
> > > 
> > > [...]
> > > 
> > > With this committed we can unlock kbind(2).
> > > 
> > > Thoughts?  ok?
> > 
> > To be honest, I don't think this makes sense unless you can make the
> > "normal" code path lock free.  You're replacing a single
> > KERNEL_LOCK/UNLOCK pair with (potentially) a bunch of them.  That may
> > actually make things worse.  So I think we need to make
> > uvm_map_extract() mpsafe first.
> 
> Unlocking uvm_map_extract() would improve things, yes.

Yes, please.  What's missing?



Re: Rework UNIX sockets locking to be fine grained

2021-12-01 Thread Martin Pieuchot
On 24/11/21(Wed) 15:04, Vitaliy Makkoveev wrote:
> [...] 
> Really, this is the simplest way. The shared lock for the pair of
> sockets moves re-lock dances to the connect and disconnect stages which
> should be also protected by locks. And not for the pair. Many SOCK_DGRAM
> sockets could be connected to one socket. This could be done, but this
> is the hell. And there is absolutely NO profit.

It's not clear to me why sharing a lock isn't simpler.  Currently all
unix(4) sockets share a lock and the locking semantic is simpler.

If two sockets are linked couldn't they use the same rwlock?  Did you
consider this approach?  Is it complicated to know which lock to pick?

This is what is done in UVM and that's why the rwlock is allocated outside
of "struct uvm_object" with rw_obj_alloc().  Having a 'rwlock pointer' in
"struct socket" could also help by setting this pointer to  in the
case of UDP/TCP sockets. 

I'm not saying your approach isn't working.  I'm just not convinced it is
the simplest path forward and I wish we could do without refcounting.



Re: Please test: UVM fault unlocking (aka vmobjlock)

2021-11-29 Thread Martin Pieuchot
On 24/11/21(Wed) 11:16, Martin Pieuchot wrote:
> Diff below unlock the bottom part of the UVM fault handler.  I'm
> interested in squashing the remaining bugs.  Please test with your usual
> setup & report back.

Thanks to all the testers, here's a new version that includes a bug fix.

Tests on !x86 architectures are much appreciated!

Thanks a lot,
Martin

diff --git sys/arch/amd64/conf/GENERIC.MP sys/arch/amd64/conf/GENERIC.MP
index bb842f6d96e..e5334c19eac 100644
--- sys/arch/amd64/conf/GENERIC.MP
+++ sys/arch/amd64/conf/GENERIC.MP
@@ -4,6 +4,6 @@ include "arch/amd64/conf/GENERIC"
 
 option MULTIPROCESSOR
 #optionMP_LOCKDEBUG
-#optionWITNESS
+option WITNESS
 
 cpu*   at mainbus?
diff --git sys/arch/i386/conf/GENERIC.MP sys/arch/i386/conf/GENERIC.MP
index 980a572b8fd..ef7ded61501 100644
--- sys/arch/i386/conf/GENERIC.MP
+++ sys/arch/i386/conf/GENERIC.MP
@@ -7,6 +7,6 @@ include "arch/i386/conf/GENERIC"
 
 option MULTIPROCESSOR  # Multiple processor support
 #optionMP_LOCKDEBUG
-#optionWITNESS
+option WITNESS
 
 cpu*   at mainbus?
diff --git sys/dev/pci/drm/i915/gem/i915_gem_shmem.c 
sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
index ce8e2eca141..47b567087e7 100644
--- sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
+++ sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
@@ -268,8 +268,10 @@ shmem_truncate(struct drm_i915_gem_object *obj)
 #ifdef __linux__
shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1);
 #else
+   rw_enter(obj->base.uao->vmobjlock, RW_WRITE);
obj->base.uao->pgops->pgo_flush(obj->base.uao, 0, obj->base.size,
PGO_ALLPAGES | PGO_FREE);
+   rw_exit(obj->base.uao->vmobjlock);
 #endif
obj->mm.madv = __I915_MADV_PURGED;
obj->mm.pages = ERR_PTR(-EFAULT);
diff --git sys/dev/pci/drm/radeon/radeon_ttm.c 
sys/dev/pci/drm/radeon/radeon_ttm.c
index eb879b5c72c..837a9f94298 100644
--- sys/dev/pci/drm/radeon/radeon_ttm.c
+++ sys/dev/pci/drm/radeon/radeon_ttm.c
@@ -1006,6 +1006,8 @@ radeon_ttm_fault(struct uvm_faultinfo *ufi, vaddr_t 
vaddr, vm_page_t *pps,
struct radeon_device *rdev;
int r;
 
+   KASSERT(rw_write_held(ufi->entry->object.uvm_obj->vmobjlock));
+
bo = (struct drm_gem_object *)ufi->entry->object.uvm_obj;
rdev = bo->dev->dev_private;
down_read(>pm.mclk_lock);
diff --git sys/uvm/uvm_aobj.c sys/uvm/uvm_aobj.c
index 20051d95dc1..a5c403ab67d 100644
--- sys/uvm/uvm_aobj.c
+++ sys/uvm/uvm_aobj.c
@@ -184,7 +184,7 @@ const struct uvm_pagerops aobj_pager = {
  * deadlock.
  */
 static LIST_HEAD(aobjlist, uvm_aobj) uao_list = 
LIST_HEAD_INITIALIZER(uao_list);
-static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_NONE);
+static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_MPFLOOR);
 
 
 /*
@@ -277,6 +277,7 @@ uao_find_swslot(struct uvm_object *uobj, int pageidx)
  * uao_set_swslot: set the swap slot for a page in an aobj.
  *
  * => setting a slot to zero frees the slot
+ * => object must be locked by caller
  * => we return the old slot number, or -1 if we failed to allocate
  *memory to record the new slot number
  */
@@ -286,7 +287,7 @@ uao_set_swslot(struct uvm_object *uobj, int pageidx, int 
slot)
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
int oldslot;
 
-   KERNEL_ASSERT_LOCKED();
+   KASSERT(rw_write_held(uobj->vmobjlock) || uobj->uo_refs == 0);
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
 
/*
@@ -358,7 +359,9 @@ uao_free(struct uvm_aobj *aobj)
struct uvm_object *uobj = >u_obj;
 
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
+   KASSERT(rw_write_held(uobj->vmobjlock));
uao_dropswap_range(uobj, 0, 0);
+   rw_exit(uobj->vmobjlock);
 
if (UAO_USES_SWHASH(aobj)) {
/*
@@ -671,6 +674,7 @@ struct uvm_object *
 uao_create(vsize_t size, int flags)
 {
static struct uvm_aobj kernel_object_store;
+   static struct rwlock bootstrap_kernel_object_lock;
static int kobj_alloced = 0;
int pages = round_page(size) >> PAGE_SHIFT;
struct uvm_aobj *aobj;
@@ -742,6 +746,11 @@ uao_create(vsize_t size, int flags)
 * Initialise UVM object.
 */
uvm_obj_init(>u_obj, _pager, refs);
+   if (flags & UAO_FLAG_KERNOBJ) {
+   /* Use a temporary static lock for kernel_object. */
+   rw_init(_kernel_object_lock, "kobjlk");
+   uvm_obj_setlock(>u_obj, _kernel_object_lock);
+   }
 
/*
 * now that aobj is ready, add it to the global list
@@ -822,20 +831,20 @@ uao_detach(struct uvm_object *uobj)
 * involved in is complete), release any swap resources and free
 * the page itself.
 */
-   uvm_lock_pageq();
-   while((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
+   

Re: Rework UNIX sockets locking to be fine grained

2021-11-24 Thread Martin Pieuchot
On 22/11/21(Mon) 14:42, Vitaliy Makkoveev wrote:
> On Sat, Nov 20, 2021 at 03:12:31AM +0300, Vitaliy Makkoveev wrote:
> > Updated diff. Re-lock dances were simplified in the unix(4) sockets
> > layer.
> > 
> > Reference counters added to unix(4) sockets layer too. This makes 
> > pointer dereference of peer's control block always safe after re-lock.
> > 
> > The `unp_refs' list cleanup done in the unp_detach(). This removes the
> > case where the socket connected to our dying socket could be passed to
> > unp_disconnect() and the check of it's connection state became much
> > easier.
> >
> 
> Another re-lock simplification. We could enforce the lock order between
> the listening socket `head' and the socket `so' linked to it's `so_q0'
> or `so_q' to solock(head) -> solock(so).
> 
> This removes re-lock from accept(2) and the accepting socket couldn't be
> stolen by concurrent accept(2) thread. This removes re-lock from `so_q0'
> and `so_q' cleanup on dying listening socket.
> 
> The previous incarnation of this diff does re-lock in a half of
> doaccept(), soclose(), sofree() and soisconnected() calls. The current
> diff does not re-lock in doaccept() and soclose() and always so re-lock
> in sofree() and soisconnected().
> 
> I guess this is the latest simplification and this diff could be pushed
> forward.

This diff is really interesting.  It shows that the current locking
design needs to be reworked.

I don't think we should expose the locking strategy with a `persocket'
variable then use if/else dances to decide if one of two locks need to
be taken/released.  Instead could we fold the TCP/UDP locking into more
generic functions?  For example connect() could be:

int
soconnect2(struct socket *so1, struct socket *so2)
{
int s, error;

s = solock_pair(so1, so2);
error = (*so1->so_proto->pr_usrreq)(so1, PRU_CONNECT2, NULL,
(struct mbuf *)so2, NULL, curproc);
sounlock_pair(so1, so2, s);
return (error);
}

And solock_pair() would do the right thing(tm) based on the socket type.

Because in the end we want to prepare this layer to use per-socket locks
with TCP/UDP sockets as well.

Could something similar be done for doaccept()?

I'm afraid about introducing reference counting.  Once there is reference
counting it tends to be abused.  It's not clear to me for which reason it
is added.  It looks like to work around lock ordering issues, could you
talk a bit about this?  Is there any alternative?

I also don't understand the problem behind:

> + unp_ref(unp2);
> + sounlock(so, SL_LOCKED);
> + solock(so2);
> + solock(so);
> +
> + /* Datagram socket could be reconnected due to re-lock. */
> + if (unp->unp_conn != unp2) {
> + sounlock(so2, SL_LOCKED);
> + unp_rele(unp2);
> + goto again;
> + }
> +
> + unp_rele(unp2);


It seems that doing an unlock/relock dance requires a lot of added
complexity, why is it done this way?

Thanks for dealing with this!

> Index: sys/kern/uipc_socket.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> retrieving revision 1.269
> diff -u -p -r1.269 uipc_socket.c
> --- sys/kern/uipc_socket.c11 Nov 2021 16:35:09 -  1.269
> +++ sys/kern/uipc_socket.c22 Nov 2021 11:36:40 -
> @@ -52,6 +52,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef DDB
>  #include 
> @@ -156,7 +157,9 @@ soalloc(int prflags)
>   so = pool_get(_pool, prflags);
>   if (so == NULL)
>   return (NULL);
> - rw_init(>so_lock, "solock");
> + rw_init_flags(>so_lock, "solock", RWL_DUPOK);
> + refcnt_init(>so_refcnt);
> +
>   return (so);
>  }
>  
> @@ -257,6 +260,8 @@ solisten(struct socket *so, int backlog)
>  void
>  sofree(struct socket *so, int s)
>  {
> + int persocket = solock_persocket(so);
> +
>   soassertlocked(so);
>  
>   if (so->so_pcb || (so->so_state & SS_NOFDREF) == 0) {
> @@ -264,16 +269,53 @@ sofree(struct socket *so, int s)
>   return;
>   }
>   if (so->so_head) {
> + struct socket *head = so->so_head;
> +
>   /*
>* We must not decommission a socket that's on the accept(2)
>* queue.  If we do, then accept(2) may hang after select(2)
>* indicated that the listening socket was ready.
>*/
> - if (!soqremque(so, 0)) {
> + if (so->so_onq == >so_q) {
>   sounlock(so, s);
>   return;
>   }
> +
> + if (persocket) {
> + /*
> +  * Concurrent close of `head' could
> +  * abort `so' due to re-lock.
> +  */
> + soref(so);
> + soref(head);
> +  

Please test: UVM fault unlocking (aka vmobjlock)

2021-11-24 Thread Martin Pieuchot
Diff below unlock the bottom part of the UVM fault handler.  I'm
interested in squashing the remaining bugs.  Please test with your usual
setup & report back.

Thanks,
Martin

diff --git sys/arch/amd64/conf/GENERIC.MP sys/arch/amd64/conf/GENERIC.MP
index bb842f6d96e..e5334c19eac 100644
--- sys/arch/amd64/conf/GENERIC.MP
+++ sys/arch/amd64/conf/GENERIC.MP
@@ -4,6 +4,6 @@ include "arch/amd64/conf/GENERIC"
 
 option MULTIPROCESSOR
 #optionMP_LOCKDEBUG
-#optionWITNESS
+option WITNESS
 
 cpu*   at mainbus?
diff --git sys/arch/i386/conf/GENERIC.MP sys/arch/i386/conf/GENERIC.MP
index 980a572b8fd..ef7ded61501 100644
--- sys/arch/i386/conf/GENERIC.MP
+++ sys/arch/i386/conf/GENERIC.MP
@@ -7,6 +7,6 @@ include "arch/i386/conf/GENERIC"
 
 option MULTIPROCESSOR  # Multiple processor support
 #optionMP_LOCKDEBUG
-#optionWITNESS
+option WITNESS
 
 cpu*   at mainbus?
diff --git sys/dev/pci/drm/i915/gem/i915_gem_shmem.c 
sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
index ce8e2eca141..47b567087e7 100644
--- sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
+++ sys/dev/pci/drm/i915/gem/i915_gem_shmem.c
@@ -268,8 +268,10 @@ shmem_truncate(struct drm_i915_gem_object *obj)
 #ifdef __linux__
shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1);
 #else
+   rw_enter(obj->base.uao->vmobjlock, RW_WRITE);
obj->base.uao->pgops->pgo_flush(obj->base.uao, 0, obj->base.size,
PGO_ALLPAGES | PGO_FREE);
+   rw_exit(obj->base.uao->vmobjlock);
 #endif
obj->mm.madv = __I915_MADV_PURGED;
obj->mm.pages = ERR_PTR(-EFAULT);
diff --git sys/dev/pci/drm/radeon/radeon_ttm.c 
sys/dev/pci/drm/radeon/radeon_ttm.c
index eb879b5c72c..837a9f94298 100644
--- sys/dev/pci/drm/radeon/radeon_ttm.c
+++ sys/dev/pci/drm/radeon/radeon_ttm.c
@@ -1006,6 +1006,8 @@ radeon_ttm_fault(struct uvm_faultinfo *ufi, vaddr_t 
vaddr, vm_page_t *pps,
struct radeon_device *rdev;
int r;
 
+   KASSERT(rw_write_held(ufi->entry->object.uvm_obj->vmobjlock));
+
bo = (struct drm_gem_object *)ufi->entry->object.uvm_obj;
rdev = bo->dev->dev_private;
down_read(>pm.mclk_lock);
diff --git sys/uvm/uvm_aobj.c sys/uvm/uvm_aobj.c
index 20051d95dc1..127218c4c40 100644
--- sys/uvm/uvm_aobj.c
+++ sys/uvm/uvm_aobj.c
@@ -31,7 +31,7 @@
 /*
  * uvm_aobj.c: anonymous memory uvm_object pager
  *
- * author: Chuck Silvers 
+* author: Chuck Silvers 
  * started: Jan-1998
  *
  * - design mostly from Chuck Cranor
@@ -184,7 +184,7 @@ const struct uvm_pagerops aobj_pager = {
  * deadlock.
  */
 static LIST_HEAD(aobjlist, uvm_aobj) uao_list = 
LIST_HEAD_INITIALIZER(uao_list);
-static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_NONE);
+static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_MPFLOOR);
 
 
 /*
@@ -277,6 +277,7 @@ uao_find_swslot(struct uvm_object *uobj, int pageidx)
  * uao_set_swslot: set the swap slot for a page in an aobj.
  *
  * => setting a slot to zero frees the slot
+ * => object must be locked by caller
  * => we return the old slot number, or -1 if we failed to allocate
  *memory to record the new slot number
  */
@@ -286,7 +287,7 @@ uao_set_swslot(struct uvm_object *uobj, int pageidx, int 
slot)
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
int oldslot;
 
-   KERNEL_ASSERT_LOCKED();
+   KASSERT(rw_write_held(uobj->vmobjlock) || uobj->uo_refs == 0);
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
 
/*
@@ -358,7 +359,9 @@ uao_free(struct uvm_aobj *aobj)
struct uvm_object *uobj = >u_obj;
 
KASSERT(UVM_OBJ_IS_AOBJ(uobj));
+   KASSERT(rw_write_held(uobj->vmobjlock));
uao_dropswap_range(uobj, 0, 0);
+   rw_exit(uobj->vmobjlock);
 
if (UAO_USES_SWHASH(aobj)) {
/*
@@ -671,6 +674,7 @@ struct uvm_object *
 uao_create(vsize_t size, int flags)
 {
static struct uvm_aobj kernel_object_store;
+   static struct rwlock bootstrap_kernel_object_lock;
static int kobj_alloced = 0;
int pages = round_page(size) >> PAGE_SHIFT;
struct uvm_aobj *aobj;
@@ -742,6 +746,11 @@ uao_create(vsize_t size, int flags)
 * Initialise UVM object.
 */
uvm_obj_init(>u_obj, _pager, refs);
+   if (flags & UAO_FLAG_KERNOBJ) {
+   /* Use a temporary static lock for kernel_object. */
+   rw_init(_kernel_object_lock, "kobjlk");
+   uvm_obj_setlock(>u_obj, _kernel_object_lock);
+   }
 
/*
 * now that aobj is ready, add it to the global list
@@ -822,20 +831,20 @@ uao_detach(struct uvm_object *uobj)
 * involved in is complete), release any swap resources and free
 * the page itself.
 */
-   uvm_lock_pageq();
-   while((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) {
+   rw_enter(

Re: Retry sleep in poll/select

2021-11-18 Thread Martin Pieuchot
On 17/11/21(Wed) 09:51, Scott Cheloha wrote:
> > On Nov 17, 2021, at 03:22, Martin Pieuchot  wrote:
> > 
> > On 16/11/21(Tue) 13:55, Visa Hankala wrote:
> >> Currently, dopselect() and doppoll() call tsleep_nsec() without retry.
> >> cheloha@ asked if the functions should handle spurious wakeups. I guess
> >> such wakeups are unlikely with the nowake wait channel, but I am not
> >> sure if that is a safe guess.
> > 
> > I'm not sure to understand, are we afraid a thread sleeping on `nowake'
> > can be awaken?  Is it the assumption here?
> 
> Yes, but I don't know how.

Then I'd suggest we start with understanding how this can happen otherwise
I fear we are adding more complexity for reasons we don't understands.

> kettenis@ said spurious wakeups were
> possible on a similar loop in sigsuspend(2)
> so I mentioned this to visa@ off-list.

I don't understand how this can happen.

> If we added an assert to panic in wakeup(9)
> if the channel is , would that be
> sufficient?

I guess so.

> Ideally if you sleep on  you should
> never get a zero status from the sleep
> functions.  It should be impossible… if that
> is possible to ensure.



Re: Retry sleep in poll/select

2021-11-17 Thread Martin Pieuchot
On 16/11/21(Tue) 13:55, Visa Hankala wrote:
> Currently, dopselect() and doppoll() call tsleep_nsec() without retry.
> cheloha@ asked if the functions should handle spurious wakeups. I guess
> such wakeups are unlikely with the nowake wait channel, but I am not
> sure if that is a safe guess.

I'm not sure to understand, are we afraid a thread sleeping on `nowake'
can be awaken?  Is it the assumption here?

> The following diff adds the retrying. The code is a bit arduous, so the
> retry loop is put in a separate function that both poll and select use.

Using a separate function makes sense anyway.

> Index: kern/sys_generic.c
> ===
> RCS file: src/sys/kern/sys_generic.c,v
> retrieving revision 1.141
> diff -u -p -r1.141 sys_generic.c
> --- kern/sys_generic.c16 Nov 2021 13:48:23 -  1.141
> +++ kern/sys_generic.c16 Nov 2021 13:50:08 -
> @@ -90,6 +90,7 @@ int dopselect(struct proc *, int, fd_set
>  int doppoll(struct proc *, struct pollfd *, u_int, struct timespec *,
>  const sigset_t *, register_t *);
>  void doselwakeup(struct selinfo *);
> +int selsleep(struct timespec *);
>  
>  int
>  iovec_copyin(const struct iovec *uiov, struct iovec **iovp, struct iovec 
> *aiov,
> @@ -664,19 +665,7 @@ dopselect(struct proc *p, int nd, fd_set
>* there's nothing to wait for.
>*/
>   if (nevents == 0 && ncollected == 0) {
> - uint64_t nsecs = INFSLP;
> -
> - if (timeout != NULL) {
> - if (!timespecisset(timeout))
> - goto done;
> - nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
> - }
> - error = tsleep_nsec(, PSOCK | PCATCH, "kqsel", nsecs);
> - /* select is not restarted after signals... */
> - if (error == ERESTART)
> - error = EINTR;
> - if (error == EWOULDBLOCK)
> - error = 0;
> + error = selsleep(timeout);
>   goto done;
>   }
>  
> @@ -849,6 +838,46 @@ selfalse(dev_t dev, int events, struct p
>  }
>  
>  /*
> + * Sleep until a signal arrives or the optional timeout expires.
> + */
> +int
> +selsleep(struct timespec *timeout)
> +{
> + uint64_t end, now, nsecs;
> + int error;
> +
> + if (timeout != NULL) {
> + if (!timespecisset(timeout))
> + return (0);
> + now = getnsecuptime();
> + end = MIN(now + TIMESPEC_TO_NSEC(timeout), MAXTSLP);
> + if (end < now)
> + end = MAXTSLP;
> + }
> +
> + do {
> + if (timeout != NULL)
> + nsecs = MAX(1, end - now);
> + else
> + nsecs = INFSLP;
> + error = tsleep_nsec(, PSOCK | PCATCH, "selslp", nsecs);
> + if (timeout != NULL) {
> + now = getnsecuptime();
> + if (now >= end)
> + break;
> + }
> + } while (error == 0);
> +
> + /* poll/select is not restarted after signals... */
> + if (error == ERESTART)
> + error = EINTR;
> + if (error == EWOULDBLOCK)
> + error = 0;
> +
> + return (error);
> +}
> +
> +/*
>   * Record a select request.
>   */
>  void
> @@ -1158,19 +1187,7 @@ doppoll(struct proc *p, struct pollfd *f
>* there's nothing to wait for.
>*/
>   if (nevents == 0 && ncollected == 0) {
> - uint64_t nsecs = INFSLP;
> -
> - if (timeout != NULL) {
> - if (!timespecisset(timeout))
> - goto done;
> - nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
> - }
> -
> - error = tsleep_nsec(, PSOCK | PCATCH, "kqpoll", nsecs);
> - if (error == ERESTART)
> - error = EINTR;
> - if (error == EWOULDBLOCK)
> - error = 0;
> + error = selsleep(timeout);
>   goto done;
>   }
>  
> 



Re: bt.5 document count()

2021-11-16 Thread Martin Pieuchot
On 16/11/21(Tue) 11:07, Claudio Jeker wrote:
> This documents count(). This function only works when used like this
>   @map[key] = count();
> But it is implemented and works. If used differently you get a syntax
> error which is not helpful. This is why I chose to document it like this.
> Another option would be to document the language (so it is clear where it
> is possible to use what). 

ok mpi@

> max(), min() and sum() are other functions that behave like this. Their
> documentation should also be adjusted IMO.
> 
> -- 
> :wq Claudio
> 
> Index: bt.5
> ===
> RCS file: /cvs/src/usr.sbin/btrace/bt.5,v
> retrieving revision 1.13
> diff -u -p -r1.13 bt.5
> --- bt.5  12 Nov 2021 16:57:24 -  1.13
> +++ bt.5  16 Nov 2021 09:50:52 -
> @@ -120,6 +120,11 @@ Functions:
>  .It Fn clear "@map"
>  Delete all (key, value) pairs from
>  .Va @map .
> +.It "@map[key]" = Fn count
> +Increment the value of
> +.Va key
> +from
> +.Va @map .
>  .It Fn delete "@map[key]"
>  Delete the pair indexed by
>  .Va key
> 



Re: quotaon(8): small improvements

2021-11-11 Thread Martin Vahlensieck
Friendly ping

On Fri, Oct 29, 2021 at 10:06:44AM +0200, Martin Vahlensieck wrote:
> Hi
> 
> Here are some small changes to quotaon(8).  If you want I can split
> them up, but since they are small I am sending one diff.  Here is
> a list of changes roughly in the order they appear in the diff:
> 
>  - Constify some function arguments
> 
>  - Use __progname instead of separate whoami variable + small KNF
>where the line is touched anyways.
> 
>  - Order letters in getopt(3) call
> 
>  - Convert a fprintf(3) + perror(3) to warn(3). warn(3) is
>already used in this function for similar purposes.
> 
>  - Replace strtok(3) with strsep(3).  Is there a more
>elegant way than my solution? (I took the freedom to collect all
>the char * variables into one line).
> 
>  - Set cp to NULL at the end of the while loop scanning the mount
>options.  Otherwise if the last mount option contains a '=' the
>part after it is used as the quota file (of course only if the
>desired userquota/groupquota option isn't found).
> 
>  - Use strncmp(3) instead of memcmp(3). Looking at the kernel code
>it seems that it is zero terminated.  The other code nearby uses
>strcmp(3) already.
> 
>  - Invert an if condition to prevent an empty body.
> 
> Happy to hear feedback, also feel free to only commit parts of it.
> 
> Best,
> 
> Martin
> 
> P.S.: If the prototypes are touched anyways, the names of the
> arguments might be removed as well to comply with KNF.  Let me
> know if you want that.
> 
> Index: quotaon.c
> ===
> RCS file: /cvs/src/usr.sbin/quotaon/quotaon.c,v
> retrieving revision 1.27
> diff -u -p -r1.27 quotaon.c
> --- quotaon.c 26 Apr 2018 12:42:51 -  1.27
> +++ quotaon.c 29 Oct 2021 07:46:25 -
> @@ -44,6 +44,8 @@
>  #include 
>  #include 
>  
> +extern char *__progname;
> +
>  char *qfname = QUOTAFILENAME;
>  char *qfextension[] = INITQFNAMES;
>  
> @@ -52,34 +54,31 @@ int   gflag;  /* operate on group quotas *
>  int  uflag;  /* operate on user quotas */
>  int  vflag;  /* verbose */
>  
> -void usage(char *whoami);
> -int  hasquota(struct fstab *fs, int type, char **qfnamep, int force);
> -int  quotaonoff(struct fstab *fs, int offmode, int type, char *qfpathname);
> -int  oneof(char *target, char *list[], int cnt);
> -int  readonly(struct fstab *fs);
> +void usage();
> +int  hasquota(const struct fstab *fs, int type, char **qfnamep, int force);
> +int  quotaonoff(const struct fstab *fs, int offmode, int type, char 
> *qfpathname);
> +int  oneof(const char *target, char *list[], int cnt);
> +int  readonly(const struct fstab *fs);
>  
>  
>  int
>  main(int argc, char *argv[])
>  {
>   struct fstab *fs;
> - char *qfnp, *whoami;
> + char *qfnp;
>   long argnum, done = 0;
>   int i, offmode = 0, errs = 0;
>   extern int optind;
>   int ch;
>  
> - whoami = strrchr(*argv, '/') + 1;
> - if (whoami == (char *)1)
> - whoami = *argv;
> - if (strcmp(whoami, "quotaoff") == 0)
> + if (strcmp(__progname, "quotaoff") == 0)
>   offmode = 1;
> - else if (strcmp(whoami, "quotaon") != 0) {
> + else if (strcmp(__progname, "quotaon") != 0) {
>   fprintf(stderr, "Name must be quotaon or quotaoff not %s\n",
> - whoami);
> + __progname);
>   exit(1);
>   }
> - while ((ch = getopt(argc, argv, "avug")) != -1) {
> + while ((ch = getopt(argc, argv, "aguv")) != -1) {
>   switch (ch) {
>   case 'a':
>   aflag = 1;
> @@ -94,13 +93,13 @@ main(int argc, char *argv[])
>   vflag = 1;
>   break;
>   default:
> - usage(whoami);
> + usage();
>   }
>   }
>   argc -= optind;
>   argv += optind;
>   if (argc <= 0 && !aflag)
> - usage(whoami);
> + usage();
>   if (!gflag && !uflag) {
>   gflag = 1;
>   uflag = 1;
> @@ -142,22 +141,20 @@ main(int argc, char *argv[])
>  }
>  
>  void
> -usage(char *whoami)
> +usage()
>  {
> -
> - fprintf(stderr, "usage: %s [-aguv] filesystem ...\n", whoami);
> + fprintf(stderr, "usage: %s [-aguv] filesystem ...\n", __progname);
>   exit(1);
>  }
>  
>  int
> -quotaonoff(struct fstab *fs, int offmode, int type, char *qfpathname

Re: poll/select: Lazy removal of knotes

2021-11-06 Thread Martin Pieuchot
On 06/11/21(Sat) 15:53, Visa Hankala wrote:
> On Fri, Nov 05, 2021 at 10:04:50AM +0100, Martin Pieuchot wrote:
> > New poll/select(2) implementation convert 'struct pollfd' and 'fdset' to
> > knotes (kqueue event descriptors) then pass them to the kqueue subsystem.
> > A knote is allocated, with kqueue_register(), for every read, write and
> > except condition watched on a given FD.  That means at most 3 allocations
> > might be necessary per FD.
> > 
> > The diff below reduce the overhead of per-syscall allocation/free of those
> > descriptors by leaving those which didn't trigger on the kqueue across
> > syscall.  Leaving knotes on the kqueue allows kqueue_register() to re-use
> > existing descriptor instead of re-allocating a new one.
> > 
> > With this knotes are now lazily removed.  The mechanism uses a serial
> > number which is incremented for every syscall that indicates if a knote
> > sitting in the kqueue is still valid or should be freed.
> > 
> > Note that performance improvements might not be visible with this diff
> > alone because kqueue_register() still pre-allocate a descriptor then drop
> > it.
> > 
> > visa@ already pointed out that the lazy removal logic could be integrated
> > in kqueue_scan() which would reduce the complexity of those two syscalls.
> > I'm arguing for doing this in a next step in-tree.
> 
> I think it would make more sense to add the removal logic to the scan
> function first as doing so would keep the code modifications more
> logical and simpler. This would also avoid the need to go through
> a temporary removal approach.

I totally support your effort and your design however I don't have the
time to do another round of test/debugging.  So please, can you take
care of doing these cleanups afterward?  If not, please send a full diff
and take over this feature, it's too much effort for me to work out of
tree.

> Index: kern/kern_event.c
> ===
> RCS file: src/sys/kern/kern_event.c,v
> retrieving revision 1.170
> diff -u -p -r1.170 kern_event.c
> --- kern/kern_event.c 6 Nov 2021 05:48:47 -   1.170
> +++ kern/kern_event.c 6 Nov 2021 15:31:04 -
> @@ -73,6 +73,7 @@ voidkqueue_terminate(struct proc *p, st
>  void KQREF(struct kqueue *);
>  void KQRELE(struct kqueue *);
>  
> +void kqueue_purge(struct proc *, struct kqueue *);
>  int  kqueue_sleep(struct kqueue *, struct timespec *);
>  
>  int  kqueue_read(struct file *, struct uio *, int);
> @@ -806,6 +807,22 @@ kqpoll_exit(void)
>  }
>  
>  void
> +kqpoll_done(unsigned int num)
> +{
> + struct proc *p = curproc;
> +
> + KASSERT(p->p_kq != NULL);
> +
> + if (p->p_kq_serial + num >= p->p_kq_serial) {
> + p->p_kq_serial += num;
> + } else {
> + /* Clear all knotes after serial wraparound. */
> + kqueue_purge(p, p->p_kq);
> + p->p_kq_serial = 1;
> + }
> +}
> +
> +void
>  kqpoll_dequeue(struct proc *p, int all)
>  {
>   struct knote marker;
> @@ -1383,6 +1400,15 @@ retry:
>  
>   mtx_leave(>kq_lock);
>  
> + /* Drop spurious events. */
> + if (p->p_kq == kq &&
> + p->p_kq_serial > (unsigned long)kn->kn_udata) {
> + filter_detach(kn);
> + knote_drop(kn, p);
> + mtx_enter(>kq_lock);
> + continue;
> + }
> +
>   memset(kevp, 0, sizeof(*kevp));
>   if (filter_process(kn, kevp) == 0) {
>   mtx_enter(>kq_lock);
> Index: kern/sys_generic.c
> ===
> RCS file: src/sys/kern/sys_generic.c,v
> retrieving revision 1.139
> diff -u -p -r1.139 sys_generic.c
> --- kern/sys_generic.c29 Oct 2021 15:52:44 -  1.139
> +++ kern/sys_generic.c6 Nov 2021 15:31:04 -
> @@ -730,8 +730,7 @@ done:
>   if (pibits[0] != (fd_set *)[0])
>   free(pibits[0], M_TEMP, 6 * ni);
>  
> - kqueue_purge(p, p->p_kq);
> - p->p_kq_serial += nd;
> + kqpoll_done(nd);
>  
>   return (error);
>  }
> @@ -1230,8 +1229,7 @@ bad:
>   if (pl != pfds)
>   free(pl, M_TEMP, sz);
>  
> - kqueue_purge(p, p->p_kq);
> - p->p_kq_serial += nfds;
> + kqpoll_done(nfds);
>  
>   return (error);
>  }
> @@ -1251,8 +1249,7 @@ ppollcollect(struct proc *p, struct keve
>   /*
>* Lazily delete spurious events.
>*
> -   

poll/select: Lazy removal of knotes

2021-11-05 Thread Martin Pieuchot
New poll/select(2) implementation convert 'struct pollfd' and 'fdset' to
knotes (kqueue event descriptors) then pass them to the kqueue subsystem.
A knote is allocated, with kqueue_register(), for every read, write and
except condition watched on a given FD.  That means at most 3 allocations
might be necessary per FD.

The diff below reduce the overhead of per-syscall allocation/free of those
descriptors by leaving those which didn't trigger on the kqueue across
syscall.  Leaving knotes on the kqueue allows kqueue_register() to re-use
existing descriptor instead of re-allocating a new one.

With this knotes are now lazily removed.  The mechanism uses a serial
number which is incremented for every syscall that indicates if a knote
sitting in the kqueue is still valid or should be freed.

Note that performance improvements might not be visible with this diff
alone because kqueue_register() still pre-allocate a descriptor then drop
it.

visa@ already pointed out that the lazy removal logic could be integrated
in kqueue_scan() which would reduce the complexity of those two syscalls.
I'm arguing for doing this in a next step in-tree.

Please test and review :)

Index: kern/sys_generic.c
===
RCS file: /cvs/src/sys/kern/sys_generic.c,v
retrieving revision 1.139
diff -u -p -r1.139 sys_generic.c
--- kern/sys_generic.c  29 Oct 2021 15:52:44 -  1.139
+++ kern/sys_generic.c  5 Nov 2021 08:11:05 -
@@ -598,7 +598,7 @@ sys_pselect(struct proc *p, void *v, reg
 
 int
 dopselect(struct proc *p, int nd, fd_set *in, fd_set *ou, fd_set *ex,
-struct timespec *timeout, const sigset_t *sigmask, register_t *retval)
+struct timespec *tsp, const sigset_t *sigmask, register_t *retval)
 {
struct kqueue_scan_state scan;
fd_mask bits[6];
@@ -666,10 +666,10 @@ dopselect(struct proc *p, int nd, fd_set
if (nevents == 0 && ncollected == 0) {
uint64_t nsecs = INFSLP;
 
-   if (timeout != NULL) {
-   if (!timespecisset(timeout))
+   if (tsp != NULL) {
+   if (!timespecisset(tsp))
goto done;
-   nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP));
+   nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(tsp), MAXTSLP));
}
error = tsleep_nsec(>p_kq, PSOCK | PCATCH, "kqsel", nsecs);
/* select is not restarted after signals... */
@@ -682,28 +682,37 @@ dopselect(struct proc *p, int nd, fd_set
 
/* Collect at most `nevents' possibly waiting in kqueue_scan() */
kqueue_scan_setup(, p->p_kq);
-   while (nevents > 0) {
+   while ((nevents - ncollected) > 0) {
struct kevent kev[KQ_NEVENTS];
int i, ready, count;
 
-   /* Maximum number of events per iteration */
-   count = MIN(nitems(kev), nevents);
-   ready = kqueue_scan(, count, kev, timeout, p, );
+   /*
+* Maximum number of events per iteration.  Use the whole
+* array to gather as many spurious events as possible.
+*/
+   count = nitems(kev);
+   ready = kqueue_scan(, count, kev, tsp, p, );
 #ifdef KTRACE
if (KTRPOINT(p, KTR_STRUCT))
ktrevent(p, kev, ready);
 #endif
-   /* Convert back events that are ready. */
+   /* Convert back events that are ready/delete spurious ones. */
for (i = 0; i < ready && error == 0; i++)
error = pselcollect(p, [i], pobits, );
+
/*
-* Stop if there was an error or if we had enough
-* space to collect all events that were ready.
+* Stop if there was an error or if we had enough space
+* to collect all non-spurious events that were ready.
 */
-   if (error || ready < count)
+   if (error || !ready || (ncollected > 0 && ready < count))
break;
 
-   nevents -= ready;
+   /*
+* If we only got spurious events try again repositioning
+* the marker.
+*/
+   if (ncollected == 0 && ((tsp == NULL) || timespecisset(tsp)))
+   scan.kqs_nevent = 0;
}
kqueue_scan_finish();
*retval = ncollected;
@@ -730,7 +739,7 @@ done:
if (pibits[0] != (fd_set *)[0])
free(pibits[0], M_TEMP, 6 * ni);
 
-   kqueue_purge(p, p->p_kq);
+   /* Needed to remove events lazily. */
p->p_kq_serial += nd;
 
return (error);
@@ -759,7 +768,7 @@ pselregister(struct proc *p, fd_set *pib
DPRINTFN(2, "select fd %d mask %d serial %lu\n",
fd, msk, p->p_kq_serial);

Re: UNIX sockets: use vnode(9) lock to protect `v_socket' dereference

2021-11-05 Thread Martin Pieuchot
On 26/10/21(Tue) 14:12, Vitaliy Makkoveev wrote:
> Another step to make UNIX sockets locking fine grained.
> 
> The listening socket has the references from file descriptors layer and
> from the vnode(9) layer. This means when we close(2)'ing such socket it
> still referenced by concurrent thread through connect(2) path.
> 
> When we bind(2) UNIX socket we link it to vnode(9) by assigning
> `v_socket'. When we connect(2)'ing socket to the socket we previously
> bind(2)'ed we finding it by namei(9) and obtain it's reference through
> `v_socket'. This socket has no extra reference in file descriptor
> layer and could be closed by concurrent thread.
> 
> This time we have `unp_lock' rwlock(9) which protects the whole layer
> and the dereference of `v_socket' is safe. But with the fine grained
> locking the `v_socket' will not be protected by global lock. When we
> obtain the vnode(9) by namei(9) in connect(9) or bind(9) paths it is
> already exclusively locked by vlode(9) lock. But in unp_detach() which
> is called on the close(2)'ing socket we assume `unp_lock' protects
> `v_socket'.
> 
> I propose to use exclusive vnode(9) lock to protect `v_socket'. With the
> fine grained locking, the `v_socket' dereference in unp_bind() or
> unp_connect() threads will be safe because unp_detach() thread will wait
> the vnode(9) lock release. The vnode referenced by `unp_vnod' has
> reference counter bumped so it's dereference is also safe without
> `unp_lock' held.

This makes sense to me.  Using the vnode lock here seems the simplest
approach.

> The `i_lock' should be take before `unp_lock' and unp_detach() should
> release solock(). To prevent connections on this socket the
> 'SO_ACCEPTCONN' bit cleared in soclose().

This is done to prevent races when solock() is released inside soabort(),
right?  Is it the only one or some more care is needed?
Will this stay with per-socket locks or is this only necessary because of
the global `unp_lock'?

> Index: sys/kern/uipc_socket.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_socket.c,v
> retrieving revision 1.265
> diff -u -p -r1.265 uipc_socket.c
> --- sys/kern/uipc_socket.c14 Oct 2021 23:05:10 -  1.265
> +++ sys/kern/uipc_socket.c26 Oct 2021 11:05:59 -
> @@ -315,6 +315,8 @@ soclose(struct socket *so, int flags)
>   /* Revoke async IO early. There is a final revocation in sofree(). */
>   sigio_free(>so_sigio);
>   if (so->so_options & SO_ACCEPTCONN) {
> + so->so_options &= ~SO_ACCEPTCONN;
> +
>   while ((so2 = TAILQ_FIRST(>so_q0)) != NULL) {
>   (void) soqremque(so2, 0);
>   (void) soabort(so2);
> Index: sys/kern/uipc_usrreq.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_usrreq.c,v
> retrieving revision 1.150
> diff -u -p -r1.150 uipc_usrreq.c
> --- sys/kern/uipc_usrreq.c21 Oct 2021 22:11:07 -  1.150
> +++ sys/kern/uipc_usrreq.c26 Oct 2021 11:05:59 -
> @@ -474,20 +474,30 @@ void
>  unp_detach(struct unpcb *unp)
>  {
>   struct socket *so = unp->unp_socket;
> - struct vnode *vp = NULL;
> + struct vnode *vp = unp->unp_vnode;
>  
>   rw_assert_wrlock(_lock);
>  
>   LIST_REMOVE(unp, unp_link);
> - if (unp->unp_vnode) {
> +
> + if (vp) {
> + unp->unp_vnode = NULL;
> +
>   /*
> -  * `v_socket' is only read in unp_connect and
> -  * unplock prevents concurrent access.
> +  * Enforce `i_lock' -> `unp_lock' because fifo
> +  * subsystem requires it.
>*/
>  
> - unp->unp_vnode->v_socket = NULL;
> - vp = unp->unp_vnode;
> - unp->unp_vnode = NULL;
> + sounlock(so, SL_LOCKED);
> +
> + VOP_LOCK(vp, LK_EXCLUSIVE);
> + vp->v_socket = NULL;
> +
> + KERNEL_LOCK();
> + vput(vp);
> + KERNEL_UNLOCK();
> +
> + solock(so);
>   }
>  
>   if (unp->unp_conn)
> @@ -500,21 +510,6 @@ unp_detach(struct unpcb *unp)
>   pool_put(_pool, unp);
>   if (unp_rights)
>   task_add(systqmp, _gc_task);
> -
> - if (vp != NULL) {
> - /*
> -  * Enforce `i_lock' -> `unplock' because fifo subsystem
> -  * requires it. The socket can't be closed concurrently
> -  * because the file descriptor reference is
> -  * still hold.
> -  */
> -
> - sounlock(so, SL_LOCKED);
> - KERNEL_LOCK();
> - vrele(vp);
> - KERNEL_UNLOCK();
> - solock(so);
> - }
>  }
>  
>  int
> 



Re: UNIX sockets: make `unp_rights', `unp_msgcount' and `unp_file' atomic

2021-11-05 Thread Martin Pieuchot
On 30/10/21(Sat) 21:22, Vitaliy Makkoveev wrote:
> This completely removes global rwlock(9) from the unp_internalize() and
> unp_externalize() normal paths but only leaves it in unp_externalize()
> error path. Also we don't need to simultaneously hold both fdplock()
> and `unp_lock' in unp_internalize(). As non obvious profit this
> simplifies the future lock dances in the UNIX sockets layer.
> 
> It's safe to call fptounp() without `unp_lock' held. We always got this
> file descriptor by fd_getfile(9) so we always have the extra reference
> and this descriptor can't be closed by concurrent thread. Some sockets
> could be destroyed through 'PRU_ABORT' path but they don't have
> associated file descriptor and they are not accessible in the
> unp_internalize() path.
> 
> The `unp_file' access without `unp_lock' held is also safe. Each socket
> could have the only associated file descriptor and each file descriptor
> could have the only associated socket. We only assign `unp_file' in the
> unp_internalize() path where we got the socket by fd_getfile(9). This
> descriptor has the extra reference and couldn't be closed concurrently.
> We could override `unp_file' but with the same address because the
> associated file descriptor can't be changed so the address will be also
> the same. So while unp_gc() concurrently runs the dereference of
> non-NULL `unp_file' is always safe.

Using an atomic operation for `unp_msgcount' is ok with me, one comment
about `unp_rights' below.

> Index: sys/kern/uipc_usrreq.c
> ===
> RCS file: /cvs/src/sys/kern/uipc_usrreq.c,v
> retrieving revision 1.153
> diff -u -p -r1.153 uipc_usrreq.c
> --- sys/kern/uipc_usrreq.c30 Oct 2021 16:35:31 -  1.153
> +++ sys/kern/uipc_usrreq.c30 Oct 2021 18:41:25 -
> @@ -58,6 +58,7 @@
>   * Locks used to protect global data and struct members:
>   *  I   immutable after creation
>   *  U   unp_lock
> + *  a   atomic
>   */
>  struct rwlock unp_lock = RWLOCK_INITIALIZER("unplock");
>  
> @@ -99,7 +100,7 @@ SLIST_HEAD(,unp_deferral)  unp_deferred =
>   SLIST_HEAD_INITIALIZER(unp_deferred);
>  
>  ino_tunp_ino;/* [U] prototype for fake inode numbers */
> -int  unp_rights; /* [U] file descriptors in flight */
> +int  unp_rights; /* [a] file descriptors in flight */
>  int  unp_defer;  /* [U] number of deferred fp to close by the GC task */
>  int  unp_gcing;  /* [U] GC task currently running */
>  
> @@ -927,17 +928,16 @@ restart:
>*/
>   rp = (struct fdpass *)CMSG_DATA(cm);
>  
> - rw_enter_write(_lock);
>   for (i = 0; i < nfds; i++) {
>   struct unpcb *unp;
>  
>   fp = rp->fp;
>   rp++;
>   if ((unp = fptounp(fp)) != NULL)
> - unp->unp_msgcount--;
> - unp_rights--;
> + atomic_dec_long(>unp_msgcount);
>   }
> - rw_exit_write(_lock);
> +
> + atomic_sub_int(_rights, nfds);
>  
>   /*
>* Copy temporary array to message and adjust length, in case of
> @@ -985,13 +985,10 @@ unp_internalize(struct mbuf *control, st
>   return (EINVAL);
>   nfds = (cm->cmsg_len - CMSG_ALIGN(sizeof(*cm))) / sizeof (int);
>  
> - rw_enter_write(_lock);
> - if (unp_rights + nfds > maxfiles / 10) {
> - rw_exit_write(_lock);
> + if (atomic_add_int_nv(_rights, nfds) > maxfiles / 10) {
> + atomic_sub_int(_rights, nfds);

I can't believe this is race free. If two threads, T1 and T2, call
atomic_add at the same time both might end up returning EMFILE even
if only the first one currently does.  This could happen if T1 exceeds
the limit and T2 does atomic_add on an already-exceeded `unp_rights'
before T1 could do atomic_sub.

I suggest using a mutex to protect `unp_rights' instead to solve this
issue.

>   return (EMFILE);
>   }
> - unp_rights += nfds;
> - rw_exit_write(_lock);
>  
>   /* Make sure we have room for the struct file pointers */
>  morespace:
> @@ -1031,7 +1028,6 @@ morespace:
>   ip = ((int *)CMSG_DATA(cm)) + nfds - 1;
>   rp = ((struct fdpass *)CMSG_DATA(cm)) + nfds - 1;
>   fdplock(fdp);
> - rw_enter_write(_lock);
>   for (i = 0; i < nfds; i++) {
>   memcpy(, ip, sizeof fd);
>   ip--;
> @@ -1056,15 +1052,13 @@ morespace:
>   rp->flags = fdp->fd_ofileflags[fd] & UF_PLEDGED;
>   rp--;
>   if ((unp = fptounp(fp)) != NULL) {
> + atomic_inc_long(>unp_msgcount);
>   unp->unp_file = fp;
> - unp->unp_msgcount++;
>   }
>   }
> - rw_exit_write(_lock);
>   fdpunlock(fdp);
>   return (0);
>  fail:
> - rw_exit_write(_lock);
>   fdpunlock(fdp);
>   if (fp != NULL)
>   FRELE(fp, p);
> @@ -1072,17 +1066,13 @@ fail:
>   for ( ; i > 

  1   2   3   4   5   6   7   8   9   10   >