Re: uvm_swap: introduce uvm_swap_data_lock
On 16/01/22(Sun) 15:35, Martin Pieuchot wrote: > On 30/12/21(Thu) 23:38, Theo Buehler wrote: > > The diff below does two things: it adds a uvm_swap_data_lock mutex and > > trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad() > > Why is it enough? Which fields is the lock protecting in these > function? Is it `uvmexp.swpages', could that be documented? It is documented in the diff below. > > What about `nswapdev'? Why is the rwlock grabbed before reading it in > sys_swapctl()?i Because it is always modified with the lock, I added some documentation. > What about `swpginuse'? This is still under KERNEL_LOCK(), documented below. > If the mutex/rwlock are used to protect the global `swap_priority' could > that be also documented? Once this is documented it should be trivial to > see that some places are missing some locking. Is it intentional? > > > The uvm_swap_data_lock protects all swap data structures, so needs to be > > grabbed a few times, many of them already documented in the comments. > > > > For review, I suggest comparing to what NetBSD did and also going > > through the consumers (swaplist_insert, swaplist_find, swaplist_trim) > > and check that they are properly locked when called, or that there is > > the KERNEL_LOCK() in place when swap data structures are manipulated. > > I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce > the differences with NetBSD. Done. > > In swapmount() I introduced locking since that's needed to be able to > > assert that the proper locks are held in swaplist_{insert,find,trim}. > > Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down? > What about `uvmexp.nswget' and `uvmexp.swpgonly' in there? This has been done as part of another change. This diff uses an atomic operation to increase `nswget' in case multiple threads fault on a page in swap at the same time. Updated diff below, ok? Index: uvm/uvm_swap.c === RCS file: /cvs/src/sys/uvm/uvm_swap.c,v retrieving revision 1.163 diff -u -p -r1.163 uvm_swap.c --- uvm/uvm_swap.c 6 Aug 2022 13:44:04 - 1.163 +++ uvm/uvm_swap.c 17 Aug 2022 11:46:20 - @@ -45,6 +45,7 @@ #include #include #include +#include #include #include #include @@ -84,13 +85,16 @@ * the system maintains a global data structure describing all swap * partitions/files. there is a sorted LIST of "swappri" structures * which describe "swapdev"'s at that priority. this LIST is headed - * by the "swap_priority" global var.each "swappri" contains a + * by the "swap_priority" global var.each "swappri" contains a * TAILQ of "swapdev" structures at that priority. * * locking: * - swap_syscall_lock (sleep lock): this lock serializes the swapctl *system call and prevents the swap priority list from changing *while we are in the middle of a system call (e.g. SWAP_STATS). + * - uvm_swap_data_lock (mutex): this lock protects all swap data + *structures including the priority list, the swapdev structures, + *and the swapmap arena. * * each swap device has the following info: * - swap device in use (could be disabled, preventing future use) @@ -106,7 +110,7 @@ * userland controls and configures swap with the swapctl(2) system call. * the sys_swapctl performs the following operations: * [1] SWAP_NSWAP: returns the number of swap devices currently configured - * [2] SWAP_STATS: given a pointer to an array of swapent structures + * [2] SWAP_STATS: given a pointer to an array of swapent structures * (passed in via "arg") of a size passed in via "misc" ... we load * the current swap config into the array. * [3] SWAP_ON: given a pathname in arg (could be device or file) and a @@ -208,9 +212,10 @@ struct extent *swapmap;/* controls the /* list of all active swap devices [by priority] */ LIST_HEAD(swap_priority, swappri); -struct swap_priority swap_priority; +struct swap_priority swap_priority;/* [S] */ /* locks */ +struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE); struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk"); struct mutex oommtx = MUTEX_INITIALIZER(IPL_VM); @@ -224,7 +229,7 @@ void swapdrum_add(struct swapdev *, in struct swapdev *swapdrum_getsdp(int); struct swapdev *swaplist_find(struct vnode *, int); -voidswaplist_insert(struct swapdev *, +voidswaplist_insert(struct swapdev *, struct swappri *, int); voidswaplist_trim(void); @@ -472,16 +477,19 @@ uvm_swap_finicrypt_all(void) /* * swaplist_insert: insert swap device "sdp" into the
Re: patch: change swblk_t type and use it in blist
On 05/08/22(Fri) 18:10, Sebastien Marie wrote: > Hi, > > When initially ported blist from DragonFlyBSD, we used custom type bsblk_t > and > bsbmp_t instead of the one used by DragonFlyBSD (swblk_t and u_swblk_t). > > The reason was swblk_t is already defined on OpenBSD, and was incompatible > with > blist (int32_t). It is defined, but not used (outside some regress file which > seems to be not affected by type change). > > This diff changes the __swblk_t definition in sys/_types.h to be 'unsigned > long', and switch back blist to use swblk_t (and u_swblk_t, even if it isn't > 'unsigned swblk_t'). > > It makes the diff with DragonFlyBSD more thin. I added a comment with the git > id > used for the initial port. > > I tested it on i386 and amd64 (kernel and userland). > > By changing bitmap type from 'u_long' to 'u_swblk_t' ('u_int64_t'), it makes > the > regress the same on 64 and 32bits archs (and it success on both). > > Comments or OK ? Makes sense to me. I'm not a standard/type lawyer so I don't know if this is fine for userland. So I'm ok with it. > diff /home/semarie/repos/openbsd/src > commit - 73f52ef7130cefbe5a8fe028eedaad0e54be7303 > path + /home/semarie/repos/openbsd/src > blob - e05867429cdd81c434f9ca589c1fb8c6d25957f8 > file + sys/sys/_types.h > --- sys/sys/_types.h > +++ sys/sys/_types.h > @@ -60,7 +60,7 @@ typedef __uint8_t __sa_family_t; /* sockaddr > address f > typedef __int32_t __segsz_t; /* segment size */ > typedef __uint32_t __socklen_t;/* length type for network > syscalls */ > typedef long__suseconds_t; /* microseconds (signed) */ > -typedef __int32_t __swblk_t; /* swap offset */ > +typedef unsigned long __swblk_t; /* swap offset */ > typedef __int64_t __time_t; /* epoch time */ > typedef __int32_t __timer_t; /* POSIX timer identifiers */ > typedef __uint32_t __uid_t;/* user id */ > blob - 102ca95dd45ba6d9cab0f3fcbb033d6043ec1606 > file + sys/sys/blist.h > --- sys/sys/blist.h > +++ sys/sys/blist.h > @@ -1,4 +1,5 @@ > /* $OpenBSD: blist.h,v 1.1 2022/07/29 17:47:12 semarie Exp $ */ > +/* DragonFlyBSD:7b80531f545c7d3c51c1660130c71d01f6bccbe0:/sys/sys/blist.h */ > /* > * Copyright (c) 2003,2004 The DragonFly Project. All rights reserved. > * > @@ -65,15 +66,13 @@ > #include > #endif > > -#define SWBLK_BITS 64 > -typedef u_long bsbmp_t; > -typedef u_long bsblk_t; > +typedef u_int64_tu_swblk_t; > > /* > * note: currently use SWAPBLK_NONE as an absolute value rather then > * a flag bit. > */ > -#define SWAPBLK_NONE ((bsblk_t)-1) > +#define SWAPBLK_NONE ((swblk_t)-1) > > /* > * blmeta and bl_bitmap_t MUST be a power of 2 in size. > @@ -81,39 +80,39 @@ typedef u_long bsblk_t; > > typedef struct blmeta { > union { > - bsblk_t bmu_avail; /* space available under us */ > - bsbmp_t bmu_bitmap; /* bitmap if we are a leaf */ > + swblk_t bmu_avail; /* space available under us */ > + u_swblk_t bmu_bitmap; /* bitmap if we are a leaf */ > } u; > - bsblk_t bm_bighint; /* biggest contiguous block hint*/ > + swblk_t bm_bighint; /* biggest contiguous block hint*/ > } blmeta_t; > > typedef struct blist { > - bsblk_t bl_blocks; /* area of coverage */ > + swblk_t bl_blocks; /* area of coverage */ > /* XXX int64_t bl_radix */ > - bsblk_t bl_radix; /* coverage radix */ > - bsblk_t bl_skip;/* starting skip*/ > - bsblk_t bl_free;/* number of free blocks*/ > + swblk_t bl_radix; /* coverage radix */ > + swblk_t bl_skip;/* starting skip*/ > + swblk_t bl_free;/* number of free blocks*/ > blmeta_t*bl_root; /* root of radix tree */ > - bsblk_t bl_rootblks;/* bsblk_t blks allocated for tree */ > + swblk_t bl_rootblks;/* swblk_t blks allocated for tree */ > } *blist_t; > > -#define BLIST_META_RADIX (sizeof(bsbmp_t)*8/2) /* 2 bits per */ > -#define BLIST_BMAP_RADIX (sizeof(bsbmp_t)*8) /* 1 bit per */ > +#define BLIST_META_RADIX (sizeof(u_swblk_t)*8/2) /* 2 bits per */ > +#define BLIST_BMAP_RADIX (sizeof(u_swblk_t)*8) /* 1 bit per */ > > /* > * The radix may exceed the size of a 64 bit signed (or unsigned) int > - * when the maximal number of blocks is allocated. With a 32-bit bsblk_t > + * when the maximal number of blocks is allocated. With a 32-bit swblk_t > * this corresponds to ~1G x PAGE_SIZE = 4096GB. The swap code usually > * divides this by 4, leaving us with a capability of up to four 1TB swap > * devices. > * > - * With a
pf.conf(5): document new anchors limit
Hi This is a diff to document the new anchors limit in pf.conf(5). I inserted it as second-to-last item, as the following paragraph talks about NMBCLUSTERS. While here: Is the double entry for table-entries intentional? Best, Martin Index: pf.conf.5 === RCS file: /cvs/src/share/man/man5/pf.conf.5,v retrieving revision 1.596 diff -u -p -r1.596 pf.conf.5 --- pf.conf.5 27 May 2022 15:45:02 - 1.596 +++ pf.conf.5 21 Jul 2022 17:00:53 - @@ -1287,6 +1287,7 @@ has the following defaults: .It tables Ta Dv PFR_KTABLE_HIWAT Ta Pq 1000 .It table-entries Ta Dv PFR_KENTRY_HIWAT Ta Pq 20 .It table-entries Ta Dv PFR_KENTRY_HIWAT_SMALL Ta Pq 10 +.It anchors Ta Dv PF_ANCHOR_HIWAT Ta Pq 512 .It frags Ta Dv NMBCLUSTERS Ns /32 Ta Pq platform dependent .El .Pp
ypconnect(2): mention correct return value
Hi While looking at the recent YP changes I noticed that the RETURN VALUES section of the man page is incorrect. Here is an update (I just copied the text from socket(2) and adjusted the function name). Best, Martin Index: ypconnect.2 === RCS file: /cvs/src/lib/libc/sys/ypconnect.2,v retrieving revision 1.2 diff -u -p -r1.2 ypconnect.2 --- ypconnect.2 17 Jul 2022 05:48:26 - 1.2 +++ ypconnect.2 21 Jul 2022 17:08:57 - @@ -45,7 +45,12 @@ general purpose. .Nm is only intended for use by internal libc YP functions. .Sh RETURN VALUES -.Rv -std +If successful, +.Fn ypconnect +returns a non-negative integer, the socket file descriptor. +Otherwise, a value of \-1 is returned and +.Va errno +is set to indicate the error. .Sh ERRORS .Fn ypconnect will fail if:
Re: Introduce uvm_pagewait()
On 28/06/22(Tue) 14:13, Martin Pieuchot wrote: > I'd like to abstract the use of PG_WANTED to start unifying & cleaning > the various cases where a code path is waiting for a busy page. Here's > the first step. > > ok? Anyone? > Index: uvm/uvm_amap.c > === > RCS file: /cvs/src/sys/uvm/uvm_amap.c,v > retrieving revision 1.90 > diff -u -p -r1.90 uvm_amap.c > --- uvm/uvm_amap.c30 Aug 2021 16:59:17 - 1.90 > +++ uvm/uvm_amap.c28 Jun 2022 11:53:08 - > @@ -781,9 +781,7 @@ ReStart: >* it and then restart. >*/ > if (pg->pg_flags & PG_BUSY) { > - atomic_setbits_int(>pg_flags, PG_WANTED); > - rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK, > - "cownow", INFSLP); > + uvm_pagewait(pg, amap->am_lock, "cownow"); > goto ReStart; > } > > Index: uvm/uvm_aobj.c > === > RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v > retrieving revision 1.103 > diff -u -p -r1.103 uvm_aobj.c > --- uvm/uvm_aobj.c29 Dec 2021 20:22:06 - 1.103 > +++ uvm/uvm_aobj.c28 Jun 2022 11:53:08 - > @@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj) > while ((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) { > pmap_page_protect(pg, PROT_NONE); > if (pg->pg_flags & PG_BUSY) { > - atomic_setbits_int(>pg_flags, PG_WANTED); > - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det", > - INFSLP); > + uvm_pagewait(pg, uobj->vmobjlock, "uao_det"); > + rw_enter(uobj->vmobjlock, RW_WRITE); > continue; > } > uao_dropswap(>u_obj, pg->offset >> PAGE_SHIFT); > @@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_ > > /* Make sure page is unbusy, else wait for it. */ > if (pg->pg_flags & PG_BUSY) { > - atomic_setbits_int(>pg_flags, PG_WANTED); > - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh", > - INFSLP); > + uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh"); > + rw_enter(uobj->vmobjlock, RW_WRITE); > curoff -= PAGE_SIZE; > continue; > } > @@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t > > /* page is there, see if we need to wait on it */ > if ((ptmp->pg_flags & PG_BUSY) != 0) { > - atomic_setbits_int(>pg_flags, PG_WANTED); > - rwsleep_nsec(ptmp, uobj->vmobjlock, PVM, > - "uao_get", INFSLP); > + uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get"); > + rw_enter(uobj->vmobjlock, RW_WRITE); > continue; /* goto top of pps while loop */ > } > > Index: uvm/uvm_km.c > === > RCS file: /cvs/src/sys/uvm/uvm_km.c,v > retrieving revision 1.150 > diff -u -p -r1.150 uvm_km.c > --- uvm/uvm_km.c 7 Jun 2022 12:07:45 - 1.150 > +++ uvm/uvm_km.c 28 Jun 2022 11:53:08 - > @@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj, > for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) { > pp = uvm_pagelookup(uobj, curoff); > if (pp && pp->pg_flags & PG_BUSY) { > - atomic_setbits_int(>pg_flags, PG_WANTED); > - rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm", > - INFSLP); > + uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm"); > + rw_enter(uobj->vmobjlock, RW_WRITE); > curoff -= PAGE_SIZE; /* loop back to us */ > continue; > } > Index: uvm/uvm_page.c > === > RCS file: /cvs/src/sys/uvm/uvm_page.c,v > retrieving revision 1.166 > diff -u -p -r1.166 uvm_page.c > --- uvm/uvm_page.c12 May 2022 12:48:36 - 1.166 > +++ uvm/uv
PATCH: better prime testing for libressl
Hello, I'm suggesting you a diff against master of the implementation of the Baillie-PSW algorithm for primality testing. The code in itself is commented and has information about what is being done and why. The reason for this change is that in this paper from 2018 (https://eprint.iacr.org/2018/749.pdf), researchers pointed out the weaknesses of main algorithms for primality testing where one could generate pseudoprimes to those tests resulting in fooling them and having composite numbers being taken as primes. There is one that has been studied and that has no known pseudoprime and it is called the Baillie-PSW algorithm. It is this one I and Theo Buehler have implemented as it is safer. Me and Theo Buehler have been actively working on it for the past month to finally present it to you. In hope this finds you well, Regards, Martin Grenouilloux. EPITA - LSE diff --git a/lib/libcrypto/Makefile b/lib/libcrypto/Makefile index d6432cdc518..7e3a07c5fba 100644 --- a/lib/libcrypto/Makefile +++ b/lib/libcrypto/Makefile @@ -89,6 +89,7 @@ SRCS+= bn_print.c bn_rand.c bn_shift.c bn_word.c bn_blind.c SRCS+= bn_kron.c bn_sqrt.c bn_gcd.c bn_prime.c bn_err.c bn_sqr.c SRCS+= bn_recp.c bn_mont.c bn_mpi.c bn_exp2.c bn_gf2m.c bn_nist.c SRCS+= bn_depr.c bn_const.c bn_x931p.c +SRCS+= bn_bpsw.c bn_isqrt.c # buffer/ SRCS+= buffer.c buf_err.c buf_str.c diff --git a/lib/libcrypto/bn/bn_bpsw.c b/lib/libcrypto/bn/bn_bpsw.c new file mode 100644 index 000..d198899b2a4 --- /dev/null +++ b/lib/libcrypto/bn/bn_bpsw.c @@ -0,0 +1,401 @@ +/* $OpenBSD$ */ +/* + * Copyright (c) 2022 Martin Grenouilloux + * Copyright (c) 2022 Theo Buehler + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + */ + +#include + +#include "bn_lcl.h" +#include "bn_prime.h" + +/* + * For an odd n compute a / 2 (mod n). If a is even, we can do a plain + * division, otherwise calculate (a + n) / 2. Then reduce (mod n). + */ +static int +bn_division_by_two_mod_n(BIGNUM *r, BIGNUM *a, const BIGNUM *n, BN_CTX *ctx) +{ + if (!BN_is_odd(n)) + return 0; + + if (!BN_mod_ct(r, a, n, ctx)) + return 0; + + if (BN_is_odd(r)) { + if (!BN_add(r, r, n)) + return 0; + } + + if (!BN_rshift1(r, r)) + return 0; + + return 1; +} + +/* + * Given the next binary digit of k and the current Lucas terms U and V, this + * helper computes the next terms in the Lucas sequence defined as follows: + * + * U' = U * V (mod n) + * V' = (V^2 + D * U^2) / 2(mod n) + * + * If digit == 0, bn_lucas_step() returns U' and V'. If digit == 1, it returns + * + * U'' = (U' + V') / 2 (mod n) + * V'' = (V' + D * U') / 2 (mod n) + * + * Compare with FIPS 186-4, Appendix C.3.3, step 6. + */ +static int +bn_lucas_step(BIGNUM *U, BIGNUM *V, int digit, const BIGNUM *D, +const BIGNUM *n, BN_CTX *ctx) +{ + BIGNUM *tmp; + int ret = 0; + + BN_CTX_start(ctx); + + if ((tmp = BN_CTX_get(ctx)) == NULL) + goto done; + + /* Store D * U^2 before computing U'. */ + if (!BN_sqr(tmp, U, ctx)) + goto done; + if (!BN_mul(tmp, D, tmp, ctx)) + goto done; + + /* U' = U * V (mod n). */ + if (!BN_mod_mul(U, U, V, n, ctx)) + goto done; + + /* V' = (V^2 + D * U^2) / 2 (mod n). */ + if (!BN_sqr(V, V, ctx)) + goto done; + if (!BN_add(V, V, tmp)) + goto done; + if (!bn_division_by_two_mod_n(V, V, n, ctx)) + goto done; + + if (digit == 1) { + /* Store D * U' before computing U''. */ + if (!BN_mul(tmp, D, U, ctx)) + goto done; + + /* U'' = (U' + V') / 2 (mod n). */ + if (!BN_add(U, U, V)) + goto done; + if (!bn_division_by_two_mod_n(U, U, n, ctx)) + goto done; + + /* V'' = (V' + D * U') / 2 (mod n). */ + if (!BN_add(V, V, tmp)) + goto done; + if (!bn_division_by_two_mod_n(V, V, n, ctx)) + goto done; + } + + ret = 1; + + done: + BN_CTX_end(ctx); + + return ret; +} + +/* + * Compute the Lucas terms U_k, V_k, see FIPS 186-4, Appendix C.3.3, steps 4-6. + */ +static int +bn_lucas(BIGNUM *U, BIGNUM *V, const BIGNUM *k, const BIGNUM *D, +const BIGNUM *n, BN_CTX *ctx) +{ + int digit, i; + int ret = 0; + + if (!BN_one(U)) + goto done; + if (!BN_one(V)) + goto done; + + /* + * Iterate over the digits of k from MSB to LSB. Start at digit 2 + * since the first digit is dealt with by setting U = 1 and V = 1. + */ + for (i =
Faster M operation for the swapper to be great again
Diff below uses two tricks to make uvm_pagermapin/out() faster and less likely to fail in OOM situations. These functions are used to map buffers when swapping pages in/out and when faulting on mmaped files. robert@ even measured a 75% improvement when populating pages related to files that aren't yet in the buffer cache. The first trick is to use the direct map when available. I'm doing this for single pages but km_alloc(9) also does that for single segment... uvm_io() only maps one page at a time for the moment so this should be enough. The second trick is to use pmap_kenter_pa() which doesn't fail and is faster. With this changes the "freeze" happening on my server when entering many pages to swap in OOM situation is much shorter and the machine becomes quickly responsive. ok? Index: uvm/uvm_pager.c === RCS file: /cvs/src/sys/uvm/uvm_pager.c,v retrieving revision 1.81 diff -u -p -r1.81 uvm_pager.c --- uvm/uvm_pager.c 28 Jun 2022 19:07:40 - 1.81 +++ uvm/uvm_pager.c 30 Jun 2022 13:34:46 - @@ -258,6 +258,16 @@ uvm_pagermapin(struct vm_page **pps, int vsize_t size; struct vm_page *pp; +#ifdef __HAVE_PMAP_DIRECT + /* use direct mappings for single page */ + if (npages == 1) { + KASSERT(pps[0]); + KASSERT(pps[0]->pg_flags & PG_BUSY); + kva = pmap_map_direct(pps[0]); + return kva; + } +#endif + prot = PROT_READ; if (flags & UVMPAGER_MAPIN_READ) prot |= PROT_WRITE; @@ -273,14 +283,7 @@ uvm_pagermapin(struct vm_page **pps, int pp = *pps++; KASSERT(pp); KASSERT(pp->pg_flags & PG_BUSY); - /* Allow pmap_enter to fail. */ - if (pmap_enter(pmap_kernel(), cva, VM_PAGE_TO_PHYS(pp), - prot, PMAP_WIRED | PMAP_CANFAIL | prot) != 0) { - pmap_remove(pmap_kernel(), kva, cva); - pmap_update(pmap_kernel()); - uvm_pseg_release(kva); - return 0; - } + pmap_kenter_pa(cva, VM_PAGE_TO_PHYS(pp), prot); } pmap_update(pmap_kernel()); return kva; @@ -294,8 +297,15 @@ uvm_pagermapin(struct vm_page **pps, int void uvm_pagermapout(vaddr_t kva, int npages) { +#ifdef __HAVE_PMAP_DIRECT + /* use direct mappings for single page */ + if (npages == 1) { + pmap_unmap_direct(kva); + return; + } +#endif - pmap_remove(pmap_kernel(), kva, kva + ((vsize_t)npages << PAGE_SHIFT)); + pmap_kremove(kva, (vsize_t)npages << PAGE_SHIFT); pmap_update(pmap_kernel()); uvm_pseg_release(kva);
Re: Use SMR instead of SRP list in rtsock.c
On 30/06/22(Thu) 11:56, Claudio Jeker wrote: > On Thu, Jun 30, 2022 at 12:34:33PM +0300, Vitaliy Makkoveev wrote: > > On Thu, Jun 30, 2022 at 11:08:48AM +0200, Claudio Jeker wrote: > > > This diff converts the SRP list to a SMR list in rtsock.c > > > SRP is a bit strange with how it works and the SMR code is a bit easier to > > > understand. Since we can sleep in the SMR_TAILQ_FOREACH() we need to grab > > > a refcount on the route pcb so that we can leave the SMR critical section > > > and then enter the SMR critical section at the end of the loop before > > > dropping the refcount again. > > > > > > The diff does not immeditaly explode but I doubt we can exploit > > > parallelism in route_input() so this may fail at some later stage if it is > > > wrong. > > > > > > Comments from the lock critics welcome > > > > We use `so_lock' rwlock(9) to protect route domain sockets. We can't > > convert this SRP list to SMR list because we call solock() within > > foreach loop. We shouldn't use SRP list either, no? Or are we allowed to sleep holding a SRP reference? That's the question that triggered this diff. > because of the so_lock the code uses a refcnt on the route pcb to make > sure that the object is not freed while we sleep. So that is handled by > this diff. > > > We can easily crash kernel by running in parallel some "route monitor" > > commands and "while true; ifconfig vether0 create ; ifconfig vether0 > > destroy; done". > > That does not cause problem on my system. > > > > -- > > > :wq Claudio > > > > > > Index: sys/net/rtsock.c > > > === > > > RCS file: /cvs/src/sys/net/rtsock.c,v > > > retrieving revision 1.334 > > > diff -u -p -r1.334 rtsock.c > > > --- sys/net/rtsock.c 28 Jun 2022 10:01:13 - 1.334 > > > +++ sys/net/rtsock.c 30 Jun 2022 08:02:09 - > > > @@ -71,7 +71,7 @@ > > > #include > > > #include > > > #include > > > -#include > > > +#include > > > > > > #include > > > #include > > > @@ -107,8 +107,6 @@ struct walkarg { > > > }; > > > > > > void route_prinit(void); > > > -void rcb_ref(void *, void *); > > > -void rcb_unref(void *, void *); > > > int route_output(struct mbuf *, struct socket *, struct sockaddr *, > > > struct mbuf *); > > > int route_ctloutput(int, struct socket *, int, int, struct mbuf *); > > > @@ -149,7 +147,7 @@ intrt_setsource(unsigned int, struct > > > struct rtpcb { > > > struct socket *rop_socket;/* [I] */ > > > > > > - SRPL_ENTRY(rtpcb) rop_list; > > > + SMR_TAILQ_ENTRY(rtpcb) rop_list; > > > struct refcnt rop_refcnt; > > > struct timeout rop_timeout; > > > unsigned introp_msgfilter; /* [s] */ > > > @@ -162,8 +160,7 @@ struct rtpcb { > > > #define sotortpcb(so) ((struct rtpcb *)(so)->so_pcb) > > > > > > struct rtptable { > > > - SRPL_HEAD(, rtpcb) rtp_list; > > > - struct srpl_rc rtp_rc; > > > + SMR_TAILQ_HEAD(, rtpcb) rtp_list; > > > struct rwlock rtp_lk; > > > unsigned intrtp_count; > > > }; > > > @@ -185,29 +182,12 @@ struct rtptable rtptable; > > > void > > > route_prinit(void) > > > { > > > - srpl_rc_init(_rc, rcb_ref, rcb_unref, NULL); > > > rw_init(_lk, "rtsock"); > > > - SRPL_INIT(_list); > > > + SMR_TAILQ_INIT(_list); > > > pool_init(_pool, sizeof(struct rtpcb), 0, > > > IPL_SOFTNET, PR_WAITOK, "rtpcb", NULL); > > > } > > > > > > -void > > > -rcb_ref(void *null, void *v) > > > -{ > > > - struct rtpcb *rop = v; > > > - > > > - refcnt_take(>rop_refcnt); > > > -} > > > - > > > -void > > > -rcb_unref(void *null, void *v) > > > -{ > > > - struct rtpcb *rop = v; > > > - > > > - refcnt_rele_wake(>rop_refcnt); > > > -} > > > - > > > int > > > route_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf > > > *nam, > > > struct mbuf *control, struct proc *p) > > > @@ -325,8 +305,7 @@ route_attach(struct socket *so, int prot > > > so->so_options |= SO_USELOOPBACK; > > > > > > rw_enter(_lk, RW_WRITE); > > > - SRPL_INSERT_HEAD_LOCKED(_rc, _list, rop, > > > - rop_list); > > > + SMR_TAILQ_INSERT_HEAD_LOCKED(_list, rop, rop_list); > > > rtptable.rtp_count++; > > > rw_exit(_lk); > > > > > > @@ -347,8 +326,7 @@ route_detach(struct socket *so) > > > rw_enter(_lk, RW_WRITE); > > > > > > rtptable.rtp_count--; > > > - SRPL_REMOVE_LOCKED(_rc, _list, rop, rtpcb, > > > - rop_list); > > > + SMR_TAILQ_REMOVE_LOCKED(_list, rop, rop_list); > > > rw_exit(_lk); > > > > > > sounlock(so); > > > @@ -356,6 +334,7 @@ route_detach(struct socket *so) > > > /* wait for all references to drop */ > > > refcnt_finalize(>rop_refcnt, "rtsockrefs"); > > > timeout_del_barrier(>rop_timeout); > > > + smr_barrier(); > > > > > > solock(so); > > > > > > @@ -501,7 +480,6 @@ route_input(struct mbuf *m0, struct sock > > > struct rtpcb *rop; > > >
Re: arp llinfo mutex
On 29/06/22(Wed) 19:40, Alexander Bluhm wrote: > Hi, > > To fix the KASSERT(la != NULL) we have to protect the rt_llinfo > with a mutex. The idea is to keep rt_llinfo and RTF_LLINFO consistent. > Also do not put the mutex in the fast path. Losing the RTM_ADD/DELETE race is not a bug. I would not add a printf in these cases. I understand you might want one for debugging purposes but I don't see any value in committing it. Do you agree? Note that some times the code checks for the RTF_LLINFO flags and some time for rt_llinfo != NULL. This is inconsistent and a bit confusing now that we use a mutex to protect those states. Could you document that rt_llinfo is now protected by the mutex (or KERNEL_LOCK())? Anyway this is an improvement ok mpi@ PS: What about ND6? > Index: netinet/if_ether.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/if_ether.c,v > retrieving revision 1.250 > diff -u -p -r1.250 if_ether.c > --- netinet/if_ether.c27 Jun 2022 20:47:10 - 1.250 > +++ netinet/if_ether.c28 Jun 2022 14:00:12 - > @@ -101,6 +101,8 @@ void arpreply(struct ifnet *, struct mbu > unsigned int); > > struct niqueue arpinq = NIQUEUE_INITIALIZER(50, NETISR_ARP); > + > +/* llinfo_arp live time, rt_llinfo and RTF_LLINFO are protected by arp_mtx */ > struct mutex arp_mtx = MUTEX_INITIALIZER(IPL_SOFTNET); > > LIST_HEAD(, llinfo_arp) arp_list; /* [mN] list of all llinfo_arp structures > */ > @@ -155,7 +157,7 @@ void > arp_rtrequest(struct ifnet *ifp, int req, struct rtentry *rt) > { > struct sockaddr *gate = rt->rt_gateway; > - struct llinfo_arp *la = (struct llinfo_arp *)rt->rt_llinfo; > + struct llinfo_arp *la; > time_t uptime; > > NET_ASSERT_LOCKED(); > @@ -171,7 +173,7 @@ arp_rtrequest(struct ifnet *ifp, int req > rt->rt_expire = 0; > break; > } > - if ((rt->rt_flags & RTF_LOCAL) && !la) > + if ((rt->rt_flags & RTF_LOCAL) && rt->rt_llinfo == NULL) > rt->rt_expire = 0; > /* >* Announce a new entry if requested or warn the user > @@ -192,44 +194,54 @@ arp_rtrequest(struct ifnet *ifp, int req > } > satosdl(gate)->sdl_type = ifp->if_type; > satosdl(gate)->sdl_index = ifp->if_index; > - if (la != NULL) > - break; /* This happens on a route change */ > /* >* Case 2: This route may come from cloning, or a manual route >* add with a LL address. >*/ > la = pool_get(_pool, PR_NOWAIT | PR_ZERO); > - rt->rt_llinfo = (caddr_t)la; > if (la == NULL) { > log(LOG_DEBUG, "%s: pool get failed\n", __func__); > break; > } > > + mtx_enter(_mtx); > + if (rt->rt_llinfo != NULL) { > + /* we lost the race, another thread has entered it */ > + mtx_leave(_mtx); > + printf("%s: llinfo exists\n", __func__); > + pool_put(_pool, la); > + break; > + } > mq_init(>la_mq, LA_HOLD_QUEUE, IPL_SOFTNET); > + rt->rt_llinfo = (caddr_t)la; > la->la_rt = rt; > rt->rt_flags |= RTF_LLINFO; > + LIST_INSERT_HEAD(_list, la, la_list); > if ((rt->rt_flags & RTF_LOCAL) == 0) > rt->rt_expire = uptime; > - mtx_enter(_mtx); > - LIST_INSERT_HEAD(_list, la, la_list); > mtx_leave(_mtx); > + > break; > > case RTM_DELETE: > - if (la == NULL) > - break; > mtx_enter(_mtx); > + la = (struct llinfo_arp *)rt->rt_llinfo; > + if (la == NULL) { > + /* we lost the race, another thread has removed it */ > + mtx_leave(_mtx); > + printf("%s: llinfo missing\n", __func__); > + break; > + } > LIST_REMOVE(la, la_list); > - mtx_leave(_mtx); > rt->rt_llinfo = NULL; > rt->rt_flags &= ~RTF_LLINFO; > atomic_sub_int(_hold_total, mq_purge(>la_mq)); > + mtx_leave(_mtx); > + > pool_put(_pool, la); > break; > > case RTM_INVALIDATE: > - if (la == NULL) > - break; > if (!ISSET(rt->rt_flags, RTF_LOCAL)) > arpinvalidate(rt); > break; > @@ -363,8 +375,6 @@ arpresolve(struct ifnet *ifp, struct rte > goto bad; > } > > - la = (struct llinfo_arp *)rt->rt_llinfo; > - KASSERT(la != NULL); > > /* >* Check the
Simplify aiodone daemon
The aiodone daemon accounts for and frees/releases pages they were written to swap. It is only used for asynchronous write. The diff below uses this knowledge to: - Stop suggesting that uvm_swap_get() can be asynchronous. There's an assert for PGO_SYNCIO 3 lines above. - Remove unused support for asynchronous read, including error conditions, from uvm_aio_aiodone_pages(). - Grab the proper lock for each page that has been written to swap. This allows to enable an assert in uvm_page_unbusy(). - Move the uvm_anon_release() call outside of uvm_page_unbusy() and assert for the different anon cases. This will allows us to unify code paths waiting for busy pages. This is adapted/simplified from what is in NetBSD. ok? Index: uvm/uvm_aobj.c === RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v retrieving revision 1.103 diff -u -p -r1.103 uvm_aobj.c --- uvm/uvm_aobj.c 29 Dec 2021 20:22:06 - 1.103 +++ uvm/uvm_aobj.c 29 Jun 2022 11:16:35 - @@ -143,7 +143,6 @@ struct pool uvm_aobj_pool; static struct uao_swhash_elt *uao_find_swhash_elt(struct uvm_aobj *, int, boolean_t); -static int uao_find_swslot(struct uvm_object *, int); static boolean_tuao_flush(struct uvm_object *, voff_t, voff_t, int); static void uao_free(struct uvm_aobj *); @@ -241,7 +240,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob /* * uao_find_swslot: find the swap slot number for an aobj/pageidx */ -inline static int +int uao_find_swslot(struct uvm_object *uobj, int pageidx) { struct uvm_aobj *aobj = (struct uvm_aobj *)uobj; Index: uvm/uvm_aobj.h === RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v retrieving revision 1.17 diff -u -p -r1.17 uvm_aobj.h --- uvm/uvm_aobj.h 21 Oct 2020 09:08:14 - 1.17 +++ uvm/uvm_aobj.h 29 Jun 2022 11:16:35 - @@ -60,6 +60,7 @@ void uao_init(void); int uao_set_swslot(struct uvm_object *, int, int); +int uao_find_swslot (struct uvm_object *, int); int uao_dropswap(struct uvm_object *, int); int uao_swap_off(int, int); int uao_shrink(struct uvm_object *, int); Index: uvm/uvm_page.c === RCS file: /cvs/src/sys/uvm/uvm_page.c,v retrieving revision 1.166 diff -u -p -r1.166 uvm_page.c --- uvm/uvm_page.c 12 May 2022 12:48:36 - 1.166 +++ uvm/uvm_page.c 29 Jun 2022 11:47:55 - @@ -1036,13 +1036,14 @@ uvm_pagefree(struct vm_page *pg) * uvm_page_unbusy: unbusy an array of pages. * * => pages must either all belong to the same object, or all belong to anons. + * => if pages are object-owned, object must be locked. * => if pages are anon-owned, anons must have 0 refcount. + * => caller must make sure that anon-owned pages are not PG_RELEASED. */ void uvm_page_unbusy(struct vm_page **pgs, int npgs) { struct vm_page *pg; - struct uvm_object *uobj; int i; for (i = 0; i < npgs; i++) { @@ -1052,35 +1053,19 @@ uvm_page_unbusy(struct vm_page **pgs, in continue; } -#if notyet - /* - * XXX swap case in uvm_aio_aiodone() is not holding the lock. -* -* This isn't compatible with the PG_RELEASED anon case below. -*/ KASSERT(uvm_page_owner_locked_p(pg)); -#endif KASSERT(pg->pg_flags & PG_BUSY); if (pg->pg_flags & PG_WANTED) { wakeup(pg); } if (pg->pg_flags & PG_RELEASED) { - uobj = pg->uobject; - if (uobj != NULL) { - uvm_lock_pageq(); - pmap_page_protect(pg, PROT_NONE); - /* XXX won't happen right now */ - if (pg->pg_flags & PQ_AOBJ) - uao_dropswap(uobj, - pg->offset >> PAGE_SHIFT); - uvm_pagefree(pg); - uvm_unlock_pageq(); - } else { - rw_enter(pg->uanon->an_lock, RW_WRITE); - uvm_anon_release(pg->uanon); - } + KASSERT(pg->uobject != NULL || + (pg->uanon != NULL && pg->uanon->an_ref > 0)); + atomic_clearbits_int(>pg_flags, PG_RELEASED); + uvm_pagefree(pg); } else { + KASSERT((pg->pg_flags & PG_FAKE) == 0); atomic_clearbits_int(>pg_flags, PG_WANTED|PG_BUSY); UVM_PAGE_OWN(pg, NULL);
Re: Unlocking pledge(2)
On 28/06/22(Tue) 18:17, Jeremie Courreges-Anglas wrote: > > Initially I just wandered in syscall_mi.h and found the locking scheme > super weird, even if technically correct. pledge_syscall() better be > safe to call without the kernel lock so I don't understand why we're > sometimes calling it with the kernel lock and sometimes not. > > ps_pledge is 64 bits so it's not possible to unset bits in an atomic > manner on all architectures. Even if we're only removing bits and there > is probably no way to see a completely garbage value, it makes sense to > just protect ps_pledge (and ps_execpledge) in the usual manner so that > we can unlock the syscall. The diff below protects the fields using > ps_mtx even though I initially used a dedicated ps_pledge_mtx. > unveil_destroy() needs to be moved after the critical section. > regress/sys/kern/pledge looks happy with this. The sys/syscall_mi.h > change can be committed in a separate step. > > Input and oks welcome. This looks nice. I doubt there's any existing program where you can really test this. Even firefox and chromium should do things correctly. Maybe you should write a regress test that tries to break the kernel. > Index: arch/amd64/amd64/vmm.c > === > RCS file: /home/cvs/src/sys/arch/amd64/amd64/vmm.c,v > retrieving revision 1.315 > diff -u -p -r1.315 vmm.c > --- arch/amd64/amd64/vmm.c27 Jun 2022 15:12:14 - 1.315 > +++ arch/amd64/amd64/vmm.c28 Jun 2022 13:54:25 - > @@ -713,7 +713,7 @@ pledge_ioctl_vmm(struct proc *p, long co > case VMM_IOC_CREATE: > case VMM_IOC_INFO: > /* The "parent" process in vmd forks and manages VMs */ > - if (p->p_p->ps_pledge & PLEDGE_PROC) > + if (pledge_get(p->p_p) & PLEDGE_PROC) > return (0); > break; > case VMM_IOC_TERM: > @@ -1312,7 +1312,7 @@ vm_find(uint32_t id, struct vm **res) >* The managing vmm parent process can lookup all >* all VMs and is indicated by PLEDGE_PROC. >*/ > - if (((p->p_p->ps_pledge & > + if (((pledge_get(p->p_p) & > (PLEDGE_VMM | PLEDGE_PROC)) == PLEDGE_VMM) && > (vm->vm_creator_pid != p->p_p->ps_pid)) > return (pledge_fail(p, EPERM, PLEDGE_VMM)); > Index: kern/init_sysent.c > === > RCS file: /home/cvs/src/sys/kern/init_sysent.c,v > retrieving revision 1.238 > diff -u -p -r1.238 init_sysent.c > --- kern/init_sysent.c27 Jun 2022 14:26:05 - 1.238 > +++ kern/init_sysent.c28 Jun 2022 15:18:25 - > @@ -1,10 +1,10 @@ > -/* $OpenBSD: init_sysent.c,v 1.238 2022/06/27 14:26:05 cheloha Exp $ > */ > +/* $OpenBSD$ */ > > /* > * System call switch table. > * > * DO NOT EDIT-- this file is automatically generated. > - * created from; OpenBSD: syscalls.master,v 1.224 2022/05/16 07:36:04 > mvs Exp > + * created from; OpenBSD: syscalls.master,v 1.225 2022/06/27 14:26:05 > cheloha Exp > */ > > #include > @@ -248,7 +248,7 @@ const struct sysent sysent[] = { > sys_listen }, /* 106 = listen */ > { 4, s(struct sys_chflagsat_args), 0, > sys_chflagsat },/* 107 = chflagsat */ > - { 2, s(struct sys_pledge_args), 0, > + { 2, s(struct sys_pledge_args), SY_NOLOCK | 0, > sys_pledge }, /* 108 = pledge */ > { 4, s(struct sys_ppoll_args), 0, > sys_ppoll },/* 109 = ppoll */ > Index: kern/kern_event.c > === > RCS file: /home/cvs/src/sys/kern/kern_event.c,v > retrieving revision 1.191 > diff -u -p -r1.191 kern_event.c > --- kern/kern_event.c 27 Jun 2022 13:35:21 - 1.191 > +++ kern/kern_event.c 28 Jun 2022 13:55:18 - > @@ -331,7 +331,7 @@ filt_procattach(struct knote *kn) > int s; > > if ((curproc->p_p->ps_flags & PS_PLEDGE) && > - (curproc->p_p->ps_pledge & PLEDGE_PROC) == 0) > + (pledge_get(curproc->p_p) & PLEDGE_PROC) == 0) > return pledge_fail(curproc, EPERM, PLEDGE_PROC); > > if (kn->kn_id > PID_MAX) > Index: kern/kern_pledge.c > === > RCS file: /home/cvs/src/sys/kern/kern_pledge.c,v > retrieving revision 1.282 > diff -u -p -r1.282 kern_pledge.c > --- kern/kern_pledge.c26 Jun 2022 06:11:49 - 1.282 > +++ kern/kern_pledge.c28 Jun 2022 15:21:46 - > @@ -21,6 +21,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -465,13 +466,26 @@ sys_pledge(struct proc *p, void *v, regi > struct process *pr = p->p_p; > uint64_t
Re: Fix the swapper
On 27/06/22(Mon) 15:44, Martin Pieuchot wrote: > Diff below contain 3 parts that can be committed independently. The 3 > of them are necessary to allow the pagedaemon to make progress in OOM > situation and to satisfy all the allocations waiting for pages in > specific ranges. > > * uvm/uvm_pager.c part reserves a second segment for the page daemon. > This is necessary to ensure the two uvm_pagermapin() calls needed by > uvm_swap_io() succeed in emergency OOM situation. (the 2nd segment is > necessary when encryption or bouncing is required) > > * uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region > for the same reason. Note that a sleeping point is introduced because > the pagedaemon is faster than the asynchronous I/O and in OOM > situation it tends to stay busy building cluster that it then discard > because no memory is available. > > * uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive > list of pages to account for a given memory range. Without this the > daemon could spin infinitely doing nothing because the global limits > are reached. Here's an updated diff with a fix on top: * in uvm/uvm_swap.c make sure uvm_swap_allocpages() is allowed to sleep when coming from uvm_fault(). This makes the faulting process wait instead of dying when there isn't any free pages to do the bouncing. I'd appreciate more reviews and tests ! Index: uvm/uvm_pager.c === RCS file: /cvs/src/sys/uvm/uvm_pager.c,v retrieving revision 1.80 diff -u -p -r1.80 uvm_pager.c --- uvm/uvm_pager.c 28 Jun 2022 12:10:37 - 1.80 +++ uvm/uvm_pager.c 28 Jun 2022 15:25:30 - @@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[] * The number of uvm_pseg instances is dynamic using an array segs. * At most UVM_PSEG_COUNT instances can exist. * - * psegs[0] always exists (so that the pager can always map in pages). - * psegs[0] element 0 is always reserved for the pagedaemon. + * psegs[0/1] always exist (so that the pager can always map in pages). + * psegs[0/1] element 0 are always reserved for the pagedaemon. * * Any other pseg is automatically created when no space is available * and automatically destroyed when it is no longer in use. @@ -93,6 +93,7 @@ uvm_pager_init(void) /* init pager map */ uvm_pseg_init([0]); + uvm_pseg_init([1]); mtx_init(_pseg_lck, IPL_VM); /* init ASYNC I/O queue */ @@ -168,9 +169,10 @@ pager_seg_restart: goto pager_seg_fail; } - /* Keep index 0 reserved for pagedaemon. */ - if (pseg == [0] && curproc != uvm.pagedaemon_proc) - i = 1; + /* Keep indexes 0,1 reserved for pagedaemon. */ + if ((pseg == [0] || pseg == [1]) && + (curproc != uvm.pagedaemon_proc)) + i = 2; else i = 0; @@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr) pseg->use &= ~(1 << id); wakeup(); - if (pseg != [0] && UVM_PSEG_EMPTY(pseg)) { + if ((pseg != [0] && pseg != [1]) && UVM_PSEG_EMPTY(pseg)) { va = pseg->start; pseg->start = 0; } Index: uvm/uvm_pdaemon.c === RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v retrieving revision 1.99 diff -u -p -r1.99 uvm_pdaemon.c --- uvm/uvm_pdaemon.c 12 May 2022 12:49:31 - 1.99 +++ uvm/uvm_pdaemon.c 28 Jun 2022 13:59:49 - @@ -101,8 +101,8 @@ extern void drmbackoff(long); * local prototypes */ -void uvmpd_scan(void); -boolean_t uvmpd_scan_inactive(struct pglist *); +void uvmpd_scan(struct uvm_pmalloc *); +boolean_t uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *); void uvmpd_tune(void); void uvmpd_drop(struct pglist *); @@ -281,7 +281,7 @@ uvm_pageout(void *arg) if (pma != NULL || ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) || ((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) { - uvmpd_scan(); + uvmpd_scan(pma); } /* @@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg) */ boolean_t -uvmpd_scan_inactive(struct pglist *pglst) +uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst) { boolean_t retval = FALSE; /* assume we haven't hit target */ int free, result; struct vm_page *p, *nextpg; struct uvm_object *uobj; - struct vm_page *pps[MAXBSIZE >> PAGE_SHIFT], **ppsp; + struct vm_page *pps[SWCLUSTPAGES], **ppsp; int npages; -
Introduce uvm_pagewait()
I'd like to abstract the use of PG_WANTED to start unifying & cleaning the various cases where a code path is waiting for a busy page. Here's the first step. ok? Index: uvm/uvm_amap.c === RCS file: /cvs/src/sys/uvm/uvm_amap.c,v retrieving revision 1.90 diff -u -p -r1.90 uvm_amap.c --- uvm/uvm_amap.c 30 Aug 2021 16:59:17 - 1.90 +++ uvm/uvm_amap.c 28 Jun 2022 11:53:08 - @@ -781,9 +781,7 @@ ReStart: * it and then restart. */ if (pg->pg_flags & PG_BUSY) { - atomic_setbits_int(>pg_flags, PG_WANTED); - rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK, - "cownow", INFSLP); + uvm_pagewait(pg, amap->am_lock, "cownow"); goto ReStart; } Index: uvm/uvm_aobj.c === RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v retrieving revision 1.103 diff -u -p -r1.103 uvm_aobj.c --- uvm/uvm_aobj.c 29 Dec 2021 20:22:06 - 1.103 +++ uvm/uvm_aobj.c 28 Jun 2022 11:53:08 - @@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj) while ((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) { pmap_page_protect(pg, PROT_NONE); if (pg->pg_flags & PG_BUSY) { - atomic_setbits_int(>pg_flags, PG_WANTED); - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det", - INFSLP); + uvm_pagewait(pg, uobj->vmobjlock, "uao_det"); + rw_enter(uobj->vmobjlock, RW_WRITE); continue; } uao_dropswap(>u_obj, pg->offset >> PAGE_SHIFT); @@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_ /* Make sure page is unbusy, else wait for it. */ if (pg->pg_flags & PG_BUSY) { - atomic_setbits_int(>pg_flags, PG_WANTED); - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh", - INFSLP); + uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh"); + rw_enter(uobj->vmobjlock, RW_WRITE); curoff -= PAGE_SIZE; continue; } @@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t /* page is there, see if we need to wait on it */ if ((ptmp->pg_flags & PG_BUSY) != 0) { - atomic_setbits_int(>pg_flags, PG_WANTED); - rwsleep_nsec(ptmp, uobj->vmobjlock, PVM, - "uao_get", INFSLP); + uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get"); + rw_enter(uobj->vmobjlock, RW_WRITE); continue; /* goto top of pps while loop */ } Index: uvm/uvm_km.c === RCS file: /cvs/src/sys/uvm/uvm_km.c,v retrieving revision 1.150 diff -u -p -r1.150 uvm_km.c --- uvm/uvm_km.c7 Jun 2022 12:07:45 - 1.150 +++ uvm/uvm_km.c28 Jun 2022 11:53:08 - @@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj, for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) { pp = uvm_pagelookup(uobj, curoff); if (pp && pp->pg_flags & PG_BUSY) { - atomic_setbits_int(>pg_flags, PG_WANTED); - rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm", - INFSLP); + uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm"); + rw_enter(uobj->vmobjlock, RW_WRITE); curoff -= PAGE_SIZE; /* loop back to us */ continue; } Index: uvm/uvm_page.c === RCS file: /cvs/src/sys/uvm/uvm_page.c,v retrieving revision 1.166 diff -u -p -r1.166 uvm_page.c --- uvm/uvm_page.c 12 May 2022 12:48:36 - 1.166 +++ uvm/uvm_page.c 28 Jun 2022 11:57:42 - @@ -1087,6 +1087,23 @@ uvm_page_unbusy(struct vm_page **pgs, in } } +/* + * uvm_pagewait: wait for a busy page + * + * => page must be known PG_BUSY + * => object must be locked + * => object will be unlocked on return + */ +void +uvm_pagewait(struct vm_page *pg, struct rwlock *lock, const char *wmesg) +{ + KASSERT(rw_lock_held(lock)); + KASSERT((pg->pg_flags & PG_BUSY) != 0); + + atomic_setbits_int(>pg_flags, PG_WANTED); + rwsleep_nsec(pg, lock, PVM | PNORELOCK, wmesg, INFSLP); +} + #if defined(UVM_PAGE_TRKOWN) /* * uvm_page_own: set or
Re: kernel lock in arp
On 27/06/22(Mon) 19:11, Alexander Bluhm wrote: > On Mon, Jun 27, 2022 at 11:49:23AM +0200, Alexander Bluhm wrote: > > On Sat, May 21, 2022 at 10:50:28PM +0300, Vitaliy Makkoveev wrote: > > > This diff looks good, except the re-check after kernel lock. It???s > > > supposed `rt??? could became inconsistent, right? But what stops to > > > make it inconsistent after first unlocked RTF_LLINFO flag check? > > > > > > I think this re-check should gone. > > > > I have copied the re-check from intenal genua code. I am not sure > > if it is really needed. We know from Hrvoje that the diff with > > re-check is stable. And we know that it crashes without kernel > > lock at all. > > > > I have talked with mpi@ about it. The main problem is that we have > > no write lock when we change RTF_LLINFO. Then rt_llinfo can get > > NULL or inconsistent. > > > > Plan is that I put some lock asserts into route add and delete. > > This helps to find the parts that modify RTF_LLINFO and rt_llinfo > > without exclusive lock. > > > > Maybe we need some kernel lock somewhere else. Or we want to use > > some ARP mutex. We could also add some comment and commit the diff > > that I have. We know that it is faster and stable. Pushing the > > kernel lock down or replacing it with something clever can always > > be done later. > > We need the re-check. I have tested it with a printf. It is > triggered by running arp -d in a loop while forwarding. > > The concurrent threads are these: > > rtrequest_delete(8000246b7428,3,80775048,8000246b7510,0) at > rtrequest_delete+0x67 > rtdeletemsg(fd8834a23550,80775048,0) at rtdeletemsg+0x1ad > rtrequest(b,8000246b7678,3,8000246b7718,0) at rtrequest+0x55c > rt_clone(8000246b7780,8000246b78f8,0) at rt_clone+0x73 > rtalloc_mpath(8000246b78f8,fd8003169ad8,0) at rtalloc_mpath+0x4c > ip_forward(fd80b8cc7e00,8077d048,fd8834a230f0,0) at > ip_forward+0x137 > ip_input_if(8000246b7a28,8000246b7a34,4,0,8077d048) at > ip_input_if+0x353 > ipv4_input(8077d048,fd80b8cc7e00) at ipv4_input+0x39 > ether_input(8077d048,fd80b8cc7e00) at ether_input+0x3ad > if_input_process(8077d048,8000246b7b18) at if_input_process+0x6f > ifiq_process(8077d458) at ifiq_process+0x69 > taskq_thread(80036080) at taskq_thread+0x100 > > rtrequest_delete(8000246c8d08,3,80775048,8000246c8df0,0) at > rtrequest_delete+0x67 > rtdeletemsg(fd8834a230f0,80775048,0) at rtdeletemsg+0x1ad > rtrequest(b,8000246c8f58,3,8000246c8ff8,0) at rtrequest+0x55c > rt_clone(8000246c9060,8000246c90b8,0) at rt_clone+0x73 > rtalloc_mpath(8000246c90b8,fd8002c754d8,0) at rtalloc_mpath+0x4c > in_ouraddr(fd8094771b00,8077d048,8000246c9138) at > in_ouraddr+0x84 > ip_input_if(8000246c91d8,8000246c91e4,4,0,8077d048) at > ip_input_if+0x1cd > ipv4_input(8077d048,fd8094771b00) at ipv4_input+0x39 > ether_input(8077d048,fd8094771b00) at ether_input+0x3ad > if_input_process(8077d048,8000246c92c8) at if_input_process+0x6f > ifiq_process(80781400) at ifiq_process+0x69 > taskq_thread(80036200) at taskq_thread+0x100 > > I have added a comment why kernel lock protects us. I would like > to get this in. It has been tested, reduces the kernel lock and > is faster. A more clever lock can be done later. > > ok? I don't understand how the KERNEL_LOCK() there prevents rtdeletemsg() from running. rtrequest_delete() seems completely broken it assumes it holds an exclusive lock. To "fix" arp the KERNEL_LOCK() should also be taken in RTM_DELETE and RTM_RESOLVE inside arp_rtrequest(). Or maybe around ifp->if_rtrequest() But it doesn't mean there isn't another problem in rtdeletemsg()... > Index: net/if_ethersubr.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v > retrieving revision 1.281 > diff -u -p -r1.281 if_ethersubr.c > --- net/if_ethersubr.c26 Jun 2022 21:19:53 - 1.281 > +++ net/if_ethersubr.c27 Jun 2022 16:55:15 - > @@ -221,10 +221,7 @@ ether_resolve(struct ifnet *ifp, struct > > switch (af) { > case AF_INET: > - KERNEL_LOCK(); > - /* XXXSMP there is a MP race in arpresolve() */ > error = arpresolve(ifp, rt, m, dst, eh->ether_dhost); > - KERNEL_UNLOCK(); > if (error) > return (error); > eh->ether_type = htons(ETHERTYPE_IP); > @@ -285,10 +282,7 @@ ether_resolve(struct ifnet *ifp, struct > break; > #endif > case AF_INET: > - KERNEL_LOCK(); > - /* XXXSMP there is a MP race in arpresolve() */ > error = arpresolve(ifp, rt, m, dst, eh->ether_dhost); > -
CoW & neighbor pages
When faulting a page after COW neighborhood pages are likely to already be entered. So speed up the fault by doing a narrow fault (do not try to map in adjacent pages). This is stolen from NetBSD. ok? Index: uvm/uvm_fault.c === RCS file: /cvs/src/sys/uvm/uvm_fault.c,v retrieving revision 1.129 diff -u -p -r1.129 uvm_fault.c --- uvm/uvm_fault.c 4 Apr 2022 09:27:05 - 1.129 +++ uvm/uvm_fault.c 27 Jun 2022 17:05:26 - @@ -737,6 +737,16 @@ uvm_fault_check(struct uvm_faultinfo *uf } /* +* for a case 2B fault waste no time on adjacent pages because +* they are likely already entered. +*/ + if (uobj != NULL && amap != NULL && + (flt->access_type & PROT_WRITE) != 0) { + /* wide fault (!narrow) */ + flt->narrow = TRUE; + } + + /* * establish range of interest based on advice from mapper * and then clip to fit map entry. note that we only want * to do this the first time through the fault. if we
Fix the swapper
Diff below contain 3 parts that can be committed independently. The 3 of them are necessary to allow the pagedaemon to make progress in OOM situation and to satisfy all the allocations waiting for pages in specific ranges. * uvm/uvm_pager.c part reserves a second segment for the page daemon. This is necessary to ensure the two uvm_pagermapin() calls needed by uvm_swap_io() succeed in emergency OOM situation. (the 2nd segment is necessary when encryption or bouncing is required) * uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region for the same reason. Note that a sleeping point is introduced because the pagedaemon is faster than the asynchronous I/O and in OOM situation it tends to stay busy building cluster that it then discard because no memory is available. * uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive list of pages to account for a given memory range. Without this the daemon could spin infinitely doing nothing because the global limits are reached. At lot could be improved, but this at least makes swapping work in OOM situations. Index: uvm/uvm_pager.c === RCS file: /cvs/src/sys/uvm/uvm_pager.c,v retrieving revision 1.78 diff -u -p -r1.78 uvm_pager.c --- uvm/uvm_pager.c 18 Feb 2022 09:04:38 - 1.78 +++ uvm/uvm_pager.c 27 Jun 2022 08:44:41 - @@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[] * The number of uvm_pseg instances is dynamic using an array segs. * At most UVM_PSEG_COUNT instances can exist. * - * psegs[0] always exists (so that the pager can always map in pages). - * psegs[0] element 0 is always reserved for the pagedaemon. + * psegs[0/1] always exist (so that the pager can always map in pages). + * psegs[0/1] element 0 are always reserved for the pagedaemon. * * Any other pseg is automatically created when no space is available * and automatically destroyed when it is no longer in use. @@ -93,6 +93,7 @@ uvm_pager_init(void) /* init pager map */ uvm_pseg_init([0]); + uvm_pseg_init([1]); mtx_init(_pseg_lck, IPL_VM); /* init ASYNC I/O queue */ @@ -168,9 +169,10 @@ pager_seg_restart: goto pager_seg_fail; } - /* Keep index 0 reserved for pagedaemon. */ - if (pseg == [0] && curproc != uvm.pagedaemon_proc) - i = 1; + /* Keep indexes 0,1 reserved for pagedaemon. */ + if ((pseg == [0] || pseg == [1]) && + (curproc != uvm.pagedaemon_proc)) + i = 2; else i = 0; @@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr) pseg->use &= ~(1 << id); wakeup(); - if (pseg != [0] && UVM_PSEG_EMPTY(pseg)) { + if ((pseg != [0] && pseg != [1]) && UVM_PSEG_EMPTY(pseg)) { va = pseg->start; pseg->start = 0; } Index: uvm/uvm_pdaemon.c === RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v retrieving revision 1.99 diff -u -p -r1.99 uvm_pdaemon.c --- uvm/uvm_pdaemon.c 12 May 2022 12:49:31 - 1.99 +++ uvm/uvm_pdaemon.c 27 Jun 2022 13:24:54 - @@ -101,8 +101,8 @@ extern void drmbackoff(long); * local prototypes */ -void uvmpd_scan(void); -boolean_t uvmpd_scan_inactive(struct pglist *); +void uvmpd_scan(struct uvm_pmalloc *); +boolean_t uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *); void uvmpd_tune(void); void uvmpd_drop(struct pglist *); @@ -281,7 +281,7 @@ uvm_pageout(void *arg) if (pma != NULL || ((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) || ((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) { - uvmpd_scan(); + uvmpd_scan(pma); } /* @@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg) */ boolean_t -uvmpd_scan_inactive(struct pglist *pglst) +uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst) { boolean_t retval = FALSE; /* assume we haven't hit target */ int free, result; struct vm_page *p, *nextpg; struct uvm_object *uobj; - struct vm_page *pps[MAXBSIZE >> PAGE_SHIFT], **ppsp; + struct vm_page *pps[SWCLUSTPAGES], **ppsp; int npages; - struct vm_page *swpps[MAXBSIZE >> PAGE_SHIFT]; /* XXX: see below */ + struct vm_page *swpps[SWCLUSTPAGES];/* XXX: see below */ int swnpages, swcpages; /* XXX: see below */ int swslot; struct vm_anon *anon; @@ -404,8 +404,27 @@ uvmpd_scan_inactive(struct pglist *pglst swnpages = swcpages = 0; free = 0; dirtyreacts = 0; + p = NULL; - for (p
pdaemon: reserve memory for swapping
uvm_swap_io() needs to perform up to 4 allocations to write pages to disk. In OOM situation uvm_swap_allocpages() always fail because the kernel doesn't reserve enough pages. Diff below set `uvmexp.reserve_pagedaemon' to the number of pages needed to write a cluster of pages to disk. With this my machine do not deadlock and can push pages to swap in OOM case. ok? Index: uvm/uvm_page.c === RCS file: /cvs/src/sys/uvm/uvm_page.c,v retrieving revision 1.166 diff -u -p -r1.166 uvm_page.c --- uvm/uvm_page.c 12 May 2022 12:48:36 - 1.166 +++ uvm/uvm_page.c 26 Jun 2022 08:17:34 - @@ -280,10 +280,13 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr /* * init reserve thresholds -* XXXCDC - values may need adjusting +* +* The pagedaemon needs to always be able to write pages to disk, +* Reserve the minimum amount of pages, a cluster, required by +* uvm_swap_allocpages() */ - uvmexp.reserve_pagedaemon = 4; - uvmexp.reserve_kernel = 8; + uvmexp.reserve_pagedaemon = (MAXBSIZE >> PAGE_SHIFT); + uvmexp.reserve_kernel = uvmexp.reserve_pagedaemon + 4; uvmexp.anonminpct = 10; uvmexp.vnodeminpct = 10; uvmexp.vtextminpct = 5;
Re: ssh-add(1): fix NULL in fprintf
ping, diff attached On Mon, May 16, 2022 at 09:21:42PM +0200, Martin Vahlensieck wrote: > Hi > > What's the status on this? Anthing required from my side? I have > reattached the patch (with the changes Theo suggested). > > Best, > > Martin > > On Mon, May 09, 2022 at 08:39:38PM +0200, Martin Vahlensieck wrote: > > On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote: > > > Martin Vahlensieck wrote: > > > > > > > if (!qflag) { > > > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > > > > - sshkey_type(key), comment); > > > > + fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path, > > > > + sshkey_type(key), comment ? " (" : "", > > > > + comment ? comment : "", comment ? ")" : ""); > > > > > > this is probably better as something like > > > > > > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > > > > - sshkey_type(key), comment ? comment : "no comment"); > > > > > > Which has a minor ambiguity, but probably harms noone. > > > > > > > Index: ssh-add.c > > === > > RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v > > retrieving revision 1.165 > > diff -u -p -r1.165 ssh-add.c > > --- ssh-add.c 4 Feb 2022 02:49:17 - 1.165 > > +++ ssh-add.c 9 May 2022 18:36:54 - > > @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss > > } > > if (!qflag) { > > fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > > - sshkey_type(key), comment); > > + sshkey_type(key), comment ? comment : "no comment"); > > } > > return 0; > > } > > @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen > > certpath, filename); > > sshkey_free(cert); > > goto out; > > - } > > + } > > > > /* Graft with private bits */ > > if ((r = sshkey_to_certified(private)) != 0) { > > Index: ssh-add.c > === > RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v > retrieving revision 1.165 > diff -u -p -r1.165 ssh-add.c > --- ssh-add.c 4 Feb 2022 02:49:17 - 1.165 > +++ ssh-add.c 9 May 2022 18:36:54 - > @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss > } > if (!qflag) { > fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > - sshkey_type(key), comment); > + sshkey_type(key), comment ? comment : "no comment"); > } > return 0; > } > @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen > certpath, filename); > sshkey_free(cert); > goto out; > - } > + } > > /* Graft with private bits */ > if ((r = sshkey_to_certified(private)) != 0) { > Index: ssh-add.c === RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v retrieving revision 1.165 diff -u -p -r1.165 ssh-add.c --- ssh-add.c 4 Feb 2022 02:49:17 - 1.165 +++ ssh-add.c 9 May 2022 18:36:54 - @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss } if (!qflag) { fprintf(stderr, "Identity removed: %s %s (%s)\n", path, - sshkey_type(key), comment); + sshkey_type(key), comment ? comment : "no comment"); } return 0; } @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen certpath, filename); sshkey_free(cert); goto out; - } + } /* Graft with private bits */ if ((r = sshkey_to_certified(private)) != 0) {
Re: set RTF_DONE in sysctl_dumpentry for the routing table
On 08/06/22(Wed) 16:13, Claudio Jeker wrote: > Notice while hacking in OpenBGPD. Unlike routing socket messages the > messages from the sysctl interface have RTF_DONE not set. > I think it would make sense to set RTF_DONE also in this case since it > makes reusing code easier. > > All messages sent out via sysctl_dumpentry() have been processed by the > kernel so setting RTF_DONE kind of makes sense. I agree, ok mpi@ > -- > :wq Claudio > > Index: rtsock.c > === > RCS file: /cvs/src/sys/net/rtsock.c,v > retrieving revision 1.328 > diff -u -p -r1.328 rtsock.c > --- rtsock.c 6 Jun 2022 14:45:41 - 1.328 > +++ rtsock.c 8 Jun 2022 14:10:20 - > @@ -1987,7 +1987,7 @@ sysctl_dumpentry(struct rtentry *rt, voi > struct rt_msghdr *rtm = (struct rt_msghdr *)w->w_tmem; > > rtm->rtm_pid = curproc->p_p->ps_pid; > - rtm->rtm_flags = rt->rt_flags; > + rtm->rtm_flags = RTF_DONE | rt->rt_flags; > rtm->rtm_priority = rt->rt_priority & RTP_MASK; > rtm_getmetrics(>rt_rmx, >rtm_rmx); > /* Do not account the routing table's reference. */ >
Re: Fix clearing of sleep timeouts
On 06/06/22(Mon) 06:47, David Gwynne wrote: > On Sun, Jun 05, 2022 at 03:57:39PM +, Visa Hankala wrote: > > On Sun, Jun 05, 2022 at 12:27:32PM +0200, Martin Pieuchot wrote: > > > On 05/06/22(Sun) 05:20, Visa Hankala wrote: > > > > Encountered the following panic: > > > > > > > > panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" > > > > failed: file "/usr/src/sys/kern/kern_synch.c", line 373 > > > > Stopped at db_enter+0x10: popq%rbp > > > > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > > > > 423109 57118 55 0x3 02 link > > > > 330695 30276 55 0x3 03 link > > > > * 46366 85501 55 0x1003 0x40804001 link > > > > 188803 85501 55 0x1003 0x40820000K link > > > > db_enter() at db_enter+0x10 > > > > panic(81f25d2b) at panic+0xbf > > > > __assert(81f9a186,81f372c8,175,81f87c6c) at > > > > __assert+0x25 > > > > sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at > > > > sleep_setup+0x1d8 > > > > cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46 > > > > timeout_barrier(8000228a28b0) at timeout_barrier+0x109 > > > > timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2 > > > > sleep_finish(800022d64d90,1) at sleep_finish+0x16d > > > > tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2 > > > > sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at > > > > sys_nanosleep+0x12d > > > > syscall(800022d64f60) at syscall+0x374 > > > > > > > > The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously, > > > > soft-interrupt-driven timeouts could be deleted synchronously without > > > > blocking. Now, timeout_del_barrier() can sleep regardless of the type > > > > of the timeout. > > > > > > > > It looks that with small adjustments timeout_del_barrier() can sleep > > > > in sleep_finish(). The management of run queues is not affected because > > > > the timeout clearing happens after it. As timeout_del_barrier() does not > > > > rely on a timeout or signal catching, there should be no risk of > > > > unbounded recursion or unwanted signal side effects within the sleep > > > > machinery. In a way, a sleep with a timeout is higher-level than > > > > one without. > > > > > > I trust you on the analysis. However this looks very fragile to me. > > > > > > The use of timeout_del_barrier() which can sleep using the global sleep > > > queue is worrying me. > > > > I think the queue handling ends in sleep_finish() when SCHED_LOCK() > > is released. The timeout clearing is done outside of it. > > That's ok. > > > The extra sleeping point inside sleep_finish() is subtle. It should not > > matter in typical use. But is it permissible with the API? Also, if > > timeout_del_barrier() sleeps, the thread's priority can change. > > What other options do we have at this point? Spin? Allocate the timeout > dynamically so sleep_finish doesn't have to wait for it and let the > handler clean up? How would you stop the timeout handler waking up the > thread if it's gone back to sleep again for some other reason? > > Sleeping here is the least worst option. I agree. I don't think sleeping is bad here. My concern is about how sleeping is implemented. There's a single API built on top of a single global data structure which now calls itself recursively. I'm not sure how much work it would be to make cond_wait(9) use its own sleep queue... This is something independent from this fix though. > As for timeout_del_barrier, if prio is a worry we can provide an > advanced version of it that lets you pass the prio in. I'd also > like to change timeout_barrier so it queues the barrier task at the > head of the pending lists rather than at the tail. I doubt prio matter.
Re: Fix clearing of sleep timeouts
On 05/06/22(Sun) 05:20, Visa Hankala wrote: > Encountered the following panic: > > panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: > file "/usr/src/sys/kern/kern_synch.c", line 373 > Stopped at db_enter+0x10: popq%rbp > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > 423109 57118 55 0x3 02 link > 330695 30276 55 0x3 03 link > * 46366 85501 55 0x1003 0x40804001 link > 188803 85501 55 0x1003 0x40820000K link > db_enter() at db_enter+0x10 > panic(81f25d2b) at panic+0xbf > __assert(81f9a186,81f372c8,175,81f87c6c) at > __assert+0x25 > sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at > sleep_setup+0x1d8 > cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46 > timeout_barrier(8000228a28b0) at timeout_barrier+0x109 > timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2 > sleep_finish(800022d64d90,1) at sleep_finish+0x16d > tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2 > sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at > sys_nanosleep+0x12d > syscall(800022d64f60) at syscall+0x374 > > The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously, > soft-interrupt-driven timeouts could be deleted synchronously without > blocking. Now, timeout_del_barrier() can sleep regardless of the type > of the timeout. > > It looks that with small adjustments timeout_del_barrier() can sleep > in sleep_finish(). The management of run queues is not affected because > the timeout clearing happens after it. As timeout_del_barrier() does not > rely on a timeout or signal catching, there should be no risk of > unbounded recursion or unwanted signal side effects within the sleep > machinery. In a way, a sleep with a timeout is higher-level than > one without. I trust you on the analysis. However this looks very fragile to me. The use of timeout_del_barrier() which can sleep using the global sleep queue is worrying me. > Note that endtsleep() can run and set P_TIMEOUT during > timeout_del_barrier() when the thread is blocked in cond_wait(). > To avoid unnecessary atomic read-modify-write operations, the clearing > of P_TIMEOUT could be conditional, but maybe that is an unnecessary > optimization at this point. I agree this optimization seems unnecessary at the moment. > While it should be possible to make the code use timeout_del() instead > of timeout_del_barrier(), the outcome might not be outright better. For > example, sleep_setup() and endtsleep() would have to coordinate so that > a late-running timeout from previous sleep cycle would not disturb the > new cycle. So that's the price for not having to sleep in sleep_finish(), right? > To test the barrier path reliably, I made the code call > timeout_del_barrier() twice in a row. The second call is guaranteed > to sleep. Of course, this is not part of the patch. ok mpi@ > Index: kern/kern_synch.c > === > RCS file: src/sys/kern/kern_synch.c,v > retrieving revision 1.187 > diff -u -p -r1.187 kern_synch.c > --- kern/kern_synch.c 13 May 2022 15:32:00 - 1.187 > +++ kern/kern_synch.c 5 Jun 2022 05:04:45 - > @@ -370,8 +370,8 @@ sleep_setup(struct sleep_state *sls, con > p->p_slppri = prio & PRIMASK; > TAILQ_INSERT_TAIL([LOOKUP(ident)], p, p_runq); > > - KASSERT((p->p_flag & P_TIMEOUT) == 0); > if (timo) { > + KASSERT((p->p_flag & P_TIMEOUT) == 0); > sls->sls_timeout = 1; > timeout_add(>p_sleep_to, timo); > } > @@ -432,13 +432,12 @@ sleep_finish(struct sleep_state *sls, in > > if (sls->sls_timeout) { > if (p->p_flag & P_TIMEOUT) { > - atomic_clearbits_int(>p_flag, P_TIMEOUT); > error1 = EWOULDBLOCK; > } else { > - /* This must not sleep. */ > + /* This can sleep. It must not use timeouts. */ > timeout_del_barrier(>p_sleep_to); > - KASSERT((p->p_flag & P_TIMEOUT) == 0); > } > + atomic_clearbits_int(>p_flag, P_TIMEOUT); > } > > /* Check if thread was woken up because of a unwind or signal */ >
Re: start unlocking kbind(2)
On 18/05/22(Wed) 15:53, Alexander Bluhm wrote: > On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote: > > + cookie = SCARG(uap, proc_cookie); > > + if (pr->ps_kbind_addr == pc) { > > + membar_datadep_consumer(); > > + if (pr->ps_kbind_cookie != cookie) > > + goto sigill; > > + } else { > > You must use membar_consumer() here. membar_datadep_consumer() is > a barrier between reading pointer and pointed data. Only alpha > requires membar_datadep_consumer() for that, everywhere else it is > a NOP. > > > + mtx_enter(>ps_mtx); > > + kpc = pr->ps_kbind_addr; > > Do we need kpc variable? I would prefer to read explicit > pr->ps_kbind_addr in the two places where we use it. > > I think the logic of barriers and mutexes is correct. > > with the suggestions above OK bluhm@ I believe you should go ahead with the current diff. ok with me. Moving the field under the scope of another lock can be easily done afterward.
Re: allow 240/4 in various network daemons
Am Sa., 28. Mai 2022 um 22:46 Uhr schrieb Seth David Schoen : > We're also interested in talking about whether there's an appropriate > path for supporting non-broadcast use of addresses within 127/8, our > most controversial change. In Linux and FreeBSD, we're experimenting IPv6 is now older than IPv4 was when v6 was introduced. You are beating a very dead horse. Best Martin
Re: ffs_truncate: Missing uvm_vnp_uncache() w/ softdep
On 24/05/22(Tue) 15:24, Mark Kettenis wrote: > > Date: Tue, 24 May 2022 14:28:39 +0200 > > From: Martin Pieuchot > > > > The softdep code path is missing a UVM cache invalidation compared to > > the !softdep one. This is necessary to flush pages of a persisting > > vnode. > > > > Since uvm_vnp_setsize() is also called later in this function for the > > !softdep case move it to not call it twice. > > > > ok? > > I'm not sure this is correct. I'm trying to understand why you're > moving the uvm_uvn_setsize() call. Are you just trying to call it > twice? Or are you trying to avoid calling it at all when we end up in > an error path? > > The way you moved it means we'll still call it twice for "partially > truncated" files with softdeps. At least the way I understand the > code is that the code will fsync the vnode and dropping down in the > "normal" non-softdep code that will call uvm_vnp_setsize() (and > uvn_vnp_uncache()) again. So maybe you should move the > uvm_uvn_setsize() call into the else case? We might want to do that indeed. I'm not sure what are the implications of calling uvm_vnp_setsize/uncache() after VOP_FSYNC(), which might fail. So I'd rather play safe and go with that diff. > > Index: ufs/ffs/ffs_inode.c > > === > > RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v > > retrieving revision 1.81 > > diff -u -p -r1.81 ffs_inode.c > > --- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 - 1.81 > > +++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 - > > @@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le > > if (length > fs->fs_maxfilesize) > > return (EFBIG); > > > > - uvm_vnp_setsize(ovp, length); > > oip->i_ci.ci_lasta = oip->i_ci.ci_clen > > = oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0; > > > > if (DOINGSOFTDEP(ovp)) { > > + uvm_vnp_setsize(ovp, length); > > + (void) uvm_vnp_uncache(ovp); > > if (length > 0 || softdep_slowdown(ovp)) { > > /* > > * If a file is only partially truncated, then > > > >
Please test: rewrite of pdaemon
Diff below brings in & adapt most of the changes from NetBSD's r1.37 of uvm_pdaemon.c. My motivation for doing this is to untangle the inner loop of uvmpd_scan_inactive() which will allow us to split the global `pageqlock' mutex in a next step. The idea behind this change is to get rid of the too-complex uvm_pager* abstraction by checking early if a page is going to be flushed or swapped to disk. The loop is then clearly divided into two cases which makes it more readable. This also opens the door to a better integration between UVM's vnode layer and the buffer cache. The main loop of uvmpd_scan_inactive() can be understood as below: . If a page can be flushed we can call "uvn_flush()" directly and pass the PGO_ALLPAGES flag instead of building a cluster beforehand. Note that, in its current form uvn_flush() is synchronous. . If the page needs to be swapped, mark it as PG_PAGEOUT, build a cluster and once it is full call uvm_swap_put(). Please test this diff, do not hesitate to play with the `vm.swapencrypt.enable' sysctl(2). Index: uvm/uvm_aobj.c === RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v retrieving revision 1.103 diff -u -p -r1.103 uvm_aobj.c --- uvm/uvm_aobj.c 29 Dec 2021 20:22:06 - 1.103 +++ uvm/uvm_aobj.c 24 May 2022 12:31:34 - @@ -143,7 +143,7 @@ struct pool uvm_aobj_pool; static struct uao_swhash_elt *uao_find_swhash_elt(struct uvm_aobj *, int, boolean_t); -static int uao_find_swslot(struct uvm_object *, int); +int uao_find_swslot(struct uvm_object *, int); static boolean_tuao_flush(struct uvm_object *, voff_t, voff_t, int); static void uao_free(struct uvm_aobj *); @@ -241,7 +241,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob /* * uao_find_swslot: find the swap slot number for an aobj/pageidx */ -inline static int +int uao_find_swslot(struct uvm_object *uobj, int pageidx) { struct uvm_aobj *aobj = (struct uvm_aobj *)uobj; Index: uvm/uvm_aobj.h === RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v retrieving revision 1.17 diff -u -p -r1.17 uvm_aobj.h --- uvm/uvm_aobj.h 21 Oct 2020 09:08:14 - 1.17 +++ uvm/uvm_aobj.h 24 May 2022 12:31:34 - @@ -60,6 +60,7 @@ void uao_init(void); int uao_set_swslot(struct uvm_object *, int, int); +int uao_find_swslot (struct uvm_object *, int); int uao_dropswap(struct uvm_object *, int); int uao_swap_off(int, int); int uao_shrink(struct uvm_object *, int); Index: uvm/uvm_map.c === RCS file: /cvs/src/sys/uvm/uvm_map.c,v retrieving revision 1.291 diff -u -p -r1.291 uvm_map.c --- uvm/uvm_map.c 4 May 2022 14:58:26 - 1.291 +++ uvm/uvm_map.c 24 May 2022 12:31:34 - @@ -3215,8 +3215,9 @@ uvm_object_printit(struct uvm_object *uo * uvm_page_printit: actually print the page */ static const char page_flagbits[] = - "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5CLEANCHK\6RELEASED\7FAKE\10RDONLY" - "\11ZERO\12DEV\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ" + "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5PAGEOUT\6RELEASED\7FAKE\10RDONLY" + "\11ZERO\12DEV\13CLEANCHK" + "\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ" "\27ENCRYPT\31PMAP0\32PMAP1\33PMAP2\34PMAP3\35PMAP4\36PMAP5"; void Index: uvm/uvm_page.c === RCS file: /cvs/src/sys/uvm/uvm_page.c,v retrieving revision 1.166 diff -u -p -r1.166 uvm_page.c --- uvm/uvm_page.c 12 May 2022 12:48:36 - 1.166 +++ uvm/uvm_page.c 24 May 2022 12:32:54 - @@ -960,6 +960,7 @@ uvm_pageclean(struct vm_page *pg) { u_int flags_to_clear = 0; + KASSERT((pg->pg_flags & PG_PAGEOUT) == 0); if ((pg->pg_flags & (PG_TABLED|PQ_ACTIVE|PQ_INACTIVE)) && (pg->uobject == NULL || !UVM_OBJ_IS_PMAP(pg->uobject))) MUTEX_ASSERT_LOCKED(); @@ -978,11 +979,14 @@ uvm_pageclean(struct vm_page *pg) rw_write_held(pg->uanon->an_lock)); /* -* if the page was an object page (and thus "TABLED"), remove it -* from the object. +* remove page from its object or anon. */ - if (pg->pg_flags & PG_TABLED) + if (pg->pg_flags & PG_TABLED) { uvm_pageremove(pg); + } else if (pg->uanon != NULL) { + pg->uanon->an_page = NULL; + pg->uanon = NULL; + } /* * now remove the page from the queues @@ -996,10 +1000,6 @@ uvm_pageclean(struct vm_page *pg) pg->wire_count = 0; uvmexp.wired--; } - if (pg->uanon) { - pg->uanon->an_page = NULL; - pg->uanon = NULL; - }
ffs_truncate: Missing uvm_vnp_uncache() w/ softdep
The softdep code path is missing a UVM cache invalidation compared to the !softdep one. This is necessary to flush pages of a persisting vnode. Since uvm_vnp_setsize() is also called later in this function for the !softdep case move it to not call it twice. ok? Index: ufs/ffs/ffs_inode.c === RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v retrieving revision 1.81 diff -u -p -r1.81 ffs_inode.c --- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 - 1.81 +++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 - @@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le if (length > fs->fs_maxfilesize) return (EFBIG); - uvm_vnp_setsize(ovp, length); oip->i_ci.ci_lasta = oip->i_ci.ci_clen = oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0; if (DOINGSOFTDEP(ovp)) { + uvm_vnp_setsize(ovp, length); + (void) uvm_vnp_uncache(ovp); if (length > 0 || softdep_slowdown(ovp)) { /* * If a file is only partially truncated, then
Re: Call uvm_vnp_uncache() before VOP_RENAME()
On 17/05/22(Tue) 16:55, Martin Pieuchot wrote: > nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a > possibly mmap'ed file before calling VOP_RENAME(). > > ok? Anyone? > Index: nfs/nfs_serv.c > === > RCS file: /cvs/src/sys/nfs/nfs_serv.c,v > retrieving revision 1.120 > diff -u -p -r1.120 nfs_serv.c > --- nfs/nfs_serv.c11 Mar 2021 13:31:35 - 1.120 > +++ nfs/nfs_serv.c4 May 2022 15:29:06 - > @@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd > error = -1; > out: > if (!error) { > + if (tvp) { > + (void)uvm_vnp_uncache(tvp); > + } > error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, _cnd, > tond.ni_dvp, tond.ni_vp, _cnd); > } else { >
Re: start unlocking kbind(2)
On 17/05/22(Tue) 10:44, David Gwynne wrote: > this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall > argument checks can be done without the kernel lock. > > care is taken to allow the pc/cookie checks to run without any lock by > being careful with the order of the checks. all modifications to the > pc/cookie state are serialised by the per process mutex. I don't understand why it is safe to do the following check without holding a mutex: if (pr->ps_kbind_addr == pc) ... Is there much differences when always grabbing the per-process mutex? > i dont know enough about uvm to say whether it is safe to unlock the > actual memory updates too, but even if i was confident i would still > prefer to change it as a separate step. I agree. > Index: kern/init_sysent.c > === > RCS file: /cvs/src/sys/kern/init_sysent.c,v > retrieving revision 1.236 > diff -u -p -r1.236 init_sysent.c > --- kern/init_sysent.c1 May 2022 23:00:04 - 1.236 > +++ kern/init_sysent.c17 May 2022 00:36:03 - > @@ -1,4 +1,4 @@ > -/* $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $ */ > +/* $OpenBSD$ */ > > /* > * System call switch table. > @@ -204,7 +204,7 @@ const struct sysent sysent[] = { > sys_utimensat },/* 84 = utimensat */ > { 2, s(struct sys_futimens_args), 0, > sys_futimens }, /* 85 = futimens */ > - { 3, s(struct sys_kbind_args), 0, > + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0, > sys_kbind },/* 86 = kbind */ > { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0, > sys_clock_gettime },/* 87 = clock_gettime */ > Index: kern/syscalls.master > === > RCS file: /cvs/src/sys/kern/syscalls.master,v > retrieving revision 1.223 > diff -u -p -r1.223 syscalls.master > --- kern/syscalls.master 24 Feb 2022 07:41:51 - 1.223 > +++ kern/syscalls.master 17 May 2022 00:36:03 - > @@ -194,7 +194,7 @@ > const struct timespec *times, int flag); } > 85 STD { int sys_futimens(int fd, \ > const struct timespec *times); } > -86 STD { int sys_kbind(const struct __kbind *param, \ > +86 STD NOLOCK { int sys_kbind(const struct __kbind *param, \ > size_t psize, int64_t proc_cookie); } > 87 STD NOLOCK { int sys_clock_gettime(clockid_t clock_id, \ > struct timespec *tp); } > Index: uvm/uvm_mmap.c > === > RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v > retrieving revision 1.169 > diff -u -p -r1.169 uvm_mmap.c > --- uvm/uvm_mmap.c19 Jan 2022 10:43:48 - 1.169 > +++ uvm/uvm_mmap.c17 May 2022 00:36:03 - > @@ -70,6 +70,7 @@ > #include > #include /* for KBIND* */ > #include > +#include > > #include /* for __LDPGSZ */ > > @@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis > const char *data; > vaddr_t baseva, last_baseva, endva, pageoffset, kva; > size_t psize, s; > - u_long pc; > + u_long pc, kpc; > int count, i, extra; > + uint64_t cookie; > int error; > > /* >* extract syscall args from uap >*/ > paramp = SCARG(uap, param); > - psize = SCARG(uap, psize); > > /* a NULL paramp disables the syscall for the process */ > if (paramp == NULL) { > + mtx_enter(>ps_mtx); > if (pr->ps_kbind_addr != 0) > - sigexit(p, SIGILL); > + goto leave_sigill; > pr->ps_kbind_addr = BOGO_PC; > + mtx_leave(>ps_mtx); > return 0; > } > > /* security checks */ > + > + /* > + * ps_kbind_addr can only be set to 0 or BOGO_PC by the > + * kernel, not by a call from userland. > + */ > pc = PROC_PC(p); > - if (pr->ps_kbind_addr == 0) { > - pr->ps_kbind_addr = pc; > - pr->ps_kbind_cookie = SCARG(uap, proc_cookie); > - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC) > - sigexit(p, SIGILL); > - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie)) > - sigexit(p, SIGILL); > + if (pc == 0 || pc == BOGO_PC) > + goto sigill; > + > + cookie = SCARG(uap, proc_cookie); > + if (pr->ps_kbind_addr == pc) { > + membar_datadep_consumer(); > + if (pr->ps_kbind_cookie != cookie) > + goto sigill; > + } else { > + mtx_enter(>ps_mtx); > + kpc = pr->ps_kbind_addr; > + > + /* > + * If we're the first thread in (kpc is 0), then > + *
Call uvm_vnp_uncache() before VOP_RENAME()
nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a possibly mmap'ed file before calling VOP_RENAME(). ok? Index: nfs/nfs_serv.c === RCS file: /cvs/src/sys/nfs/nfs_serv.c,v retrieving revision 1.120 diff -u -p -r1.120 nfs_serv.c --- nfs/nfs_serv.c 11 Mar 2021 13:31:35 - 1.120 +++ nfs/nfs_serv.c 4 May 2022 15:29:06 - @@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd error = -1; out: if (!error) { + if (tvp) { + (void)uvm_vnp_uncache(tvp); + } error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, _cnd, tond.ni_dvp, tond.ni_vp, _cnd); } else {
Re: ssh-add(1): fix NULL in fprintf
Hi What's the status on this? Anthing required from my side? I have reattached the patch (with the changes Theo suggested). Best, Martin On Mon, May 09, 2022 at 08:39:38PM +0200, Martin Vahlensieck wrote: > On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote: > > Martin Vahlensieck wrote: > > > > > if (!qflag) { > > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > > > - sshkey_type(key), comment); > > > + fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path, > > > + sshkey_type(key), comment ? " (" : "", > > > + comment ? comment : "", comment ? ")" : ""); > > > > this is probably better as something like > > > > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > > > - sshkey_type(key), comment ? comment : "no comment"); > > > > Which has a minor ambiguity, but probably harms noone. > > > > Index: ssh-add.c > === > RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v > retrieving revision 1.165 > diff -u -p -r1.165 ssh-add.c > --- ssh-add.c 4 Feb 2022 02:49:17 - 1.165 > +++ ssh-add.c 9 May 2022 18:36:54 - > @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss > } > if (!qflag) { > fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > - sshkey_type(key), comment); > + sshkey_type(key), comment ? comment : "no comment"); > } > return 0; > } > @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen > certpath, filename); > sshkey_free(cert); > goto out; > - } > + } > > /* Graft with private bits */ > if ((r = sshkey_to_certified(private)) != 0) { Index: ssh-add.c === RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v retrieving revision 1.165 diff -u -p -r1.165 ssh-add.c --- ssh-add.c 4 Feb 2022 02:49:17 - 1.165 +++ ssh-add.c 9 May 2022 18:36:54 - @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss } if (!qflag) { fprintf(stderr, "Identity removed: %s %s (%s)\n", path, - sshkey_type(key), comment); + sshkey_type(key), comment ? comment : "no comment"); } return 0; } @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen certpath, filename); sshkey_free(cert); goto out; - } + } /* Graft with private bits */ if ((r = sshkey_to_certified(private)) != 0) {
Re: Picky, but much more efficient arc4random_uniform!
int main() { int results[3] = { 0, 0, 0 }; for (int i = 0; i < 10; i++) { results[arc4random_uniform_fast_simple(3)]++; } for (int i = 0; i < 3; i++) printf("%d: %d\n", i, results[i]); return 0; } % ./a.out 0: 24809 1: 50011 2: 25180 You can't reuse bits because they'll be biased.
libcrypto/err_prn.c: skip BIO*
Hi As far as I can tell, this ends up calling vprintf eventually, so skip the steps inbetween. Best, Martin Index: err_prn.c === RCS file: /home/reposync/cvs/src/lib/libcrypto/err/err_prn.c,v retrieving revision 1.19 diff -u -p -r1.19 err_prn.c --- err_prn.c 7 Jan 2022 09:02:18 - 1.19 +++ err_prn.c 7 Jan 2022 16:13:48 - @@ -92,12 +92,7 @@ ERR_print_errors_cb(int (*cb)(const char static int print_fp(const char *str, size_t len, void *fp) { - BIO bio; - - BIO_set(, BIO_s_file()); - BIO_set_fp(, fp, BIO_NOCLOSE); - - return BIO_printf(, "%s", str); + return fprintf(fp, "%s", str); } void
apply(1): constify two arguments
Index: apply.c === RCS file: /cvs/src/usr.bin/apply/apply.c,v retrieving revision 1.29 diff -u -p -r1.29 apply.c --- apply.c 1 Apr 2018 17:45:05 - 1.29 +++ apply.c 12 May 2022 21:14:04 - @@ -54,7 +54,7 @@ char *str; size_t sz; void -stradd(char *p) +stradd(const char *p) { size_t n; @@ -73,7 +73,7 @@ stradd(char *p) } void -strset(char *p) +strset(const char *p) { if (str != NULL) str[0] = '\0';
Re: uvm_pagedequeue()
On 10/05/22(Tue) 20:23, Mark Kettenis wrote: > > Date: Tue, 10 May 2022 18:45:21 +0200 > > From: Martin Pieuchot > > > > On 05/05/22(Thu) 14:54, Martin Pieuchot wrote: > > > Diff below introduces a new wrapper to manipulate active/inactive page > > > queues. > > > > > > ok? > > > > Anyone? > > Sorry I started looking at this and got distracted. > > I'm not sure about the changes to uvm_pageactivate(). It doesn't > quite match what NetBSD does, but I guess NetBSD assumes that > uvm_pageactiave() isn't called for a page that is already active? And > that's something we can't guarantee? It does look at what NetBSD did 15 years ago. We're not ready to synchronize with NetBSD -current yet. We're getting there! > The diff is correct though in the sense that it is equivalent to the > code we already have. So if this definitely is the direction you want > to go: > > ok kettenis@ > > > > Index: uvm/uvm_page.c > > > === > > > RCS file: /cvs/src/sys/uvm/uvm_page.c,v > > > retrieving revision 1.165 > > > diff -u -p -r1.165 uvm_page.c > > > --- uvm/uvm_page.c4 May 2022 14:58:26 - 1.165 > > > +++ uvm/uvm_page.c5 May 2022 12:49:13 - > > > @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg) > > > /* > > >* now remove the page from the queues > > >*/ > > > - if (pg->pg_flags & PQ_ACTIVE) { > > > - TAILQ_REMOVE(_active, pg, pageq); > > > - flags_to_clear |= PQ_ACTIVE; > > > - uvmexp.active--; > > > - } > > > - if (pg->pg_flags & PQ_INACTIVE) { > > > - TAILQ_REMOVE(_inactive, pg, pageq); > > > - flags_to_clear |= PQ_INACTIVE; > > > - uvmexp.inactive--; > > > - } > > > + uvm_pagedequeue(pg); > > > > > > /* > > >* if the page was wired, unwire it now. > > > @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg) > > > MUTEX_ASSERT_LOCKED(); > > > > > > if (pg->wire_count == 0) { > > > - if (pg->pg_flags & PQ_ACTIVE) { > > > - TAILQ_REMOVE(_active, pg, pageq); > > > - atomic_clearbits_int(>pg_flags, PQ_ACTIVE); > > > - uvmexp.active--; > > > - } > > > - if (pg->pg_flags & PQ_INACTIVE) { > > > - TAILQ_REMOVE(_inactive, pg, pageq); > > > - atomic_clearbits_int(>pg_flags, PQ_INACTIVE); > > > - uvmexp.inactive--; > > > - } > > > + uvm_pagedequeue(pg); > > > uvmexp.wired++; > > > } > > > pg->wire_count++; > > > @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg) > > > KASSERT(uvm_page_owner_locked_p(pg)); > > > MUTEX_ASSERT_LOCKED(); > > > > > > + uvm_pagedequeue(pg); > > > + if (pg->wire_count == 0) { > > > + TAILQ_INSERT_TAIL(_active, pg, pageq); > > > + atomic_setbits_int(>pg_flags, PQ_ACTIVE); > > > + uvmexp.active++; > > > + > > > + } > > > +} > > > + > > > +/* > > > + * uvm_pagedequeue: remove a page from any paging queue > > > + */ > > > +void > > > +uvm_pagedequeue(struct vm_page *pg) > > > +{ > > > + if (pg->pg_flags & PQ_ACTIVE) { > > > + TAILQ_REMOVE(_active, pg, pageq); > > > + atomic_clearbits_int(>pg_flags, PQ_ACTIVE); > > > + uvmexp.active--; > > > + } > > > if (pg->pg_flags & PQ_INACTIVE) { > > > TAILQ_REMOVE(_inactive, pg, pageq); > > > atomic_clearbits_int(>pg_flags, PQ_INACTIVE); > > > uvmexp.inactive--; > > > } > > > - if (pg->wire_count == 0) { > > > - /* > > > - * if page is already active, remove it from list so we > > > - * can put it at tail. if it wasn't active, then mark > > > - * it active and bump active count > > > - */ > > > - if (pg->pg_flags & PQ_ACTIVE) > > > - TAILQ_REMOVE(_active, pg, pageq); > > > - else { > > > - atomic_setbits_int(>pg_flags, PQ_ACTIVE); > > > - uvmexp.active++; > > > - } > > > - > > > - TAILQ_INSERT_TAIL(_active, pg, pageq); > > > - } > > > } > > > - > > > /* > > > * uvm_pagezero: zero fill a page > > > */ > > > Index: uvm/uvm_page.h > > > === > > > RCS file: /cvs/src/sys/uvm/uvm_page.h,v > > > retrieving revision 1.67 > > > diff -u -p -r1.67 uvm_page.h > > > --- uvm/uvm_page.h29 Jan 2022 06:25:33 - 1.67 > > > +++ uvm/uvm_page.h5 May 2022 12:49:13 - > > > @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *); > > > #endif > > > > > > void uvm_pageactivate(struct vm_page *); > > > +void uvm_pagedequeue(struct vm_page *); > > > vaddr_t uvm_pageboot_alloc(vsize_t); > > > void uvm_pagecopy(struct vm_page *, struct vm_page *); > > > void uvm_pagedeactivate(struct vm_page *); > > > > > > >
Re: Mark pw_error __dead in util.h
On Tue, May 03, 2022 at 10:37:36PM -0500, Matthew Martin wrote: > The function is already marked __dead in passwd.c, so appears to just be > an oversight. ping diff --git util.h util.h index dd64f478e23..752f8bb9fc5 100644 --- util.h +++ util.h @@ -97,7 +97,7 @@ void pw_edit(int, const char *); void pw_prompt(void); void pw_copy(int, int, const struct passwd *, const struct passwd *); intpw_scan(char *, struct passwd *, int *); -void pw_error(const char *, int, int); +__dead voidpw_error(const char *, int, int); intgetptmfd(void); intopenpty(int *, int *, char *, const struct termios *, const struct winsize *);
Re: uvm_pagedequeue()
On 05/05/22(Thu) 14:54, Martin Pieuchot wrote: > Diff below introduces a new wrapper to manipulate active/inactive page > queues. > > ok? Anyone? > Index: uvm/uvm_page.c > === > RCS file: /cvs/src/sys/uvm/uvm_page.c,v > retrieving revision 1.165 > diff -u -p -r1.165 uvm_page.c > --- uvm/uvm_page.c4 May 2022 14:58:26 - 1.165 > +++ uvm/uvm_page.c5 May 2022 12:49:13 - > @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg) > /* >* now remove the page from the queues >*/ > - if (pg->pg_flags & PQ_ACTIVE) { > - TAILQ_REMOVE(_active, pg, pageq); > - flags_to_clear |= PQ_ACTIVE; > - uvmexp.active--; > - } > - if (pg->pg_flags & PQ_INACTIVE) { > - TAILQ_REMOVE(_inactive, pg, pageq); > - flags_to_clear |= PQ_INACTIVE; > - uvmexp.inactive--; > - } > + uvm_pagedequeue(pg); > > /* >* if the page was wired, unwire it now. > @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg) > MUTEX_ASSERT_LOCKED(); > > if (pg->wire_count == 0) { > - if (pg->pg_flags & PQ_ACTIVE) { > - TAILQ_REMOVE(_active, pg, pageq); > - atomic_clearbits_int(>pg_flags, PQ_ACTIVE); > - uvmexp.active--; > - } > - if (pg->pg_flags & PQ_INACTIVE) { > - TAILQ_REMOVE(_inactive, pg, pageq); > - atomic_clearbits_int(>pg_flags, PQ_INACTIVE); > - uvmexp.inactive--; > - } > + uvm_pagedequeue(pg); > uvmexp.wired++; > } > pg->wire_count++; > @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg) > KASSERT(uvm_page_owner_locked_p(pg)); > MUTEX_ASSERT_LOCKED(); > > + uvm_pagedequeue(pg); > + if (pg->wire_count == 0) { > + TAILQ_INSERT_TAIL(_active, pg, pageq); > + atomic_setbits_int(>pg_flags, PQ_ACTIVE); > + uvmexp.active++; > + > + } > +} > + > +/* > + * uvm_pagedequeue: remove a page from any paging queue > + */ > +void > +uvm_pagedequeue(struct vm_page *pg) > +{ > + if (pg->pg_flags & PQ_ACTIVE) { > + TAILQ_REMOVE(_active, pg, pageq); > + atomic_clearbits_int(>pg_flags, PQ_ACTIVE); > + uvmexp.active--; > + } > if (pg->pg_flags & PQ_INACTIVE) { > TAILQ_REMOVE(_inactive, pg, pageq); > atomic_clearbits_int(>pg_flags, PQ_INACTIVE); > uvmexp.inactive--; > } > - if (pg->wire_count == 0) { > - /* > - * if page is already active, remove it from list so we > - * can put it at tail. if it wasn't active, then mark > - * it active and bump active count > - */ > - if (pg->pg_flags & PQ_ACTIVE) > - TAILQ_REMOVE(_active, pg, pageq); > - else { > - atomic_setbits_int(>pg_flags, PQ_ACTIVE); > - uvmexp.active++; > - } > - > - TAILQ_INSERT_TAIL(_active, pg, pageq); > - } > } > - > /* > * uvm_pagezero: zero fill a page > */ > Index: uvm/uvm_page.h > === > RCS file: /cvs/src/sys/uvm/uvm_page.h,v > retrieving revision 1.67 > diff -u -p -r1.67 uvm_page.h > --- uvm/uvm_page.h29 Jan 2022 06:25:33 - 1.67 > +++ uvm/uvm_page.h5 May 2022 12:49:13 - > @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *); > #endif > > void uvm_pageactivate(struct vm_page *); > +void uvm_pagedequeue(struct vm_page *); > vaddr_t uvm_pageboot_alloc(vsize_t); > void uvm_pagecopy(struct vm_page *, struct vm_page *); > void uvm_pagedeactivate(struct vm_page *); >
Re: ssh-add(1): fix NULL in fprintf
On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote: > Martin Vahlensieck wrote: > > > if (!qflag) { > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > > - sshkey_type(key), comment); > > + fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path, > > + sshkey_type(key), comment ? " (" : "", > > + comment ? comment : "", comment ? ")" : ""); > > this is probably better as something like > > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path, > > - sshkey_type(key), comment ? comment : "no comment"); > > Which has a minor ambiguity, but probably harms noone. > Index: ssh-add.c === RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v retrieving revision 1.165 diff -u -p -r1.165 ssh-add.c --- ssh-add.c 4 Feb 2022 02:49:17 - 1.165 +++ ssh-add.c 9 May 2022 18:36:54 - @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss } if (!qflag) { fprintf(stderr, "Identity removed: %s %s (%s)\n", path, - sshkey_type(key), comment); + sshkey_type(key), comment ? comment : "no comment"); } return 0; } @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen certpath, filename); sshkey_free(cert); goto out; - } + } /* Graft with private bits */ if ((r = sshkey_to_certified(private)) != 0) {
ssh-add(1): fix NULL in fprintf
Hi When removing an identity from the agent using the private key file, ssh-add first tries to find the public key file. If that fails, it loads the public key from the private key file, but no comment is loaded. This means comment is NULL when it is used inside delete_one to print `Identity removed: ...' Below is a diff which only prints the braces and the comment if it is not NULL. Something similar is done in ssh-keygen.c line 2423-2425. So with the following setup: $ ssh-keygen -t ed25519 -f demo -C demo -N '' $ mv demo.pub demo_pub $ ssh-add demo Identity added: demo (demo) Before: $ ssh-add -d demo Identity removed: demo ED25519 ((null)) $ tail -n 1 /var/log/messages May 9 18:15:53 demo ssh-add: vfprintf %s NULL in "Identity removed: %s %s (%s) " After: $ ssh-add -d demo Identity removed: demo ED25519 Best, Martin P.S.: While here remove a trailing space as well. Index: ssh-add.c === RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v retrieving revision 1.165 diff -u -p -r1.165 ssh-add.c --- ssh-add.c 4 Feb 2022 02:49:17 - 1.165 +++ ssh-add.c 9 May 2022 16:04:14 - @@ -117,8 +117,9 @@ delete_one(int agent_fd, const struct ss return r; } if (!qflag) { - fprintf(stderr, "Identity removed: %s %s (%s)\n", path, - sshkey_type(key), comment); + fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path, + sshkey_type(key), comment ? " (" : "", + comment ? comment : "", comment ? ")" : ""); } return 0; } @@ -392,7 +393,7 @@ add_file(int agent_fd, const char *filen certpath, filename); sshkey_free(cert); goto out; - } + } /* Graft with private bits */ if ((r = sshkey_to_certified(private)) != 0) {
Re: uvm: Consider BUFPAGES_DEFICIT in swap_shortage
On 05/05/22(Thu) 10:56, Bob Beck wrote: > On Thu, May 05, 2022 at 10:16:23AM -0600, Bob Beck wrote: > > Ugh. You???re digging in the most perilous parts of the pile. > > > > I will go look with you??? sigh. (This is not yet an ok for that.) > > > > > On May 5, 2022, at 7:53 AM, Martin Pieuchot wrote: > > > > > > When considering the amount of free pages in the page daemon a small > > > amount is always kept for the buffer cache... except in one place. > > > > > > The diff below gets rid of this exception. This is important because > > > uvmpd_scan() is called conditionally using the following check: > > > > > > if (uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) { > > >... > > > } > > > > > > So in case of swap shortage we might end up freeing fewer pages than > > > wanted. > > So a bit of backgroud. > > I am pretty much of the belief that this "low water mark" for pages is > nonsense now. I was in the midst of trying to prove that > to myself and therefore rip down some of the crazy accounting and > very arbitrary limits in the buffer cache and got distracted. > > Maybe something like this to start? (buf failing that I think > your current diff is probably ok). Thanks. I'll commit my diff then to make the current code coherent and let me progress in my refactoring. Then we can consider changing this magic. > Index: sys/sys/mount.h > === > RCS file: /cvs/src/sys/sys/mount.h,v > retrieving revision 1.148 > diff -u -p -u -p -r1.148 mount.h > --- sys/sys/mount.h 6 Apr 2021 14:17:35 - 1.148 > +++ sys/sys/mount.h 5 May 2022 16:50:50 - > @@ -488,10 +488,8 @@ struct bcachestats { > #ifdef _KERNEL > extern struct bcachestats bcstats; > extern long buflowpages, bufhighpages, bufbackpages; > -#define BUFPAGES_DEFICIT (((buflowpages - bcstats.numbufpages) < 0) ? 0 \ > -: buflowpages - bcstats.numbufpages) > -#define BUFPAGES_INACT (((bcstats.numcleanpages - buflowpages) < 0) ? 0 \ > -: bcstats.numcleanpages - buflowpages) > +#define BUFPAGES_DEFICIT 0 > +#define BUFPAGES_INACT bcstats.numcleanpages > extern int bufcachepercent; > extern void bufadjust(int); > struct uvm_constraint_range; > > > > > > > > ok? > > > > > > Index: uvm/uvm_pdaemon.c > > > === > > > RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v > > > retrieving revision 1.98 > > > diff -u -p -r1.98 uvm_pdaemon.c > > > --- uvm/uvm_pdaemon.c 4 May 2022 14:58:26 - 1.98 > > > +++ uvm/uvm_pdaemon.c 5 May 2022 13:40:28 - > > > @@ -923,12 +923,13 @@ uvmpd_scan(void) > > >* detect if we're not going to be able to page anything out > > >* until we free some swap resources from active pages. > > >*/ > > > + free = uvmexp.free - BUFPAGES_DEFICIT; > > > swap_shortage = 0; > > > - if (uvmexp.free < uvmexp.freetarg && > > > + if (free < uvmexp.freetarg && > > > uvmexp.swpginuse == uvmexp.swpages && > > > !uvm_swapisfull() && > > > pages_freed == 0) { > > > - swap_shortage = uvmexp.freetarg - uvmexp.free; > > > + swap_shortage = uvmexp.freetarg - free; > > > } > > > > > > for (p = TAILQ_FIRST(_active); > > > > >
uvm: Consider BUFPAGES_DEFICIT in swap_shortage
When considering the amount of free pages in the page daemon a small amount is always kept for the buffer cache... except in one place. The diff below gets rid of this exception. This is important because uvmpd_scan() is called conditionally using the following check: if (uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) { ... } So in case of swap shortage we might end up freeing fewer pages than wanted. ok? Index: uvm/uvm_pdaemon.c === RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v retrieving revision 1.98 diff -u -p -r1.98 uvm_pdaemon.c --- uvm/uvm_pdaemon.c 4 May 2022 14:58:26 - 1.98 +++ uvm/uvm_pdaemon.c 5 May 2022 13:40:28 - @@ -923,12 +923,13 @@ uvmpd_scan(void) * detect if we're not going to be able to page anything out * until we free some swap resources from active pages. */ + free = uvmexp.free - BUFPAGES_DEFICIT; swap_shortage = 0; - if (uvmexp.free < uvmexp.freetarg && + if (free < uvmexp.freetarg && uvmexp.swpginuse == uvmexp.swpages && !uvm_swapisfull() && pages_freed == 0) { - swap_shortage = uvmexp.freetarg - uvmexp.free; + swap_shortage = uvmexp.freetarg - free; } for (p = TAILQ_FIRST(_active);
uvm_pagedequeue()
Diff below introduces a new wrapper to manipulate active/inactive page queues. ok? Index: uvm/uvm_page.c === RCS file: /cvs/src/sys/uvm/uvm_page.c,v retrieving revision 1.165 diff -u -p -r1.165 uvm_page.c --- uvm/uvm_page.c 4 May 2022 14:58:26 - 1.165 +++ uvm/uvm_page.c 5 May 2022 12:49:13 - @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg) /* * now remove the page from the queues */ - if (pg->pg_flags & PQ_ACTIVE) { - TAILQ_REMOVE(_active, pg, pageq); - flags_to_clear |= PQ_ACTIVE; - uvmexp.active--; - } - if (pg->pg_flags & PQ_INACTIVE) { - TAILQ_REMOVE(_inactive, pg, pageq); - flags_to_clear |= PQ_INACTIVE; - uvmexp.inactive--; - } + uvm_pagedequeue(pg); /* * if the page was wired, unwire it now. @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg) MUTEX_ASSERT_LOCKED(); if (pg->wire_count == 0) { - if (pg->pg_flags & PQ_ACTIVE) { - TAILQ_REMOVE(_active, pg, pageq); - atomic_clearbits_int(>pg_flags, PQ_ACTIVE); - uvmexp.active--; - } - if (pg->pg_flags & PQ_INACTIVE) { - TAILQ_REMOVE(_inactive, pg, pageq); - atomic_clearbits_int(>pg_flags, PQ_INACTIVE); - uvmexp.inactive--; - } + uvm_pagedequeue(pg); uvmexp.wired++; } pg->wire_count++; @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg) KASSERT(uvm_page_owner_locked_p(pg)); MUTEX_ASSERT_LOCKED(); + uvm_pagedequeue(pg); + if (pg->wire_count == 0) { + TAILQ_INSERT_TAIL(_active, pg, pageq); + atomic_setbits_int(>pg_flags, PQ_ACTIVE); + uvmexp.active++; + + } +} + +/* + * uvm_pagedequeue: remove a page from any paging queue + */ +void +uvm_pagedequeue(struct vm_page *pg) +{ + if (pg->pg_flags & PQ_ACTIVE) { + TAILQ_REMOVE(_active, pg, pageq); + atomic_clearbits_int(>pg_flags, PQ_ACTIVE); + uvmexp.active--; + } if (pg->pg_flags & PQ_INACTIVE) { TAILQ_REMOVE(_inactive, pg, pageq); atomic_clearbits_int(>pg_flags, PQ_INACTIVE); uvmexp.inactive--; } - if (pg->wire_count == 0) { - /* -* if page is already active, remove it from list so we -* can put it at tail. if it wasn't active, then mark -* it active and bump active count -*/ - if (pg->pg_flags & PQ_ACTIVE) - TAILQ_REMOVE(_active, pg, pageq); - else { - atomic_setbits_int(>pg_flags, PQ_ACTIVE); - uvmexp.active++; - } - - TAILQ_INSERT_TAIL(_active, pg, pageq); - } } - /* * uvm_pagezero: zero fill a page */ Index: uvm/uvm_page.h === RCS file: /cvs/src/sys/uvm/uvm_page.h,v retrieving revision 1.67 diff -u -p -r1.67 uvm_page.h --- uvm/uvm_page.h 29 Jan 2022 06:25:33 - 1.67 +++ uvm/uvm_page.h 5 May 2022 12:49:13 - @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *); #endif void uvm_pageactivate(struct vm_page *); +void uvm_pagedequeue(struct vm_page *); vaddr_tuvm_pageboot_alloc(vsize_t); void uvm_pagecopy(struct vm_page *, struct vm_page *); void uvm_pagedeactivate(struct vm_page *);
ssh: sshkey.c: reduce code duplication
Hi I noticed that sshkey_unshield_private contains a exact duplicate of the code in private2_check_padding. So by pulling private2_check_padding up, the code can be reused. Or is there a reason for this split? Best, Martin P.S.: This diff also removes two trailing spaces while here. Index: sshkey.c === RCS file: /home/reposync/cvs/src/usr.bin/ssh/sshkey.c,v retrieving revision 1.120 diff -u -p -r1.120 sshkey.c --- sshkey.c6 Jan 2022 22:05:42 - 1.120 +++ sshkey.c4 May 2022 19:12:16 - @@ -2079,14 +2079,38 @@ sshkey_shield_private(struct sshkey *k) return r; } +/* Check deterministic padding after private key */ +static int +private2_check_padding(struct sshbuf *decrypted) +{ + u_char pad; + size_t i; + int r; + + i = 0; + while (sshbuf_len(decrypted)) { + if ((r = sshbuf_get_u8(decrypted, )) != 0) + goto out; + if (pad != (++i & 0xff)) { + r = SSH_ERR_INVALID_FORMAT; + goto out; + } + } + /* success */ + r = 0; + out: + explicit_bzero(, sizeof(pad)); + explicit_bzero(, sizeof(i)); + return r; +} + int sshkey_unshield_private(struct sshkey *k) { struct sshbuf *prvbuf = NULL; - u_char pad, *cp, keyiv[SSH_DIGEST_MAX_LENGTH]; + u_char *cp, keyiv[SSH_DIGEST_MAX_LENGTH]; struct sshcipher_ctx *cctx = NULL; const struct sshcipher *cipher; - size_t i; struct sshkey *kswap = NULL, tmp; int r = SSH_ERR_INTERNAL_ERROR; @@ -2148,16 +2172,9 @@ sshkey_unshield_private(struct sshkey *k /* Parse private key */ if ((r = sshkey_private_deserialize(prvbuf, )) != 0) goto out; - /* Check deterministic padding */ - i = 0; - while (sshbuf_len(prvbuf)) { - if ((r = sshbuf_get_u8(prvbuf, )) != 0) - goto out; - if (pad != (++i & 0xff)) { - r = SSH_ERR_INVALID_FORMAT; - goto out; - } - } + + if ((r = private2_check_padding(prvbuf)) != 0) + goto out; /* Swap the parsed key back into place */ tmp = *kswap; @@ -3966,9 +3983,9 @@ sshkey_private_to_blob2(struct sshkey *p explicit_bzero(salt, sizeof(salt)); if (key != NULL) freezero(key, keylen + ivlen); - if (pubkeyblob != NULL) + if (pubkeyblob != NULL) freezero(pubkeyblob, pubkeylen); - if (b64 != NULL) + if (b64 != NULL) freezero(b64, strlen(b64)); return r; } @@ -4192,31 +4209,6 @@ private2_decrypt(struct sshbuf *decoded, } sshbuf_free(kdf); sshbuf_free(decrypted); - return r; -} - -/* Check deterministic padding after private key */ -static int -private2_check_padding(struct sshbuf *decrypted) -{ - u_char pad; - size_t i; - int r = SSH_ERR_INTERNAL_ERROR; - - i = 0; - while (sshbuf_len(decrypted)) { - if ((r = sshbuf_get_u8(decrypted, )) != 0) - goto out; - if (pad != (++i & 0xff)) { - r = SSH_ERR_INVALID_FORMAT; - goto out; - } - } - /* success */ - r = 0; - out: - explicit_bzero(, sizeof(pad)); - explicit_bzero(, sizeof(i)); return r; }
ssh: mux.c: mark argument as const
Index: mux.c === RCS file: /home/reposync/cvs/src/usr.bin/ssh/mux.c,v retrieving revision 1.92 diff -u -p -r1.92 mux.c --- mux.c 11 Jan 2022 01:26:47 - 1.92 +++ mux.c 13 Jan 2022 16:27:14 - @@ -227,7 +227,7 @@ mux_master_control_cleanup_cb(struct ssh /* Check mux client environment variables before passing them to mux master. */ static int -env_permitted(char *env) +env_permitted(const char *env) { int i, ret; char name[1024], *cp;
ssh: channels.c: Fix comment and add a const
Hi channel_new no longer frees remote_name. So update the comment accordingly. As remote_name is not modified, it can be const as well. Best, Martin Index: channels.c === RCS file: /home/reposync/cvs/src/usr.bin/ssh/channels.c,v retrieving revision 1.418 diff -u -p -r1.418 channels.c --- channels.c 4 May 2022 07:31:22 - 1.418 +++ channels.c 4 May 2022 19:02:14 - @@ -349,12 +349,11 @@ channel_register_fds(struct ssh *ssh, Ch } /* - * Allocate a new channel object and set its type and socket. This will cause - * remote_name to be freed. + * Allocate a new channel object and set its type and socket. */ Channel * channel_new(struct ssh *ssh, char *ctype, int type, int rfd, int wfd, int efd, -u_int window, u_int maxpack, int extusage, char *remote_name, int nonblock) +u_int window, u_int maxpack, int extusage, const char *remote_name, int nonblock) { struct ssh_channels *sc = ssh->chanctxt; u_int i, found; Index: channels.h === RCS file: /home/reposync/cvs/src/usr.bin/ssh/channels.h,v retrieving revision 1.142 diff -u -p -r1.142 channels.h --- channels.h 30 Mar 2022 21:10:25 - 1.142 +++ channels.h 6 Apr 2022 20:26:56 - @@ -272,7 +272,7 @@ Channel *channel_by_id(struct ssh *, int Channel*channel_by_remote_id(struct ssh *, u_int); Channel*channel_lookup(struct ssh *, int); Channel *channel_new(struct ssh *, char *, int, int, int, int, - u_int, u_int, int, char *, int); + u_int, u_int, int, const char *, int); voidchannel_set_fds(struct ssh *, int, int, int, int, int, int, int, u_int); voidchannel_free(struct ssh *, Channel *);
Mark pw_error __dead in util.h
The function is already marked __dead in passwd.c, so appears to just be an oversight. diff --git util.h util.h index dd64f478e23..752f8bb9fc5 100644 --- util.h +++ util.h @@ -97,7 +97,7 @@ void pw_edit(int, const char *); void pw_prompt(void); void pw_copy(int, int, const struct passwd *, const struct passwd *); intpw_scan(char *, struct passwd *, int *); -void pw_error(const char *, int, int); +__dead voidpw_error(const char *, int, int); intgetptmfd(void); intopenpty(int *, int *, char *, const struct termios *, const struct winsize *);
Merge swap-backed and object-backed inactive lists
Let's simplify the existing logic and use a single list for inactive pages. uvmpd_scan_inactive() already does a lot of check if it finds a page which is swap-backed. This will be improved in a next change. ok? Index: uvm/uvm.h === RCS file: /cvs/src/sys/uvm/uvm.h,v retrieving revision 1.68 diff -u -p -r1.68 uvm.h --- uvm/uvm.h 24 Nov 2020 13:49:09 - 1.68 +++ uvm/uvm.h 2 May 2022 16:32:16 - @@ -53,8 +53,7 @@ struct uvm { /* vm_page queues */ struct pglist page_active; /* [Q] allocated pages, in use */ - struct pglist page_inactive_swp;/* [Q] pages inactive (reclaim/free) */ - struct pglist page_inactive_obj;/* [Q] pages inactive (reclaim/free) */ + struct pglist page_inactive;/* [Q] pages inactive (reclaim/free) */ /* Lock order: pageqlock, then fpageqlock. */ struct mutex pageqlock; /* [] lock for active/inactive page q */ struct mutex fpageqlock;/* [] lock for free page q + pdaemon */ Index: uvm/uvm_map.c === RCS file: /cvs/src/sys/uvm/uvm_map.c,v retrieving revision 1.290 diff -u -p -r1.290 uvm_map.c --- uvm/uvm_map.c 12 Mar 2022 08:11:07 - 1.290 +++ uvm/uvm_map.c 2 May 2022 16:32:16 - @@ -3281,8 +3281,7 @@ uvm_page_printit(struct vm_page *pg, boo (*pr)(" >>> page not found in uvm_pmemrange <<<\n"); pgl = NULL; } else if (pg->pg_flags & PQ_INACTIVE) { - pgl = (pg->pg_flags & PQ_SWAPBACKED) ? - _inactive_swp : _inactive_obj; + pgl = _inactive; } else if (pg->pg_flags & PQ_ACTIVE) { pgl = _active; } else { Index: uvm/uvm_page.c === RCS file: /cvs/src/sys/uvm/uvm_page.c,v retrieving revision 1.164 diff -u -p -r1.164 uvm_page.c --- uvm/uvm_page.c 28 Apr 2022 09:59:28 - 1.164 +++ uvm/uvm_page.c 2 May 2022 16:32:16 - @@ -185,8 +185,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr */ TAILQ_INIT(_active); - TAILQ_INIT(_inactive_swp); - TAILQ_INIT(_inactive_obj); + TAILQ_INIT(_inactive); mtx_init(, IPL_VM); mtx_init(, IPL_VM); uvm_pmr_init(); @@ -994,10 +993,7 @@ uvm_pageclean(struct vm_page *pg) uvmexp.active--; } if (pg->pg_flags & PQ_INACTIVE) { - if (pg->pg_flags & PQ_SWAPBACKED) - TAILQ_REMOVE(_inactive_swp, pg, pageq); - else - TAILQ_REMOVE(_inactive_obj, pg, pageq); + TAILQ_REMOVE(_inactive, pg, pageq); flags_to_clear |= PQ_INACTIVE; uvmexp.inactive--; } @@ -1253,10 +1249,7 @@ uvm_pagewire(struct vm_page *pg) uvmexp.active--; } if (pg->pg_flags & PQ_INACTIVE) { - if (pg->pg_flags & PQ_SWAPBACKED) - TAILQ_REMOVE(_inactive_swp, pg, pageq); - else - TAILQ_REMOVE(_inactive_obj, pg, pageq); + TAILQ_REMOVE(_inactive, pg, pageq); atomic_clearbits_int(>pg_flags, PQ_INACTIVE); uvmexp.inactive--; } @@ -1304,10 +1297,7 @@ uvm_pagedeactivate(struct vm_page *pg) } if ((pg->pg_flags & PQ_INACTIVE) == 0) { KASSERT(pg->wire_count == 0); - if (pg->pg_flags & PQ_SWAPBACKED) - TAILQ_INSERT_TAIL(_inactive_swp, pg, pageq); - else - TAILQ_INSERT_TAIL(_inactive_obj, pg, pageq); + TAILQ_INSERT_TAIL(_inactive, pg, pageq); atomic_setbits_int(>pg_flags, PQ_INACTIVE); uvmexp.inactive++; pmap_clear_reference(pg); @@ -1335,10 +1325,7 @@ uvm_pageactivate(struct vm_page *pg) MUTEX_ASSERT_LOCKED(); if (pg->pg_flags & PQ_INACTIVE) { - if (pg->pg_flags & PQ_SWAPBACKED) - TAILQ_REMOVE(_inactive_swp, pg, pageq); - else - TAILQ_REMOVE(_inactive_obj, pg, pageq); + TAILQ_REMOVE(_inactive, pg, pageq); atomic_clearbits_int(>pg_flags, PQ_INACTIVE); uvmexp.inactive--; } Index: uvm/uvm_pdaemon.c === RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v retrieving revision 1.97 diff -u -p -r1.97 uvm_pdaemon.c --- uvm/uvm_pdaemon.c 30 Apr 2022 17:58:43 - 1.97 +++ uvm/uvm_pdaemon.c 2 May 2022 16:32:16 - @@ -396,13 +396,6 @@ uvmpd_scan_inactive(struct pglist *pglst int dirtyreacts; /* -* note: we currently keep swap-backed pages on a
uvmpd_scan(): Recheck PG_BUSY after locking the page
rw_enter(9) can sleep. When the lock is finally acquired by the pagedaemon the previous check might no longer be true and the page could be busy. In this case we shouldn't touch it. Diff below recheck for PG_BUSY after acquiring the lock and also use a variable for the lock to reduce the differences with NetBSD. ok? Index: uvm/uvm_pdaemon.c === RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v retrieving revision 1.96 diff -u -p -r1.96 uvm_pdaemon.c --- uvm/uvm_pdaemon.c 11 Apr 2022 16:43:49 - 1.96 +++ uvm/uvm_pdaemon.c 28 Apr 2022 10:22:52 - @@ -879,6 +879,8 @@ uvmpd_scan(void) int free, inactive_shortage, swap_shortage, pages_freed; struct vm_page *p, *nextpg; struct uvm_object *uobj; + struct vm_anon *anon; + struct rwlock *slock; boolean_t got_it; MUTEX_ASSERT_LOCKED(); @@ -947,20 +949,34 @@ uvmpd_scan(void) p != NULL && (inactive_shortage > 0 || swap_shortage > 0); p = nextpg) { nextpg = TAILQ_NEXT(p, pageq); - - /* skip this page if it's busy. */ - if (p->pg_flags & PG_BUSY) + if (p->pg_flags & PG_BUSY) { continue; + } - if (p->pg_flags & PQ_ANON) { - KASSERT(p->uanon != NULL); - if (rw_enter(p->uanon->an_lock, RW_WRITE|RW_NOSLEEP)) + /* +* lock the page's owner. +*/ + if (p->uobject != NULL) { + uobj = p->uobject; + slock = uobj->vmobjlock; + if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) { continue; + } } else { - KASSERT(p->uobject != NULL); - if (rw_enter(p->uobject->vmobjlock, - RW_WRITE|RW_NOSLEEP)) + anon = p->uanon; + KASSERT(p->uanon != NULL); + slock = anon->an_lock; + if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) { continue; + } + } + + /* +* skip this page if it's busy. +*/ + if ((p->pg_flags & PG_BUSY) != 0) { + rw_exit(slock); + continue; } /* @@ -997,10 +1013,11 @@ uvmpd_scan(void) uvmexp.pddeact++; inactive_shortage--; } - if (p->pg_flags & PQ_ANON) - rw_exit(p->uanon->an_lock); - else - rw_exit(p->uobject->vmobjlock); + + /* +* we're done with this page. +*/ + rw_exit(slock); } }
Call uvm_pageactivate() from uvm_pageunwire()
I'd like to use a proper interface to add/remove pages on the active/inactive queues. This will help for lock assertions and help improving the existing LRU limitations. Diff below makes uvm_pageunwire() call uvm_pageactivate() instead of inserting the page itself. ok? Index: uvm/uvm_page.c === RCS file: /cvs/src/sys/uvm/uvm_page.c,v retrieving revision 1.163 diff -u -p -r1.163 uvm_page.c --- uvm/uvm_page.c 12 Mar 2022 12:34:22 - 1.163 +++ uvm/uvm_page.c 26 Apr 2022 12:13:59 - @@ -1279,9 +1279,7 @@ uvm_pageunwire(struct vm_page *pg) pg->wire_count--; if (pg->wire_count == 0) { - TAILQ_INSERT_TAIL(_active, pg, pageq); - uvmexp.active++; - atomic_setbits_int(>pg_flags, PQ_ACTIVE); + uvm_pageactivate(pg); uvmexp.wired--; } }
Decrement uvmexp.swpgonly
Small diff to decrement the counter only if the I/O succeed. This prevent a false positive if a check is performed before an error is returned. ok? Index: uvm/uvm_swap.c === RCS file: /cvs/src/sys/uvm/uvm_swap.c,v retrieving revision 1.154 diff -u -p -r1.154 uvm_swap.c --- uvm/uvm_swap.c 17 Mar 2022 10:15:13 - 1.154 +++ uvm/uvm_swap.c 26 Apr 2022 12:04:52 - @@ -1572,17 +1572,16 @@ uvm_swap_get(struct vm_page *page, int s } KERNEL_LOCK(); - /* this page is (about to be) no longer only in swap. */ - atomic_dec_int(); - result = uvm_swap_io(, swslot, 1, B_READ | ((flags & PGO_SYNCIO) ? 0 : B_ASYNC)); + KERNEL_UNLOCK(); - if (result != VM_PAGER_OK && result != VM_PAGER_PEND) { - /* oops, the read failed so it really is still only in swap. */ - atomic_inc_int(); + if (result == VM_PAGER_OK || result == VM_PAGER_PEND) { + /* +* this page is no longer only in swap. +*/ + atomic_dec_int(); } - KERNEL_UNLOCK(); return (result); }
Re: const openpty et al.
On Thu, Apr 07, 2022 at 06:11:45PM -0500, Matthew Martin wrote: > const the termp and winp arguments for openpty and related. This matches > the prototypes for openpty and forkpty in glibc and musl libc. ping; has an ok from tb@ [1] 1: https://marc.info/?l=openbsd-tech=164986161215132=2 diff --git openpty.3 openpty.3 index e0c82e00c53..080d92d8ced 100644 --- openpty.3 +++ openpty.3 @@ -47,15 +47,15 @@ .Ft int .Fn getptmfd "void" .Ft int -.Fn openpty "int *amaster" "int *aslave" "char *name" "struct termios *termp" "struct winsize *winp" +.Fn openpty "int *amaster" "int *aslave" "char *name" "const struct termios *termp" "const struct winsize *winp" .Ft int -.Fn fdopenpty "int ptmfd" "int *amaster" "int *aslave" "char *name" "struct termios *termp" "struct winsize *winp" +.Fn fdopenpty "int ptmfd" "int *amaster" "int *aslave" "char *name" "const struct termios *termp" "const struct winsize *winp" .Ft int .Fn login_tty "int fd" .Ft pid_t -.Fn forkpty "int *amaster" "char *name" "struct termios *termp" "struct winsize *winp" +.Fn forkpty "int *amaster" "char *name" "const struct termios *termp" "const struct winsize *winp" .Ft pid_t -.Fn fdforkpty "int ptmfd" "int *amaster" "char *name" "struct termios *termp" "struct winsize *winp" +.Fn fdforkpty "int ptmfd" "int *amaster" "char *name" "const struct termios *termp" "const struct winsize *winp" .Sh DESCRIPTION The .Fn openpty , diff --git pty.c pty.c index c796acb5041..d59f863ff2d 100644 --- pty.c +++ pty.c @@ -50,8 +50,8 @@ getptmfd(void) } int -openpty(int *amaster, int *aslave, char *name, struct termios *termp, -struct winsize *winp) +openpty(int *amaster, int *aslave, char *name, const struct termios *termp, +const struct winsize *winp) { int ptmfd; @@ -67,7 +67,7 @@ openpty(int *amaster, int *aslave, char *name, struct termios *termp, int fdopenpty(int ptmfd, int *amaster, int *aslave, char *name, -struct termios *termp, struct winsize *winp) +const struct termios *termp, const struct winsize *winp) { int master, slave; struct ptmget ptm; @@ -97,7 +97,8 @@ fdopenpty(int ptmfd, int *amaster, int *aslave, char *name, } pid_t -forkpty(int *amaster, char *name, struct termios *termp, struct winsize *winp) +forkpty(int *amaster, char *name, const struct termios *termp, +const struct winsize *winp) { int ptmfd; pid_t pid; @@ -113,8 +114,8 @@ forkpty(int *amaster, char *name, struct termios *termp, struct winsize *winp) } pid_t -fdforkpty(int ptmfd, int *amaster, char *name, struct termios *termp, -struct winsize *winp) +fdforkpty(int ptmfd, int *amaster, char *name, const struct termios *termp, +const struct winsize *winp) { int master, slave; pid_t pid; diff --git util.h util.h index d7aa8faed4b..ecae5819e53 100644 --- util.h +++ util.h @@ -99,12 +99,14 @@ voidpw_copy(int, int, const struct passwd *, const struct passwd *); intpw_scan(char *, struct passwd *, int *); void pw_error(const char *, int, int); intgetptmfd(void); -intopenpty(int *, int *, char *, struct termios *, struct winsize *); -intfdopenpty(int, int *, int *, char *, struct termios *, - struct winsize *); +intopenpty(int *, int *, char *, const struct termios *, + const struct winsize *); +intfdopenpty(int, int *, int *, char *, const struct termios *, + const struct winsize *); intopendisk(const char *, int, char *, size_t, int); -pid_t forkpty(int *, char *, struct termios *, struct winsize *); -pid_t fdforkpty(int, int *, char *, struct termios *, struct winsize *); +pid_t forkpty(int *, char *, const struct termios *, const struct winsize *); +pid_t fdforkpty(int, int *, char *, const struct termios *, + const struct winsize *); intgetmaxpartitions(void); intgetrawpartition(void); void login_fbtab(const char *, uid_t, gid_t);
xmss_hash.c: remove superfluous includes
Hi Neither openssl/evp.h nor openssl/hmac.h are required. Best, Martin Index: xmss_hash.c === RCS file: /cvs/src/usr.bin/ssh/xmss_hash.c,v retrieving revision 1.2 diff -u -p -r1.2 xmss_hash.c --- xmss_hash.c 26 Feb 2018 03:56:44 - 1.2 +++ xmss_hash.c 7 Apr 2022 07:04:19 - @@ -15,8 +15,6 @@ Public domain. #include #include #include -#include -#include int core_hash_SHA2(unsigned char *, const unsigned int, const unsigned char *, unsigned int, const unsigned char *, unsigned long long, unsigned int);
ssh-xmss.c: Add missing includes
Hi malloc(3) and friends require stdlib.h, SIZE_MAX requires stdint.h. Best, Martin Index: ssh-xmss.c === RCS file: /cvs/src/usr.bin/ssh/ssh-xmss.c,v retrieving revision 1.4 diff -u -p -r1.4 ssh-xmss.c --- ssh-xmss.c 19 Oct 2020 22:49:23 - 1.4 +++ ssh-xmss.c 7 Apr 2022 07:14:09 - @@ -19,8 +19,10 @@ #include #include +#include #include #include +#include #include #include "log.h"
readconf.c: Avoid a xstrdup
Hi There is no need to duplicate options->send_env[i] only free it in all cases. Just use options->send_env[i] directly. Best, Martin Index: readconf.c === RCS file: /cvs/src/usr.bin/ssh/readconf.c,v retrieving revision 1.366 diff -u -p -r1.366 readconf.c --- readconf.c 8 Feb 2022 08:59:12 - 1.366 +++ readconf.c 19 Apr 2022 09:38:48 - @@ -740,19 +740,15 @@ static void rm_env(Options *options, const char *arg, const char *filename, int linenum) { int i, j, onum_send_env = options->num_send_env; - char *cp; /* Remove an environment variable */ for (i = 0; i < options->num_send_env; ) { - cp = xstrdup(options->send_env[i]); - if (!match_pattern(cp, arg + 1)) { - free(cp); + if (!match_pattern(options->send_env[i], arg + 1)) { i++; continue; } debug3("%s line %d: removing environment %s", - filename, linenum, cp); - free(cp); + filename, linenum, options->send_env[i]); free(options->send_env[i]); options->send_env[i] = NULL; for (j = i; j < options->num_send_env - 1; j++) {
Re: refcount btrace
On 08/04/22(Fri) 12:16, Alexander Bluhm wrote: > On Fri, Apr 08, 2022 at 02:39:34AM +, Visa Hankala wrote: > > On Thu, Apr 07, 2022 at 07:55:11PM +0200, Alexander Bluhm wrote: > > > On Wed, Mar 23, 2022 at 06:13:27PM +0100, Alexander Bluhm wrote: > > > > In my opinion tracepoints give insight at minimal cost. It is worth > > > > it to have it in GENERIC to make it easy to use. > > > > > > After release I want to revive the btrace of refcounts discussion. > > > > > > As mpi@ mentioned the idea of dt(4) is to have these trace points > > > in GENERIC kernel. If you want to hunt a bug, just turn it on. > > > Refounting is a common place for bugs, leaks can be detected easily. > > > > > > The alternative are some defines that you can compile in and access > > > from ddb. This is more work and you would have to implement it for > > > every recount. > > > https://marc.info/?l=openbsd-tech=163786435916039=2 > > > > > > There is no measuarable performance difference. dt(4) is written > > > in a way that is is only one additional branch. At least my goal > > > is to add trace points to useful places when we identify them. > > > > DT_INDEX_ENTER() still checks the index first, so it has two branches > > in practice. > > > > I think dt_tracing should be checked first so that it serves as > > a gateway to the trace code. Under normal operation, the check's > > outcome is always the same, which is easy even for simple branch > > predictors. > > Reordering the check is easy. Now dt_tracing is first. > > > I have a slight suspicion that dt(4) is now becoming a way to add code > > that would be otherwise unacceptable. Also, how "durable" are trace > > points perceived? Is an added trace point an achieved advantage that > > is difficult to remove even when its utility has diminished? There is > > a similarity to (ad hoc) debug printfs too. > > As I understand dt(4) it is a replacement for debug printfs. But > it has advantages. I can be turnd on selectively from userland. > It does not spam the console, but can be processed in userland. It > is always there, you don't have to recompile. > > Of course you always have the printf or tracepoint at the worng > place. I think people debugging the code should move them to > the useful places. Then we may end with generally useful tool. > A least that is my hope. > > There are obvious places to debug. We have syscall entry and return. > And I think reference counting is also generally interesting. I'm happy if this can help debugging real reference counting issues. Do you have a script that could be committed to /usr/share/btrace to show how to track reference counting using these probes? > Index: dev/dt/dt_prov_static.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/dev/dt/dt_prov_static.c,v > retrieving revision 1.13 > diff -u -p -r1.13 dt_prov_static.c > --- dev/dt/dt_prov_static.c 17 Mar 2022 14:53:59 - 1.13 > +++ dev/dt/dt_prov_static.c 8 Apr 2022 09:40:29 - > @@ -87,6 +87,12 @@ DT_STATIC_PROBE1(smr, barrier_exit, "int > DT_STATIC_PROBE0(smr, wakeup); > DT_STATIC_PROBE2(smr, thread, "uint64_t", "uint64_t"); > > +/* > + * reference counting > + */ > +DT_STATIC_PROBE0(refcnt, none); > +DT_STATIC_PROBE3(refcnt, inpcb, "void *", "int", "int"); > +DT_STATIC_PROBE3(refcnt, tdb, "void *", "int", "int"); > > /* > * List of all static probes > @@ -127,15 +133,24 @@ struct dt_probe *const dtps_static[] = { > &_DT_STATIC_P(smr, barrier_exit), > &_DT_STATIC_P(smr, wakeup), > &_DT_STATIC_P(smr, thread), > + /* refcnt */ > + &_DT_STATIC_P(refcnt, none), > + &_DT_STATIC_P(refcnt, inpcb), > + &_DT_STATIC_P(refcnt, tdb), > }; > > +struct dt_probe *const *dtps_index_refcnt; > + > int > dt_prov_static_init(void) > { > int i; > > - for (i = 0; i < nitems(dtps_static); i++) > + for (i = 0; i < nitems(dtps_static); i++) { > + if (dtps_static[i] == &_DT_STATIC_P(refcnt, none)) > + dtps_index_refcnt = _static[i]; > dt_dev_register_probe(dtps_static[i]); > + } > > return i; > } > Index: dev/dt/dtvar.h > === > RCS file: /data/mirror/openbsd/cvs/src/sys/dev/dt/dtvar.h,v > retrieving revision 1.13 > diff -u -p -r1.13 dtvar.h > --- dev/dt/dtvar.h27 Feb 2022 10:14:01 - 1.13 > +++ dev/dt/dtvar.h8 Apr 2022 09:42:19 - > @@ -313,11 +313,30 @@ extern volatile uint32_tdt_tracing; /* > #define DT_STATIC_ENTER(func, name, args...) do { > \ > extern struct dt_probe _DT_STATIC_P(func, name);\ > struct dt_probe *dtp = &_DT_STATIC_P(func, name); \ > - struct dt_provider *dtpv = dtp->dtp_prov; \ > \ > if
Kill selrecord()
Now that poll(2) & select(2) use the kqueue backend under the hood we can start retiring the old machinery. The diff below does not touch driver definitions, however it : - kills selrecord() & doselwakeup() - make it obvious that `kern.nselcoll' is now always 0 - Change all poll/select hooks to return 0 - Kill a seltid check in wsdisplaystart() which is now always true In a later step we could remove the *_poll() requirement from device drivers and kill seltrue() & selfalse(). ok? Index: arch/sparc64/dev/vldcp.c === RCS file: /cvs/src/sys/arch/sparc64/dev/vldcp.c,v retrieving revision 1.22 diff -u -p -r1.22 vldcp.c --- arch/sparc64/dev/vldcp.c24 Oct 2021 17:05:04 - 1.22 +++ arch/sparc64/dev/vldcp.c11 Apr 2022 16:38:32 - @@ -581,44 +581,7 @@ vldcpioctl(dev_t dev, u_long cmd, caddr_ int vldcppoll(dev_t dev, int events, struct proc *p) { - struct vldcp_softc *sc; - struct ldc_conn *lc; - uint64_t head, tail, state; - int revents = 0; - int s, err; - - sc = vldcp_lookup(dev); - if (sc == NULL) - return (POLLERR); - lc = >sc_lc; - - s = spltty(); - if (events & (POLLIN | POLLRDNORM)) { - err = hv_ldc_rx_get_state(lc->lc_id, , , ); - - if (err == 0 && state == LDC_CHANNEL_UP && head != tail) - revents |= events & (POLLIN | POLLRDNORM); - } - if (events & (POLLOUT | POLLWRNORM)) { - err = hv_ldc_tx_get_state(lc->lc_id, , , ); - - if (err == 0 && state == LDC_CHANNEL_UP && head != tail) - revents |= events & (POLLOUT | POLLWRNORM); - } - if (revents == 0) { - if (events & (POLLIN | POLLRDNORM)) { - cbus_intr_setenabled(sc->sc_bustag, sc->sc_rx_ino, - INTR_ENABLED); - selrecord(p, >sc_rsel); - } - if (events & (POLLOUT | POLLWRNORM)) { - cbus_intr_setenabled(sc->sc_bustag, sc->sc_tx_ino, - INTR_ENABLED); - selrecord(p, >sc_wsel); - } - } - splx(s); - return revents; + return 0; } void Index: dev/audio.c === RCS file: /cvs/src/sys/dev/audio.c,v retrieving revision 1.198 diff -u -p -r1.198 audio.c --- dev/audio.c 21 Mar 2022 19:22:39 - 1.198 +++ dev/audio.c 11 Apr 2022 16:38:52 - @@ -2053,17 +2053,7 @@ audio_mixer_read(struct audio_softc *sc, int audio_mixer_poll(struct audio_softc *sc, int events, struct proc *p) { - int revents = 0; - - mtx_enter(_lock); - if (sc->mix_isopen && sc->mix_pending) - revents |= events & (POLLIN | POLLRDNORM); - if (revents == 0) { - if (events & (POLLIN | POLLRDNORM)) - selrecord(p, >mix_sel); - } - mtx_leave(_lock); - return revents; + return 0; } int @@ -2101,21 +2091,7 @@ audio_mixer_close(struct audio_softc *sc int audio_poll(struct audio_softc *sc, int events, struct proc *p) { - int revents = 0; - - mtx_enter(_lock); - if ((sc->mode & AUMODE_RECORD) && sc->rec.used > 0) - revents |= events & (POLLIN | POLLRDNORM); - if ((sc->mode & AUMODE_PLAY) && sc->play.used < sc->play.len) - revents |= events & (POLLOUT | POLLWRNORM); - if (revents == 0) { - if (events & (POLLIN | POLLRDNORM)) - selrecord(p, >rec.sel); - if (events & (POLLOUT | POLLWRNORM)) - selrecord(p, >play.sel); - } - mtx_leave(_lock); - return revents; + return 0; } int Index: dev/hotplug.c === RCS file: /cvs/src/sys/dev/hotplug.c,v retrieving revision 1.21 diff -u -p -r1.21 hotplug.c --- dev/hotplug.c 25 Dec 2020 12:59:52 - 1.21 +++ dev/hotplug.c 11 Apr 2022 16:39:24 - @@ -183,16 +183,7 @@ hotplugioctl(dev_t dev, u_long cmd, cadd int hotplugpoll(dev_t dev, int events, struct proc *p) { - int revents = 0; - - if (events & (POLLIN | POLLRDNORM)) { - if (evqueue_count > 0) - revents |= events & (POLLIN | POLLRDNORM); - else - selrecord(p, _sel); - } - - return (revents); + return (0); } int Index: dev/midi.c === RCS file: /cvs/src/sys/dev/midi.c,v retrieving revision 1.54 diff -u -p -r1.54 midi.c --- dev/midi.c 6 Apr 2022 18:59:27 - 1.54 +++ dev/midi.c 11 Apr 2022 16:39:31 - @@ -334,31 +334,7 @@ done: int midipoll(dev_t dev, int events, struct proc *p) { - struct midi_softc *sc; - int
sshd_config(5): Use correct path for system-wide known_hosts
Hi The path to the system-wide known_hosts file is /etc/ssh/ssh_known_hosts and not /etc/ssh/known_hosts. See auth2-hostbased.c line 221-223. Best, Martin Index: sshd_config.5 === RCS file: /cvs/src/usr.bin/ssh/sshd_config.5,v retrieving revision 1.340 diff -u -p -r1.340 sshd_config.5 --- sshd_config.5 31 Mar 2022 17:58:44 - 1.340 +++ sshd_config.5 10 Apr 2022 20:35:39 - @@ -818,7 +818,7 @@ should ignore the user's during .Cm HostbasedAuthentication and use only the system-wide known hosts file -.Pa /etc/ssh/known_hosts . +.Pa /etc/ssh/ssh_known_hosts . The default is .Dq no . .It Cm Include
const openpty et al.
const the termp and winp arguments for openpty and related. This matches the prototypes for openpty and forkpty in glibc and musl libc. diff --git openpty.3 openpty.3 index e0c82e00c53..080d92d8ced 100644 --- openpty.3 +++ openpty.3 @@ -47,15 +47,15 @@ .Ft int .Fn getptmfd "void" .Ft int -.Fn openpty "int *amaster" "int *aslave" "char *name" "struct termios *termp" "struct winsize *winp" +.Fn openpty "int *amaster" "int *aslave" "char *name" "const struct termios *termp" "const struct winsize *winp" .Ft int -.Fn fdopenpty "int ptmfd" "int *amaster" "int *aslave" "char *name" "struct termios *termp" "struct winsize *winp" +.Fn fdopenpty "int ptmfd" "int *amaster" "int *aslave" "char *name" "const struct termios *termp" "const struct winsize *winp" .Ft int .Fn login_tty "int fd" .Ft pid_t -.Fn forkpty "int *amaster" "char *name" "struct termios *termp" "struct winsize *winp" +.Fn forkpty "int *amaster" "char *name" "const struct termios *termp" "const struct winsize *winp" .Ft pid_t -.Fn fdforkpty "int ptmfd" "int *amaster" "char *name" "struct termios *termp" "struct winsize *winp" +.Fn fdforkpty "int ptmfd" "int *amaster" "char *name" "const struct termios *termp" "const struct winsize *winp" .Sh DESCRIPTION The .Fn openpty , diff --git pty.c pty.c index c796acb5041..d59f863ff2d 100644 --- pty.c +++ pty.c @@ -50,8 +50,8 @@ getptmfd(void) } int -openpty(int *amaster, int *aslave, char *name, struct termios *termp, -struct winsize *winp) +openpty(int *amaster, int *aslave, char *name, const struct termios *termp, +const struct winsize *winp) { int ptmfd; @@ -67,7 +67,7 @@ openpty(int *amaster, int *aslave, char *name, struct termios *termp, int fdopenpty(int ptmfd, int *amaster, int *aslave, char *name, -struct termios *termp, struct winsize *winp) +const struct termios *termp, const struct winsize *winp) { int master, slave; struct ptmget ptm; @@ -97,7 +97,8 @@ fdopenpty(int ptmfd, int *amaster, int *aslave, char *name, } pid_t -forkpty(int *amaster, char *name, struct termios *termp, struct winsize *winp) +forkpty(int *amaster, char *name, const struct termios *termp, +const struct winsize *winp) { int ptmfd; pid_t pid; @@ -113,8 +114,8 @@ forkpty(int *amaster, char *name, struct termios *termp, struct winsize *winp) } pid_t -fdforkpty(int ptmfd, int *amaster, char *name, struct termios *termp, -struct winsize *winp) +fdforkpty(int ptmfd, int *amaster, char *name, const struct termios *termp, +const struct winsize *winp) { int master, slave; pid_t pid; diff --git util.h util.h index d7aa8faed4b..ecae5819e53 100644 --- util.h +++ util.h @@ -99,12 +99,14 @@ voidpw_copy(int, int, const struct passwd *, const struct passwd *); intpw_scan(char *, struct passwd *, int *); void pw_error(const char *, int, int); intgetptmfd(void); -intopenpty(int *, int *, char *, struct termios *, struct winsize *); -intfdopenpty(int, int *, int *, char *, struct termios *, - struct winsize *); +intopenpty(int *, int *, char *, const struct termios *, + const struct winsize *); +intfdopenpty(int, int *, int *, char *, const struct termios *, + const struct winsize *); intopendisk(const char *, int, char *, size_t, int); -pid_t forkpty(int *, char *, struct termios *, struct winsize *); -pid_t fdforkpty(int, int *, char *, struct termios *, struct winsize *); +pid_t forkpty(int *, char *, const struct termios *, const struct winsize *); +pid_t fdforkpty(int, int *, char *, const struct termios *, + const struct winsize *); intgetmaxpartitions(void); intgetrawpartition(void); void login_fbtab(const char *, uid_t, gid_t);
rcctl.8: Add missing variable
diff --git rcctl.8 rcctl.8 index 6c3048b834c..93a76a937c1 100644 --- rcctl.8 +++ rcctl.8 @@ -64,6 +64,7 @@ can be one of .Cm class , .Cm flags , .Cm logger , +.Cm rtable , .Cm status , .Cm timeout or
Re: refcount btrace
On 20/03/22(Sun) 05:39, Visa Hankala wrote: > On Sat, Mar 19, 2022 at 12:10:11AM +0100, Alexander Bluhm wrote: > > On Thu, Mar 17, 2022 at 07:25:27AM +, Visa Hankala wrote: > > > On Thu, Mar 17, 2022 at 12:42:13AM +0100, Alexander Bluhm wrote: > > > > I would like to use btrace to debug refernce counting. The idea > > > > is to a a tracepoint for every type of refcnt we have. When it > > > > changes, print the actual object, the current counter and the change > > > > value. > > > > > > > Do we want that feature? > > > > > > I am against this in its current form. The code would become more > > > complex, and the trace points can affect timing. There is a risk that > > > the kernel behaves slightly differently when dt has been compiled in. > > > > On our main architectures dt(4) is in GENERIC. I see your timing > > point for uvm structures. > > In my opinion, having dt(4) enabled by default is another reason why > there should be no carte blanche for adding trace points. Each trace > point adds a tiny amount of bloat. Few users will use the tracing > facility. > > Maybe high-rate trace points could be behind a build option... The whole point of dt(4) is to be able to debug GENERIC kernel. I doubt the cost of an additional if () block matters.
rpki-client(8): properly zero terminate pretty printed key
Hi It seems the pretty printed key is zero terminated only if the size of hex stays the same or increases between calls. This diff fixes it, so it is always properly terminated. While here, also drop *hex != '\0' from the if inside the loop, as it is checked directly above in the loop condition and constify the argument, as it is not modified. Best, Martin Index: print.c === RCS file: /cvs/src/usr.sbin/rpki-client/print.c,v retrieving revision 1.5 diff -u -p -r1.5 print.c --- print.c 10 Feb 2022 17:33:28 - 1.5 +++ print.c 17 Mar 2022 17:46:01 - @@ -28,19 +28,21 @@ #include "extern.h" static const char * -pretty_key_id(char *hex) +pretty_key_id(const char *hex) { static char buf[128]; /* bigger than SHA_DIGEST_LENGTH * 3 */ size_t i; for (i = 0; i < sizeof(buf) && *hex != '\0'; i++) { - if (i % 3 == 2 && *hex != '\0') + if (i % 3 == 2) buf[i] = ':'; else buf[i] = *hex++; } if (i == sizeof(buf)) memcpy(buf + sizeof(buf) - 4, "...", 4); + else + buf[i] = '\0'; return buf; }
wg: remove argument names from prototypes
None of the other prototypes have argument names. Index: if_wg.c === RCS file: /home/reposync/cvs/src/sys/net/if_wg.c,v retrieving revision 1.22 diff -u -p -r1.22 if_wg.c --- if_wg.c 22 Feb 2022 01:15:02 - 1.22 +++ if_wg.c 15 Mar 2022 21:10:37 - @@ -325,7 +325,7 @@ voidwg_peer_send_buf(struct wg_peer *, void wg_send_initiation(void *); void wg_send_response(struct wg_peer *); void wg_send_cookie(struct wg_softc *, struct cookie_macs *, uint32_t, - struct wg_endpoint *e); + struct wg_endpoint *); void wg_send_keepalive(void *); void wg_peer_clear_secrets(void *); void wg_handshake(struct wg_softc *, struct mbuf *); Index: wg_cookie.c === RCS file: /home/reposync/cvs/src/sys/net/wg_cookie.c,v retrieving revision 1.3 diff -u -p -r1.3 wg_cookie.c --- wg_cookie.c 10 Mar 2021 10:21:48 - 1.3 +++ wg_cookie.c 15 Mar 2022 21:09:29 - @@ -37,7 +37,7 @@ static void cookie_macs_mac2(struct cook static int cookie_timer_expired(struct timespec *, time_t, long); static voidcookie_checker_make_cookie(struct cookie_checker *, uint8_t[COOKIE_COOKIE_SIZE], struct sockaddr *); -static int ratelimit_init(struct ratelimit *, struct pool *pool); +static int ratelimit_init(struct ratelimit *, struct pool *); static voidratelimit_deinit(struct ratelimit *); static voidratelimit_gc(struct ratelimit *, int); static int ratelimit_allow(struct ratelimit *, struct sockaddr *);
two small typos
Index: if_wg.c === RCS file: /home/reposync/cvs/src/sys/net/if_wg.c,v retrieving revision 1.22 diff -u -p -r1.22 if_wg.c --- if_wg.c 22 Feb 2022 01:15:02 - 1.22 +++ if_wg.c 15 Mar 2022 21:10:37 - @@ -2023,7 +2023,7 @@ wg_input(void *_sc, struct mbuf *m, stru /* * Ensure mbuf is contiguous over full length of packet. This is done -* os we can directly read the handshake values in wg_handshake, and so +* so we can directly read the handshake values in wg_handshake, and so * we can decrypt a transport packet by passing a single buffer to * noise_remote_decrypt in wg_decap. */ Index: pf.c === RCS file: /home/reposync/cvs/src/sys/net/pf.c,v retrieving revision 1.1125 diff -u -p -r1.1125 pf.c --- pf.c5 Mar 2022 10:43:32 - 1.1125 +++ pf.c10 Mar 2022 15:53:51 - @@ -1340,7 +1340,7 @@ pf_state_expires(const struct pf_state * * this function may be called by the state purge task while * the state is being modified. avoid inconsistent reads of * state->timeout by having the caller do the read (and any -* chacks it needs to do on the same variable) and then pass +* checks it needs to do on the same variable) and then pass * their view of the timeout in here for this function to use. * the only consequence of using a stale timeout value is * that the state won't be a candidate for purging until the
Re: issue with login.conf(5) rtable and su -l user
On Sun, Mar 13, 2022 at 02:30:23PM +0100, Solene Rapenne wrote: > Hi, I'm playing with the new rtable feature in login.conf(5) but it > seems one use case doesn't trigger the rtable change. > > I have an user called alice, if I ssh locally from my user to alice > with ssh alice@localhost, alice has the correct routing table, if I use > as root "su -l alice", then alice seems using rtable 0. Ignoring -L which already honors rtable, su has three cases: -l (asme=0 asthem=1) -m (asme=1 asthem=0) (asme=0 asthem=0) -l should honor rtable; I am not sure about the other two. I think the least suprising would be for the neither case to honor rtable and for -m to not, but I don't have a strong opinion here. Patch as suggested below. > if it works, I'm using rtable 1 (openvpn), if not, it's using rtable 0. id -R will show the rtable directly. diff --git su.c su.c index f87e6690835..c2fbbe2724d 100644 --- su.c +++ su.c @@ -355,6 +355,8 @@ main(int argc, char **argv) flags &= ~LOGIN_SETLOGIN; } else { flags = LOGIN_SETRESOURCES|LOGIN_SETGROUP|LOGIN_SETUSER; + if (!asme) + flags |= LOGIN_SETRTABLE; if (asthem) flags |= LOGIN_SETENV|LOGIN_SETPRIORITY|LOGIN_SETUMASK; }
Swap encrypt under memory pressure
Try to allocate the buffer before doing the encryption, if it fails we do not spend time doing the encryption. This reduce the pressure when swapping with low memory. ok? Index: uvm/uvm_swap.c === RCS file: /cvs/src/sys/uvm/uvm_swap.c,v retrieving revision 1.153 diff -u -p -r1.153 uvm_swap.c --- uvm/uvm_swap.c 22 Feb 2022 01:15:02 - 1.153 +++ uvm/uvm_swap.c 12 Mar 2022 10:30:26 - @@ -1690,6 +1690,26 @@ uvm_swap_io(struct vm_page **pps, int st } } + /* +* now allocate a buf for the i/o. +* [make sure we don't put the pagedaemon to sleep...] +*/ + pflag = (async || curproc == uvm.pagedaemon_proc) ? PR_NOWAIT : + PR_WAITOK; + bp = pool_get(, pflag | PR_ZERO); + + /* +* if we failed to get a swapbuf, return "try again" +*/ + if (bp == NULL) { + if (bounce) { + uvm_pagermapout(bouncekva, npages); + uvm_swap_freepages(tpps, npages); + } + uvm_pagermapout(kva, npages); + return (VM_PAGER_AGAIN); + } + /* encrypt to swap */ if (write && bounce) { int i, opages; @@ -1729,35 +1749,6 @@ uvm_swap_io(struct vm_page **pps, int st PGO_PDFREECLUST); kva = bouncekva; - } - - /* -* now allocate a buf for the i/o. -* [make sure we don't put the pagedaemon to sleep...] -*/ - pflag = (async || curproc == uvm.pagedaemon_proc) ? PR_NOWAIT : - PR_WAITOK; - bp = pool_get(, pflag | PR_ZERO); - - /* -* if we failed to get a swapbuf, return "try again" -*/ - if (bp == NULL) { - if (write && bounce) { -#ifdef UVM_SWAP_ENCRYPT - int i; - - /* swap encrypt needs cleanup */ - if (encrypt) - for (i = 0; i < npages; i++) - SWAP_KEY_PUT(sdp, SWD_KEY(sdp, - startslot + i)); -#endif - - uvm_pagermapout(kva, npages); - uvm_swap_freepages(tpps, npages); - } - return (VM_PAGER_AGAIN); } /*
Re: rpki-client: fix wrong conditional
On Thu, Mar 10, 2022 at 06:15:48PM +0100, Theo Buehler wrote: > On Thu, Mar 10, 2022 at 06:03:14PM +0100, Claudio Jeker wrote: > > On Thu, Mar 10, 2022 at 05:54:21PM +0100, Theo Buehler wrote: > > > On Thu, Mar 10, 2022 at 05:51:46PM +0100, Claudio Jeker wrote: > > > > On Thu, Mar 10, 2022 at 05:33:28PM +0100, Martin Vahlensieck wrote: > > > > > Hi > > > > > > > > > > This pulls up and adjusts the check if i exceeds the bounds of pfds. > > > > > Before it was technically wrong, as i > NPFDS means that the last > > > > > write (i == NPFDS) was already out of bounds. > > > > > > > > I see no reason to pull up the check but the if condition should indeed > > > > be > > > > greater or equal. One could consider to change this into an assert() > > > > but I > > > > think I stick with the errx(). > > > > > > Agreed. ok for the diff that just changes the checks to >= > > > > Actually I was wrong, the check needs to happen at the start of the loop > > not at the end else it does not work if the list is exactly the number of > > elements to fill NPFDS. > > Ah right, we need to bail out when we would actually go past the limit. > However, I see no reason to fiddle with the timeout before checking. > So let's move the check to the beginning. Here you go: Index: http.c === RCS file: /cvs/src/usr.sbin/rpki-client/http.c,v retrieving revision 1.53 diff -u -p -r1.53 http.c --- http.c 10 Feb 2022 11:10:40 - 1.53 +++ http.c 10 Mar 2022 17:28:21 - @@ -1809,6 +1809,9 @@ proc_http(char *bind_addr, int fd) timeout = INFTIM; now = getmonotime(); LIST_FOREACH(conn, , entry) { + if (i >= NPFDS) + errx(1, "too many connections"); + if (conn->io_time == 0) conn->io_time = now + HTTP_IO_TIMEOUT; @@ -1828,10 +1831,11 @@ proc_http(char *bind_addr, int fd) pfds[i].events = conn->events; conn->pfd = [i]; i++; - if (i > NPFDS) - errx(1, "too many connections"); } LIST_FOREACH(conn, , entry) { + if (i >= NPFDS) + errx(1, "too many connections"); + if (conn->idle_time <= now) timeout = 0; else { @@ -1844,8 +1848,6 @@ proc_http(char *bind_addr, int fd) pfds[i].events = POLLIN; conn->pfd = [i]; i++; - if (i > NPFDS) - errx(1, "too many connections"); } if (poll(pfds, i, timeout) == -1) {
rpki-client: fix wrong conditional
Hi This pulls up and adjusts the check if i exceeds the bounds of pfds. Before it was technically wrong, as i > NPFDS means that the last write (i == NPFDS) was already out of bounds. Best, Martin Index: http.c === RCS file: /cvs/src/usr.sbin/rpki-client/http.c,v retrieving revision 1.53 diff -u -p -r1.53 http.c --- http.c 10 Feb 2022 11:10:40 - 1.53 +++ http.c 10 Mar 2022 16:28:48 - @@ -1820,6 +1820,10 @@ proc_http(char *bind_addr, int fd) if (timeout == INFTIM || diff < timeout) timeout = diff; } + + if (i >= NPFDS) + errx(1, "too many connections"); + if (conn->state == STATE_WRITE_DATA) pfds[i].fd = conn->req->outfd; else @@ -1828,8 +1832,6 @@ proc_http(char *bind_addr, int fd) pfds[i].events = conn->events; conn->pfd = [i]; i++; - if (i > NPFDS) - errx(1, "too many connections"); } LIST_FOREACH(conn, , entry) { if (conn->idle_time <= now) @@ -1840,12 +1842,14 @@ proc_http(char *bind_addr, int fd) if (timeout == INFTIM || diff < timeout) timeout = diff; } + + if (i >= NPFDS) + errx(1, "too many connections"); + pfds[i].fd = conn->fd; pfds[i].events = POLLIN; conn->pfd = [i]; i++; - if (i > NPFDS) - errx(1, "too many connections"); } if (poll(pfds, i, timeout) == -1) {
[patch] LOGIN_SETENV fails when login.conf is missing
Anton spotted a doas regression failure in t-run-keepenv-path after the change to doas for LOGIN_SETALL. Since that test runs doas in a chroot and the setup does not create a login.conf, login_getclass in login_cap.c will return a login_cap_t with a NULL lc_cap (and errno set to ENOENT) on L133. setuserenv returns -1 if lc->lc_cap is NULL causing the "could not set user environment" failure. As all the login_getcap* functions as well as gsetrl and setuserpath succeed on a missing login.conf, the below path changes setuserenv to return 0 as well. This matches what happens if the class has no setenv capability right below. The comment is taken from gsetrl. diff --git gen/login_cap.c gen/login_cap.c index 67933653349..c142c0a7503 100644 --- gen/login_cap.c +++ gen/login_cap.c @@ -700,8 +700,11 @@ setuserpath(login_cap_t *lc, const struct passwd *pwd) char *path = NULL, *opath = NULL, *op, *np; int len, error; + /* +* If we have no capabilities then set _PATH_DEFPATH. +*/ if (lc->lc_cap == NULL) - goto setit; /* impossible */ + goto setit; if ((len = cgetustr(lc->lc_cap, "path", )) <= 0) goto setit; @@ -753,8 +756,12 @@ setuserenv(login_cap_t *lc, const struct passwd *pwd) char *beg, *end, *ep, *list, *value; int len, error; + /* +* If we have no capabilities then there is nothing to do and +* we can just return success. +*/ if (lc->lc_cap == NULL) - return (-1);/* impossible */ + return (0); if ((len = cgetustr(lc->lc_cap, "setenv", )) <= 0) return (0);
Re: Add rtable capability to login.conf
On Fri, Feb 18, 2022 at 03:25:51PM -0500, Ted Unangst wrote: > On 2022-02-06, Ted Unangst wrote: > > On 2022-02-05, Matthew Martin wrote: > > > On Sat, Jan 29, 2022 at 06:25:32PM -0600, Matthew Martin wrote: > > > > On Sat, Jan 29, 2022 at 07:10:00PM -0500, Ted Unangst wrote: > > > > > I believe it would be better to add setrtable to id pledge. > > > > > > ping > > > > > > Also are there any opinions on adding LOGIN_SETRTABLE to doas? > > > > I think this diff looks fine. > > > > For doas, we can use setall with an extra note in the man page. > > Final auction for oks. I think all the login.conf.d changes are in now. > > Plan is add setrtable to pledge first so people don't get caught, then libc. ping? > > Index: doas.1 > > === > > RCS file: /home/cvs/src/usr.bin/doas/doas.1,v > > retrieving revision 1.25 > > diff -u -p -r1.25 doas.1 > > --- doas.1 16 Jan 2021 09:18:41 - 1.25 > > +++ doas.1 6 Feb 2022 18:41:53 - > > @@ -54,6 +54,8 @@ and > > and the > > .Xr umask 2 > > are set to values appropriate for the target user. > > +Other values may also be set as specified in > > +.Pa /etc/login.conf . > > .Ev DOAS_USER > > is set to the name of the user executing > > .Nm . > > Index: doas.c > > === > > RCS file: /home/cvs/src/usr.bin/doas/doas.c,v > > retrieving revision 1.93 > > diff -u -p -r1.93 doas.c > > --- doas.c 30 Nov 2021 20:08:15 - 1.93 > > +++ doas.c 6 Feb 2022 18:39:38 - > > @@ -450,10 +450,7 @@ main(int argc, char **argv) > > if (targpw == NULL) > > errx(1, "no passwd entry for target"); > > > > - if (setusercontext(NULL, targpw, target, LOGIN_SETGROUP | > > - LOGIN_SETPATH | > > - LOGIN_SETPRIORITY | LOGIN_SETRESOURCES | LOGIN_SETUMASK | > > - LOGIN_SETUSER) != 0) > > + if (setusercontext(NULL, targpw, target, LOGIN_SETALL) == -1) > > errx(1, "failed to set user context for target"); > > > > if (pledge("stdio rpath exec", NULL) == -1) > > > >
Re: yacc(1): skeleton.c: remove outdated comment
ping, diff reattached On Thu, Feb 10, 2022 at 04:29:53PM +0100, Martin Vahlensieck wrote: > Hi > > yysccsid was removed in 1.30 back in 2009. > > Best, > > Martin > > Index: skeleton.c > === > RCS file: /cvs/src/usr.bin/yacc/skeleton.c,v > retrieving revision 1.40 > diff -u -p -r1.40 skeleton.c > --- skeleton.c3 Feb 2021 01:10:10 - 1.40 > +++ skeleton.c10 Feb 2022 15:29:08 - > @@ -35,10 +35,6 @@ > > #include "defs.h" > > -/* The definition of yysccsid in the banner should be replaced with */ > -/* a #pragma ident directive if the target C compiler supports > */ > -/* #pragma ident directives. > */ > -/* */ > /* If the skeleton is changed, the banner should be changed so that */ > /* the altered version can be easily distinguished from the original. > */ > /* */ > Index: skeleton.c === RCS file: /cvs/src/usr.bin/yacc/skeleton.c,v retrieving revision 1.40 diff -u -p -r1.40 skeleton.c --- skeleton.c 3 Feb 2021 01:10:10 - 1.40 +++ skeleton.c 10 Feb 2022 15:29:08 - @@ -35,10 +35,6 @@ #include "defs.h" -/* The definition of yysccsid in the banner should be replaced with */ -/* a #pragma ident directive if the target C compiler supports */ -/* #pragma ident directives. */ -/* */ /* If the skeleton is changed, the banner should be changed so that */ /* the altered version can be easily distinguished from the original. */ /* */
ssh(1): monitor.c: save a xstrdup
Hi Tiny diff to save an allocation. Best, Martin Index: monitor.c === RCS file: /cvs/src/usr.bin/ssh/monitor.c,v retrieving revision 1.231 diff -u -p -r1.231 monitor.c --- monitor.c 28 Jan 2022 06:18:42 - 1.231 +++ monitor.c 23 Feb 2022 16:49:27 - @@ -658,9 +658,8 @@ mm_answer_pwnamallow(struct ssh *ssh, in pwent = getpwnamallow(ssh, username); - authctxt->user = xstrdup(username); + authctxt->user = username; setproctitle("%s [priv]", pwent ? username : "unknown"); - free(username); sshbuf_reset(m);
Re: if_wg: Missing DPRINTF newline
On Thu, Feb 03, 2022 at 07:53:43AM +, Jason McIntyre wrote: > On Wed, Feb 02, 2022 at 07:46:39PM -0600, Matthew Martin wrote: > > Two DPRINTFs in sys/net/if_wg.c are missing a newline. > > > > if this is committed, the committer may also want to question the dodgy > capitals in "Delay Unsupported". > > jmc ping diff --git if_wg.c if_wg.c index 13c48f42c54..a3efd577dbc 100644 --- if_wg.c +++ if_wg.c @@ -2156,7 +2156,7 @@ wg_output(struct ifnet *ifp, struct mbuf *m, struct sockaddr *sa, } if (m->m_pkthdr.ph_loopcnt++ > M_MAXLOOP) { - DPRINTF(sc, "Packet looped"); + DPRINTF(sc, "Packet looped\n"); ret = ELOOP; goto error; } @@ -2169,7 +2169,7 @@ wg_output(struct ifnet *ifp, struct mbuf *m, struct sockaddr *sa, * another aip_lookup in wg_qstart, or refcnting as mentioned before. */ if (m->m_pkthdr.pf.delay > 0) { - DPRINTF(sc, "PF Delay Unsupported"); + DPRINTF(sc, "PF Delay Unsupported\n"); ret = EOPNOTSUPP; goto error; }
Re: more MAKEDEV cleanup
On 05/04/21(Mon) 09:25, Miod Vallat wrote: > The following diff attempts to clean up a few loose ends in the current > MAKEDEV files: > > - remove no-longer applicable device definitions (MSCP and SMD disks, > this kind of thing). > - makes sure all platforms use the same `ramdisk' target for > installation media devices, rather than a mix of `ramd' and `ramdisk'. > - moves as many `ramdisk' devices to MI land (bio, diskmap, random, > etc). > - reduces the number of block devices in `ramdisk' targets to only one > per device, since the installer script will invoke MAKEDEV by itself > for the devices it needs to use. > - sort device names in `all' and `ramdisk' MI lists to make maintainence > easier. This causes some ordering change in the `all' target in the > generated MAKEDEVs. What happened to this? > Index: MAKEDEV.common > === > RCS file: /OpenBSD/src/etc/MAKEDEV.common,v > retrieving revision 1.113 > diff -u -p -r1.113 MAKEDEV.common > --- MAKEDEV.common12 Feb 2021 10:26:33 - 1.113 > +++ MAKEDEV.common5 Apr 2021 09:18:49 - > @@ -114,7 +114,7 @@ dnl make a 'disktgt' macro that automati > dnl disktgt(rd, {-rd-}) > dnl > dnl target(all,rd,0) > -dnl target(ramd,rd,0) > +dnl target(ramdisk,rd,0) > dnl disk_q(rd) > dnl __devitem(rd, {-rd*-}, {-rd-})dnl > dnl > @@ -122,62 +122,60 @@ dnl Note: not all devices are generated > dnlits own extra list. > dnl > divert(1)dnl > +target(all, acpi)dnl > +target(all, apm)dnl > +target(all, bio)dnl > +target(all, bpf)dnl > +twrget(all, com, tty0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b)dnl > +twrget(all, czs, cua, a, b, c, d)dnl > +target(all, diskmap)dnl > +target(all, dt)dnl > twrget(all, fdesc, fd)dnl > -target(all, st, 0, 1)dnl > -target(all, std)dnl > -target(all, ra, 0, 1, 2, 3)dnl > -target(all, rx, 0, 1)dnl > -target(all, wd, 0, 1, 2, 3)dnl > -target(all, xd, 0, 1, 2, 3)dnl > +target(all, fuse)dnl > +target(all, hotplug)dnl > +target(all, joy, 0, 1)dnl > +target(all, kcov)dnl > +target(all, kstat)dnl > +target(all, local)dnl > +target(all, lpt, 0, 1, 2)dnl > +twrget(all, lpt, lpa, 0, 1, 2)dnl > +target(all, par, 0)dnl > +target(all, pci, 0, 1, 2, 3)dnl > target(all, pctr)dnl > target(all, pctr0)dnl > target(all, pf)dnl > -target(all, apm)dnl > -target(all, acpi)dnl > +target(all, pppac)dnl > +target(all, pppx)dnl > +target(all, ptm)dnl > +target(all, pty, 0)dnl > +target(all, pvbus, 0, 1)dnl > +target(all, radio, 0)dnl > +target(all, rmidi, 0, 1, 2, 3, 4, 5, 6, 7)dnl > +twrget(all, rnd, random)dnl > +twrget(all, speak, speaker)dnl > +target(all, st, 0, 1)dnl > +target(all, std)dnl > +target(all, switch, 0, 1, 2, 3)dnl > +target(all, tap, 0, 1, 2, 3)dnl > twrget(all, tth, ttyh, 0, 1)dnl > target(all, ttyA, 0, 1)dnl > -twrget(all, mac_tty0, tty0, 0, 1)dnl > -twrget(all, tzs, tty, a, b, c, d)dnl > -twrget(all, czs, cua, a, b, c, d)dnl > target(all, ttyc, 0, 1, 2, 3, 4, 5, 6, 7)dnl > -twrget(all, com, tty0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b)dnl > -twrget(all, mmcl, mmclock)dnl > -target(all, lpt, 0, 1, 2)dnl > -twrget(all, lpt, lpa, 0, 1, 2)dnl > -target(all, joy, 0, 1)dnl > -twrget(all, rnd, random)dnl > -target(all, uk, 0)dnl > -twrget(all, vi, video, 0, 1)dnl > -twrget(all, speak, speaker)dnl > -target(all, asc, 0)dnl > -target(all, radio, 0)dnl > +target(all, tun, 0, 1, 2, 3)dnl > target(all, tuner, 0)dnl > -target(all, rmidi, 0, 1, 2, 3, 4, 5, 6, 7)dnl > +twrget(all, tzs, tty, a, b, c, d)dnl > target(all, uall)dnl > -target(all, pci, 0, 1, 2, 3)dnl > -twrget(all, wsmouse, wscons)dnl > -target(all, par, 0)dnl > -target(all, apci, 0)dnl > -target(all, local)dnl > -target(all, ptm)dnl > -target(all, hotplug)dnl > -target(all, pppx)dnl > -target(all, pppac)dnl > -target(all, fuse)dnl > +target(all, uk, 0)dnl > +twrget(all, vi, video, 0, 1)dnl > target(all, vmm)dnl > -target(all, pvbus, 0, 1)dnl > -target(all, bpf)dnl > -target(all, kcov)dnl > -target(all, dt)dnl > -target(all, kstat)dnl > +target(all, vnd, 0, 1, 2, 3)dnl > +target(all, vscsi, 0)dnl > +target(all, wd, 0, 1, 2, 3)dnl > +twrget(all, wsmouse, wscons)dnl > dnl > _mkdev(all, {-all-}, {-dnl > show_target(all)dnl > -})dnl > dnl > -dnl XXX some arches use ramd, others ramdisk - needs to be fixed eventually > -__devitem(ramdisk, ramdisk, Ramdisk kernel devices,nothing)dnl > -dnl > target(usb, usb, 0, 1, 2, 3, 4, 5, 6, 7)dnl > target(usb, uhid, 0, 1, 2, 3, 4, 5, 6, 7)dnl > twrget(usb, fido, fido)dnl > @@ -208,26 +206,26 @@ __devitem(ch, {-ch*-}, SCSI media change > _mcdev(ch, ch*, ch, {-major_ch_c-}, 660, operator)dnl > __devitem(uk, uk*, Unknown SCSI devices)dnl > _mcdev(uk, uk*, uk, {-major_uk_c-}, 640, operator)dnl > -dnl XXX see ramdisk above > -__devitem(ramd, ramdisk, Ramdisk kernel devices,nothing)dnl > dnl > -_mkdev(ramd, ramdisk, {-dnl > -show_target(ramd)dnl > +__devitem(ramdisk, ramdisk, Ramdisk kernel devices,nothing)dnl > +_mkdev(ramdisk, ramdisk, {-dnl >
yacc(1): skeleton.c: remove outdated comment
Hi yysccsid was removed in 1.30 back in 2009. Best, Martin Index: skeleton.c === RCS file: /cvs/src/usr.bin/yacc/skeleton.c,v retrieving revision 1.40 diff -u -p -r1.40 skeleton.c --- skeleton.c 3 Feb 2021 01:10:10 - 1.40 +++ skeleton.c 10 Feb 2022 15:29:08 - @@ -35,10 +35,6 @@ #include "defs.h" -/* The definition of yysccsid in the banner should be replaced with */ -/* a #pragma ident directive if the target C compiler supports */ -/* #pragma ident directives. */ -/* */ /* If the skeleton is changed, the banner should be changed so that */ /* the altered version can be easily distinguished from the original. */ /* */
Re: Add rtable capability to login.conf
On Sat, Jan 29, 2022 at 06:25:32PM -0600, Matthew Martin wrote: > On Sat, Jan 29, 2022 at 07:10:00PM -0500, Ted Unangst wrote: > > I believe it would be better to add setrtable to id pledge. ping Also are there any opinions on adding LOGIN_SETRTABLE to doas? diff --git include/login_cap.h include/login_cap.h index d9a4c2c349c..1e831b6471a 100644 --- include/login_cap.h +++ include/login_cap.h @@ -53,7 +53,8 @@ #defineLOGIN_SETUMASK 0x0020 /* Set umask */ #defineLOGIN_SETUSER 0x0040 /* Set user */ #defineLOGIN_SETENV0x0080 /* Set environment */ -#defineLOGIN_SETALL0x00ff /* Set all. */ +#defineLOGIN_SETRTABLE 0x0100 /* Set rtable */ +#defineLOGIN_SETALL0x01ff /* Set all. */ #defineBI_AUTH "authorize" /* Accepted authentication */ #defineBI_REJECT "reject"/* Rejected authentication */ diff --git lib/libc/gen/login_cap.c lib/libc/gen/login_cap.c index 862f33b2065..65848f09ef6 100644 --- lib/libc/gen/login_cap.c +++ lib/libc/gen/login_cap.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include @@ -574,7 +575,7 @@ int setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags) { login_cap_t *flc; - quad_t p; + quad_t p, rtable; int i; flc = NULL; @@ -625,6 +626,14 @@ setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags) umask((mode_t)p); } + if (flags & LOGIN_SETRTABLE) { + rtable = login_getcapnum(lc, "rtable", 0, 0); + + if (setrtable((int)rtable) == -1) { + syslog(LOG_ERR, "%s: setrtable: %m", lc->lc_class); + } + } + if (flags & LOGIN_SETGROUP) { if (setresgid(pwd->pw_gid, pwd->pw_gid, pwd->pw_gid) == -1) { syslog(LOG_ERR, "setresgid(%u,%u,%u): %m", diff --git lib/libc/sys/pledge.2 lib/libc/sys/pledge.2 index 581d274822c..ff3aa59f9e3 100644 --- lib/libc/sys/pledge.2 +++ lib/libc/sys/pledge.2 @@ -514,7 +514,8 @@ process: .Xr setlogin 2 , .Xr setrlimit 2 , .Xr getpriority 2 , -.Xr setpriority 2 +.Xr setpriority 2 , +.Xr setrtable 2 .It Cm pf Allows a subset of .Xr ioctl 2 diff --git share/man/man5/login.conf.5 share/man/man5/login.conf.5 index da935fa223e..ffeafc3531c 100644 --- share/man/man5/login.conf.5 +++ share/man/man5/login.conf.5 @@ -276,6 +276,10 @@ Initial priority (nice) level. Require home directory to login. .\" .Pp +.It rtable Ta number Ta Dv 0 Ta +Rtable to be set for the class. +.\" +.Pp .It setenv Ta envlist Ta "" Ta A list of environment variables and associated values to be set for the class. .\" diff --git sys/kern/kern_pledge.c sys/kern/kern_pledge.c index b876b91a18a..26b5e1fab96 100644 --- sys/kern/kern_pledge.c +++ sys/kern/kern_pledge.c @@ -300,6 +300,7 @@ const uint64_t pledge_syscalls[SYS_MAXSYSCALL] = { [SYS_setresgid] = PLEDGE_ID, [SYS_setgroups] = PLEDGE_ID, [SYS_setlogin] = PLEDGE_ID, + [SYS_setrtable] = PLEDGE_ID, [SYS_unveil] = PLEDGE_UNVEIL,
Re: uvm_unmap_kill_entry(): unwire with map lock held
On 04/02/22(Fri) 03:39, Klemens Nanni wrote: > [...] > ... with the lock grabbed in uvm_map_teardown() that is, otherwise > the first call path can lock against itself (regress/misc/posixtestsuite > is a reproduce for this): > > vm_map_lock_read_ln+0x38 > uvm_fault_unwire+0x58 > uvm_unmap_kill_entry_withlock+0x68 > uvm_unmap_remove+0x2d4 > sys_munmap+0x11c > > which is obvious in hindsight. > > So grabbing the lock in uvm_map_teardown() instead avoids that while > still ensuring a locked map in the path missing a lock. This should be fine since this function should only be called when the last reference of a map is dropped. In other words the locking here is necessary to satisfy the assertions. I wonder if lock shouldn't be taken & released around uvm_map_teardown() which makes it easier to see that this is called after the last refcount decrement. > Index: uvm_map.c > === > RCS file: /cvs/src/sys/uvm/uvm_map.c,v > retrieving revision 1.282 > diff -u -p -r1.282 uvm_map.c > --- uvm_map.c 21 Dec 2021 22:21:32 - 1.282 > +++ uvm_map.c 4 Feb 2022 02:51:00 - > @@ -2734,6 +2751,7 @@ uvm_map_teardown(struct vm_map *map) > KERNEL_ASSERT_UNLOCKED(); > > KASSERT((map->flags & VM_MAP_INTRSAFE) == 0); > + vm_map_lock(map); > > /* Remove address selectors. */ > uvm_addr_destroy(map->uaddr_exe); >
if_wg: Missing DPRINTF newline
Two DPRINTFs in sys/net/if_wg.c are missing a newline. diff --git if_wg.c if_wg.c index 13c48f42c54..a3efd577dbc 100644 --- if_wg.c +++ if_wg.c @@ -2156,7 +2156,7 @@ wg_output(struct ifnet *ifp, struct mbuf *m, struct sockaddr *sa, } if (m->m_pkthdr.ph_loopcnt++ > M_MAXLOOP) { - DPRINTF(sc, "Packet looped"); + DPRINTF(sc, "Packet looped\n"); ret = ELOOP; goto error; } @@ -2169,7 +2169,7 @@ wg_output(struct ifnet *ifp, struct mbuf *m, struct sockaddr *sa, * another aip_lookup in wg_qstart, or refcnting as mentioned before. */ if (m->m_pkthdr.pf.delay > 0) { - DPRINTF(sc, "PF Delay Unsupported"); + DPRINTF(sc, "PF Delay Unsupported\n"); ret = EOPNOTSUPP; goto error; }
Re: uvm_unmap_kill_entry(): unwire with map lock held
On 31/01/22(Mon) 10:24, Klemens Nanni wrote: > Running with my uvm assert diff showed that uvm_fault_unwire_locked() > was called without any locks held. > > This happened when I rebooted my machine: > > uvm_fault_unwire_locked() > uvm_unmap_kill_entry_withlock() > uvm_unmap_kill_entry() > uvm_map_teardown() > uvmspace_free() > > This code does not corrupt anything because > uvm_unmap_kill_entry_withlock() is grabbing the kernel lock aorund its > uvm_fault_unwire_locked() call. > > But regardless of the kernel lock dances in this code path, the uvm map > ought to be protected by its own lock. uvm_fault_unwire() does that. > > uvm_fault_unwire_locked()'s comment says the map must at least be read > locked, which is what all other code paths to that function do. > > This makes my latest assert diff happy in the reboot case (it did not > always hit that assert). I'm happy your asserts found a first bug. I"m afraid calling this function below could result in a grabbing the lock twice. Can we be sure this doesn't happen? uvm_unmap_kill_entry() is called in many different contexts and this is currently a mess. I don't know what NetBSD did in this area but it is worth looking at and see if there isn't a good idea to untangle this. > Index: uvm_map.c > === > RCS file: /cvs/src/sys/uvm/uvm_map.c,v > retrieving revision 1.282 > diff -u -p -r1.282 uvm_map.c > --- uvm_map.c 21 Dec 2021 22:21:32 - 1.282 > +++ uvm_map.c 31 Jan 2022 09:28:04 - > @@ -2132,7 +2143,7 @@ uvm_unmap_kill_entry_withlock(struct vm_ > if (VM_MAPENT_ISWIRED(entry)) { > KERNEL_LOCK(); > entry->wired_count = 0; > - uvm_fault_unwire_locked(map, entry->start, entry->end); > + uvm_fault_unwire(map, entry->start, entry->end); > KERNEL_UNLOCK(); > } > >
Re: Add rtable capability to login.conf
On Sat, Jan 29, 2022 at 07:10:00PM -0500, Ted Unangst wrote: > I believe it would be better to add setrtable to id pledge. Makes the diff quite a bit shorter too. diff --git include/login_cap.h include/login_cap.h index d9a4c2c349c..1e831b6471a 100644 --- include/login_cap.h +++ include/login_cap.h @@ -53,7 +53,8 @@ #defineLOGIN_SETUMASK 0x0020 /* Set umask */ #defineLOGIN_SETUSER 0x0040 /* Set user */ #defineLOGIN_SETENV0x0080 /* Set environment */ -#defineLOGIN_SETALL0x00ff /* Set all. */ +#defineLOGIN_SETRTABLE 0x0100 /* Set rtable */ +#defineLOGIN_SETALL0x01ff /* Set all. */ #defineBI_AUTH "authorize" /* Accepted authentication */ #defineBI_REJECT "reject"/* Rejected authentication */ diff --git lib/libc/gen/login_cap.c lib/libc/gen/login_cap.c index 862f33b2065..65848f09ef6 100644 --- lib/libc/gen/login_cap.c +++ lib/libc/gen/login_cap.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include @@ -574,7 +575,7 @@ int setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags) { login_cap_t *flc; - quad_t p; + quad_t p, rtable; int i; flc = NULL; @@ -625,6 +626,14 @@ setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags) umask((mode_t)p); } + if (flags & LOGIN_SETRTABLE) { + rtable = login_getcapnum(lc, "rtable", 0, 0); + + if (setrtable((int)rtable) == -1) { + syslog(LOG_ERR, "%s: setrtable: %m", lc->lc_class); + } + } + if (flags & LOGIN_SETGROUP) { if (setresgid(pwd->pw_gid, pwd->pw_gid, pwd->pw_gid) == -1) { syslog(LOG_ERR, "setresgid(%u,%u,%u): %m", diff --git lib/libc/sys/pledge.2 lib/libc/sys/pledge.2 index 581d274822c..ff3aa59f9e3 100644 --- lib/libc/sys/pledge.2 +++ lib/libc/sys/pledge.2 @@ -514,7 +514,8 @@ process: .Xr setlogin 2 , .Xr setrlimit 2 , .Xr getpriority 2 , -.Xr setpriority 2 +.Xr setpriority 2 , +.Xr setrtable 2 .It Cm pf Allows a subset of .Xr ioctl 2 diff --git share/man/man5/login.conf.5 share/man/man5/login.conf.5 index da935fa223e..ffeafc3531c 100644 --- share/man/man5/login.conf.5 +++ share/man/man5/login.conf.5 @@ -276,6 +276,10 @@ Initial priority (nice) level. Require home directory to login. .\" .Pp +.It rtable Ta number Ta Dv 0 Ta +Rtable to be set for the class. +.\" +.Pp .It setenv Ta envlist Ta "" Ta A list of environment variables and associated values to be set for the class. .\" diff --git sys/kern/kern_pledge.c sys/kern/kern_pledge.c index 6687bf91f09..ead40b2df6f 100644 --- sys/kern/kern_pledge.c +++ sys/kern/kern_pledge.c @@ -300,6 +300,7 @@ const uint64_t pledge_syscalls[SYS_MAXSYSCALL] = { [SYS_setresgid] = PLEDGE_ID, [SYS_setgroups] = PLEDGE_ID, [SYS_setlogin] = PLEDGE_ID, + [SYS_setrtable] = PLEDGE_ID, [SYS_unveil] = PLEDGE_UNVEIL,
Add rtable capability to login.conf
It would be nice to have the ability to set a user's rtable upon login. This would be useful both for road warrior VPN setups (put both the VPN interface and user in an rdomain other than 0) and to differentiate users in firewall rules on the gateway or unbound views on a resolver. The below patch adds an rtable capability to login.conf which sets the user's default rtable on login. Since setusercontext will now call setrtable, that syscall needs to be covered under some pledge promise. Since the wroute promise already allows programs to set a different rtable on a per socket basis, I don't think adding the setrtable syscall materially changes the scope or attack surface. This impacts dhcpleased, iked, rad, and slaacd which already use the wroute promise. The wroute promise is added to calendar, skeyaudit, su, cron, and inetd. chroot, login, ssh, nsd, and unbound are impacted, but don't show up in the diff because they don't use pledge. ssh continues to work as expected: the user's session has the login specified default rtable; however, the socket connecting the user to the session remains in the original rtable so the connection doesn't break. nsd and unbound open their listening sockets prior to privdrop, so those sockets are in the original rtable (which should be set via rcctl). Notifications and XFRs for nsd and lookups for unbound happen in the login specified rtable. While this is somewhat surprising, I think most people would set the rtable for a daemon via rcctl. doas is a bit of a special case: it explicitly lists the LOGIN_ flags rather than using LOGIN_SETALL and removing specific flags. I am inclined to say doas should also set the default rtable, but it's not a strong stance. I've left it out of this version of the patch pending feedback on how others feel. In addition xenodm would need to add wroute to the pledge in dm.c StartDisplay and to the first pledge in session.c StartClient. >From a Debian code search the ports which use LOGIN_SETALL appear to be mail/alpine, security/sudo, and x11/gnome/gdm. None of these ports have pledge markers in their Makefile. Finally it may be nicer to drop -a from calendar and skeyaudit. Each user that desires the service could then create their own cron job. While it would allow a fair bit of code to be deleted and a tighter pledge, it would break existing workflows and could be considered as in a separate thread if desired. - Matthew Martin diff --git include/login_cap.h include/login_cap.h index d9a4c2c349c..1e831b6471a 100644 --- include/login_cap.h +++ include/login_cap.h @@ -53,7 +53,8 @@ #defineLOGIN_SETUMASK 0x0020 /* Set umask */ #defineLOGIN_SETUSER 0x0040 /* Set user */ #defineLOGIN_SETENV0x0080 /* Set environment */ -#defineLOGIN_SETALL0x00ff /* Set all. */ +#defineLOGIN_SETRTABLE 0x0100 /* Set rtable */ +#defineLOGIN_SETALL0x01ff /* Set all. */ #defineBI_AUTH "authorize" /* Accepted authentication */ #defineBI_REJECT "reject"/* Rejected authentication */ diff --git lib/libc/gen/login_cap.c lib/libc/gen/login_cap.c index 862f33b2065..65848f09ef6 100644 --- lib/libc/gen/login_cap.c +++ lib/libc/gen/login_cap.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include @@ -574,7 +575,7 @@ int setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags) { login_cap_t *flc; - quad_t p; + quad_t p, rtable; int i; flc = NULL; @@ -625,6 +626,14 @@ setusercontext(login_cap_t *lc, struct passwd *pwd, uid_t uid, u_int flags) umask((mode_t)p); } + if (flags & LOGIN_SETRTABLE) { + rtable = login_getcapnum(lc, "rtable", 0, 0); + + if (setrtable((int)rtable) == -1) { + syslog(LOG_ERR, "%s: setrtable: %m", lc->lc_class); + } + } + if (flags & LOGIN_SETGROUP) { if (setresgid(pwd->pw_gid, pwd->pw_gid, pwd->pw_gid) == -1) { syslog(LOG_ERR, "setresgid(%u,%u,%u): %m", diff --git share/man/man5/login.conf.5 share/man/man5/login.conf.5 index da935fa223e..ffeafc3531c 100644 --- share/man/man5/login.conf.5 +++ share/man/man5/login.conf.5 @@ -276,6 +276,10 @@ Initial priority (nice) level. Require home directory to login. .\" .Pp +.It rtable Ta number Ta Dv 0 Ta +Rtable to be set for the class. +.\" +.Pp .It setenv Ta envlist Ta "" Ta A list of environment variables and associated values to be set for the class. .\" diff --git sys/kern/kern_pledge.c sys/kern/kern_pledge.c index 6687bf91f09..901d41cefb6 100644 --- sys/kern/kern_pledge.c +++ sys/kern/kern_pledge.c @@ -373,6 +373,8 @@ const uint64_t pledge_syscalls[SYS_MAXSYSCALL] = { [SY
Re: unlock mmap(2) for anonymous mappings
On 24/01/22(Mon) 12:06, Klemens Nanni wrote: > On Sun, Jan 16, 2022 at 09:22:50AM -0300, Martin Pieuchot wrote: > > IMHO this approach of let's try if it works now and revert if it isn't > > doesn't help us make progress. I'd be more confident seeing diffs that > > assert for the right lock in the functions called by uvm_mapanon() and > > documentation about which lock is protecting which field of the data > > structures. > > I picked `vm_map's `size' as first underdocumented member. > All accesses to it are protected by vm_map_lock(), either through the > function itself or implicitly by already requiring the calling function > to lock the map. Could we use a vm_map_assert_locked() or something similar that reflect the exclusive or shared (read) lock comments? I don't trust comments. It's too easy to miss a lock in a code path. > So annotate functions using `size' wrt. the required lock. > > Index: uvm_map.c > === > RCS file: /cvs/src/sys/uvm/uvm_map.c,v > retrieving revision 1.282 > diff -u -p -r1.282 uvm_map.c > --- uvm_map.c 21 Dec 2021 22:21:32 - 1.282 > +++ uvm_map.c 21 Jan 2022 13:03:05 - > @@ -805,6 +805,8 @@ uvm_map_sel_limits(vaddr_t *min, vaddr_t > * Fills in *start_ptr and *end_ptr to be the first and last entry describing > * the space. > * If called with prefilled *start_ptr and *end_ptr, they are to be correct. > + * > + * map must be at least read-locked. > */ > int > uvm_map_isavail(struct vm_map *map, struct uvm_addr_state *uaddr, > @@ -2206,6 +2208,8 @@ uvm_unmap_kill_entry(struct vm_map *map, > * If remove_holes, then remove ET_HOLE entries as well. > * If markfree, entry will be properly marked free, otherwise, no replacement > * entry will be put in the tree (corrupting the tree). > + * > + * map must be locked. > */ > void > uvm_unmap_remove(struct vm_map *map, vaddr_t start, vaddr_t end, > @@ -2976,6 +2980,9 @@ uvm_tree_sanity(struct vm_map *map, char > UVM_ASSERT(map, addr == vm_map_max(map), file, line); > } > > +/* > + * map must be at least read-locked. > + */ > void > uvm_tree_size_chk(struct vm_map *map, char *file, int line) > { > Index: uvm_map.h > === > RCS file: /cvs/src/sys/uvm/uvm_map.h,v > retrieving revision 1.71 > diff -u -p -r1.71 uvm_map.h > --- uvm_map.h 15 Dec 2021 12:53:53 - 1.71 > +++ uvm_map.h 21 Jan 2022 12:51:26 - > @@ -272,7 +272,7 @@ struct vm_map { > > struct uvm_map_addr addr; /* [v] Entry tree, by addr */ > > - vsize_t size; /* virtual size */ > + vsize_t size; /* [v] virtual size */ > int ref_count; /* [a] Reference count */ > int flags; /* flags */ > struct mutexflags_lock; /* flags lock */ >
Re: DDBPROF: move option to amd64,i386 GENERIC
On 18/01/22(Tue) 04:38, Klemens Nanni wrote: > While intended for more architectures, DDBPROF is strictly amd64 and > i386 only, so the machine-independent sys/conf/GENERIC does not seem fit > (until all architectures are supported). This define should die. There's no need to polish this turd. Somebody has to stand up and turn this into a "#if NDT > 0" with the audit that goes with it. > Index: conf/GENERIC > === > RCS file: /cvs/src/sys/conf/GENERIC,v > retrieving revision 1.281 > diff -u -p -r1.281 GENERIC > --- conf/GENERIC 23 Dec 2021 10:04:14 - 1.281 > +++ conf/GENERIC 18 Jan 2022 04:28:29 - > @@ -4,7 +4,6 @@ > #GENERIC kernel > > option DDB # in-kernel debugger > -#option DDBPROF # ddb(4) based profiling > #option DDB_SAFE_CONSOLE # allow break into ddb during boot > #makeoptions DEBUG=""# do not compile full symbol table > #makeoptions PROF="-pg" # build profiled kernel > Index: arch/amd64/conf/GENERIC > === > RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v > retrieving revision 1.510 > diff -u -p -r1.510 GENERIC > --- arch/amd64/conf/GENERIC 4 Jan 2022 05:50:43 - 1.510 > +++ arch/amd64/conf/GENERIC 18 Jan 2022 04:28:04 - > @@ -13,6 +13,8 @@ machine amd64 > include "../../../conf/GENERIC" > maxusers 80 # estimated number of users > > +#option DDBPROF # ddb(4) based profiling > + > option USER_PCICONF# user-space PCI configuration > > option APERTURE# in-kernel aperture driver for XFree86 > Index: arch/i386/conf/GENERIC > === > RCS file: /cvs/src/sys/arch/i386/conf/GENERIC,v > retrieving revision 1.860 > diff -u -p -r1.860 GENERIC > --- arch/i386/conf/GENERIC2 Jan 2022 23:14:27 - 1.860 > +++ arch/i386/conf/GENERIC18 Jan 2022 04:28:26 - > @@ -13,6 +13,8 @@ machine i386 > include "../../../conf/GENERIC" > maxusers 80 # estimated number of users > > +#option DDBPROF # ddb(4) based profiling > + > option USER_PCICONF# user-space PCI configuration > > option APERTURE# in-kernel aperture driver for XFree86 >
Re: uvm_swap: introduce uvm_swap_data_lock
Nice! On 30/12/21(Thu) 23:38, Theo Buehler wrote: > The diff below does two things: it adds a uvm_swap_data_lock mutex and > trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad() Why is it enough? Which fields is the lock protecting in these function? Is it `uvmexp.swpages', could that be documented? What about `nswapdev'? Why is the rwlock grabbed before reading it in sys_swapctl()?i What about `swpginuse'? If the mutex/rwlock are used to protect the global `swap_priority' could that be also documented? Once this is documented it should be trivial to see that some places are missing some locking. Is it intentional? > The uvm_swap_data_lock protects all swap data structures, so needs to be > grabbed a few times, many of them already documented in the comments. > > For review, I suggest comparing to what NetBSD did and also going > through the consumers (swaplist_insert, swaplist_find, swaplist_trim) > and check that they are properly locked when called, or that there is > the KERNEL_LOCK() in place when swap data structures are manipulated. I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce the differences with NetBSD. > In swapmount() I introduced locking since that's needed to be able to > assert that the proper locks are held in swaplist_{insert,find,trim}. Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down? What about `uvmexp.nswget' and `uvmexp.swpgonly' in there? > Index: uvm/uvm_swap.c > === > RCS file: /cvs/src/sys/uvm/uvm_swap.c,v > retrieving revision 1.152 > diff -u -p -r1.152 uvm_swap.c > --- uvm/uvm_swap.c12 Dec 2021 09:14:59 - 1.152 > +++ uvm/uvm_swap.c30 Dec 2021 15:47:20 - > @@ -44,6 +44,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -91,6 +92,9 @@ > * - swap_syscall_lock (sleep lock): this lock serializes the swapctl > *system call and prevents the swap priority list from changing > *while we are in the middle of a system call (e.g. SWAP_STATS). > + * - uvm_swap_data_lock (mutex): this lock protects all swap data > + *structures including the priority list, the swapdev structures, > + *and the swapmap arena. > * > * each swap device has the following info: > * - swap device in use (could be disabled, preventing future use) > @@ -212,6 +216,7 @@ LIST_HEAD(swap_priority, swappri); > struct swap_priority swap_priority; > > /* locks */ > +struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE); > struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk"); > > /* > @@ -442,7 +447,7 @@ uvm_swap_finicrypt_all(void) > /* > * swaplist_insert: insert swap device "sdp" into the global list > * > - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock > + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock > * => caller must provide a newly malloc'd swappri structure (we will > * FREE it if we don't need it... this it to prevent malloc blocking > * here while adding swap) > @@ -452,6 +457,9 @@ swaplist_insert(struct swapdev *sdp, str > { > struct swappri *spp, *pspp; > > + rw_assert_wrlock(_syscall_lock); > + MUTEX_ASSERT_LOCKED(_swap_data_lock); > + > /* >* find entry at or after which to insert the new device. >*/ > @@ -493,7 +501,7 @@ swaplist_insert(struct swapdev *sdp, str > * swaplist_find: find and optionally remove a swap device from the > * global list. > * > - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock > + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock > * => we return the swapdev we found (and removed) > */ > struct swapdev * > @@ -502,6 +510,9 @@ swaplist_find(struct vnode *vp, boolean_ > struct swapdev *sdp; > struct swappri *spp; > > + rw_assert_wrlock(_syscall_lock); > + MUTEX_ASSERT_LOCKED(_swap_data_lock); > + > /* >* search the lists for the requested vp >*/ > @@ -524,13 +535,16 @@ swaplist_find(struct vnode *vp, boolean_ > * swaplist_trim: scan priority list for empty priority entries and kill > * them. > * > - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock > + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock > */ > void > swaplist_trim(void) > { > struct swappri *spp, *nextspp; > > + rw_assert_wrlock(_syscall_lock); > + MUTEX_ASSERT_LOCKED(_swap_data_lock); > + > LIST_FOREACH_SAFE(spp, _priority, spi_swappri, nextspp) { > if (!TAILQ_EMPTY(>spi_swapdev)) > continue; > @@ -543,7 +557,7 @@ swaplist_trim(void) > * swapdrum_add: add a "swapdev"'s blocks into /dev/drum's area. > * > * => caller must hold swap_syscall_lock > - * => uvm.swap_data_lock should be unlocked (we may sleep) > + * => uvm_swap_data_lock should be unlocked
Re: unlock mmap(2) for anonymous mappings
On 14/01/22(Fri) 23:01, Mark Kettenis wrote: > > Date: Tue, 11 Jan 2022 23:13:20 + > > From: Klemens Nanni > > > > On Tue, Jan 11, 2022 at 09:54:44AM -0700, Theo de Raadt wrote: > > > > Now this is clearly a "slow" path. I don't think there is any reason > > > > not to put all the code in that if (uvw_wxabort) block under the > > > > kernel lock. For now I think making access to ps_wxcounter atomic is > > > > just too fine grained. > > > > > > Right. Lock the whole block. > > > > Thanks everyone, here's the combined diff for that. > > I think mpi@ should be involved in the actual unlocking of mmap(2), > munmap(2) and mprotect(2). But the changes to uvm_mmap.c are ok > kettenis@ and can be committed now. It isn't clear to me what changed since the last time this has been tried. Why is it safe now? What are the locking assumptions? IMHO this approach of let's try if it works now and revert if it isn't doesn't help us make progress. I'd be more confident seeing diffs that assert for the right lock in the functions called by uvm_mapanon() and documentation about which lock is protecting which field of the data structures. NetBSD has done much of this and the code bases do not diverge much so it can be useful to look there as well.
Re: patch: add a new ktrace facility for replacing some printf-debug
On 07/01/22(Fri) 10:54, Sebastien Marie wrote: > Hi, > > Debugging some code paths is complex: for example, unveil(2) code is > involved inside VFS, and using DEBUG_UNVEIL means that the kernel is > spamming printf() for all processes using unveil(2) (a lot of > processes) instead of just the interested cases I want to follow. > > So I cooked the following diff to add a KTRLOG() facility to be able > to replace printf()-like debugging with a more process-limited method. I wish you could debug such issue without having to change any kernel code that's why I started btrace(8). You should already be able to filter probes per thread/process, maybe you want to replace some debug printf by static probes. Alternatively you could define DDBPROF and use the kprobe provider which allow you to inspect prologue and epilogue of ELF functions. Maybe btrace(8) is not yet fully functional to debug this particular problem, but improving it should hopefully give us a tool to debug most of the kernel issues without having to write a diff and boot a custom kernel. At least that's the goal.
Re: gprof: Profiling a multi-threaded application
On 10/12/21(Fri) 09:56, Yuichiro NAITO wrote: > Any comments about this topic? I'm ok with this approach. I would appreciate if somebody else could take it over, I'm too busy with other stuff. > On 7/12/21 18:09, Yuichiro NAITO wrote: > > Hi, Martin > > > > n 2021/07/10 16:55, Martin Pieuchot wrote: > > > Hello Yuichiro, thanks for your work ! > > > > Thanks for the response. > > > > > > On 2021/06/16 16:34, Yuichiro NAITO wrote: > > > > > When I compile a multi-threaded application with '-pg' option, I > > > > > always get no > > > > > results in gmon.out. With bad luck, my application gets SIGSEGV while > > > > > running. > > > > > Because the buffer that holds number of caller/callee count is the > > > > > only one > > > > > in the process and will be broken by multi-threaded access. > > > > > > > > > > I get the idea to solve this problem from NetBSD. NetBSD has > > > > > individual buffers > > > > > for each thread and merges them at the end of profiling. > > > > > > Note that the kernel use a similar approach but doesn't merge the buffer, > > > instead it generates a file for each CPU. > > > > Yes, so the number of output files are limited by the number of CPUs in > > case of > > the kernel profiling. I think number of application threads can be > > increased more > > casually. I don't want to see dozens of 'gmon.out' files. > > > > > > > NetBSD stores the reference to the individual buffer by > > > > > pthread_setspecific(3). > > > > > I think it causes infinite recursive call if whole libc library > > > > > (except > > > > > mcount.c) is compiled with -pg. > > > > > > > > > > The compiler generates '_mcount' function call at the beginning of > > > > > every > > > > > functions. If '_mcount' calls pthread_getspecific(3) to get the > > > > > individual > > > > > buffer, pthread_getspecific() calls '_mcount' again and causes > > > > > infinite > > > > > recursion. > > > > > > > > > > NetBSD prevents from infinite recursive call by checking a global > > > > > variable. But > > > > > this approach also prevents simultaneously call of '_mcount' on a > > > > > multi-threaded > > > > > application. It makes a little bit wrong results of profiling. > > > > > > > > > > So I added a pointer to the buffer in `struct pthread` that can be > > > > > accessible > > > > > via macro call as same as pthread_self(3). This approach can prevent > > > > > of > > > > > infinite recursive call of '_mcount'. > > > > > > Not calling a libc function for this makes sense. However I'm not > > > convinced that accessing `tib_thread' before _rthread_init() has been > > > called is safe. > > > > Before '_rthread_init’ is called, '__isthreaded' global variable is kept to > > be 0. > > My patch doesn't access tib_thread in this case. > > After calling `_rthread_init`, `pthread_create()` changes `__isthreaded` to > > 1. > > Tib_thread will be accessed by all threads that are newly created and the > > initial one. > > > > I believe tib of the initial thread has been initialized in `_libc_preinit' > > function > > in 'lib/libc/dlfcn/init.c'. > > > > > I'm not sure where is the cleanest way to place the per-thread buffer, > > > I'd suggest you ask guenther@ about this. > > > > I added guenther@ in CC of this mail. > > I hope I can get an advise about per-thread buffer. > > > > > Maybe the initialization can be done outside of _mcount()? > > > > Yes, I think tib is initialized in `pthread_create()` and `_libc_preinit()`. > > > > > > > I obtained merging function from NetBSD that is called in '_mcleanup' > > > > > function. > > > > > Merging function needs to walk through all the individual buffers, > > > > > I added SLIST_ENTRY member in 'struct gmonparam' to make a list of > > > > > the buffers. > > > > > And also added '#ifndef _KERNEL' for the SLIST_ENTRY member not to be > > > > > used for > > > > > the kernel. > > > > > > > > > > But I still use pthread_getspecific(3) for that can call destructor &
Re: net write in pcb hash
On 08/12/21(Wed) 22:39, Alexander Bluhm wrote: > On Wed, Dec 08, 2021 at 03:28:34PM -0300, Martin Pieuchot wrote: > > On 04/12/21(Sat) 01:02, Alexander Bluhm wrote: > > > Hi, > > > > > > As I want a read-only net lock for forwarding, all paths should be > > > checked for the correct net lock and asserts. I found code in > > > in_pcbhashlookup() where reading the PCB table results in a write > > > to optimize the cache. > > > > > > Porperly protecting PCB hashes is out of scope for parallel forwarding. > > > Can we get away with this hack? Only update the cache when we are > > > in TCP oder UDP stack with the write lock. The access from pf is > > > read-only. > > > > > > NET_WLOCKED() indicates whether we may change data structures. > > > > I recall that we currently do not want to introduce such idiom: change > > the behavior of the code depending on which lock is held by the caller. > > > > Can we instead assert that a write-lock is held before modifying the > > list? > > We could also pass down the kind of lock that is used. Goal is > that pf uses shared net lock. TCP and UDP will keep the exclusice > net lock for a while. Changing the logic of a function based on the type of a lock is not different from the previous approach. > Diff gets longer but perhaps a bit clearer what is going on. I believe we want to split in_pcblookup_listen() into two functions. One which is read-only and one which modifies the head of the hash chain. The read-only asserts for any lock and the one that modifies the hash calls the former and assert for a write lock. Alternatively, we could protect the PCB hash with a mutex. This has the advantage of not making the scope of the NET_LOCK() more complex. In the end we all know something like that will be done. I don't know how other BSD did this but I'm sure this will help getting the remaining socket layer out of the NET_LOCK(). > Index: net/pf.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v > retrieving revision 1.1122 > diff -u -p -r1.1122 pf.c > --- net/pf.c 7 Jul 2021 18:38:25 - 1.1122 > +++ net/pf.c 8 Dec 2021 21:16:16 - > @@ -3317,14 +3317,12 @@ pf_socket_lookup(struct pf_pdesc *pd) > sport = pd->hdr.tcp.th_sport; > dport = pd->hdr.tcp.th_dport; > PF_ASSERT_LOCKED(); > - NET_ASSERT_LOCKED(); > tb = > break; > case IPPROTO_UDP: > sport = pd->hdr.udp.uh_sport; > dport = pd->hdr.udp.uh_dport; > PF_ASSERT_LOCKED(); > - NET_ASSERT_LOCKED(); > tb = > break; > default: > @@ -3348,22 +3346,24 @@ pf_socket_lookup(struct pf_pdesc *pd) >* Fails when rtable is changed while evaluating the ruleset >* The socket looked up will not match the one hit in the end. >*/ > - inp = in_pcbhashlookup(tb, saddr->v4, sport, daddr->v4, dport, > - pd->rdomain); > + NET_ASSERT_LOCKED(); > + inp = in_pcbhashlookup_wlocked(tb, saddr->v4, sport, daddr->v4, > + dport, pd->rdomain, 0); > if (inp == NULL) { > - inp = in_pcblookup_listen(tb, daddr->v4, dport, > - NULL, pd->rdomain); > + inp = in_pcblookup_listen_wlocked(tb, daddr->v4, dport, > + NULL, pd->rdomain, 0); > if (inp == NULL) > return (-1); > } > break; > #ifdef INET6 > case AF_INET6: > - inp = in6_pcbhashlookup(tb, >v6, sport, >v6, > - dport, pd->rdomain); > + NET_ASSERT_LOCKED(); > + inp = in6_pcbhashlookup_wlocked(tb, >v6, sport, > + >v6, dport, pd->rdomain, 0); > if (inp == NULL) { > - inp = in6_pcblookup_listen(tb, >v6, dport, > - NULL, pd->rdomain); > + inp = in6_pcblookup_listen_wlocked(tb, >v6, > + dport, NULL, pd->rdomain, 0); > if (inp == NULL) > return (-1); > } > Index: netinet/in_pcb.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/in_pcb.c,v > retrieving revision 1.256 > diff -u -p -r1.256 in_pcb.c >
Re: net write in pcb hash
On 04/12/21(Sat) 01:02, Alexander Bluhm wrote: > Hi, > > As I want a read-only net lock for forwarding, all paths should be > checked for the correct net lock and asserts. I found code in > in_pcbhashlookup() where reading the PCB table results in a write > to optimize the cache. > > Porperly protecting PCB hashes is out of scope for parallel forwarding. > Can we get away with this hack? Only update the cache when we are > in TCP oder UDP stack with the write lock. The access from pf is > read-only. > > NET_WLOCKED() indicates whether we may change data structures. I recall that we currently do not want to introduce such idiom: change the behavior of the code depending on which lock is held by the caller. Can we instead assert that a write-lock is held before modifying the list? > Also move the assert from pf to in_pcb where the critical section > is. > > bluhm > > Index: net/pf.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v > retrieving revision 1.1122 > diff -u -p -r1.1122 pf.c > --- net/pf.c 7 Jul 2021 18:38:25 - 1.1122 > +++ net/pf.c 3 Dec 2021 22:20:32 - > @@ -3317,14 +3317,12 @@ pf_socket_lookup(struct pf_pdesc *pd) > sport = pd->hdr.tcp.th_sport; > dport = pd->hdr.tcp.th_dport; > PF_ASSERT_LOCKED(); > - NET_ASSERT_LOCKED(); > tb = > break; > case IPPROTO_UDP: > sport = pd->hdr.udp.uh_sport; > dport = pd->hdr.udp.uh_dport; > PF_ASSERT_LOCKED(); > - NET_ASSERT_LOCKED(); > tb = > break; > default: > Index: netinet/in_pcb.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/in_pcb.c,v > retrieving revision 1.256 > diff -u -p -r1.256 in_pcb.c > --- netinet/in_pcb.c 25 Oct 2021 22:20:47 - 1.256 > +++ netinet/in_pcb.c 3 Dec 2021 22:20:32 - > @@ -1069,6 +1069,8 @@ in_pcbhashlookup(struct inpcbtable *tabl > u_int16_t fport = fport_arg, lport = lport_arg; > u_int rdomain; > > + NET_ASSERT_LOCKED(); > + > rdomain = rtable_l2(rtable); > head = in_pcbhash(table, rdomain, , fport, , lport); > LIST_FOREACH(inp, head, inp_hash) { > @@ -1085,7 +1087,7 @@ in_pcbhashlookup(struct inpcbtable *tabl >* repeated accesses are quicker. This is analogous to >* the historic single-entry PCB cache. >*/ > - if (inp != LIST_FIRST(head)) { > + if (NET_WLOCKED() && inp != LIST_FIRST(head)) { > LIST_REMOVE(inp, inp_hash); > LIST_INSERT_HEAD(head, inp, inp_hash); > } > @@ -1119,6 +1121,8 @@ in_pcblookup_listen(struct inpcbtable *t > u_int16_t lport = lport_arg; > u_int rdomain; > > + NET_ASSERT_LOCKED(); > + > key1 = > key2 = _addr; > #if NPF > 0 > @@ -1185,7 +1189,7 @@ in_pcblookup_listen(struct inpcbtable *t >* repeated accesses are quicker. This is analogous to >* the historic single-entry PCB cache. >*/ > - if (inp != NULL && inp != LIST_FIRST(head)) { > + if (NET_WLOCKED() && inp != NULL && inp != LIST_FIRST(head)) { > LIST_REMOVE(inp, inp_hash); > LIST_INSERT_HEAD(head, inp, inp_hash); > } > Index: sys/systm.h > === > RCS file: /data/mirror/openbsd/cvs/src/sys/sys/systm.h,v > retrieving revision 1.154 > diff -u -p -r1.154 systm.h > --- sys/systm.h 2 Jun 2021 00:39:25 - 1.154 > +++ sys/systm.h 3 Dec 2021 22:20:32 - > @@ -344,6 +344,8 @@ extern struct rwlock netlock; > #define NET_RLOCK_IN_IOCTL()do { rw_enter_read(); } while > (0) > #define NET_RUNLOCK_IN_IOCTL() do { rw_exit_read(); } while (0) > > +#define NET_WLOCKED() (rw_status() == RW_WRITE) > + > #ifdef DIAGNOSTIC > > #define NET_ASSERT_UNLOCKED() > \ >
Re: kbind(2): push kernel lock down as needed
On 06/12/21(Mon) 14:58, Scott Cheloha wrote: > On Mon, Dec 06, 2021 at 08:35:15PM +0100, Mark Kettenis wrote: > > > Date: Sun, 5 Dec 2021 18:01:04 -0600 > > > From: Scott Cheloha > > > > > > Two things in sys_kbind() need an explicit kernel lock. First, > > > sigexit(). Second, uvm_map_extract(), because the following call > > > chain panics without it: > > > > > > [...] > > > > > > With this committed we can unlock kbind(2). > > > > > > Thoughts? ok? > > > > To be honest, I don't think this makes sense unless you can make the > > "normal" code path lock free. You're replacing a single > > KERNEL_LOCK/UNLOCK pair with (potentially) a bunch of them. That may > > actually make things worse. So I think we need to make > > uvm_map_extract() mpsafe first. > > Unlocking uvm_map_extract() would improve things, yes. Yes, please. What's missing?
Re: Rework UNIX sockets locking to be fine grained
On 24/11/21(Wed) 15:04, Vitaliy Makkoveev wrote: > [...] > Really, this is the simplest way. The shared lock for the pair of > sockets moves re-lock dances to the connect and disconnect stages which > should be also protected by locks. And not for the pair. Many SOCK_DGRAM > sockets could be connected to one socket. This could be done, but this > is the hell. And there is absolutely NO profit. It's not clear to me why sharing a lock isn't simpler. Currently all unix(4) sockets share a lock and the locking semantic is simpler. If two sockets are linked couldn't they use the same rwlock? Did you consider this approach? Is it complicated to know which lock to pick? This is what is done in UVM and that's why the rwlock is allocated outside of "struct uvm_object" with rw_obj_alloc(). Having a 'rwlock pointer' in "struct socket" could also help by setting this pointer to in the case of UDP/TCP sockets. I'm not saying your approach isn't working. I'm just not convinced it is the simplest path forward and I wish we could do without refcounting.
Re: Please test: UVM fault unlocking (aka vmobjlock)
On 24/11/21(Wed) 11:16, Martin Pieuchot wrote: > Diff below unlock the bottom part of the UVM fault handler. I'm > interested in squashing the remaining bugs. Please test with your usual > setup & report back. Thanks to all the testers, here's a new version that includes a bug fix. Tests on !x86 architectures are much appreciated! Thanks a lot, Martin diff --git sys/arch/amd64/conf/GENERIC.MP sys/arch/amd64/conf/GENERIC.MP index bb842f6d96e..e5334c19eac 100644 --- sys/arch/amd64/conf/GENERIC.MP +++ sys/arch/amd64/conf/GENERIC.MP @@ -4,6 +4,6 @@ include "arch/amd64/conf/GENERIC" option MULTIPROCESSOR #optionMP_LOCKDEBUG -#optionWITNESS +option WITNESS cpu* at mainbus? diff --git sys/arch/i386/conf/GENERIC.MP sys/arch/i386/conf/GENERIC.MP index 980a572b8fd..ef7ded61501 100644 --- sys/arch/i386/conf/GENERIC.MP +++ sys/arch/i386/conf/GENERIC.MP @@ -7,6 +7,6 @@ include "arch/i386/conf/GENERIC" option MULTIPROCESSOR # Multiple processor support #optionMP_LOCKDEBUG -#optionWITNESS +option WITNESS cpu* at mainbus? diff --git sys/dev/pci/drm/i915/gem/i915_gem_shmem.c sys/dev/pci/drm/i915/gem/i915_gem_shmem.c index ce8e2eca141..47b567087e7 100644 --- sys/dev/pci/drm/i915/gem/i915_gem_shmem.c +++ sys/dev/pci/drm/i915/gem/i915_gem_shmem.c @@ -268,8 +268,10 @@ shmem_truncate(struct drm_i915_gem_object *obj) #ifdef __linux__ shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1); #else + rw_enter(obj->base.uao->vmobjlock, RW_WRITE); obj->base.uao->pgops->pgo_flush(obj->base.uao, 0, obj->base.size, PGO_ALLPAGES | PGO_FREE); + rw_exit(obj->base.uao->vmobjlock); #endif obj->mm.madv = __I915_MADV_PURGED; obj->mm.pages = ERR_PTR(-EFAULT); diff --git sys/dev/pci/drm/radeon/radeon_ttm.c sys/dev/pci/drm/radeon/radeon_ttm.c index eb879b5c72c..837a9f94298 100644 --- sys/dev/pci/drm/radeon/radeon_ttm.c +++ sys/dev/pci/drm/radeon/radeon_ttm.c @@ -1006,6 +1006,8 @@ radeon_ttm_fault(struct uvm_faultinfo *ufi, vaddr_t vaddr, vm_page_t *pps, struct radeon_device *rdev; int r; + KASSERT(rw_write_held(ufi->entry->object.uvm_obj->vmobjlock)); + bo = (struct drm_gem_object *)ufi->entry->object.uvm_obj; rdev = bo->dev->dev_private; down_read(>pm.mclk_lock); diff --git sys/uvm/uvm_aobj.c sys/uvm/uvm_aobj.c index 20051d95dc1..a5c403ab67d 100644 --- sys/uvm/uvm_aobj.c +++ sys/uvm/uvm_aobj.c @@ -184,7 +184,7 @@ const struct uvm_pagerops aobj_pager = { * deadlock. */ static LIST_HEAD(aobjlist, uvm_aobj) uao_list = LIST_HEAD_INITIALIZER(uao_list); -static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_NONE); +static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_MPFLOOR); /* @@ -277,6 +277,7 @@ uao_find_swslot(struct uvm_object *uobj, int pageidx) * uao_set_swslot: set the swap slot for a page in an aobj. * * => setting a slot to zero frees the slot + * => object must be locked by caller * => we return the old slot number, or -1 if we failed to allocate *memory to record the new slot number */ @@ -286,7 +287,7 @@ uao_set_swslot(struct uvm_object *uobj, int pageidx, int slot) struct uvm_aobj *aobj = (struct uvm_aobj *)uobj; int oldslot; - KERNEL_ASSERT_LOCKED(); + KASSERT(rw_write_held(uobj->vmobjlock) || uobj->uo_refs == 0); KASSERT(UVM_OBJ_IS_AOBJ(uobj)); /* @@ -358,7 +359,9 @@ uao_free(struct uvm_aobj *aobj) struct uvm_object *uobj = >u_obj; KASSERT(UVM_OBJ_IS_AOBJ(uobj)); + KASSERT(rw_write_held(uobj->vmobjlock)); uao_dropswap_range(uobj, 0, 0); + rw_exit(uobj->vmobjlock); if (UAO_USES_SWHASH(aobj)) { /* @@ -671,6 +674,7 @@ struct uvm_object * uao_create(vsize_t size, int flags) { static struct uvm_aobj kernel_object_store; + static struct rwlock bootstrap_kernel_object_lock; static int kobj_alloced = 0; int pages = round_page(size) >> PAGE_SHIFT; struct uvm_aobj *aobj; @@ -742,6 +746,11 @@ uao_create(vsize_t size, int flags) * Initialise UVM object. */ uvm_obj_init(>u_obj, _pager, refs); + if (flags & UAO_FLAG_KERNOBJ) { + /* Use a temporary static lock for kernel_object. */ + rw_init(_kernel_object_lock, "kobjlk"); + uvm_obj_setlock(>u_obj, _kernel_object_lock); + } /* * now that aobj is ready, add it to the global list @@ -822,20 +831,20 @@ uao_detach(struct uvm_object *uobj) * involved in is complete), release any swap resources and free * the page itself. */ - uvm_lock_pageq(); - while((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) { +
Re: Rework UNIX sockets locking to be fine grained
On 22/11/21(Mon) 14:42, Vitaliy Makkoveev wrote: > On Sat, Nov 20, 2021 at 03:12:31AM +0300, Vitaliy Makkoveev wrote: > > Updated diff. Re-lock dances were simplified in the unix(4) sockets > > layer. > > > > Reference counters added to unix(4) sockets layer too. This makes > > pointer dereference of peer's control block always safe after re-lock. > > > > The `unp_refs' list cleanup done in the unp_detach(). This removes the > > case where the socket connected to our dying socket could be passed to > > unp_disconnect() and the check of it's connection state became much > > easier. > > > > Another re-lock simplification. We could enforce the lock order between > the listening socket `head' and the socket `so' linked to it's `so_q0' > or `so_q' to solock(head) -> solock(so). > > This removes re-lock from accept(2) and the accepting socket couldn't be > stolen by concurrent accept(2) thread. This removes re-lock from `so_q0' > and `so_q' cleanup on dying listening socket. > > The previous incarnation of this diff does re-lock in a half of > doaccept(), soclose(), sofree() and soisconnected() calls. The current > diff does not re-lock in doaccept() and soclose() and always so re-lock > in sofree() and soisconnected(). > > I guess this is the latest simplification and this diff could be pushed > forward. This diff is really interesting. It shows that the current locking design needs to be reworked. I don't think we should expose the locking strategy with a `persocket' variable then use if/else dances to decide if one of two locks need to be taken/released. Instead could we fold the TCP/UDP locking into more generic functions? For example connect() could be: int soconnect2(struct socket *so1, struct socket *so2) { int s, error; s = solock_pair(so1, so2); error = (*so1->so_proto->pr_usrreq)(so1, PRU_CONNECT2, NULL, (struct mbuf *)so2, NULL, curproc); sounlock_pair(so1, so2, s); return (error); } And solock_pair() would do the right thing(tm) based on the socket type. Because in the end we want to prepare this layer to use per-socket locks with TCP/UDP sockets as well. Could something similar be done for doaccept()? I'm afraid about introducing reference counting. Once there is reference counting it tends to be abused. It's not clear to me for which reason it is added. It looks like to work around lock ordering issues, could you talk a bit about this? Is there any alternative? I also don't understand the problem behind: > + unp_ref(unp2); > + sounlock(so, SL_LOCKED); > + solock(so2); > + solock(so); > + > + /* Datagram socket could be reconnected due to re-lock. */ > + if (unp->unp_conn != unp2) { > + sounlock(so2, SL_LOCKED); > + unp_rele(unp2); > + goto again; > + } > + > + unp_rele(unp2); It seems that doing an unlock/relock dance requires a lot of added complexity, why is it done this way? Thanks for dealing with this! > Index: sys/kern/uipc_socket.c > === > RCS file: /cvs/src/sys/kern/uipc_socket.c,v > retrieving revision 1.269 > diff -u -p -r1.269 uipc_socket.c > --- sys/kern/uipc_socket.c11 Nov 2021 16:35:09 - 1.269 > +++ sys/kern/uipc_socket.c22 Nov 2021 11:36:40 - > @@ -52,6 +52,7 @@ > #include > #include > #include > +#include > > #ifdef DDB > #include > @@ -156,7 +157,9 @@ soalloc(int prflags) > so = pool_get(_pool, prflags); > if (so == NULL) > return (NULL); > - rw_init(>so_lock, "solock"); > + rw_init_flags(>so_lock, "solock", RWL_DUPOK); > + refcnt_init(>so_refcnt); > + > return (so); > } > > @@ -257,6 +260,8 @@ solisten(struct socket *so, int backlog) > void > sofree(struct socket *so, int s) > { > + int persocket = solock_persocket(so); > + > soassertlocked(so); > > if (so->so_pcb || (so->so_state & SS_NOFDREF) == 0) { > @@ -264,16 +269,53 @@ sofree(struct socket *so, int s) > return; > } > if (so->so_head) { > + struct socket *head = so->so_head; > + > /* >* We must not decommission a socket that's on the accept(2) >* queue. If we do, then accept(2) may hang after select(2) >* indicated that the listening socket was ready. >*/ > - if (!soqremque(so, 0)) { > + if (so->so_onq == >so_q) { > sounlock(so, s); > return; > } > + > + if (persocket) { > + /* > + * Concurrent close of `head' could > + * abort `so' due to re-lock. > + */ > + soref(so); > + soref(head); > +
Please test: UVM fault unlocking (aka vmobjlock)
Diff below unlock the bottom part of the UVM fault handler. I'm interested in squashing the remaining bugs. Please test with your usual setup & report back. Thanks, Martin diff --git sys/arch/amd64/conf/GENERIC.MP sys/arch/amd64/conf/GENERIC.MP index bb842f6d96e..e5334c19eac 100644 --- sys/arch/amd64/conf/GENERIC.MP +++ sys/arch/amd64/conf/GENERIC.MP @@ -4,6 +4,6 @@ include "arch/amd64/conf/GENERIC" option MULTIPROCESSOR #optionMP_LOCKDEBUG -#optionWITNESS +option WITNESS cpu* at mainbus? diff --git sys/arch/i386/conf/GENERIC.MP sys/arch/i386/conf/GENERIC.MP index 980a572b8fd..ef7ded61501 100644 --- sys/arch/i386/conf/GENERIC.MP +++ sys/arch/i386/conf/GENERIC.MP @@ -7,6 +7,6 @@ include "arch/i386/conf/GENERIC" option MULTIPROCESSOR # Multiple processor support #optionMP_LOCKDEBUG -#optionWITNESS +option WITNESS cpu* at mainbus? diff --git sys/dev/pci/drm/i915/gem/i915_gem_shmem.c sys/dev/pci/drm/i915/gem/i915_gem_shmem.c index ce8e2eca141..47b567087e7 100644 --- sys/dev/pci/drm/i915/gem/i915_gem_shmem.c +++ sys/dev/pci/drm/i915/gem/i915_gem_shmem.c @@ -268,8 +268,10 @@ shmem_truncate(struct drm_i915_gem_object *obj) #ifdef __linux__ shmem_truncate_range(file_inode(obj->base.filp), 0, (loff_t)-1); #else + rw_enter(obj->base.uao->vmobjlock, RW_WRITE); obj->base.uao->pgops->pgo_flush(obj->base.uao, 0, obj->base.size, PGO_ALLPAGES | PGO_FREE); + rw_exit(obj->base.uao->vmobjlock); #endif obj->mm.madv = __I915_MADV_PURGED; obj->mm.pages = ERR_PTR(-EFAULT); diff --git sys/dev/pci/drm/radeon/radeon_ttm.c sys/dev/pci/drm/radeon/radeon_ttm.c index eb879b5c72c..837a9f94298 100644 --- sys/dev/pci/drm/radeon/radeon_ttm.c +++ sys/dev/pci/drm/radeon/radeon_ttm.c @@ -1006,6 +1006,8 @@ radeon_ttm_fault(struct uvm_faultinfo *ufi, vaddr_t vaddr, vm_page_t *pps, struct radeon_device *rdev; int r; + KASSERT(rw_write_held(ufi->entry->object.uvm_obj->vmobjlock)); + bo = (struct drm_gem_object *)ufi->entry->object.uvm_obj; rdev = bo->dev->dev_private; down_read(>pm.mclk_lock); diff --git sys/uvm/uvm_aobj.c sys/uvm/uvm_aobj.c index 20051d95dc1..127218c4c40 100644 --- sys/uvm/uvm_aobj.c +++ sys/uvm/uvm_aobj.c @@ -31,7 +31,7 @@ /* * uvm_aobj.c: anonymous memory uvm_object pager * - * author: Chuck Silvers +* author: Chuck Silvers * started: Jan-1998 * * - design mostly from Chuck Cranor @@ -184,7 +184,7 @@ const struct uvm_pagerops aobj_pager = { * deadlock. */ static LIST_HEAD(aobjlist, uvm_aobj) uao_list = LIST_HEAD_INITIALIZER(uao_list); -static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_NONE); +static struct mutex uao_list_lock = MUTEX_INITIALIZER(IPL_MPFLOOR); /* @@ -277,6 +277,7 @@ uao_find_swslot(struct uvm_object *uobj, int pageidx) * uao_set_swslot: set the swap slot for a page in an aobj. * * => setting a slot to zero frees the slot + * => object must be locked by caller * => we return the old slot number, or -1 if we failed to allocate *memory to record the new slot number */ @@ -286,7 +287,7 @@ uao_set_swslot(struct uvm_object *uobj, int pageidx, int slot) struct uvm_aobj *aobj = (struct uvm_aobj *)uobj; int oldslot; - KERNEL_ASSERT_LOCKED(); + KASSERT(rw_write_held(uobj->vmobjlock) || uobj->uo_refs == 0); KASSERT(UVM_OBJ_IS_AOBJ(uobj)); /* @@ -358,7 +359,9 @@ uao_free(struct uvm_aobj *aobj) struct uvm_object *uobj = >u_obj; KASSERT(UVM_OBJ_IS_AOBJ(uobj)); + KASSERT(rw_write_held(uobj->vmobjlock)); uao_dropswap_range(uobj, 0, 0); + rw_exit(uobj->vmobjlock); if (UAO_USES_SWHASH(aobj)) { /* @@ -671,6 +674,7 @@ struct uvm_object * uao_create(vsize_t size, int flags) { static struct uvm_aobj kernel_object_store; + static struct rwlock bootstrap_kernel_object_lock; static int kobj_alloced = 0; int pages = round_page(size) >> PAGE_SHIFT; struct uvm_aobj *aobj; @@ -742,6 +746,11 @@ uao_create(vsize_t size, int flags) * Initialise UVM object. */ uvm_obj_init(>u_obj, _pager, refs); + if (flags & UAO_FLAG_KERNOBJ) { + /* Use a temporary static lock for kernel_object. */ + rw_init(_kernel_object_lock, "kobjlk"); + uvm_obj_setlock(>u_obj, _kernel_object_lock); + } /* * now that aobj is ready, add it to the global list @@ -822,20 +831,20 @@ uao_detach(struct uvm_object *uobj) * involved in is complete), release any swap resources and free * the page itself. */ - uvm_lock_pageq(); - while((pg = RBT_ROOT(uvm_objtree, >memt)) != NULL) { + rw_enter(
Re: Retry sleep in poll/select
On 17/11/21(Wed) 09:51, Scott Cheloha wrote: > > On Nov 17, 2021, at 03:22, Martin Pieuchot wrote: > > > > On 16/11/21(Tue) 13:55, Visa Hankala wrote: > >> Currently, dopselect() and doppoll() call tsleep_nsec() without retry. > >> cheloha@ asked if the functions should handle spurious wakeups. I guess > >> such wakeups are unlikely with the nowake wait channel, but I am not > >> sure if that is a safe guess. > > > > I'm not sure to understand, are we afraid a thread sleeping on `nowake' > > can be awaken? Is it the assumption here? > > Yes, but I don't know how. Then I'd suggest we start with understanding how this can happen otherwise I fear we are adding more complexity for reasons we don't understands. > kettenis@ said spurious wakeups were > possible on a similar loop in sigsuspend(2) > so I mentioned this to visa@ off-list. I don't understand how this can happen. > If we added an assert to panic in wakeup(9) > if the channel is , would that be > sufficient? I guess so. > Ideally if you sleep on you should > never get a zero status from the sleep > functions. It should be impossible… if that > is possible to ensure.
Re: Retry sleep in poll/select
On 16/11/21(Tue) 13:55, Visa Hankala wrote: > Currently, dopselect() and doppoll() call tsleep_nsec() without retry. > cheloha@ asked if the functions should handle spurious wakeups. I guess > such wakeups are unlikely with the nowake wait channel, but I am not > sure if that is a safe guess. I'm not sure to understand, are we afraid a thread sleeping on `nowake' can be awaken? Is it the assumption here? > The following diff adds the retrying. The code is a bit arduous, so the > retry loop is put in a separate function that both poll and select use. Using a separate function makes sense anyway. > Index: kern/sys_generic.c > === > RCS file: src/sys/kern/sys_generic.c,v > retrieving revision 1.141 > diff -u -p -r1.141 sys_generic.c > --- kern/sys_generic.c16 Nov 2021 13:48:23 - 1.141 > +++ kern/sys_generic.c16 Nov 2021 13:50:08 - > @@ -90,6 +90,7 @@ int dopselect(struct proc *, int, fd_set > int doppoll(struct proc *, struct pollfd *, u_int, struct timespec *, > const sigset_t *, register_t *); > void doselwakeup(struct selinfo *); > +int selsleep(struct timespec *); > > int > iovec_copyin(const struct iovec *uiov, struct iovec **iovp, struct iovec > *aiov, > @@ -664,19 +665,7 @@ dopselect(struct proc *p, int nd, fd_set >* there's nothing to wait for. >*/ > if (nevents == 0 && ncollected == 0) { > - uint64_t nsecs = INFSLP; > - > - if (timeout != NULL) { > - if (!timespecisset(timeout)) > - goto done; > - nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP)); > - } > - error = tsleep_nsec(, PSOCK | PCATCH, "kqsel", nsecs); > - /* select is not restarted after signals... */ > - if (error == ERESTART) > - error = EINTR; > - if (error == EWOULDBLOCK) > - error = 0; > + error = selsleep(timeout); > goto done; > } > > @@ -849,6 +838,46 @@ selfalse(dev_t dev, int events, struct p > } > > /* > + * Sleep until a signal arrives or the optional timeout expires. > + */ > +int > +selsleep(struct timespec *timeout) > +{ > + uint64_t end, now, nsecs; > + int error; > + > + if (timeout != NULL) { > + if (!timespecisset(timeout)) > + return (0); > + now = getnsecuptime(); > + end = MIN(now + TIMESPEC_TO_NSEC(timeout), MAXTSLP); > + if (end < now) > + end = MAXTSLP; > + } > + > + do { > + if (timeout != NULL) > + nsecs = MAX(1, end - now); > + else > + nsecs = INFSLP; > + error = tsleep_nsec(, PSOCK | PCATCH, "selslp", nsecs); > + if (timeout != NULL) { > + now = getnsecuptime(); > + if (now >= end) > + break; > + } > + } while (error == 0); > + > + /* poll/select is not restarted after signals... */ > + if (error == ERESTART) > + error = EINTR; > + if (error == EWOULDBLOCK) > + error = 0; > + > + return (error); > +} > + > +/* > * Record a select request. > */ > void > @@ -1158,19 +1187,7 @@ doppoll(struct proc *p, struct pollfd *f >* there's nothing to wait for. >*/ > if (nevents == 0 && ncollected == 0) { > - uint64_t nsecs = INFSLP; > - > - if (timeout != NULL) { > - if (!timespecisset(timeout)) > - goto done; > - nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP)); > - } > - > - error = tsleep_nsec(, PSOCK | PCATCH, "kqpoll", nsecs); > - if (error == ERESTART) > - error = EINTR; > - if (error == EWOULDBLOCK) > - error = 0; > + error = selsleep(timeout); > goto done; > } > >
Re: bt.5 document count()
On 16/11/21(Tue) 11:07, Claudio Jeker wrote: > This documents count(). This function only works when used like this > @map[key] = count(); > But it is implemented and works. If used differently you get a syntax > error which is not helpful. This is why I chose to document it like this. > Another option would be to document the language (so it is clear where it > is possible to use what). ok mpi@ > max(), min() and sum() are other functions that behave like this. Their > documentation should also be adjusted IMO. > > -- > :wq Claudio > > Index: bt.5 > === > RCS file: /cvs/src/usr.sbin/btrace/bt.5,v > retrieving revision 1.13 > diff -u -p -r1.13 bt.5 > --- bt.5 12 Nov 2021 16:57:24 - 1.13 > +++ bt.5 16 Nov 2021 09:50:52 - > @@ -120,6 +120,11 @@ Functions: > .It Fn clear "@map" > Delete all (key, value) pairs from > .Va @map . > +.It "@map[key]" = Fn count > +Increment the value of > +.Va key > +from > +.Va @map . > .It Fn delete "@map[key]" > Delete the pair indexed by > .Va key >
Re: quotaon(8): small improvements
Friendly ping On Fri, Oct 29, 2021 at 10:06:44AM +0200, Martin Vahlensieck wrote: > Hi > > Here are some small changes to quotaon(8). If you want I can split > them up, but since they are small I am sending one diff. Here is > a list of changes roughly in the order they appear in the diff: > > - Constify some function arguments > > - Use __progname instead of separate whoami variable + small KNF >where the line is touched anyways. > > - Order letters in getopt(3) call > > - Convert a fprintf(3) + perror(3) to warn(3). warn(3) is >already used in this function for similar purposes. > > - Replace strtok(3) with strsep(3). Is there a more >elegant way than my solution? (I took the freedom to collect all >the char * variables into one line). > > - Set cp to NULL at the end of the while loop scanning the mount >options. Otherwise if the last mount option contains a '=' the >part after it is used as the quota file (of course only if the >desired userquota/groupquota option isn't found). > > - Use strncmp(3) instead of memcmp(3). Looking at the kernel code >it seems that it is zero terminated. The other code nearby uses >strcmp(3) already. > > - Invert an if condition to prevent an empty body. > > Happy to hear feedback, also feel free to only commit parts of it. > > Best, > > Martin > > P.S.: If the prototypes are touched anyways, the names of the > arguments might be removed as well to comply with KNF. Let me > know if you want that. > > Index: quotaon.c > === > RCS file: /cvs/src/usr.sbin/quotaon/quotaon.c,v > retrieving revision 1.27 > diff -u -p -r1.27 quotaon.c > --- quotaon.c 26 Apr 2018 12:42:51 - 1.27 > +++ quotaon.c 29 Oct 2021 07:46:25 - > @@ -44,6 +44,8 @@ > #include > #include > > +extern char *__progname; > + > char *qfname = QUOTAFILENAME; > char *qfextension[] = INITQFNAMES; > > @@ -52,34 +54,31 @@ int gflag; /* operate on group quotas * > int uflag; /* operate on user quotas */ > int vflag; /* verbose */ > > -void usage(char *whoami); > -int hasquota(struct fstab *fs, int type, char **qfnamep, int force); > -int quotaonoff(struct fstab *fs, int offmode, int type, char *qfpathname); > -int oneof(char *target, char *list[], int cnt); > -int readonly(struct fstab *fs); > +void usage(); > +int hasquota(const struct fstab *fs, int type, char **qfnamep, int force); > +int quotaonoff(const struct fstab *fs, int offmode, int type, char > *qfpathname); > +int oneof(const char *target, char *list[], int cnt); > +int readonly(const struct fstab *fs); > > > int > main(int argc, char *argv[]) > { > struct fstab *fs; > - char *qfnp, *whoami; > + char *qfnp; > long argnum, done = 0; > int i, offmode = 0, errs = 0; > extern int optind; > int ch; > > - whoami = strrchr(*argv, '/') + 1; > - if (whoami == (char *)1) > - whoami = *argv; > - if (strcmp(whoami, "quotaoff") == 0) > + if (strcmp(__progname, "quotaoff") == 0) > offmode = 1; > - else if (strcmp(whoami, "quotaon") != 0) { > + else if (strcmp(__progname, "quotaon") != 0) { > fprintf(stderr, "Name must be quotaon or quotaoff not %s\n", > - whoami); > + __progname); > exit(1); > } > - while ((ch = getopt(argc, argv, "avug")) != -1) { > + while ((ch = getopt(argc, argv, "aguv")) != -1) { > switch (ch) { > case 'a': > aflag = 1; > @@ -94,13 +93,13 @@ main(int argc, char *argv[]) > vflag = 1; > break; > default: > - usage(whoami); > + usage(); > } > } > argc -= optind; > argv += optind; > if (argc <= 0 && !aflag) > - usage(whoami); > + usage(); > if (!gflag && !uflag) { > gflag = 1; > uflag = 1; > @@ -142,22 +141,20 @@ main(int argc, char *argv[]) > } > > void > -usage(char *whoami) > +usage() > { > - > - fprintf(stderr, "usage: %s [-aguv] filesystem ...\n", whoami); > + fprintf(stderr, "usage: %s [-aguv] filesystem ...\n", __progname); > exit(1); > } > > int > -quotaonoff(struct fstab *fs, int offmode, int type, char *qfpathname
Re: poll/select: Lazy removal of knotes
On 06/11/21(Sat) 15:53, Visa Hankala wrote: > On Fri, Nov 05, 2021 at 10:04:50AM +0100, Martin Pieuchot wrote: > > New poll/select(2) implementation convert 'struct pollfd' and 'fdset' to > > knotes (kqueue event descriptors) then pass them to the kqueue subsystem. > > A knote is allocated, with kqueue_register(), for every read, write and > > except condition watched on a given FD. That means at most 3 allocations > > might be necessary per FD. > > > > The diff below reduce the overhead of per-syscall allocation/free of those > > descriptors by leaving those which didn't trigger on the kqueue across > > syscall. Leaving knotes on the kqueue allows kqueue_register() to re-use > > existing descriptor instead of re-allocating a new one. > > > > With this knotes are now lazily removed. The mechanism uses a serial > > number which is incremented for every syscall that indicates if a knote > > sitting in the kqueue is still valid or should be freed. > > > > Note that performance improvements might not be visible with this diff > > alone because kqueue_register() still pre-allocate a descriptor then drop > > it. > > > > visa@ already pointed out that the lazy removal logic could be integrated > > in kqueue_scan() which would reduce the complexity of those two syscalls. > > I'm arguing for doing this in a next step in-tree. > > I think it would make more sense to add the removal logic to the scan > function first as doing so would keep the code modifications more > logical and simpler. This would also avoid the need to go through > a temporary removal approach. I totally support your effort and your design however I don't have the time to do another round of test/debugging. So please, can you take care of doing these cleanups afterward? If not, please send a full diff and take over this feature, it's too much effort for me to work out of tree. > Index: kern/kern_event.c > === > RCS file: src/sys/kern/kern_event.c,v > retrieving revision 1.170 > diff -u -p -r1.170 kern_event.c > --- kern/kern_event.c 6 Nov 2021 05:48:47 - 1.170 > +++ kern/kern_event.c 6 Nov 2021 15:31:04 - > @@ -73,6 +73,7 @@ voidkqueue_terminate(struct proc *p, st > void KQREF(struct kqueue *); > void KQRELE(struct kqueue *); > > +void kqueue_purge(struct proc *, struct kqueue *); > int kqueue_sleep(struct kqueue *, struct timespec *); > > int kqueue_read(struct file *, struct uio *, int); > @@ -806,6 +807,22 @@ kqpoll_exit(void) > } > > void > +kqpoll_done(unsigned int num) > +{ > + struct proc *p = curproc; > + > + KASSERT(p->p_kq != NULL); > + > + if (p->p_kq_serial + num >= p->p_kq_serial) { > + p->p_kq_serial += num; > + } else { > + /* Clear all knotes after serial wraparound. */ > + kqueue_purge(p, p->p_kq); > + p->p_kq_serial = 1; > + } > +} > + > +void > kqpoll_dequeue(struct proc *p, int all) > { > struct knote marker; > @@ -1383,6 +1400,15 @@ retry: > > mtx_leave(>kq_lock); > > + /* Drop spurious events. */ > + if (p->p_kq == kq && > + p->p_kq_serial > (unsigned long)kn->kn_udata) { > + filter_detach(kn); > + knote_drop(kn, p); > + mtx_enter(>kq_lock); > + continue; > + } > + > memset(kevp, 0, sizeof(*kevp)); > if (filter_process(kn, kevp) == 0) { > mtx_enter(>kq_lock); > Index: kern/sys_generic.c > === > RCS file: src/sys/kern/sys_generic.c,v > retrieving revision 1.139 > diff -u -p -r1.139 sys_generic.c > --- kern/sys_generic.c29 Oct 2021 15:52:44 - 1.139 > +++ kern/sys_generic.c6 Nov 2021 15:31:04 - > @@ -730,8 +730,7 @@ done: > if (pibits[0] != (fd_set *)[0]) > free(pibits[0], M_TEMP, 6 * ni); > > - kqueue_purge(p, p->p_kq); > - p->p_kq_serial += nd; > + kqpoll_done(nd); > > return (error); > } > @@ -1230,8 +1229,7 @@ bad: > if (pl != pfds) > free(pl, M_TEMP, sz); > > - kqueue_purge(p, p->p_kq); > - p->p_kq_serial += nfds; > + kqpoll_done(nfds); > > return (error); > } > @@ -1251,8 +1249,7 @@ ppollcollect(struct proc *p, struct keve > /* >* Lazily delete spurious events. >* > -
poll/select: Lazy removal of knotes
New poll/select(2) implementation convert 'struct pollfd' and 'fdset' to knotes (kqueue event descriptors) then pass them to the kqueue subsystem. A knote is allocated, with kqueue_register(), for every read, write and except condition watched on a given FD. That means at most 3 allocations might be necessary per FD. The diff below reduce the overhead of per-syscall allocation/free of those descriptors by leaving those which didn't trigger on the kqueue across syscall. Leaving knotes on the kqueue allows kqueue_register() to re-use existing descriptor instead of re-allocating a new one. With this knotes are now lazily removed. The mechanism uses a serial number which is incremented for every syscall that indicates if a knote sitting in the kqueue is still valid or should be freed. Note that performance improvements might not be visible with this diff alone because kqueue_register() still pre-allocate a descriptor then drop it. visa@ already pointed out that the lazy removal logic could be integrated in kqueue_scan() which would reduce the complexity of those two syscalls. I'm arguing for doing this in a next step in-tree. Please test and review :) Index: kern/sys_generic.c === RCS file: /cvs/src/sys/kern/sys_generic.c,v retrieving revision 1.139 diff -u -p -r1.139 sys_generic.c --- kern/sys_generic.c 29 Oct 2021 15:52:44 - 1.139 +++ kern/sys_generic.c 5 Nov 2021 08:11:05 - @@ -598,7 +598,7 @@ sys_pselect(struct proc *p, void *v, reg int dopselect(struct proc *p, int nd, fd_set *in, fd_set *ou, fd_set *ex, -struct timespec *timeout, const sigset_t *sigmask, register_t *retval) +struct timespec *tsp, const sigset_t *sigmask, register_t *retval) { struct kqueue_scan_state scan; fd_mask bits[6]; @@ -666,10 +666,10 @@ dopselect(struct proc *p, int nd, fd_set if (nevents == 0 && ncollected == 0) { uint64_t nsecs = INFSLP; - if (timeout != NULL) { - if (!timespecisset(timeout)) + if (tsp != NULL) { + if (!timespecisset(tsp)) goto done; - nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(timeout), MAXTSLP)); + nsecs = MAX(1, MIN(TIMESPEC_TO_NSEC(tsp), MAXTSLP)); } error = tsleep_nsec(>p_kq, PSOCK | PCATCH, "kqsel", nsecs); /* select is not restarted after signals... */ @@ -682,28 +682,37 @@ dopselect(struct proc *p, int nd, fd_set /* Collect at most `nevents' possibly waiting in kqueue_scan() */ kqueue_scan_setup(, p->p_kq); - while (nevents > 0) { + while ((nevents - ncollected) > 0) { struct kevent kev[KQ_NEVENTS]; int i, ready, count; - /* Maximum number of events per iteration */ - count = MIN(nitems(kev), nevents); - ready = kqueue_scan(, count, kev, timeout, p, ); + /* +* Maximum number of events per iteration. Use the whole +* array to gather as many spurious events as possible. +*/ + count = nitems(kev); + ready = kqueue_scan(, count, kev, tsp, p, ); #ifdef KTRACE if (KTRPOINT(p, KTR_STRUCT)) ktrevent(p, kev, ready); #endif - /* Convert back events that are ready. */ + /* Convert back events that are ready/delete spurious ones. */ for (i = 0; i < ready && error == 0; i++) error = pselcollect(p, [i], pobits, ); + /* -* Stop if there was an error or if we had enough -* space to collect all events that were ready. +* Stop if there was an error or if we had enough space +* to collect all non-spurious events that were ready. */ - if (error || ready < count) + if (error || !ready || (ncollected > 0 && ready < count)) break; - nevents -= ready; + /* +* If we only got spurious events try again repositioning +* the marker. +*/ + if (ncollected == 0 && ((tsp == NULL) || timespecisset(tsp))) + scan.kqs_nevent = 0; } kqueue_scan_finish(); *retval = ncollected; @@ -730,7 +739,7 @@ done: if (pibits[0] != (fd_set *)[0]) free(pibits[0], M_TEMP, 6 * ni); - kqueue_purge(p, p->p_kq); + /* Needed to remove events lazily. */ p->p_kq_serial += nd; return (error); @@ -759,7 +768,7 @@ pselregister(struct proc *p, fd_set *pib DPRINTFN(2, "select fd %d mask %d serial %lu\n", fd, msk, p->p_kq_serial);
Re: UNIX sockets: use vnode(9) lock to protect `v_socket' dereference
On 26/10/21(Tue) 14:12, Vitaliy Makkoveev wrote: > Another step to make UNIX sockets locking fine grained. > > The listening socket has the references from file descriptors layer and > from the vnode(9) layer. This means when we close(2)'ing such socket it > still referenced by concurrent thread through connect(2) path. > > When we bind(2) UNIX socket we link it to vnode(9) by assigning > `v_socket'. When we connect(2)'ing socket to the socket we previously > bind(2)'ed we finding it by namei(9) and obtain it's reference through > `v_socket'. This socket has no extra reference in file descriptor > layer and could be closed by concurrent thread. > > This time we have `unp_lock' rwlock(9) which protects the whole layer > and the dereference of `v_socket' is safe. But with the fine grained > locking the `v_socket' will not be protected by global lock. When we > obtain the vnode(9) by namei(9) in connect(9) or bind(9) paths it is > already exclusively locked by vlode(9) lock. But in unp_detach() which > is called on the close(2)'ing socket we assume `unp_lock' protects > `v_socket'. > > I propose to use exclusive vnode(9) lock to protect `v_socket'. With the > fine grained locking, the `v_socket' dereference in unp_bind() or > unp_connect() threads will be safe because unp_detach() thread will wait > the vnode(9) lock release. The vnode referenced by `unp_vnod' has > reference counter bumped so it's dereference is also safe without > `unp_lock' held. This makes sense to me. Using the vnode lock here seems the simplest approach. > The `i_lock' should be take before `unp_lock' and unp_detach() should > release solock(). To prevent connections on this socket the > 'SO_ACCEPTCONN' bit cleared in soclose(). This is done to prevent races when solock() is released inside soabort(), right? Is it the only one or some more care is needed? Will this stay with per-socket locks or is this only necessary because of the global `unp_lock'? > Index: sys/kern/uipc_socket.c > === > RCS file: /cvs/src/sys/kern/uipc_socket.c,v > retrieving revision 1.265 > diff -u -p -r1.265 uipc_socket.c > --- sys/kern/uipc_socket.c14 Oct 2021 23:05:10 - 1.265 > +++ sys/kern/uipc_socket.c26 Oct 2021 11:05:59 - > @@ -315,6 +315,8 @@ soclose(struct socket *so, int flags) > /* Revoke async IO early. There is a final revocation in sofree(). */ > sigio_free(>so_sigio); > if (so->so_options & SO_ACCEPTCONN) { > + so->so_options &= ~SO_ACCEPTCONN; > + > while ((so2 = TAILQ_FIRST(>so_q0)) != NULL) { > (void) soqremque(so2, 0); > (void) soabort(so2); > Index: sys/kern/uipc_usrreq.c > === > RCS file: /cvs/src/sys/kern/uipc_usrreq.c,v > retrieving revision 1.150 > diff -u -p -r1.150 uipc_usrreq.c > --- sys/kern/uipc_usrreq.c21 Oct 2021 22:11:07 - 1.150 > +++ sys/kern/uipc_usrreq.c26 Oct 2021 11:05:59 - > @@ -474,20 +474,30 @@ void > unp_detach(struct unpcb *unp) > { > struct socket *so = unp->unp_socket; > - struct vnode *vp = NULL; > + struct vnode *vp = unp->unp_vnode; > > rw_assert_wrlock(_lock); > > LIST_REMOVE(unp, unp_link); > - if (unp->unp_vnode) { > + > + if (vp) { > + unp->unp_vnode = NULL; > + > /* > - * `v_socket' is only read in unp_connect and > - * unplock prevents concurrent access. > + * Enforce `i_lock' -> `unp_lock' because fifo > + * subsystem requires it. >*/ > > - unp->unp_vnode->v_socket = NULL; > - vp = unp->unp_vnode; > - unp->unp_vnode = NULL; > + sounlock(so, SL_LOCKED); > + > + VOP_LOCK(vp, LK_EXCLUSIVE); > + vp->v_socket = NULL; > + > + KERNEL_LOCK(); > + vput(vp); > + KERNEL_UNLOCK(); > + > + solock(so); > } > > if (unp->unp_conn) > @@ -500,21 +510,6 @@ unp_detach(struct unpcb *unp) > pool_put(_pool, unp); > if (unp_rights) > task_add(systqmp, _gc_task); > - > - if (vp != NULL) { > - /* > - * Enforce `i_lock' -> `unplock' because fifo subsystem > - * requires it. The socket can't be closed concurrently > - * because the file descriptor reference is > - * still hold. > - */ > - > - sounlock(so, SL_LOCKED); > - KERNEL_LOCK(); > - vrele(vp); > - KERNEL_UNLOCK(); > - solock(so); > - } > } > > int >
Re: UNIX sockets: make `unp_rights', `unp_msgcount' and `unp_file' atomic
On 30/10/21(Sat) 21:22, Vitaliy Makkoveev wrote: > This completely removes global rwlock(9) from the unp_internalize() and > unp_externalize() normal paths but only leaves it in unp_externalize() > error path. Also we don't need to simultaneously hold both fdplock() > and `unp_lock' in unp_internalize(). As non obvious profit this > simplifies the future lock dances in the UNIX sockets layer. > > It's safe to call fptounp() without `unp_lock' held. We always got this > file descriptor by fd_getfile(9) so we always have the extra reference > and this descriptor can't be closed by concurrent thread. Some sockets > could be destroyed through 'PRU_ABORT' path but they don't have > associated file descriptor and they are not accessible in the > unp_internalize() path. > > The `unp_file' access without `unp_lock' held is also safe. Each socket > could have the only associated file descriptor and each file descriptor > could have the only associated socket. We only assign `unp_file' in the > unp_internalize() path where we got the socket by fd_getfile(9). This > descriptor has the extra reference and couldn't be closed concurrently. > We could override `unp_file' but with the same address because the > associated file descriptor can't be changed so the address will be also > the same. So while unp_gc() concurrently runs the dereference of > non-NULL `unp_file' is always safe. Using an atomic operation for `unp_msgcount' is ok with me, one comment about `unp_rights' below. > Index: sys/kern/uipc_usrreq.c > === > RCS file: /cvs/src/sys/kern/uipc_usrreq.c,v > retrieving revision 1.153 > diff -u -p -r1.153 uipc_usrreq.c > --- sys/kern/uipc_usrreq.c30 Oct 2021 16:35:31 - 1.153 > +++ sys/kern/uipc_usrreq.c30 Oct 2021 18:41:25 - > @@ -58,6 +58,7 @@ > * Locks used to protect global data and struct members: > * I immutable after creation > * U unp_lock > + * a atomic > */ > struct rwlock unp_lock = RWLOCK_INITIALIZER("unplock"); > > @@ -99,7 +100,7 @@ SLIST_HEAD(,unp_deferral) unp_deferred = > SLIST_HEAD_INITIALIZER(unp_deferred); > > ino_tunp_ino;/* [U] prototype for fake inode numbers */ > -int unp_rights; /* [U] file descriptors in flight */ > +int unp_rights; /* [a] file descriptors in flight */ > int unp_defer; /* [U] number of deferred fp to close by the GC task */ > int unp_gcing; /* [U] GC task currently running */ > > @@ -927,17 +928,16 @@ restart: >*/ > rp = (struct fdpass *)CMSG_DATA(cm); > > - rw_enter_write(_lock); > for (i = 0; i < nfds; i++) { > struct unpcb *unp; > > fp = rp->fp; > rp++; > if ((unp = fptounp(fp)) != NULL) > - unp->unp_msgcount--; > - unp_rights--; > + atomic_dec_long(>unp_msgcount); > } > - rw_exit_write(_lock); > + > + atomic_sub_int(_rights, nfds); > > /* >* Copy temporary array to message and adjust length, in case of > @@ -985,13 +985,10 @@ unp_internalize(struct mbuf *control, st > return (EINVAL); > nfds = (cm->cmsg_len - CMSG_ALIGN(sizeof(*cm))) / sizeof (int); > > - rw_enter_write(_lock); > - if (unp_rights + nfds > maxfiles / 10) { > - rw_exit_write(_lock); > + if (atomic_add_int_nv(_rights, nfds) > maxfiles / 10) { > + atomic_sub_int(_rights, nfds); I can't believe this is race free. If two threads, T1 and T2, call atomic_add at the same time both might end up returning EMFILE even if only the first one currently does. This could happen if T1 exceeds the limit and T2 does atomic_add on an already-exceeded `unp_rights' before T1 could do atomic_sub. I suggest using a mutex to protect `unp_rights' instead to solve this issue. > return (EMFILE); > } > - unp_rights += nfds; > - rw_exit_write(_lock); > > /* Make sure we have room for the struct file pointers */ > morespace: > @@ -1031,7 +1028,6 @@ morespace: > ip = ((int *)CMSG_DATA(cm)) + nfds - 1; > rp = ((struct fdpass *)CMSG_DATA(cm)) + nfds - 1; > fdplock(fdp); > - rw_enter_write(_lock); > for (i = 0; i < nfds; i++) { > memcpy(, ip, sizeof fd); > ip--; > @@ -1056,15 +1052,13 @@ morespace: > rp->flags = fdp->fd_ofileflags[fd] & UF_PLEDGED; > rp--; > if ((unp = fptounp(fp)) != NULL) { > + atomic_inc_long(>unp_msgcount); > unp->unp_file = fp; > - unp->unp_msgcount++; > } > } > - rw_exit_write(_lock); > fdpunlock(fdp); > return (0); > fail: > - rw_exit_write(_lock); > fdpunlock(fdp); > if (fp != NULL) > FRELE(fp, p); > @@ -1072,17 +1066,13 @@ fail: > for ( ; i >