Re: Prevent off-by-one accounting hang in out-of-swap situations

2023-10-26 Thread Martin Pieuchot
On 26/10/23(Thu) 07:06, Miod Vallat wrote:
> > I wonder if the diff below makes a difference.  It's hard to debug and it
> > might be worth adding a counter for bad swap slots.
> 
> It did not help (but your diff is probably correct).

In that case I'd like to put both diffs in, are you ok with that?

> > Index: uvm/uvm_anon.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
> > retrieving revision 1.56
> > diff -u -p -r1.56 uvm_anon.c
> > --- uvm/uvm_anon.c  2 Sep 2023 08:24:40 -   1.56
> > +++ uvm/uvm_anon.c  22 Oct 2023 21:27:42 -
> > @@ -116,7 +116,7 @@ uvm_anfree_list(struct vm_anon *anon, st
> > uvm_unlock_pageq(); /* free the daemon */
> > }
> > } else {
> > -   if (anon->an_swslot != 0) {
> > +   if (anon->an_swslot != 0 && anon->an_swslot != SWSLOT_BAD) {
> > /* This page is no longer only in swap. */
> > KASSERT(uvmexp.swpgonly > 0);
> > atomic_dec_int(&uvmexp.swpgonly);



Re: Prevent off-by-one accounting hang in out-of-swap situations

2023-10-22 Thread Martin Pieuchot
On 22/10/23(Sun) 20:29, Miod Vallat wrote:
> > On 21/10/23(Sat) 14:28, Miod Vallat wrote:
> > > > Stuart, Miod, I wonder if this also help for the off-by-one issue you
> > > > are seeing.  It might not.
> > > 
> > > It makes the aforementioned issue disappear on the affected machine.
> > 
> > Thanks at lot for testing!
> 
> Spoke too soon. I have just hit
> 
> panic: kernel diagnostic assertion "uvmexp.swpgonly > 0" failed: file 
> "/usr/src/sys/uvm/uvm_anon.c", line 121
> Stopped at  db_enter+0x8:   add #0x4, r14
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *235984  11904  0 0x14000  0x2000  reaper
> db_enter() at db_enter+0x8
> panic() at panic+0x74
> __assert() at __assert+0x1c
> uvm_anfree_list() at uvm_anfree_list+0x156
> amap_wipeout() at amap_wipeout+0xe6
> uvm_unmap_detach() at uvm_unmap_detach+0x42
> uvm_map_teardown() at uvm_map_teardown+0x104
> uvmspace_free() at uvmspace_free+0x2a
> reaper() at reaper+0x86
> ddb> show uvmexp
> Current UVM status:
>   pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>   14875 VM pages: 376 active, 2076 inactive, 1 wired, 7418 free (924
> zero)
>   min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
>   freemin=495, free-target=660, inactive-target=2809, wired-max=4958
>   faults=73331603, traps=39755714, intrs=33863551, ctxswitch=11641480
> fpuswitch
> =0
>   softint=15742561, syscalls=39755712, kmapent=11
>   fault counts:
> noram=1, noanon=0, noamap=0, pgwait=1629, pgrele=0
> ok relocks(total)=1523991(1524022), anget(retries)=23905247(950233),
> amapco
> py=9049749
> neighbor anon/obj pg=12025732/40041442,
> gets(lock/unlock)=12859247/574102
> cases: anon=20680175, anoncow=3225049, obj=11467884, prcopy=1391019,
> przero
> =36545783
>   daemon and swap counts:
> woke=6868, revs=6246, scans=3525644, obscans=511526, anscans=2930634
> busy=0, freed=1973275, reactivate=83484, deactivate=3941988
> pageouts=94506, pending=94506, nswget=949421
> nswapdev=1
> swpages=4194415, swpginuse=621, swpgonly=0 paging=0
>   kernel pointers:
> objs(kern)=0x8c3ca94c

I wonder if the diff below makes a difference.  It's hard to debug and it
might be worth adding a counter for bad swap slots.

Index: uvm/uvm_anon.c
===
RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
retrieving revision 1.56
diff -u -p -r1.56 uvm_anon.c
--- uvm/uvm_anon.c  2 Sep 2023 08:24:40 -   1.56
+++ uvm/uvm_anon.c  22 Oct 2023 21:27:42 -
@@ -116,7 +116,7 @@ uvm_anfree_list(struct vm_anon *anon, st
uvm_unlock_pageq(); /* free the daemon */
}
} else {
-   if (anon->an_swslot != 0) {
+   if (anon->an_swslot != 0 && anon->an_swslot != SWSLOT_BAD) {
/* This page is no longer only in swap. */
KASSERT(uvmexp.swpgonly > 0);
atomic_dec_int(&uvmexp.swpgonly);



Re: Prevent off-by-one accounting hang in out-of-swap situations

2023-10-21 Thread Martin Pieuchot
On 21/10/23(Sat) 14:28, Miod Vallat wrote:
> > Stuart, Miod, I wonder if this also help for the off-by-one issue you
> > are seeing.  It might not.
> 
> It makes the aforementioned issue disappear on the affected machine.

Thanks at lot for testing!

> > Comments, ok?
> 
> > diff --git sys/uvm/uvm_pdaemon.c sys/uvm/uvm_pdaemon.c
> > index 284211d226c..a26a776df67 100644
> > --- sys/uvm/uvm_pdaemon.c
> > +++ sys/uvm/uvm_pdaemon.c
> 
> > @@ -917,9 +914,7 @@ uvmpd_scan(struct uvm_pmalloc *pma, struct 
> > uvm_constraint_range *constraint)
> >  */
> > free = uvmexp.free - BUFPAGES_DEFICIT;
> > swap_shortage = 0;
> > -   if (free < uvmexp.freetarg &&
> > -   uvmexp.swpginuse == uvmexp.swpages &&
> > -   !uvm_swapisfull() &&
> > +   if (free < uvmexp.freetarg && uvm_swapisfilled() && !uvm_swapisfull() &&
> > pages_freed == 0) {
> > swap_shortage = uvmexp.freetarg - free;
> > }
> 
> It's unfortunate that you now invoke two uvm_swapisxxx() routines, which
> will both acquire a mutex. Maybe a third uvm_swapisxxx routine could be
> introduced to compute the swapisfilled && !swapisfull condition at once?

I'm mot interested in such micro optimization yet.  Not acquiring a mutex
twice is IMHO not worth making half-shiny. 

However is someone is interested in going down in this direction, I'd
suggest try placing `uvmexp.freetarg' under the same lock and deal with
all its occurrences.  This is a possible next step to reduce the scope of
the uvm_lock_pageq() which is currently responsible for most of the
contention on MP in UVM.



Re: bt(5), btrace(8): execute END probe and print maps after exit() statement

2023-10-21 Thread Martin Pieuchot
On 18/10/23(Wed) 12:56, Scott Cheloha wrote:
> Hi,
> 
> A bt(5) exit() statement causes the btrace(8) interpreter to exit(3)
> immediately.
> 
> A BPFtrace exit() statement is more nuanced: the END probe is executed
> and the contents of all maps are printed before the interpreter exits.
> 
> This patch adds a halting check after the execution of each bt(5)
> statement.  If a statement causes the program to halt, the halt
> bubbles up to the top-level rule evaluation loop and terminates
> execution.  rules_teardown() then runs, just as if the program had
> received SIGTERM.
> 
> Two edge-like cases:
> 
> 1. You can exit() from BEGIN.  rules_setup() returns non-zero if this
>happens so the main loop knows to halt immediately.
> 
> 2. You can exit() from END.  This is just an early-return: the END probe
>doesn't run again.
> 
> Thoughts?

Makes sense to ease the transition from bpftrace scripts.  Ok with me if
you make sure the regression tests still pass.  Some outputs might
depend on the actual behavior and would need to be updated.

> 
> $ btrace -e '
> BEGIN {
>   @[probe] = "reached";
>   exit();
>   @[probe] = "not reached";
> }
> END {
>   @[probe] = "reached";
>   exit();
>   @[probe] = "not reached";
> }'
> 
> Index: btrace.c
> ===
> RCS file: /cvs/src/usr.sbin/btrace/btrace.c,v
> retrieving revision 1.79
> diff -u -p -r1.79 btrace.c
> --- btrace.c  12 Oct 2023 15:16:44 -  1.79
> +++ btrace.c  18 Oct 2023 17:54:16 -
> @@ -71,10 +71,10 @@ struct dtioc_probe_info   *dtpi_get_by_val
>   * Main loop and rule evaluation.
>   */
>  void  rules_do(int);
> -void  rules_setup(int);
> -void  rules_apply(int, struct dt_evt *);
> +int   rules_setup(int);
> +int   rules_apply(int, struct dt_evt *);
>  void  rules_teardown(int);
> -void  rule_eval(struct bt_rule *, struct dt_evt *);
> +int   rule_eval(struct bt_rule *, struct dt_evt *);
>  void  rule_printmaps(struct bt_rule *);
>  
>  /*
> @@ -84,7 +84,7 @@ uint64_t builtin_nsecs(struct dt_evt *
>  const char   *builtin_kstack(struct dt_evt *);
>  const char   *builtin_arg(struct dt_evt *, enum bt_argtype);
>  struct bt_arg*fn_str(struct bt_arg *, struct dt_evt *, char 
> *);
> -void  stmt_eval(struct bt_stmt *, struct dt_evt *);
> +int   stmt_eval(struct bt_stmt *, struct dt_evt *);
>  void  stmt_bucketize(struct bt_stmt *, struct dt_evt *);
>  void  stmt_clear(struct bt_stmt *);
>  void  stmt_delete(struct bt_stmt *, struct dt_evt *);
> @@ -405,6 +405,7 @@ void
>  rules_do(int fd)
>  {
>   struct sigaction sa;
> + int halt = 0;
>  
>   memset(&sa, 0, sizeof(sa));
>   sigemptyset(&sa.sa_mask);
> @@ -415,9 +416,9 @@ rules_do(int fd)
>   if (sigaction(SIGTERM, &sa, NULL))
>   err(1, "sigaction");
>  
> - rules_setup(fd);
> + halt = rules_setup(fd);
>  
> - while (!quit_pending && g_nprobes > 0) {
> + while (!quit_pending && !halt && g_nprobes > 0) {
>   static struct dt_evt devtbuf[64];
>   ssize_t rlen;
>   size_t i;
> @@ -434,8 +435,11 @@ rules_do(int fd)
>   if ((rlen % sizeof(struct dt_evt)) != 0)
>   err(1, "incorrect read");
>  
> - for (i = 0; i < rlen / sizeof(struct dt_evt); i++)
> - rules_apply(fd, &devtbuf[i]);
> + for (i = 0; i < rlen / sizeof(struct dt_evt); i++) {
> + halt = rules_apply(fd, &devtbuf[i]);
> + if (halt)
> + break;
> + }
>   }
>  
>   rules_teardown(fd);
> @@ -484,7 +488,7 @@ rules_action_scan(struct bt_stmt *bs)
>   return evtflags;
>  }
>  
> -void
> +int
>  rules_setup(int fd)
>  {
>   struct dtioc_probe_info *dtpi;
> @@ -493,7 +497,7 @@ rules_setup(int fd)
>   struct bt_probe *bp;
>   struct bt_stmt *bs;
>   struct bt_arg *ba;
> - int dokstack = 0, on = 1;
> + int dokstack = 0, halt = 0, on = 1;
>   uint64_t evtflags;
>  
>   TAILQ_FOREACH(r, &g_rules, br_next) {
> @@ -553,7 +557,7 @@ rules_setup(int fd)
>   clock_gettime(CLOCK_REALTIME, &bt_devt.dtev_tsp);
>  
>   if (rbegin)
> - rule_eval(rbegin, &bt_devt);
> + halt = rule_eval(rbegin, &bt_devt);
>  
>   /* Enable all probes */
>   TAILQ_FOREACH(r, &g_rules, br_next) {
> @@ -571,9 +575,14 @@ rules_setup(int fd)
>   if (ioctl(fd, DTIOCRECORD, &on))
>   err(1, "DTIOCRECORD");
>   }
> +
> + return halt;
>  }
>  
> -void
> +/*
> + * Returns non-zero if the program should halt.
> + */
> +int
>  rules_apply(int fd, struct dt_evt *dtev)
>  {
>   struct bt_r

Prevent off-by-one accounting hang in out-of-swap situations

2023-10-17 Thread Martin Pieuchot
Diff below makes out-of-swap checks more robust.

When a system is low on swap, two variables are used to adapt the
behavior of the page daemon and fault handler:

`swpginuse'
indicates how much swap space is being currently used

`swpgonly'
indicates how much swap space stores content of pages that are no
longer living in memory.

The diff below changes the heuristic to detect if the system is currently
out-of-swap.  In my tests it makes the system more stable by preventing
hangs.  It prevents from hangs that occur when the system has more than 99%
of it swap space filled.  When this happen, the checks using the variables
above to figure out if we are out-of-swap might never be true because of:

- Races between the fault-handler and the accounting of the page daemon
  due to asynchronous swapping

- The swap partition being never completely allocated (bad pages, 
  off-by-one, rounding error, size of a swap cluster...)

- Possible off-by-one accounting errors in swap space

So I'm adapting uvm_swapisfull() to return true as soon as more than
99% of the swap space is filled with pages which are no longer in memory.

I also introduce uvm_swapisfilled() to prevent later failures if there is
less than a cluster of space available (the minimum we try to swap out).
This prevent deadlocking if a few slots (pg_flags & PQ_SWAPBACKED) &&
-   uvmexp.swpginuse == uvmexp.swpages) {
+   if ((p->pg_flags & PQ_SWAPBACKED) && uvm_swapisfilled())
uvmpd_dropswap(p);
-   }
 
/*
 * the page we are looking at is dirty.   we must
@@ -917,9 +914,7 @@ uvmpd_scan(struct uvm_pmalloc *pma, struct 
uvm_constraint_range *constraint)
 */
free = uvmexp.free - BUFPAGES_DEFICIT;
swap_shortage = 0;
-   if (free < uvmexp.freetarg &&
-   uvmexp.swpginuse == uvmexp.swpages &&
-   !uvm_swapisfull() &&
+   if (free < uvmexp.freetarg && uvm_swapisfilled() && !uvm_swapisfull() &&
pages_freed == 0) {
swap_shortage = uvmexp.freetarg - free;
}
diff --git sys/uvm/uvm_swap.c sys/uvm/uvm_swap.c
index 27963259eba..913b2366a7c 100644
--- sys/uvm/uvm_swap.c
+++ sys/uvm/uvm_swap.c
@@ -1516,8 +1516,30 @@ ReTry:   /* XXXMRG */
 }
 
 /*
- * uvm_swapisfull: return true if all of available swap is allocated
- * and in use.
+ * uvm_swapisfilled: return true if the amount of free space in swap is
+ * smaller than the size of a cluster.
+ *
+ * As long as some swap slots are being used by pages currently in memory,
+ * it is possible to reuse them.  Even if the swap space has been completly
+ * filled we do not consider it full.
+ */
+int
+uvm_swapisfilled(void)
+{
+   int result;
+
+   mtx_enter(&uvm_swap_data_lock);
+   KASSERT(uvmexp.swpginuse <= uvmexp.swpages);
+   result = (uvmexp.swpginuse + SWCLUSTPAGES) >= uvmexp.swpages;
+   mtx_leave(&uvm_swap_data_lock);
+
+   return result;
+}
+
+/*
+ * uvm_swapisfull: return true if the amount of pages only in swap
+ * accounts for more than 99% of the total swap space.
+ *
  */
 int
 uvm_swapisfull(void)
@@ -1526,7 +1548,7 @@ uvm_swapisfull(void)
 
mtx_enter(&uvm_swap_data_lock);
KASSERT(uvmexp.swpgonly <= uvmexp.swpages);
-   result = (uvmexp.swpgonly == uvmexp.swpages);
+   result = (uvmexp.swpgonly >= (uvmexp.swpages * 99 / 100));
mtx_leave(&uvm_swap_data_lock);
 
return result;
diff --git sys/uvm/uvm_swap.h sys/uvm/uvm_swap.h
index 9904fe58cd7..f60237405ca 100644
--- sys/uvm/uvm_swap.h
+++ sys/uvm/uvm_swap.h
@@ -42,6 +42,7 @@ int   uvm_swap_put(int, struct vm_page **, 
int, int);
 intuvm_swap_alloc(int *, boolean_t);
 void   uvm_swap_free(int, int);
 void   uvm_swap_markbad(int, int);
+intuvm_swapisfilled(void);
 intuvm_swapisfull(void);
 void   uvm_swap_freepages(struct vm_page **, int);
 #ifdef HIBERNATE
diff --git sys/uvm/uvmexp.h sys/uvm/uvmexp.h
index de5f5fa367c..144494b73ff 100644
--- sys/uvm/uvmexp.h
+++ sys/uvm/uvmexp.h
@@ -83,7 +83,7 @@ struct uvmexp {
/* swap */
int nswapdev;   /* [S] number of configured swap devices in system */
int swpages;/* [S] number of PAGE_SIZE'ed swap pages */
-   int swpginuse;  /* [K] number of swap pages in use */
+   int swpginuse;  /* [S] number of swap pages in use */
int swpgonly;   /* [a] number of swap pages in use, not also in RAM */
int nswget; /* [a] number of swap pages moved from disk to RAM */
int nanon;  /* XXX number total of anon's in system */



small fix in uvmpd_scan_inactive()

2023-10-17 Thread Martin Pieuchot
Diff below merges two equivalent if blocks.  No functional change, ok?


Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.107
diff -u -p -r1.107 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   16 Oct 2023 11:32:54 -  1.107
+++ uvm/uvm_pdaemon.c   17 Oct 2023 10:28:25 -
@@ -650,6 +650,11 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
p->offset >> PAGE_SHIFT,
swslot + swcpages);
swcpages++;
+   rw_exit(slock);
+
+   /* cluster not full yet? */
+   if (swcpages < swnpages)
+   continue;
}
} else {
/* if p == NULL we must be doing a last swap i/o */
@@ -666,14 +671,6 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 * for object pages, we always do the pageout.
 */
if (swap_backed) {
-   if (p) {/* if we just added a page to cluster */
-   rw_exit(slock);
-
-   /* cluster not full yet? */
-   if (swcpages < swnpages)
-   continue;
-   }
-
/* starting I/O now... set up for it */
npages = swcpages;
ppsp = swpps;



Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-18 Thread Martin Pieuchot
On 17/09/23(Sun) 11:22, Scott Cheloha wrote:
> v2 is attached.

Thanks.

> Clockintrs now have an argument.  If we pass the PCB as argument, we
> can avoid doing a linear search to find the PCB during the interrupt.
> 
> One thing I'm unsure about is whether I need to add a "barrier" flag
> to clockintr_cancel() and/or clockintr_disestablish().  This would
> cause the caller to block until the clockintr callback function has
> finished executing, which would ensure that it was safe to free the
> PCB.

Please do not reinvent the wheel.  Try to imitate what other kernel APIs
do.  Look at task_add(9) and timeout_add(9).  Call the functions add/del()
to match existing APIs, then we can add a clockintr_del_barrier() if needed.
Do not introduce functions before we need them.  I hope we won't need
it.

> On Mon, Sep 04, 2023 at 01:39:25PM +0100, Martin Pieuchot wrote:
> > On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> > > On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > [...]
> > > The goal of clockintr is to provide a machine-independent API for
> > > scheduling clock interrupts.  You can use it to implement something
> > > like hardclock() or statclock().  We are already using it to implement
> > > these functions, among others.
> > 
> > After reading all the code and the previous manuals, I understand it as
> > a low-level per-CPU timeout API with nanosecond precision.  Is that it?
> 
> Yes.
> 
> The distinguishing feature is that it is usually wired up to a
> platform backend, so it can deliver the interrupt at the requested
> expiration time with relatively low error.

Why shouldn't we use it and prefer timeout then?  That's unclear to me
and I'd love to have this clearly documented.

What happened to the manual? 
 
> > Apart from running periodically on a given CPU an important need for
> > dt(4) is to get a timestamps for every event.  Currently nanotime(9)
> > is used.  This is global to the system.  I saw that ftrace use different
> > clocks per-CPU which might be something to consider now that we're
> > moving to a per-CPU API.
> > 
> > It's all about cost of the instrumentation.  Note that if we use a
> > different per-CPU timestamp it has to work outside of clockintr because
> > probes can be anywhere.
> > 
> > Regarding clockintr_establish(), I'd rather have it *not* allocated the
> > `clockintr'.  I'd prefer waste some more space in every PCB.  The reason
> > for this is to keep live allocation in dt(4) to dt_pcb_alloc() only to
> > be able to not go through malloc(9) at some point in the future to not
> > interfere with the rest of the kernel.  Is there any downside to this?
> 
> You're asking me to change from callee-allocated clockintrs to
> caller-allocated clockintrs.  I don't want to do this.

Why not?  What is your plan?  You want to put many clockintrs in the
kernel?  I don't understand where you're going.  Can you explain?  From
my point of view this design decision is only added complexity for no
good reason

> I am hoping to expirment with using a per-CPU pool for clockintrs
> during the next release cycle.  I think keeping all the clockintrs on
> a single page in memory will have cache benefits when reinserting
> clockintrs into the sorted data structure.

I do not understand this wish for micro optimization.

The API has only on dynamic consumer: dt(4), I tell you I'd prefer to
allocate the structure outside of your API and you tell me you don't
want.  You want to use pool.  Really I don't understand...  Please let's
design an API for our needs and not based on best practises from outside.

Or am I completely missing something from the clockintr world conquest
plan?

If you want to play with pools, they are many other stuff you can do :o)
I just don't understand.

> > Can we have a different hook for the interval provider?
> 
> I think I have added this now?
> 
> "profile" uses dt_prov_profile_intr().
> 
> "interval" uses dt_prov_interval_intr().
> 
> Did you mean a different kind of "hook"?

That's it and please remove DT_ENTER().  There's no need for the use of
the macro inside dt(4).  I thought I already mentioned it.

> > Since we need only one clockintr and we don't care about the CPU
> > should we pick a random one?  Could that be implemented by passing
> > a NULL "struct cpu_info *" pointer to clockintr_establish()?  So
> > multiple "interval" probes would run on different CPUs...
> 
> It would be simpler to just stick to running the "interval" provider
> on the primary CPU.

Well the pr

Re: Use counters_read(9) from ddb(4)

2023-09-15 Thread Martin Pieuchot
On 11/09/23(Mon) 21:05, Martin Pieuchot wrote:
> On 06/09/23(Wed) 23:13, Alexander Bluhm wrote:
> > On Wed, Sep 06, 2023 at 12:23:33PM -0500, Scott Cheloha wrote:
> > > On Wed, Sep 06, 2023 at 01:04:19PM +0100, Martin Pieuchot wrote:
> > > > Debugging OOM is hard.  UVM uses per-CPU counters and sadly
> > > > counters_read(9) needs to allocate memory.  This is not acceptable in
> > > > ddb(4).  As a result I cannot see the content of UVM counters in OOM
> > > > situations.
> > > > 
> > > > Diff below introduces a *_static() variant of counters_read(9) that
> > > > takes a secondary buffer to avoid calling malloc(9).  Is it fine?  Do
> > > > you have a better idea?  Should we make it the default or using the
> > > > stack might be a problem?
> > > 
> > > Instead of adding a second interface I think we could get away with
> > > just extending counters_read(9) to take a scratch buffer as an optional
> > > fourth parameter:
> > > 
> > > void
> > > counters_read(struct cpumem *cm, uint64_t *output, unsigned int n,
> > > uint64_t *scratch);
> > > 
> > > "scratch"?  "temp"?  "tmp"?
> > 
> > scratch is fine for me
> 
> Fine with me.

Here's a full diff, works for me(tm), ok?

Index: sys/kern/kern_sysctl.c
===
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
retrieving revision 1.418
diff -u -p -r1.418 kern_sysctl.c
--- sys/kern/kern_sysctl.c  16 Jul 2023 03:01:31 -  1.418
+++ sys/kern/kern_sysctl.c  15 Sep 2023 13:29:53 -
@@ -519,7 +519,7 @@ kern_sysctl(int *name, u_int namelen, vo
unsigned int i;
 
memset(&mbs, 0, sizeof(mbs));
-   counters_read(mbstat, counters, MBSTAT_COUNT);
+   counters_read(mbstat, counters, MBSTAT_COUNT, NULL);
for (i = 0; i < MBSTAT_TYPES; i++)
mbs.m_mtypes[i] = counters[i];
 
Index: sys/kern/subr_evcount.c
===
RCS file: /cvs/src/sys/kern/subr_evcount.c,v
retrieving revision 1.15
diff -u -p -r1.15 subr_evcount.c
--- sys/kern/subr_evcount.c 5 Dec 2022 08:58:49 -   1.15
+++ sys/kern/subr_evcount.c 15 Sep 2023 14:01:55 -
@@ -101,7 +101,7 @@ evcount_sysctl(int *name, u_int namelen,
 {
int error = 0, s, nintr, i;
struct evcount *ec;
-   u_int64_t count;
+   uint64_t count, scratch;
 
if (newp != NULL)
return (EPERM);
@@ -129,7 +129,7 @@ evcount_sysctl(int *name, u_int namelen,
if (ec == NULL)
return (ENOENT);
if (ec->ec_percpu != NULL) {
-   counters_read(ec->ec_percpu, &count, 1);
+   counters_read(ec->ec_percpu, &count, 1, &scratch);
} else {
s = splhigh();
count = ec->ec_count;
Index: sys/kern/subr_percpu.c
===
RCS file: /cvs/src/sys/kern/subr_percpu.c,v
retrieving revision 1.10
diff -u -p -r1.10 subr_percpu.c
--- sys/kern/subr_percpu.c  3 Oct 2022 14:10:53 -   1.10
+++ sys/kern/subr_percpu.c  15 Sep 2023 14:16:41 -
@@ -159,17 +159,19 @@ counters_free(struct cpumem *cm, unsigne
 }
 
 void
-counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
+counters_read(struct cpumem *cm, uint64_t *output, unsigned int n,
+uint64_t *scratch)
 {
struct cpumem_iter cmi;
-   uint64_t *gen, *counters, *temp;
+   uint64_t *gen, *counters, *temp = scratch;
uint64_t enter, leave;
unsigned int i;
 
for (i = 0; i < n; i++)
output[i] = 0;
 
-   temp = mallocarray(n, sizeof(uint64_t), M_TEMP, M_WAITOK);
+   if (scratch == NULL)
+   temp = mallocarray(n, sizeof(uint64_t), M_TEMP, M_WAITOK);
 
gen = cpumem_first(&cmi, cm);
do {
@@ -202,7 +204,8 @@ counters_read(struct cpumem *cm, uint64_
gen = cpumem_next(&cmi, cm);
} while (gen != NULL);
 
-   free(temp, M_TEMP, n * sizeof(uint64_t));
+   if (scratch == NULL)
+   free(temp, M_TEMP, n * sizeof(uint64_t));
 }
 
 void
@@ -305,7 +308,8 @@ counters_free(struct cpumem *cm, unsigne
 }
 
 void
-counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
+counters_read(struct cpumem *cm, uint64_t *output, unsigned int n,
+uint64_t *scratch)
 {
uint64_t *counters;
unsigned int i;
Index: sys/net/pfkeyv2_convert.c
===
RCS file: /cvs/src/sys/net/pfkeyv2_convert.c,v
retrieving revis

Re: Use counters_read(9) from ddb(4)

2023-09-11 Thread Martin Pieuchot
On 06/09/23(Wed) 23:13, Alexander Bluhm wrote:
> On Wed, Sep 06, 2023 at 12:23:33PM -0500, Scott Cheloha wrote:
> > On Wed, Sep 06, 2023 at 01:04:19PM +0100, Martin Pieuchot wrote:
> > > Debugging OOM is hard.  UVM uses per-CPU counters and sadly
> > > counters_read(9) needs to allocate memory.  This is not acceptable in
> > > ddb(4).  As a result I cannot see the content of UVM counters in OOM
> > > situations.
> > > 
> > > Diff below introduces a *_static() variant of counters_read(9) that
> > > takes a secondary buffer to avoid calling malloc(9).  Is it fine?  Do
> > > you have a better idea?  Should we make it the default or using the
> > > stack might be a problem?
> > 
> > Instead of adding a second interface I think we could get away with
> > just extending counters_read(9) to take a scratch buffer as an optional
> > fourth parameter:
> > 
> > void
> > counters_read(struct cpumem *cm, uint64_t *output, unsigned int n,
> > uint64_t *scratch);
> > 
> > "scratch"?  "temp"?  "tmp"?
> 
> scratch is fine for me

Fine with me.



Use counters_read(9) from ddb(4)

2023-09-06 Thread Martin Pieuchot
Debugging OOM is hard.  UVM uses per-CPU counters and sadly
counters_read(9) needs to allocate memory.  This is not acceptable in
ddb(4).  As a result I cannot see the content of UVM counters in OOM
situations.

Diff below introduces a *_static() variant of counters_read(9) that
takes a secondary buffer to avoid calling malloc(9).  Is it fine?  Do
you have a better idea?  Should we make it the default or using the
stack might be a problem?

Thanks,
Martin

Index: kern/subr_percpu.c
===
RCS file: /cvs/src/sys/kern/subr_percpu.c,v
retrieving revision 1.10
diff -u -p -r1.10 subr_percpu.c
--- kern/subr_percpu.c  3 Oct 2022 14:10:53 -   1.10
+++ kern/subr_percpu.c  6 Sep 2023 11:54:31 -
@@ -161,15 +161,25 @@ counters_free(struct cpumem *cm, unsigne
 void
 counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
 {
-   struct cpumem_iter cmi;
-   uint64_t *gen, *counters, *temp;
-   uint64_t enter, leave;
+   uint64_t *temp;
unsigned int i;
 
for (i = 0; i < n; i++)
output[i] = 0;
 
temp = mallocarray(n, sizeof(uint64_t), M_TEMP, M_WAITOK);
+   counters_read_static(cm, output, n, temp);
+   free(temp, M_TEMP, n * sizeof(uint64_t));
+}
+
+void
+counters_read_static(struct cpumem *cm, uint64_t *output, unsigned int n,
+uint64_t *temp)
+{
+   struct cpumem_iter cmi;
+   uint64_t *gen, *counters;
+   uint64_t enter, leave;
+   unsigned int i;
 
gen = cpumem_first(&cmi, cm);
do {
@@ -201,8 +211,6 @@ counters_read(struct cpumem *cm, uint64_
 
gen = cpumem_next(&cmi, cm);
} while (gen != NULL);
-
-   free(temp, M_TEMP, n * sizeof(uint64_t));
 }
 
 void
Index: sys/percpu.h
===
RCS file: /cvs/src/sys/sys/percpu.h,v
retrieving revision 1.8
diff -u -p -r1.8 percpu.h
--- sys/percpu.h28 Aug 2018 15:15:02 -  1.8
+++ sys/percpu.h6 Sep 2023 11:52:55 -
@@ -114,6 +114,8 @@ struct cpumem   *counters_alloc(unsigned i
 struct cpumem  *counters_alloc_ncpus(struct cpumem *, unsigned int);
 voidcounters_free(struct cpumem *, unsigned int);
 voidcounters_read(struct cpumem *, uint64_t *, unsigned int);
+voidcounters_read_static(struct cpumem *, uint64_t *,
+unsigned int, uint64_t *);
 voidcounters_zero(struct cpumem *, unsigned int);
 
 static inline uint64_t *
Index: uvm/uvm_meter.c
===
RCS file: /cvs/src/sys/uvm/uvm_meter.c,v
retrieving revision 1.49
diff -u -p -r1.49 uvm_meter.c
--- uvm/uvm_meter.c 18 Aug 2023 09:18:52 -  1.49
+++ uvm/uvm_meter.c 6 Sep 2023 11:53:02 -
@@ -249,11 +249,12 @@ uvm_total(struct vmtotal *totalp)
 void
 uvmexp_read(struct uvmexp *uexp)
 {
-   uint64_t counters[exp_ncounters];
+   uint64_t counters[exp_ncounters], temp[exp_ncounters];
 
memcpy(uexp, &uvmexp, sizeof(*uexp));
 
-   counters_read(uvmexp_counters, counters, exp_ncounters);
+   counters_read_static(uvmexp_counters, counters, exp_ncounters,
+   temp);
 
/* stat counters */
uexp->faults = (int)counters[faults];



Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-04 Thread Martin Pieuchot
On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > [...] 
> > The only behavior that needs to be preserved is the output of dumping
> > stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
> > be adapted with this change.  You can figure that out by looking at the
> > output of /usr/src/share/btrace/kprofile.bt without and with this diff.
> > 
> > Please generate a FlameGraph to make sure they're still the same.
> 
> dt_prov_profile_intr() runs at the same stack depth as hardclock(), so
> indeed they are still the same.

Lovely.

> > Apart from that I'd prefer if we could skip the mechanical change and
> > go straight to what dt(4) needs.  Otherwise we will have to re-design
> > everything.
> 
> I think a mechanical "move the code from point A to point B" patch is
> useful.  It makes the changes easier to follow when tracing the
> revision history in the future.
> 
> If you insist on skipping it, though, I think I can live without it.

I do insist.  It is really hard for me to follow and work with you
because you're too verbose for my capacity.  If you want to work with
me, please do smaller steps and do not mix so much in big diffs.  I
have plenty of possible comments but can deal with huge chunks.

> > The current code assumes the periodic entry points are external to dt(4).
> > This diff moves them in the middle of dt(4) but keeps the existing flow
> > which makes the code very convoluted.
> > 
> > A starting point to understand the added complexity it so see that the
> > DT_ENTER() macro are no longer needed if we move the entry points inside
> > dt(4).
> 
> I did see that.  It seems really easy to remove the macros in a
> subsequent patch, though.
> 
> Again, if you want to do it all in one patch that's OK.

Yes please.

> > The first periodic timeout is dt_prov_interval_enter().  It could be
> > implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
> > approach is that it uses too much code in the kernel which is a problem
> > when instrumenting the kernel itself.  Every subsystem used by dt(4) is
> > impossible to instrument with btrace(8).
> 
> I think you can avoid this instrumentation problem by using clockintr,
> where the callback functions are run from the hardware interrupt
> context, just like hardclock().

Fair enough.

> > The second periodic timeout it dt_prov_profile_enter().  It is similar
> > to the previous one and has to run on every CPU.
> > 
> > Both are currently bound to tick, but we want per-PCB time resolution.
> > We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
> > timeouts fires.
> 
> My current thought is that each PCB could have its own "dp_period" or
> "dp_interval", a count of nanoseconds.

Indeed.  We can have `dp_nsecs' and use that to determine if
clockintr_advance() needs to be called in dt_ioctl_record_start().

> > > - Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
> > >   PCB's implementing provider needs a clock interrupt to do its work, it
> > >   stores the handle in dp_clockintr.  The handle, if any, is established
> > >   during dtpv_alloc() and disestablished during dtpv_dealloc().
> > 
> > Sorry, but as I said I don't understand what is a clockintr.  How does it
> > fit in the kernel and how is it supposed to be used?
> 
> The goal of clockintr is to provide a machine-independent API for
> scheduling clock interrupts.  You can use it to implement something
> like hardclock() or statclock().  We are already using it to implement
> these functions, among others.

After reading all the code and the previous manuals, I understand it as
a low-level per-CPU timeout API with nanosecond precision.  Is that it?

> > Why have it per PCB and not per provider or for the whole driver instead?
> > Per-PCB implies that if I run 3 different profiling on a 32 CPU machines
> > I now have 96 different clockintr.  Is it what we want?
> 
> Yes, I think that sounds fine.  If we run into scaling problems we can
> always just change the underlying data structure to an RB tree or a
> minheap.

Fine.

> > If so, why not simply use timeout(9)?  What's the difference?
> 
> Some code needs to run from a hardware interrupt context.  It's going
> to be easier to profile more of the kernel if we're collecting stack
> traces during a clock interrupt.
> 
> Timeouts run at IPL_SOFTCLOCK.  You can profile process context
> code...  that's it.  Timeouts also only run on a single CPU.  Per-CPU
>

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-04 Thread Martin Pieuchot
On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > On 23/08/23(Wed) 18:52, Scott Cheloha wrote:
> > > This is the next patch in the clock interrupt reorganization series.
> > 
> > Thanks for your diff.  I'm sorry but it is really hard for me to help
> > review this diff because there is still no man page for this API+subsystem.
> > 
> > Can we start with that please?
> 
> Sure, a first draft of a clockintr_establish.9 manpage is included
> below.

Lovely, I'll answer to that first in this email.  Please commit it, so
we can tweak it in tree.  ok mpi@

> We also have a manpage in the tree, clockintr.9.  It is a bit out of
> date, but it covers the broad strokes of how the driver-facing portion
> of the subsystem works.

Currently reading it, should we get rid of `schedhz' in the manpage and
its remaining in the kernel?

Why isn't the statclock() always randomize?  I see a couple of
clocks/archs that do not use CL_RNDSTAT...   Is there any technical
reason apart from testing?

I don't understand what you mean with "Until we understand scheduler
lock contention better"?  What does lock contention has to do with
firing hardclock and statclock at the same moment?

> Index: share/man/man9/clockintr_establish.9
> ===
> RCS file: share/man/man9/clockintr_establish.9
> diff -N share/man/man9/clockintr_establish.9
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ share/man/man9/clockintr_establish.9  26 Aug 2023 01:44:37 -
> @@ -0,0 +1,239 @@
> +.\" $OpenBSD$
> +.\"
> +.\" Copyright (c) 2020-2023 Scott Cheloha 
> +.\"
> +.\" Permission to use, copy, modify, and distribute this software for any
> +.\" purpose with or without fee is hereby granted, provided that the above
> +.\" copyright notice and this permission notice appear in all copies.
> +.\"
> +.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> +.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> +.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> +.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> +.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> +.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> +.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> +.\"
> +.Dd $Mdocdate$
> +.Dt CLOCKINTR_ESTABLISH 9
> +.Os
> +.Sh NAME
> +.Nm clockintr_advance ,
> +.Nm clockintr_cancel ,
> +.Nm clockintr_disestablish ,
> +.Nm clockintr_establish ,
> +.Nm clockintr_expiration ,
> +.Nm clockintr_nsecuptime ,
> +.Nm clockintr_schedule ,
> +.Nm clockintr_stagger
> +.Nd schedule a function for execution from clock interrupt context
> +.Sh SYNOPSIS
> +.In sys/clockintr.h
> +.Ft uint64_t
> +.Fo clockintr_advance
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t interval"

Can we do s/interval/nsecs/?

> +.Fc
> +.Ft void
> +.Fo clockintr_cancel
> +.Fa "struct clockintr *cl"
> +.Fc
> +.Ft void
> +.Fo clockintr_disestablish
> +.Fa "struct clockintr *cl"
> +.Fc
> +.Ft struct clockintr *
> +.Fo clockintr_establish
> +.Fa "struct clockintr_queue *queue"
> +.Fa "void (*func)(struct clockintr *, void *)"
> +.Fc
> +.Ft uint64_t
> +.Fo clockintr_expiration
> +.Fa "const struct clockintr *cl"
> +.Fc
> +.Ft uint64_t
> +.Fo clockintr_nsecuptime
> +.Fa "const struct clockintr *cl"
> +.Fc
> +.Ft void
> +.Fo clockintr_schedule
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t abs"
> +.Fc
> +.Ft void
> +.Fo clockintr_stagger
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t interval"
> +.Fa "u_int numerator"
> +.Fa "u_int denominator"
> +.Fc
> +.Sh DESCRIPTION
> +The clock interrupt subsystem schedules functions for asynchronous execution
> +in the future from a hardware interrupt context.

This is the same description as timeout_set(9) apart from "hardware
interrupt context".  Why should I use this one and not the other one?
I'm confused.  I dislike choices.

> +.Pp
> +The
> +.Fn clockintr_establish
> +function allocates a new clock interrupt object
> +.Po
> +a
> +.Dq clockintr
> +.Pc
> +and binds it to the given clock interrupt
> +.Fa queue .
> +When the clockintr is executed,
> +the callback function
> +.Fa func
> +will be called from a hardware interrupt context on the CPU in control of th

Re: anon & pmap_page_protect

2023-09-01 Thread Martin Pieuchot
On 12/08/23(Sat) 10:43, Martin Pieuchot wrote:
> Since UVM has been imported, we zap mappings associated to anon pages
> before deactivating or freeing them.  Sadly, with the introduction of
> locking for amaps & anons, I added new code paths that do not respect
> this behavior.
> The diff below restores it by moving the call to pmap_page_protect()
> inside uvm_anon_release().  With it the 3 code paths using the function
> are now coherent with the rest of UVM.
> 
> I remember a discussion we had questioning the need for zapping such
> mappings.  I'm interested in hearing more arguments for or against this
> change. However, right now, I'm more concerned about coherency, so I'd
> like to commit the change below before we try something different.

Unless there's an objection, I'll commit this tomorrow in order to
unblock my upcoming UVM changes.

> Index: uvm/uvm_anon.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
> retrieving revision 1.55
> diff -u -p -u -7 -r1.55 uvm_anon.c
> --- uvm/uvm_anon.c11 Apr 2023 00:45:09 -  1.55
> +++ uvm/uvm_anon.c15 May 2023 13:55:28 -
> @@ -251,14 +251,15 @@ uvm_anon_release(struct vm_anon *anon)
>   KASSERT((pg->pg_flags & PG_RELEASED) != 0);
>   KASSERT((pg->pg_flags & PG_BUSY) != 0);
>   KASSERT(pg->uobject == NULL);
>   KASSERT(pg->uanon == anon);
>   KASSERT(anon->an_ref == 0);
>  
>   uvm_lock_pageq();
> + pmap_page_protect(pg, PROT_NONE);
>   uvm_pagefree(pg);
>   uvm_unlock_pageq();
>   KASSERT(anon->an_page == NULL);
>   lock = anon->an_lock;
>   uvm_anfree(anon);
>   rw_exit(lock);
>   /* Note: extra reference is held for PG_RELEASED case. */
> Index: uvm/uvm_fault.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
> retrieving revision 1.133
> diff -u -p -u -7 -r1.133 uvm_fault.c
> --- uvm/uvm_fault.c   4 Nov 2022 09:36:44 -   1.133
> +++ uvm/uvm_fault.c   15 May 2023 13:55:28 -
> @@ -392,15 +392,14 @@ uvmfault_anonget(struct uvm_faultinfo *u
>  
>   /*
>* if we were RELEASED during I/O, then our anon is
>* no longer part of an amap.   we need to free the
>* anon and try again.
>*/
>   if (pg->pg_flags & PG_RELEASED) {
> - pmap_page_protect(pg, PROT_NONE);
>   KASSERT(anon->an_ref == 0);
>   /*
>* Released while we had unlocked amap.
>*/
>   if (locked)
>   uvmfault_unlockall(ufi, NULL, NULL);
>   uvm_anon_release(anon); /* frees page for us */
> 



Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-08-24 Thread Martin Pieuchot
On 23/08/23(Wed) 18:52, Scott Cheloha wrote:
> This is the next patch in the clock interrupt reorganization series.

Thanks for your diff.  I'm sorry but it is really hard for me to help
review this diff because there is still no man page for this API+subsystem.

Can we start with that please?

> This patch moves the entry points for the interval and profile dt(4)
> providers from the hardclock(9) to a dedicated clock interrupt
> callback, dt_prov_profile_intr(), in dev/dt/dt_prov_profile.c.
> 
> - To preserve current behavior, (1) both provider entrypoints have
>   been moved into a single callback function, (2) the interrupt runs at
>   the same frequency as the hardclock, and (3) the interrupt is
>   staggered to co-occur with the hardclock on a given CPU.

The only behavior that needs to be preserved is the output of dumping
stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
be adapted with this change.  You can figure that out by looking at the
output of /usr/src/share/btrace/kprofile.bt without and with this diff.

Please generate a FlameGraph to make sure they're still the same.

Apart from that I'd prefer if we could skip the mechanical change and
go straight to what dt(4) needs.  Otherwise we will have to re-design
everything.   If you don't want to do this work, then leave it and tell
me what you need and what is your plan so I can help you and do it
myself.

dt(4) needs a way to schedule two different kind of periodic timeouts
with the higher precision possible.  It is currently plugged to hardclock
because there is nothing better.

The current code assumes the periodic entry points are external to dt(4).
This diff moves them in the middle of dt(4) but keeps the existing flow
which makes the code very convoluted. 
A starting point to understand the added complexity it so see that the
DT_ENTER() macro are no longer needed if we move the entry points inside
dt(4).

The first periodic timeout is dt_prov_interval_enter().  It could be
implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
approach is that it uses too much code in the kernel which is a problem
when instrumenting the kernel itself.  Every subsystem used by dt(4) is
impossible to instrument with btrace(8).

The second periodic timeout it dt_prov_profile_enter().  It is similar
to the previous one and has to run on every CPU.

Both are currently bound to tick, but we want per-PCB time resolution.
We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
timeouts fires.

> - Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
>   PCB's implementing provider needs a clock interrupt to do its work, it
>   stores the handle in dp_clockintr.  The handle, if any, is established
>   during dtpv_alloc() and disestablished during dtpv_dealloc().

Sorry, but as I said I don't understand what is a clockintr.  How does it
fit in the kernel and how is it supposed to be used?

Why have it per PCB and not per provider or for the whole driver instead?
Per-PCB implies that if I run 3 different profiling on a 32 CPU machines
I now have 96 different clockintr.  Is it what we want?  If so, why not
simply use timeout(9)?  What's the difference?

>   One alternative is to start running the clock interrupts when they
>   are allocated in dtpv_alloc() and stop them when they are freed in
>   dtpv_dealloc().  This is wasteful, though: the PCBs are not recording
>   yet, so the interrupts won't perform any useful work until the
>   controlling process enables recording.
> 
>   An additional pair of provider hooks, e.g. "dtpv_record_start" and
>   "dtpv_record_stop", might resolve this.

Another alternative would be to have a single hardclock-like handler for
dt(4).  Sorry, I don't understand the big picture behind clockintr, so I
can't help.  

> - We haven't needed to destroy clock interrupts yet, so the
>   clockintr_disestablish() function used in this patch is new.

So maybe that's fishy.  Are they supposed to be destroyed?  What is the
goal of this API?



Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Martin Pieuchot
On 12/08/23(Sat) 11:48, Visa Hankala wrote:
> On Sat, Aug 12, 2023 at 01:29:10PM +0200, Martin Pieuchot wrote:
> > On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> > > On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > > > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > > > removed from the scheduler before smr_flush() is called.  So there's no
> > > > need for the SMR thread to peg itself to such CPUs.  This currently
> > > > isn't a problem because we use per-CPU runqueues but it doesn't work
> > > > with a global one.  So the diff below skip halted CPUs.  It should also
> > > > speed up rebooting/halting on machine with a huge number of CPUs.
> > > 
> > > Because SPCF_HALTED does not (?) imply that the CPU has stopped
> > > processing interrupts, this skipping is not safe as is. Interrupt
> > > handlers might access SMR-protected data.
> > 
> > Interesting.  This is worse than I expected.  It seems we completely
> > forgot about suspend/resume and rebooting when we started pinning
> > interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
> > was enough to ensure no more code would be executed on secondary CPUs,
> > no?  Wouldn't it be better to remap interrupts to the primary CPU in
> > those cases?  Is it easily doable? 
> 
> I think device interrupt stopping already happens through
> config_suspend_all().

Indeed.  I'm a bit puzzled about the order of operations though.  In the
case of reboot/halt the existing order of operations are:

sched_stop_secondary_cpus() <--- remove secondary CPUs from the scheduler


vfs_shutdown()
if_downall()
uvm_swap_finicrypt_all()<--- happens on a single CPU but with
interrupts possibly on secondary CPUs


smr_flush() <--- tells the SMR thread to execute itself
on all CPUs even if they are out of the
scheduler

config_suspend_all()<--- stop interrupts from firing

x86_broadcast_ipi(X86_IPI_HALT) <--- stop secondary CPUs (on x86)


So do we want to keep the existing requirement of being able to execute
a thread on a CPU that has been removed from the scheduler?  That's is
what smr_flush() currently needs.  I find it surprising but I can add
that as a requirement for the upcoming scheduler.  I don't know if other
options are possible or even attractive.



Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Martin Pieuchot
On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > removed from the scheduler before smr_flush() is called.  So there's no
> > need for the SMR thread to peg itself to such CPUs.  This currently
> > isn't a problem because we use per-CPU runqueues but it doesn't work
> > with a global one.  So the diff below skip halted CPUs.  It should also
> > speed up rebooting/halting on machine with a huge number of CPUs.
> 
> Because SPCF_HALTED does not (?) imply that the CPU has stopped
> processing interrupts, this skipping is not safe as is. Interrupt
> handlers might access SMR-protected data.

Interesting.  This is worse than I expected.  It seems we completely
forgot about suspend/resume and rebooting when we started pinning
interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
was enough to ensure no more code would be executed on secondary CPUs,
no?  Wouldn't it be better to remap interrupts to the primary CPU in
those cases?  Is it easily doable? 

> One possible solution is to spin. When smr_grace_wait() sees
> SPCF_HALTED, it should probably call cpu_unidle(ci) and spin until
> condition READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp becomes true.
> However, for this to work, sched_idle() needs to invoke smr_idle().
> Here is a potential problem since the cpu_idle_{enter,cycle,leave}()
> logic is not consistent between architectures.

We're trying to move away from per-CPU runqueues.  That's how I found
this issue.  I don't see how this possible solution could work with a
global runqueue.  Do you?

> I think the intent in sched_idle() was that cpu_idle_enter() should
> block interrupts so that sched_idle() could check without races if
> the CPU can sleep. Now, on some architectures cpu_idle_enter() is
> a no-op. These architectures have to check the idle state in their
> cpu_idle_cycle() function before pausing the CPU.
> 
> To avoid touching architecture-specific code, cpu_is_idle() could
> be redefined to
> 
> ((ci)->ci_schedstate.spc_whichqs == 0 &&
>  (ci)->ci_schedstate.spc_smrgp == READ_ONCE(smr_grace_period))
> 
> Then the loop conditions
> 
>   while (!cpu_is_idle(curcpu())) {
> 
> and
> 
>   while (spc->spc_whichqs == 0) {
> 
> in sched_idle() would have to be changed to
> 
>   while (spc->spc_whichqs != 0) {
> 
> and
> 
>   while (cpu_is_idle(ci)) {
> 
> 
> :(
> 
> > Index: kern/kern_smr.c
> > ===
> > RCS file: /cvs/src/sys/kern/kern_smr.c,v
> > retrieving revision 1.16
> > diff -u -p -r1.16 kern_smr.c
> > --- kern/kern_smr.c 14 Aug 2022 01:58:27 -  1.16
> > +++ kern/kern_smr.c 11 Aug 2023 19:43:54 -
> > @@ -158,6 +158,8 @@ smr_grace_wait(void)
> > CPU_INFO_FOREACH(cii, ci) {
> > if (!CPU_IS_RUNNING(ci))
> > continue;
> > +   if (ci->ci_schedstate.spc_schedflags & SPCF_HALTED)
> > +   continue;
> > if (READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp)
> > continue;
> > sched_peg_curproc(ci);
> > 



anon & pmap_page_protect

2023-08-12 Thread Martin Pieuchot
Since UVM has been imported, we zap mappings associated to anon pages
before deactivating or freeing them.  Sadly, with the introduction of
locking for amaps & anons, I added new code paths that do not respect
this behavior.
The diff below restores it by moving the call to pmap_page_protect()
inside uvm_anon_release().  With it the 3 code paths using the function
are now coherent with the rest of UVM.

I remember a discussion we had questioning the need for zapping such
mappings.  I'm interested in hearing more arguments for or against this
change. However, right now, I'm more concerned about coherency, so I'd
like to commit the change below before we try something different.

ok?

Index: uvm/uvm_anon.c
===
RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
retrieving revision 1.55
diff -u -p -u -7 -r1.55 uvm_anon.c
--- uvm/uvm_anon.c  11 Apr 2023 00:45:09 -  1.55
+++ uvm/uvm_anon.c  15 May 2023 13:55:28 -
@@ -251,14 +251,15 @@ uvm_anon_release(struct vm_anon *anon)
KASSERT((pg->pg_flags & PG_RELEASED) != 0);
KASSERT((pg->pg_flags & PG_BUSY) != 0);
KASSERT(pg->uobject == NULL);
KASSERT(pg->uanon == anon);
KASSERT(anon->an_ref == 0);
 
uvm_lock_pageq();
+   pmap_page_protect(pg, PROT_NONE);
uvm_pagefree(pg);
uvm_unlock_pageq();
KASSERT(anon->an_page == NULL);
lock = anon->an_lock;
uvm_anfree(anon);
rw_exit(lock);
/* Note: extra reference is held for PG_RELEASED case. */
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.133
diff -u -p -u -7 -r1.133 uvm_fault.c
--- uvm/uvm_fault.c 4 Nov 2022 09:36:44 -   1.133
+++ uvm/uvm_fault.c 15 May 2023 13:55:28 -
@@ -392,15 +392,14 @@ uvmfault_anonget(struct uvm_faultinfo *u
 
/*
 * if we were RELEASED during I/O, then our anon is
 * no longer part of an amap.   we need to free the
 * anon and try again.
 */
if (pg->pg_flags & PG_RELEASED) {
-   pmap_page_protect(pg, PROT_NONE);
KASSERT(anon->an_ref == 0);
/*
 * Released while we had unlocked amap.
 */
if (locked)
uvmfault_unlockall(ufi, NULL, NULL);
uvm_anon_release(anon); /* frees page for us */



smr_grace_wait(): Skip halted CPUs

2023-08-11 Thread Martin Pieuchot
When stopping a machine, with "halt -p" for example, secondary CPUs are
removed from the scheduler before smr_flush() is called.  So there's no
need for the SMR thread to peg itself to such CPUs.  This currently
isn't a problem because we use per-CPU runqueues but it doesn't work
with a global one.  So the diff below skip halted CPUs.  It should also
speed up rebooting/halting on machine with a huge number of CPUs.

ok?

Index: kern/kern_smr.c
===
RCS file: /cvs/src/sys/kern/kern_smr.c,v
retrieving revision 1.16
diff -u -p -r1.16 kern_smr.c
--- kern/kern_smr.c 14 Aug 2022 01:58:27 -  1.16
+++ kern/kern_smr.c 11 Aug 2023 19:43:54 -
@@ -158,6 +158,8 @@ smr_grace_wait(void)
CPU_INFO_FOREACH(cii, ci) {
if (!CPU_IS_RUNNING(ci))
continue;
+   if (ci->ci_schedstate.spc_schedflags & SPCF_HALTED)
+   continue;
if (READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp)
continue;
sched_peg_curproc(ci);



Re: uvm_pagelookup(): moar sanity checks

2023-08-11 Thread Martin Pieuchot
On 11/08/23(Fri) 20:41, Mark Kettenis wrote:
> > Date: Fri, 11 Aug 2023 20:12:19 +0200
> > From: Martin Pieuchot 
> > 
> > Here's a simple diff to add some more sanity checks in uvm_pagelookup().
> > 
> > Nothing fancy, it helps documenting the flags and reduce the difference
> > with NetBSD.  This is part of my on-going work on UVM.
> > 
> > ok?
> 
> NetBSD really has that extra blank line after the return?

No.  It's my mistake.
 
> > Index: uvm/uvm_page.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> > retrieving revision 1.172
> > diff -u -p -r1.172 uvm_page.c
> > --- uvm/uvm_page.c  13 May 2023 09:24:59 -  1.172
> > +++ uvm/uvm_page.c  11 Aug 2023 17:55:43 -
> > @@ -1219,10 +1219,16 @@ struct vm_page *
> >  uvm_pagelookup(struct uvm_object *obj, voff_t off)
> >  {
> > /* XXX if stack is too much, handroll */
> > -   struct vm_page pg;
> > +   struct vm_page p, *pg;
> > +
> > +   p.offset = off;
> > +   pg = RBT_FIND(uvm_objtree, &obj->memt, &p);
> > +
> > +   KASSERT(pg == NULL || obj->uo_npages != 0);
> > +   KASSERT(pg == NULL || (pg->pg_flags & PG_RELEASED) == 0 ||
> > +   (pg->pg_flags & PG_BUSY) != 0);
> > +   return (pg);
> >  
> > -   pg.offset = off;
> > -   return RBT_FIND(uvm_objtree, &obj->memt, &pg);
> >  }
> >  
> >  /*
> > 
> > 



uvm_pagelookup(): moar sanity checks

2023-08-11 Thread Martin Pieuchot
Here's a simple diff to add some more sanity checks in uvm_pagelookup().

Nothing fancy, it helps documenting the flags and reduce the difference
with NetBSD.  This is part of my on-going work on UVM.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.172
diff -u -p -r1.172 uvm_page.c
--- uvm/uvm_page.c  13 May 2023 09:24:59 -  1.172
+++ uvm/uvm_page.c  11 Aug 2023 17:55:43 -
@@ -1219,10 +1219,16 @@ struct vm_page *
 uvm_pagelookup(struct uvm_object *obj, voff_t off)
 {
/* XXX if stack is too much, handroll */
-   struct vm_page pg;
+   struct vm_page p, *pg;
+
+   p.offset = off;
+   pg = RBT_FIND(uvm_objtree, &obj->memt, &p);
+
+   KASSERT(pg == NULL || obj->uo_npages != 0);
+   KASSERT(pg == NULL || (pg->pg_flags & PG_RELEASED) == 0 ||
+   (pg->pg_flags & PG_BUSY) != 0);
+   return (pg);
 
-   pg.offset = off;
-   return RBT_FIND(uvm_objtree, &obj->memt, &pg);
 }
 
 /*



Re: hardclock(9), roundrobin: make roundrobin() an independent clock interrupt

2023-08-10 Thread Martin Pieuchot
On 10/08/23(Thu) 12:18, Scott Cheloha wrote:
> On Thu, Aug 10, 2023 at 01:05:27PM +0200, Martin Pieuchot wrote:
> [...] 
> > Can we get rid of `hardclock_period' and use a variable set to 100ms?
> > This should be tested on alpha which has a hz of 1024 but I'd argue this
> > is an improvement.
> 
> Sure, that's cleaner.  The updated patch below adds a new
> "roundrobin_period" variable initialized during clockintr_init().

I'd rather see this variable initialized in sched_bsd.c to 100ms without
depending on `hz'.  Is is possible?  My point is to untangle this completely
from `hz'.

> Index: kern/sched_bsd.c
> ===
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.79
> diff -u -p -r1.79 sched_bsd.c
> --- kern/sched_bsd.c  5 Aug 2023 20:07:55 -   1.79
> +++ kern/sched_bsd.c  10 Aug 2023 17:15:53 -
> @@ -54,9 +54,8 @@
>  #include 
>  #endif
>  
> -
> +uint32_t roundrobin_period;  /* [I] roundrobin period (ns) */
>  int  lbolt;  /* once a second sleep address */
> -int  rrticks_init;   /* # of hardclock ticks per roundrobin() */
>  
>  #ifdef MULTIPROCESSOR
>  struct __mp_lock sched_lock;
> @@ -69,21 +68,23 @@ uint32_t  decay_aftersleep(uint32_t, uin
>   * Force switch among equal priority processes every 100ms.
>   */
>  void
> -roundrobin(struct cpu_info *ci)
> +roundrobin(struct clockintr *cl, void *cf)
>  {
> + struct cpu_info *ci = curcpu();
>   struct schedstate_percpu *spc = &ci->ci_schedstate;
> + uint64_t count;
>  
> - spc->spc_rrticks = rrticks_init;
> + count = clockintr_advance(cl, roundrobin_period);
>  
>   if (ci->ci_curproc != NULL) {
> - if (spc->spc_schedflags & SPCF_SEENRR) {
> + if (spc->spc_schedflags & SPCF_SEENRR || count >= 2) {
>   /*
>* The process has already been through a roundrobin
>* without switching and may be hogging the CPU.
>* Indicate that the process should yield.
>*/
>   atomic_setbits_int(&spc->spc_schedflags,
> - SPCF_SHOULDYIELD);
> + SPCF_SEENRR | SPCF_SHOULDYIELD);
>   } else {
>   atomic_setbits_int(&spc->spc_schedflags,
>   SPCF_SEENRR);
> @@ -695,8 +696,6 @@ scheduler_start(void)
>* its job.
>*/
>   timeout_set(&schedcpu_to, schedcpu, &schedcpu_to);
> -
> - rrticks_init = hz / 10;
>   schedcpu(&schedcpu_to);
>  
>  #ifndef SMALL_KERNEL
> Index: kern/kern_sched.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.84
> diff -u -p -r1.84 kern_sched.c
> --- kern/kern_sched.c 5 Aug 2023 20:07:55 -   1.84
> +++ kern/kern_sched.c 10 Aug 2023 17:15:53 -
> @@ -102,6 +102,12 @@ sched_init_cpu(struct cpu_info *ci)
>   if (spc->spc_profclock == NULL)
>   panic("%s: clockintr_establish profclock", __func__);
>   }
> + if (spc->spc_roundrobin == NULL) {
> + spc->spc_roundrobin = clockintr_establish(&ci->ci_queue,
> + roundrobin);
> + if (spc->spc_roundrobin == NULL)
> + panic("%s: clockintr_establish roundrobin", __func__);
> + }
>  
>   kthread_create_deferred(sched_kthreads_create, ci);
>  
> Index: kern/kern_clockintr.c
> ===
> RCS file: /cvs/src/sys/kern/kern_clockintr.c,v
> retrieving revision 1.30
> diff -u -p -r1.30 kern_clockintr.c
> --- kern/kern_clockintr.c 5 Aug 2023 20:07:55 -   1.30
> +++ kern/kern_clockintr.c 10 Aug 2023 17:15:53 -
> @@ -69,6 +69,7 @@ clockintr_init(u_int flags)
>  
>   KASSERT(hz > 0 && hz <= 10);
>   hardclock_period = 10 / hz;
> + roundrobin_period = hardclock_period * 10;
>  
>   KASSERT(stathz >= 1 && stathz <= 10);
>  
> @@ -204,6 +205,11 @@ clockintr_cpu_init(const struct intrcloc
>   clockintr_stagger(spc->spc_profclock, profclock_period,
>   multiplier, MAXCPUS);
>   }
> + if (spc->spc_roundrobin->cl_expiration == 0) {
> + clockintr_stagger(spc->spc_roundrobin, hardclock_period,
> + multiplier, MAXCPUS);
> + }
> 

Re: hardclock(9), roundrobin: make roundrobin() an independent clock interrupt

2023-08-10 Thread Martin Pieuchot
On 05/08/23(Sat) 17:17, Scott Cheloha wrote:
> This is the next piece of the clock interrupt reorganization patch
> series.

The round robin logic is here to make sure process doesn't hog a CPU.
The period to tell a process it should yield doesn't have to be tied
to the hardclock period.  We want to be sure a process doesn't run more
than 100ms at a time.

Is the priority of this new clock interrupt the same as the hardlock?

I don't understand what clockintr_advance() is doing.  Maybe you could
write a manual for it?  I'm afraid we could wait 200ms now?  Or what
`count' of 2 mean?

Same question for clockintr_stagger().

Can we get rid of `hardclock_period' and use a variable set to 100ms?
This should be tested on alpha which has a hz of 1024 but I'd argue this
is an improvement.

> This patch removes the roundrobin() call from hardclock() and makes
> roundrobin() an independent clock interrupt.
> 
> - Revise roundrobin() to make it a valid clock interrupt callback.
>   It remains periodic.  It still runs at one tenth of the hardclock
>   frequency.
> 
> - Account for multiple expirations in roundrobin().  If two or more
>   intervals have elapsed we set SPCF_SHOULDYIELD immediately.
> 
>   This preserves existing behavior: hardclock() is called multiple
>   times during clockintr_hardclock() if clock interrupts are blocked
>   for long enough.
> 
> - Each schedstate_percpu has its own roundrobin() handle, spc_roundrobin.
>   spc_roundrobin is established during sched_init_cpu(), staggered during
>   the first clockintr_cpu_init() call, and advanced during 
> clockintr_cpu_init().
>   Expirations during suspend/resume are discarded.
> 
> - spc_rrticks and rrticks_init are now useless.  Delete them.
> 
> ok?
> 
> Also, yes, I see the growing pile of scheduler-controlled clock
> interrupt handles.  My current plan is to move the setup code at the
> end of clockintr_cpu_init() to a different routine, maybe something
> like "sched_start_cpu()".  On the primary CPU you'd call it immediately
> after cpu_initclocks().  On secondary CPUs you'd call it at the end of
> cpu_hatch() just before cpu_switchto().
> 
> In any case, we will need to find a home for that code someplace.  It
> can't stay in clockintr_cpu_init() forever.
> 
> Index: kern/sched_bsd.c
> ===
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.79
> diff -u -p -r1.79 sched_bsd.c
> --- kern/sched_bsd.c  5 Aug 2023 20:07:55 -   1.79
> +++ kern/sched_bsd.c  5 Aug 2023 22:15:25 -
> @@ -56,7 +56,6 @@
>  
>  
>  int  lbolt;  /* once a second sleep address */
> -int  rrticks_init;   /* # of hardclock ticks per roundrobin() */
>  
>  #ifdef MULTIPROCESSOR
>  struct __mp_lock sched_lock;
> @@ -69,21 +68,23 @@ uint32_t  decay_aftersleep(uint32_t, uin
>   * Force switch among equal priority processes every 100ms.
>   */
>  void
> -roundrobin(struct cpu_info *ci)
> +roundrobin(struct clockintr *cl, void *cf)
>  {
> + struct cpu_info *ci = curcpu();
>   struct schedstate_percpu *spc = &ci->ci_schedstate;
> + uint64_t count;
>  
> - spc->spc_rrticks = rrticks_init;
> + count = clockintr_advance(cl, hardclock_period * 10);
>  
>   if (ci->ci_curproc != NULL) {
> - if (spc->spc_schedflags & SPCF_SEENRR) {
> + if (spc->spc_schedflags & SPCF_SEENRR || count >= 2) {
>   /*
>* The process has already been through a roundrobin
>* without switching and may be hogging the CPU.
>* Indicate that the process should yield.
>*/
>   atomic_setbits_int(&spc->spc_schedflags,
> - SPCF_SHOULDYIELD);
> + SPCF_SEENRR | SPCF_SHOULDYIELD);
>   } else {
>   atomic_setbits_int(&spc->spc_schedflags,
>   SPCF_SEENRR);
> @@ -695,8 +696,6 @@ scheduler_start(void)
>* its job.
>*/
>   timeout_set(&schedcpu_to, schedcpu, &schedcpu_to);
> -
> - rrticks_init = hz / 10;
>   schedcpu(&schedcpu_to);
>  
>  #ifndef SMALL_KERNEL
> Index: kern/kern_sched.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.84
> diff -u -p -r1.84 kern_sched.c
> --- kern/kern_sched.c 5 Aug 2023 20:07:55 -   1.84
> +++ kern/kern_sched.c 5 Aug 2023 22:15:25 -
> @@ -102,6 +102,12 @@ sched_init_cpu(struct cpu_info *ci)
>   if (spc->spc_profclock == NULL)
>   panic("%s: clockintr_establish profclock", __func__);
>   }
> + if (spc->spc_roundrobin == NULL) {
> + spc->spc_roundrobin = clockintr_establish(&ci->ci_queue,
> + roundrobin);
> + if (spc->spc_roundrobin == NULL)
> + panic("%s: clockintr_es

Re: [v2]: uvm_meter, schedcpu: make uvm_meter() an independent timeout

2023-08-03 Thread Martin Pieuchot
On 02/08/23(Wed) 18:27, Claudio Jeker wrote:
> On Wed, Aug 02, 2023 at 10:15:20AM -0500, Scott Cheloha wrote:
> > Now that the proc0 wakeup(9) is gone we can retry the other part of
> > the uvm_meter() patch.
> > 
> > uvm_meter() is meant to run every 5 seconds, but for historical
> > reasons it is called from schedcpu() and it is scheduled against the
> > UTC clock.  schedcpu() and uvm_meter() have different periods, so
> > uvm_meter() ought to be a separate timeout.  uvm_meter() is started
> > alongside schedcpu() so the two will still run in sync.
> > 
> > v1: https://marc.info/?l=openbsd-tech&m=168710929409153&w=2
> > 
> > ok?
> 
> I would refer if uvm_meter is killed and the load calcualtion moved to the
> scheduler.

Me too.

> > Index: sys/uvm/uvm_meter.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_meter.c,v
> > retrieving revision 1.46
> > diff -u -p -r1.46 uvm_meter.c
> > --- sys/uvm/uvm_meter.c 2 Aug 2023 13:54:45 -   1.46
> > +++ sys/uvm/uvm_meter.c 2 Aug 2023 15:13:49 -
> > @@ -85,10 +85,12 @@ void uvmexp_read(struct uvmexp *);
> >   * uvm_meter: calculate load average
> >   */
> >  void
> > -uvm_meter(void)
> > +uvm_meter(void *unused)
> >  {
> > -   if ((gettime() % 5) == 0)
> > -   uvm_loadav(&averunnable);
> > +   static struct timeout to = TIMEOUT_INITIALIZER(uvm_meter, NULL);
> > +
> > +   timeout_add_sec(&to, 5);
> > +   uvm_loadav(&averunnable);
> >  }
> >  
> >  /*
> > Index: sys/uvm/uvm_extern.h
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
> > retrieving revision 1.170
> > diff -u -p -r1.170 uvm_extern.h
> > --- sys/uvm/uvm_extern.h21 Jun 2023 21:16:21 -  1.170
> > +++ sys/uvm/uvm_extern.h2 Aug 2023 15:13:49 -
> > @@ -414,7 +414,7 @@ voiduvmspace_free(struct vmspace *);
> >  struct vmspace *uvmspace_share(struct process *);
> >  intuvm_share(vm_map_t, vaddr_t, vm_prot_t,
> > vm_map_t, vaddr_t, vsize_t);
> > -void   uvm_meter(void);
> > +void   uvm_meter(void *);
> >  intuvm_sysctl(int *, u_int, void *, size_t *, 
> > void *, size_t, struct proc *);
> >  struct vm_page *uvm_pagealloc(struct uvm_object *,
> > Index: sys/kern/sched_bsd.c
> > ===
> > RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> > retrieving revision 1.78
> > diff -u -p -r1.78 sched_bsd.c
> > --- sys/kern/sched_bsd.c25 Jul 2023 18:16:19 -  1.78
> > +++ sys/kern/sched_bsd.c2 Aug 2023 15:13:50 -
> > @@ -235,7 +235,6 @@ schedcpu(void *arg)
> > }
> > SCHED_UNLOCK(s);
> > }
> > -   uvm_meter();
> > wakeup(&lbolt);
> > timeout_add_sec(to, 1);
> >  }
> > @@ -688,6 +687,7 @@ scheduler_start(void)
> >  
> > rrticks_init = hz / 10;
> > schedcpu(&schedcpu_to);
> > +   uvm_meter(NULL);
> >  
> >  #ifndef SMALL_KERNEL
> > if (perfpolicy == PERFPOL_AUTO)
> > Index: share/man/man9/uvm_init.9
> > ===
> > RCS file: /cvs/src/share/man/man9/uvm_init.9,v
> > retrieving revision 1.7
> > diff -u -p -r1.7 uvm_init.9
> > --- share/man/man9/uvm_init.9   21 Jun 2023 21:16:21 -  1.7
> > +++ share/man/man9/uvm_init.9   2 Aug 2023 15:13:50 -
> > @@ -168,7 +168,7 @@ argument is ignored.
> >  .Ft void
> >  .Fn uvm_kernacc "caddr_t addr" "size_t len" "int rw"
> >  .Ft void
> > -.Fn uvm_meter
> > +.Fn uvm_meter "void *arg"
> >  .Ft int
> >  .Fn uvm_sysctl "int *name" "u_int namelen" "void *oldp" "size_t *oldlenp" 
> > "void *newp " "size_t newlen" "struct proc *p"
> >  .Ft int
> > @@ -212,7 +212,7 @@ access, in the kernel address space.
> >  .Pp
> >  The
> >  .Fn uvm_meter
> > -function calculates the load average and wakes up the swapper if necessary.
> > +timeout updates system load averages every five seconds.
> >  .Pp
> >  The
> >  .Fn uvm_sysctl
> 
> -- 
> :wq Claudio
> 



Re: uvm_loadav: don't recompute schedstate_percpu.spc_nrun

2023-08-03 Thread Martin Pieuchot
On 02/08/23(Wed) 14:22, Claudio Jeker wrote:
> On Mon, Jul 31, 2023 at 10:21:11AM -0500, Scott Cheloha wrote:
> > On Fri, Jul 28, 2023 at 07:36:41PM -0500, Scott Cheloha wrote:
> > > claudio@ notes that uvm_loadav() pointlessly walks the allproc list to
> > > recompute schedstate_percpu.spn_nrun for each CPU.
> > > 
> > > We can just use the value instead of recomputing it.
> > 
> > Whoops, off-by-one.  The current load averaging code includes the
> > running thread in the nrun count if it is *not* the idle thread.
> 
> Yes, with this the loadavg seems to be consistent and following the number
> of running processes. The code seems to behave like before (with all its
> quirks).
> 
> OK claudio@, this is a good first step. Now I think this code should later
> be moved into kern_sched.c or sched_bsd.c and removed from uvm. Not sure why
> the load calculation is part of memory management...
> 
> On top of this I wonder about the per-CPU load calculation. In my opinion
> it is wrong to skip the calculation if the CPU is idle. Because of this
> there is no decay for idle CPUs and that feels wrong to me.
> Do we have a userland utility that reports spc_ldavg?

I don't understand why the SCHED_LOCK() is needed.  Since I'm really
against adding new uses for it, could you comment on that?

> > Index: uvm_meter.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_meter.c,v
> > retrieving revision 1.44
> > diff -u -p -r1.44 uvm_meter.c
> > --- uvm_meter.c 21 Jun 2023 21:16:21 -  1.44
> > +++ uvm_meter.c 31 Jul 2023 15:20:37 -
> > @@ -102,43 +102,29 @@ uvm_loadav(struct loadavg *avg)
> >  {
> > CPU_INFO_ITERATOR cii;
> > struct cpu_info *ci;
> > -   int i, nrun;
> > -   struct proc *p;
> > -   int nrun_cpu[MAXCPUS];
> > +   struct schedstate_percpu *spc;
> > +   u_int i, nrun = 0, nrun_cpu;
> > +   int s;
> >  
> > -   nrun = 0;
> > -   memset(nrun_cpu, 0, sizeof(nrun_cpu));
> >  
> > -   LIST_FOREACH(p, &allproc, p_list) {
> > -   switch (p->p_stat) {
> > -   case SSTOP:
> > -   case SSLEEP:
> > -   break;
> > -   case SRUN:
> > -   case SONPROC:
> > -   if (p == p->p_cpu->ci_schedstate.spc_idleproc)
> > -   continue;
> > -   /* FALLTHROUGH */
> > -   case SIDL:
> > -   nrun++;
> > -   if (p->p_cpu)
> > -   nrun_cpu[CPU_INFO_UNIT(p->p_cpu)]++;
> > -   }
> > +   SCHED_LOCK(s);
> > +   CPU_INFO_FOREACH(cii, ci) {
> > +   spc = &ci->ci_schedstate;
> > +   nrun_cpu = spc->spc_nrun;
> > +   if (ci->ci_curproc != spc->spc_idleproc)
> > +   nrun_cpu++;
> > +   if (nrun_cpu == 0)
> > +   continue;
> > +   spc->spc_ldavg = (cexp[0] * spc->spc_ldavg +
> > +   nrun_cpu * FSCALE *
> > +   (FSCALE - cexp[0])) >> FSHIFT;
> > +   nrun += nrun_cpu;
> > }
> > +   SCHED_UNLOCK(s);
> >  
> > for (i = 0; i < 3; i++) {
> > avg->ldavg[i] = (cexp[i] * avg->ldavg[i] +
> > nrun * FSCALE * (FSCALE - cexp[i])) >> FSHIFT;
> > -   }
> > -
> > -   CPU_INFO_FOREACH(cii, ci) {
> > -   struct schedstate_percpu *spc = &ci->ci_schedstate;
> > -
> > -   if (nrun_cpu[CPU_INFO_UNIT(ci)] == 0)
> > -   continue;
> > -   spc->spc_ldavg = (cexp[0] * spc->spc_ldavg +
> > -   nrun_cpu[CPU_INFO_UNIT(ci)] * FSCALE *
> > -   (FSCALE - cexp[0])) >> FSHIFT;
> > }
> >  }
> >  
> 
> -- 
> :wq Claudio
> 



Re: Add exit status to route.8

2023-08-02 Thread Matthew Martin
On Wed, Aug 02, 2023 at 06:36:26PM -0400, A Tammy wrote:
> Not a huge fan of this complicated representation.
> > +.Ar command
> > +was invoked but failed with this exit status;
> > +see its manual page for more information.
> > +.It 126
> > +.Ar command
> > +was found but could not be invoked, or it was invoked but failed
> > +with exit status 126.
> > +.It 127
> > +.Ar command
> > +could not be found, or it was invoked but failed with exit status 127.
> > +.El
> A lot of repetition of ' but failed with exit status
> 1/126/127' maybe condense them into something like: route exits with the
> exit status of the invoked command on successful execution or with the
> following special exit codes - then the 1/126/127 thing.
> >  .Sh EXAMPLES
> >  Show the current IPv4 routing tables,
> >  without attempting to print hostnames symbolically:

I agree; was just trying to match the existing docs. How is the below?

diff --git route.8 route.8
index 887446c1420..5a7b0355520 100644
--- route.8
+++ route.8
@@ -281,7 +281,8 @@ and/or a gateway.
 .Op Fl T Ar rtable
 .Tg
 .Cm exec
-.Op Ar command ...
+.Ar command
+.Op Ar arg ...
 .Xc
 Execute a command, forcing the process and its children to use the
 routing table and appropriate routing domain as specified with the
@@ -514,6 +515,35 @@ host and network name database
 .It Pa /etc/mygate
 default gateway address
 .El
+.Sh EXIT STATUS
+For commands other than
+.Cm exec ,
+the
+.Nm
+utility exits 0 on success, and >0 if an error occurs.
+.Pp
+For the
+.Cm exec
+command the
+.Nm
+utility exits with the exit status of
+.Ar command
+if it could be invoked.
+Otherwise the
+.Nm
+utility exits with one of the following values:
+.Bl -tag -width Ds
+.It 1
+An invalid command line option was passed to
+.Nm
+or setting the routing table failed.
+.It 126
+.Ar command
+was found but could not be invoked.
+.It 127
+.Ar command
+could not be found.
+.El
 .Sh EXAMPLES
 Show the current IPv4 routing tables,
 without attempting to print hostnames symbolically:



Add exit status to route.8

2023-08-02 Thread Matthew Martin
A user in IRC asked about route exec's exit status which seems
a reasonable thing to document.

The text is a combination of .Ex -std and env(1). Also route exec
requires a command, so fix the .Op markup.


diff --git route.8 route.8
index 887446c1420..ee5bd15fa1a 100644
--- route.8
+++ route.8
@@ -281,7 +281,8 @@ and/or a gateway.
 .Op Fl T Ar rtable
 .Tg
 .Cm exec
-.Op Ar command ...
+.Ar command
+.Op Ar arg ...
 .Xc
 Execute a command, forcing the process and its children to use the
 routing table and appropriate routing domain as specified with the
@@ -514,6 +515,44 @@ host and network name database
 .It Pa /etc/mygate
 default gateway address
 .El
+.Sh EXIT STATUS
+For commands other than
+.Cm exec ,
+the
+.Nm
+utility exits 0 on success, and >0 if an error occurs.
+.Pp
+For the
+.Cm exec
+command the
+.Nm
+utility exits with one of the following values:
+.Bl -tag -width Ds
+.It 0
+.Nm
+completed successfully and
+.Ar command
+was invoked and completed successfully too.
+.It 1
+An invalid command line option was passed to
+.Nm
+or setting the routing table failed and
+.Ar command
+was not invoked, or
+.Ar command
+was invoked but failed with exit status 1.
+.It 2\(en125, 128\(en255
+.Ar command
+was invoked but failed with this exit status;
+see its manual page for more information.
+.It 126
+.Ar command
+was found but could not be invoked, or it was invoked but failed
+with exit status 126.
+.It 127
+.Ar command
+could not be found, or it was invoked but failed with exit status 127.
+.El
 .Sh EXAMPLES
 Show the current IPv4 routing tables,
 without attempting to print hostnames symbolically:



Re: Make USB WiFi drivers more robust

2023-07-24 Thread Martin Pieuchot
On 24/07/23(Mon) 12:07, Mark Kettenis wrote:
> Hi All,
> 
> I recently committed a change to the xhci(4) driver that fixed an
> issue with suspending a machine while it has USB devices plugged in.
> Unfortunately this diff had some unintended side effects.  After
> looking at the way the USB stack works, I've come to the conclusion
> that it is best to try and fix the drivers to self-protect against
> events coming in while the device is being detached.  Some drivers
> already do this, some drivers only do this partially.  The diff below
> makes sure that all of the USB WiFi drivers do this in a consistent
> way by checking that we're in the processes of detaching the devices
> at the following points:

We spend quite some time in the past years trying to get rid of the
usbd_is_dying() mechanism.  I'm quite sad to see such diff.  I've no
idea what the underlying issue is and if an alternative is possible.

The idea is that the USB stack should already take care of this, not
every driver.  Because we've seen it's hard for drivers to do that
correctly.
 
> 1. The driver's ioctl function.
> 
> 2. The driver's USB transfer completion callbacks.

Those are called by usb_transfer_complete().  This correspond to
xhci_xfer_done() and xhci_event_port_change().  Does those functions
need to be called during suspend?

It is not clear to me what your issue is.  I wish we could find a fix
inside xhci(4) or the USB stack.

Thanks,
Martin



Re: patch: atfork unlock

2023-05-21 Thread Martin Pieuchot
On 07/12/22(Wed) 22:17, Joel Knight wrote:
> Hi. As first mentioned on misc[1], I've identified a deadlock in libc
> when a process forks, the children are multi-threaded, and they set
> one or more atfork callbacks. The diff below causes ATFORK_UNLOCK() to
> release the lock even when the process isn't multi-threaded. This
> avoids the deadlock. With this patch applied, the test case I have for
> this issue succeeds and there are no new failures during a full 'make
> regress'.
> 
> Threading is outside my area of expertise so I've no idea if the fix
> proposed here is appropriate. I'm happy to take or test feedback.
> 
> The diff is below and a clean copy is here:
> https://www.packetmischief.ca/files/patches/atfork_on_fork.diff.

This sounds legit to me.  Anyone some time wants to take a look?

> .joel
> 
> 
> [1] https://marc.info/?l=openbsd-misc&m=166926508819790&w=2
> 
> 
> 
> Index: lib/libc/include/thread_private.h
> ===
> RCS file: /data/cvs-mirror/OpenBSD/src/lib/libc/include/thread_private.h,v
> retrieving revision 1.36
> diff -p -u -r1.36 thread_private.h
> --- lib/libc/include/thread_private.h 6 Jan 2021 19:54:17 - 1.36
> +++ lib/libc/include/thread_private.h 8 Dec 2022 04:28:45 -
> @@ -228,7 +228,7 @@ __END_HIDDEN_DECLS
>   } while (0)
>  #define _ATFORK_UNLOCK() \
>   do { \
> - if (__isthreaded) \
> + if (_thread_cb.tc_atfork_unlock != NULL) \
>   _thread_cb.tc_atfork_unlock(); \
>   } while (0)
> 
> Index: regress/lib/libpthread/pthread_atfork_on_fork/Makefile
> ===
> RCS file: regress/lib/libpthread/pthread_atfork_on_fork/Makefile
> diff -N regress/lib/libpthread/pthread_atfork_on_fork/Makefile
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ regress/lib/libpthread/pthread_atfork_on_fork/Makefile 7 Dec 2022
> 04:38:39 -
> @@ -0,0 +1,9 @@
> +# $OpenBSD$
> +
> +PROG= pthread_atfork_on_fork
> +
> +REGRESS_TARGETS= timeout
> +timeout:
> + timeout 10s ./${PROG}
> +
> +.include 
> Index: regress/lib/libpthread/pthread_atfork_on_fork/pthread_atfork_on_fork.c
> ===
> RCS file: 
> regress/lib/libpthread/pthread_atfork_on_fork/pthread_atfork_on_fork.c
> diff -N regress/lib/libpthread/pthread_atfork_on_fork/pthread_atfork_on_fork.c
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ regress/lib/libpthread/pthread_atfork_on_fork/pthread_atfork_on_fork.c
> 7 Dec 2022 04:59:10 -
> @@ -0,0 +1,94 @@
> +/* $OpenBSD$ */
> +
> +/*
> + * Copyright (c) 2022 Joel Knight 
> + *
> + * Permission to use, copy, modify, and distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +/*
> + * This test exercises atfork lock/unlock through multiple generations of
> + * forked child processes where each child also becomes multi-threaded.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define NUMBER_OF_GENERATIONS 4
> +#define NUMBER_OF_TEST_THREADS 2
> +
> +#define SAY(...) do { \
> +fprintf(stderr, "pid %5i ", getpid()); \
> +fprintf(stderr, __VA_ARGS__); \
> +} while (0)
> +
> +void
> +prepare(void)
> +{
> +/* Do nothing */
> +}
> +
> +void *
> +thread(void *arg)
> +{
> +return (NULL);
> +}
> +
> +int
> +test(int fork_level)
> +{
> +pid_t   proc_pid;
> +size_t  thread_index;
> +pthread_t   threads[NUMBER_OF_TEST_THREADS];
> +
> +proc_pid = fork();
> +fork_level = fork_level - 1;
> +
> +if (proc_pid == 0) {
> +SAY("generation %i\n", fork_level);
> +pthread_atfork(prepare, NULL, NULL);
> +
> +for (thread_index = 0; thread_index < NUMBER_OF_TEST_THREADS;
> thread_index++) {
> +pthread_create(&threads[thread_index], NULL, thread, NULL);
> +}
> +
> +for (thread_index = 0; thread_index < NUMBER_OF_TEST_THREADS;
> thread_index++) {
> +pthread_join(threads[thread_index], NULL);
> +}
> +
> +if (fork_level > 0) {
> +test(fork_level);
> +}
> +
> +SAY("exiting\n");
> +exit(0);
> +}
> +else {
> +SAY("parent waiting on child %i\n", proc_pid);
> +waitpid(proc_pid, 0, 0);
> +}
> +
> +return (0);
> +}
> +
> +int
> +main(int argc, char *argv[])
> 

cwm: don't draw colors with alpha

2023-02-25 Thread Martin Wijk
Hello,

When using a compositor within cwm, some programs (like chrome and firefox)
get a percentage of transparency applied to their border, while others
(like xterm) don't.
This diff fixes that by explicitly setting the alpha channel value for each
color to 0x.

(If someone wants transparency on their window borders, this boldly assumes
they want it uniform across all applications...  which could be achieved with
a fancier compositor than xcompmgr).

Worthwhile?


diff --git app/cwm/conf.c app/cwm/conf.c
index 6459aa18f..ab6c161bd 100644
--- app/cwm/conf.c
+++ app/cwm/conf.c
@@ -505,6 +505,9 @@ conf_screen(struct screen_ctx *sc)
}
}
 
+   for (i = 0; i < CWM_COLOR_NITEMS; i++)
+   sc->xftcolor[i].color.alpha = 0x;
+
conf_grab_kbd(sc->rootwin);
 }
 

--
mw



www: Move horizontal rule and update year

2023-02-14 Thread Martin Vahlensieck
Hi

Going back a few versions, it seems this hr was used to separate the
past from the future.  So put it back in the right place.  While here
also correct the year for future events, or should that be replaced by
"None currently scheduled"?

Best,

Martin

diff --git a/events.html b/events.html
index a10a3e50a..1d3f0fd65 100644
--- a/events.html
+++ b/events.html
@@ -40,7 +40,9 @@ like-minded people.
 
 Future events:
 
-2022
+2023
+
+
 
 Past events:
 
@@ -94,8 +96,6 @@ Sep 15-18, 2022, Vienna, Austria
 
 
 
-
-
 
 
 



Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-29 Thread Martin Pieuchot
On 28/11/22(Mon) 15:04, Mark Kettenis wrote:
> > Date: Wed, 23 Nov 2022 17:33:26 +0100
> > From: Martin Pieuchot 
> > 
> > On 23/11/22(Wed) 16:34, Mark Kettenis wrote:
> > > > Date: Wed, 23 Nov 2022 10:52:32 +0100
> > > > From: Martin Pieuchot 
> > > > 
> > > > On 22/11/22(Tue) 23:40, Mark Kettenis wrote:
> > > > > > Date: Tue, 22 Nov 2022 17:47:44 +
> > > > > > From: Miod Vallat 
> > > > > > 
> > > > > > > Here is a diff.  Maybe bluhm@ can try this on the macppc machine 
> > > > > > > that
> > > > > > > triggered the original "vref used where vget required" problem?
> > > > > > 
> > > > > > On a similar machine it panics after a few hours with:
> > > > > > 
> > > > > > panic: uvn_flush: PGO_SYNCIO return 'try again' error (impossible)
> > > > > > 
> > > > > > The trace (transcribed by hand) is
> > > > > > uvn_flush+0x820
> > > > > > uvm_vnp_terminate+0x79
> > > > > > vclean+0xdc
> > > > > > vgonel+0x70
> > > > > > getnewvnode+0x240
> > > > > > ffs_vget+0xcc
> > > > > > ffs_inode_alloc+0x13c
> > > > > > ufs_makeinode+0x94
> > > > > > ufs_create+0x58
> > > > > > VOP_CREATE+0x48
> > > > > > vn_open+0x188
> > > > > > doopenat+0x1b4
> > > > > 
> > > > > Ah right, there is another path where we end up with a refcount of
> > > > > zero.  Should be fixable, but I need to think about this for a bit.
> > > > 
> > > > Not sure to understand what you mean with refcount of 0.  Could you
> > > > elaborate?
> > > 
> > > Sorry, I was thinking ahead a bit.  I'm pretty much convinced that the
> > > issue we're dealing with is a race between a vnode being
> > > recycled/cleaned and the pagedaemon paging out pages associated with
> > > that same vnode.
> > > 
> > > The crashes we've seen before were all in the pagedaemon path where we
> > > end up calling into the VFS layer with a vnode that has v_usecount ==
> > > 0.  My "fix" avoids that, but hits the issue that when we are in the
> > > codepath that is recycling/cleaning the vnode, we can't use vget() to
> > > get a reference to the vnode since it checks that the vnode isn't in
> > > the process of being cleaned.
> > > 
> > > But if we avoid that issue (by for example) skipping the vget() call
> > > if the UVM_VNODE_DYING flag is set, we run into the same scenario
> > > where we call into the VFS layer with v_usecount == 0.  Now that may
> > > not actually be a problem, but I need to investigate this a bit more.
> > 
> > When the UVM_VNODE_DYING flag is set the caller always own a valid
> > reference to the vnode.  Either because it is in the process of cleaning
> > it via  uvm_vnp_terminate() or because it uvn_detach() has been called
> > which means the reference to the vnode hasn't been dropped yet.  So I
> > believe `v_usecount' for such vnode is positive.
> 
> I don't think so.  The vnode that can be recycled is sitting on the
> freelist with v_usecount == 0.  When getnewvnode() decides to recycle
> a vnode it takes it off the freelist and calls vgonel(), which i turn
> calls vclean(), which only increases v_usecount if it is non-zero.  So
> at the point where uvn_vnp_terminate() gets called v_usecount
> definitely is 0.
> 
> That said, the vnode is no longer on the freelist at that point and
> since UVM_VNODE_DYING is set, uvn_vnp_uncache() will return
> immediately without calling vref() to get another reference.  So that
> is fine.
> 
> > > Or maybe calling into the VFS layer with a vnode that has v_usecount
> > > == 0 is perfectly fine and we should do the vget() dance I propose in
> > > uvm_vnp_unache() instead of in uvn_put().
> > 
> > I'm not following.  uvm_vnp_uncache() is always called with a valid
> > vnode, no?
> 
> Not sure what you mean with a "valid vnode"; uvm_vnp_uncache() checks
> the UVM_VNODE_VALID flag at the start, which suggests that it can be
> called in cases where that flag is not set.  But it will unlock and
> return immediately in that case, so it should be safe.
> 
> Anyway, I think I have convinced myself that in the case where the
> pagedaemon ends up calling uvn_

Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-23 Thread Martin Pieuchot
On 23/11/22(Wed) 16:34, Mark Kettenis wrote:
> > Date: Wed, 23 Nov 2022 10:52:32 +0100
> > From: Martin Pieuchot 
> > 
> > On 22/11/22(Tue) 23:40, Mark Kettenis wrote:
> > > > Date: Tue, 22 Nov 2022 17:47:44 +
> > > > From: Miod Vallat 
> > > > 
> > > > > Here is a diff.  Maybe bluhm@ can try this on the macppc machine that
> > > > > triggered the original "vref used where vget required" problem?
> > > > 
> > > > On a similar machine it panics after a few hours with:
> > > > 
> > > > panic: uvn_flush: PGO_SYNCIO return 'try again' error (impossible)
> > > > 
> > > > The trace (transcribed by hand) is
> > > > uvn_flush+0x820
> > > > uvm_vnp_terminate+0x79
> > > > vclean+0xdc
> > > > vgonel+0x70
> > > > getnewvnode+0x240
> > > > ffs_vget+0xcc
> > > > ffs_inode_alloc+0x13c
> > > > ufs_makeinode+0x94
> > > > ufs_create+0x58
> > > > VOP_CREATE+0x48
> > > > vn_open+0x188
> > > > doopenat+0x1b4
> > > 
> > > Ah right, there is another path where we end up with a refcount of
> > > zero.  Should be fixable, but I need to think about this for a bit.
> > 
> > Not sure to understand what you mean with refcount of 0.  Could you
> > elaborate?
> 
> Sorry, I was thinking ahead a bit.  I'm pretty much convinced that the
> issue we're dealing with is a race between a vnode being
> recycled/cleaned and the pagedaemon paging out pages associated with
> that same vnode.
> 
> The crashes we've seen before were all in the pagedaemon path where we
> end up calling into the VFS layer with a vnode that has v_usecount ==
> 0.  My "fix" avoids that, but hits the issue that when we are in the
> codepath that is recycling/cleaning the vnode, we can't use vget() to
> get a reference to the vnode since it checks that the vnode isn't in
> the process of being cleaned.
> 
> But if we avoid that issue (by for example) skipping the vget() call
> if the UVM_VNODE_DYING flag is set, we run into the same scenario
> where we call into the VFS layer with v_usecount == 0.  Now that may
> not actually be a problem, but I need to investigate this a bit more.

When the UVM_VNODE_DYING flag is set the caller always own a valid
reference to the vnode.  Either because it is in the process of cleaning
it via  uvm_vnp_terminate() or because it uvn_detach() has been called
which means the reference to the vnode hasn't been dropped yet.  So I
believe `v_usecount' for such vnode is positive.

> Or maybe calling into the VFS layer with a vnode that has v_usecount
> == 0 is perfectly fine and we should do the vget() dance I propose in
> uvm_vnp_unache() instead of in uvn_put().

I'm not following.  uvm_vnp_uncache() is always called with a valid
vnode, no?



Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-23 Thread Martin Pieuchot
On 22/11/22(Tue) 23:40, Mark Kettenis wrote:
> > Date: Tue, 22 Nov 2022 17:47:44 +
> > From: Miod Vallat 
> > 
> > > Here is a diff.  Maybe bluhm@ can try this on the macppc machine that
> > > triggered the original "vref used where vget required" problem?
> > 
> > On a similar machine it panics after a few hours with:
> > 
> > panic: uvn_flush: PGO_SYNCIO return 'try again' error (impossible)
> > 
> > The trace (transcribed by hand) is
> > uvn_flush+0x820
> > uvm_vnp_terminate+0x79
> > vclean+0xdc
> > vgonel+0x70
> > getnewvnode+0x240
> > ffs_vget+0xcc
> > ffs_inode_alloc+0x13c
> > ufs_makeinode+0x94
> > ufs_create+0x58
> > VOP_CREATE+0x48
> > vn_open+0x188
> > doopenat+0x1b4
> 
> Ah right, there is another path where we end up with a refcount of
> zero.  Should be fixable, but I need to think about this for a bit.

Not sure to understand what you mean with refcount of 0.  Could you
elaborate?

My understanding of the panic reported is that the proposed diff creates
a complicated relationship between the vnode and the UVM vnode layer.
The above problem occurs because VXLOCK is set on a vnode when it is
being recycled *before* calling uvm_vnp_terminate().   Now that uvn_io()
calls vget(9) it will fail because VXLOCK is set, which is what we want
during vclean(9).



Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-22 Thread Martin Pieuchot
On 18/11/22(Fri) 21:33, Mark Kettenis wrote:
> > Date: Thu, 17 Nov 2022 20:23:37 +0100
> > From: Mark Kettenis 
> > 
> > > From: Jeremie Courreges-Anglas 
> > > Date: Thu, 17 Nov 2022 18:00:21 +0100
> > > 
> > > On Tue, Nov 15 2022, Martin Pieuchot  wrote:
> > > > UVM vnode objects include a reference count to keep track of the number
> > > > of processes that have the corresponding pages mapped in their VM space.
> > > >
> > > > When the last process referencing a given library or executable dies,
> > > > the reaper will munmap this object on its behalf.  When this happens it
> > > > doesn't free the associated pages to speed-up possible re-use of the
> > > > file.  Instead the pages are placed on the inactive list but stay ready
> > > > to be pmap_enter()'d without requiring I/O as soon as a newly process
> > > > needs to access them.
> > > >
> > > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
> > > > doesn't work well with swapping [0].  For some reason when the page 
> > > > daemon
> > > > wants to free pages on the inactive list it tries to flush the pages to
> > > > disk and panic(9) because it needs a valid reference to the vnode to do
> > > > so.
> > > >
> > > > This indicates that the mechanism described above, which seems to work
> > > > fine for RO mappings, is currently buggy in more complex situations.
> > > > Flushing the pages when the last reference of the UVM object is dropped
> > > > also doesn't seem to be enough as bluhm@ reported [1].
> > > >
> > > > The diff below, which has already be committed and reverted, gets rid of
> > > > the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
> > > > the arm64 caching bug has been found and fixed.
> > > >
> > > > Getting rid of this logic means more I/O will be generated and pages
> > > > might have a faster reuse cycle.  I'm aware this might introduce a small
> > > > slowdown,
> > > 
> > > Numbers for my usual make -j4 in libc,
> > > on an Unmatched riscv64 box, now:
> > >16m32.65s real21m36.79s user30m53.45s system
> > >16m32.37s real21m33.40s user31m17.98s system
> > >16m32.63s real21m35.74s user31m12.01s system
> > >16m32.13s real21m36.12s user31m06.92s system
> > > After:
> > >19m14.15s real21m09.39s user36m51.33s system
> > >19m19.11s real21m02.61s user36m58.46s system
> > >19m21.77s real21m09.23s user37m03.85s system
> > >19m09.39s real21m08.96s user36m36.00s system
> > > 
> > > 4 cores amd64 VM, before (-current plus an other diff):
> > >1m54.31s real 2m47.36s user 4m24.70s system
> > >1m52.64s real 2m45.68s user 4m23.46s system
> > >1m53.47s real 2m43.59s user 4m27.60s system
> > > After:
> > >2m34.12s real 2m51.15s user 6m20.91s system
> > >2m34.30s real 2m48.48s user 6m23.34s system
> > >2m37.07s real 2m49.60s user 6m31.53s system
> > > 
> > > > however I believe we should work towards loading files from the
> > > > buffer cache to save I/O cycles instead of having another layer of 
> > > > cache.
> > > > Such work isn't trivial and making sure the vnode <-> UVM relation is
> > > > simple and well understood is the first step in this direction.
> > > >
> > > > I'd appreciate if the diff below could be tested on many architectures,
> > > > include the offending rpi4.
> > > 
> > > Mike has already tested a make build on a riscv64 Unmatched.  I have
> > > also run regress in sys, lib/libc and lib/libpthread on that arch.  As
> > > far as I can see this looks stable on my machine, but what I really care
> > > about is the riscv64 bulk build cluster (I'm going to start another
> > > bulk build soon).
> > > 
> > > > Comments?  Oks?
> > > 
> > > The performance drop in my microbenchmark kinda worries me but it's only
> > > a microbenchmark...
> > 
> > I wouldn't call this a microbenchmark.  I fear this is typical for
> > builds of anything on clang architectures.  And I expect it to be
> > worse on single-processor machine where *every* time we exec

Get rid of UVM_VNODE_CANPERSIST

2022-11-15 Thread Martin Pieuchot
UVM vnode objects include a reference count to keep track of the number
of processes that have the corresponding pages mapped in their VM space.

When the last process referencing a given library or executable dies,
the reaper will munmap this object on its behalf.  When this happens it
doesn't free the associated pages to speed-up possible re-use of the
file.  Instead the pages are placed on the inactive list but stay ready
to be pmap_enter()'d without requiring I/O as soon as a newly process
needs to access them.

The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
doesn't work well with swapping [0].  For some reason when the page daemon
wants to free pages on the inactive list it tries to flush the pages to
disk and panic(9) because it needs a valid reference to the vnode to do
so.

This indicates that the mechanism described above, which seems to work
fine for RO mappings, is currently buggy in more complex situations.
Flushing the pages when the last reference of the UVM object is dropped
also doesn't seem to be enough as bluhm@ reported [1].

The diff below, which has already be committed and reverted, gets rid of
the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
the arm64 caching bug has been found and fixed.

Getting rid of this logic means more I/O will be generated and pages
might have a faster reuse cycle.  I'm aware this might introduce a small
slowdown, however I believe we should work towards loading files from the
buffer cache to save I/O cycles instead of having another layer of cache.
Such work isn't trivial and making sure the vnode <-> UVM relation is
simple and well understood is the first step in this direction.

I'd appreciate if the diff below could be tested on many architectures,
include the offending rpi4.

Comments?  Oks?

[0] https://marc.info/?l=openbsd-bugs&m=164846737707559&w=2 
[1] https://marc.info/?l=openbsd-bugs&m=166843373415030&w=2

Index: uvm/uvm_vnode.c
===
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.130
diff -u -p -r1.130 uvm_vnode.c
--- uvm/uvm_vnode.c 20 Oct 2022 13:31:52 -  1.130
+++ uvm/uvm_vnode.c 15 Nov 2022 13:28:28 -
@@ -161,11 +161,8 @@ uvn_attach(struct vnode *vp, vm_prot_t a
 * add it to the writeable list, and then return.
 */
if (uvn->u_flags & UVM_VNODE_VALID) {   /* already active? */
+   KASSERT(uvn->u_obj.uo_refs > 0);
 
-   /* regain vref if we were persisting */
-   if (uvn->u_obj.uo_refs == 0) {
-   vref(vp);
-   }
uvn->u_obj.uo_refs++;   /* bump uvn ref! */
 
/* check for new writeable uvn */
@@ -235,14 +232,14 @@ uvn_attach(struct vnode *vp, vm_prot_t a
KASSERT(uvn->u_obj.uo_refs == 0);
uvn->u_obj.uo_refs++;
oldflags = uvn->u_flags;
-   uvn->u_flags = UVM_VNODE_VALID|UVM_VNODE_CANPERSIST;
+   uvn->u_flags = UVM_VNODE_VALID;
uvn->u_nio = 0;
uvn->u_size = used_vnode_size;
 
/*
 * add a reference to the vnode.   this reference will stay as long
 * as there is a valid mapping of the vnode.   dropped when the
-* reference count goes to zero [and we either free or persist].
+* reference count goes to zero.
 */
vref(vp);
 
@@ -323,16 +320,6 @@ uvn_detach(struct uvm_object *uobj)
 */
vp->v_flag &= ~VTEXT;
 
-   /*
-* we just dropped the last reference to the uvn.   see if we can
-* let it "stick around".
-*/
-   if (uvn->u_flags & UVM_VNODE_CANPERSIST) {
-   /* won't block */
-   uvn_flush(uobj, 0, 0, PGO_DEACTIVATE|PGO_ALLPAGES);
-   goto out;
-   }
-
/* its a goner! */
uvn->u_flags |= UVM_VNODE_DYING;
 
@@ -382,7 +369,6 @@ uvn_detach(struct uvm_object *uobj)
/* wake up any sleepers */
if (oldflags & UVM_VNODE_WANTED)
wakeup(uvn);
-out:
rw_exit(uobj->vmobjlock);
 
/* drop our reference to the vnode. */
@@ -498,8 +484,8 @@ uvm_vnp_terminate(struct vnode *vp)
}
 
/*
-* done.   now we free the uvn if its reference count is zero
-* (true if we are zapping a persisting uvn).   however, if we are
+* done.   now we free the uvn if its reference count is zero.
+* however, if we are
 * terminating a uvn with active mappings we let it live ... future
 * calls down to the vnode layer will fail.
 */
@@ -507,14 +493,14 @@ uvm_vnp_terminate(struct vnode *vp)
if (uvn->u_obj.uo_refs) {
/*
 * uvn must live on it is dead-vnode state until all references
-* are gone.   restore flags.clear CANPERSIST state.
+* are gone.   restore flags.
 */
uvn->u_flags &= ~(UVM_VNODE_DYING|U

btrace: string comparison in filters

2022-11-11 Thread Martin Pieuchot
Diff below adds support for the common following idiom:

syscall:open:entry
/comm == "ksh"/
{
...
}

String comparison is tricky as it can be combined with any other
expression in filters, like:

syscall:mmap:entry
/comm == "cc" && pid != 4589/
{
...
}

I don't have the energy to change the parser so I went for the easy
solution to treat any "stupid" string comparisons as 'true' albeit
printing a warning.  I'd love if somebody with some yacc knowledge
could come up with a better solution.

ok?

Index: usr.sbin/btrace/bt_parse.y
===
RCS file: /cvs/src/usr.sbin/btrace/bt_parse.y,v
retrieving revision 1.46
diff -u -p -r1.46 bt_parse.y
--- usr.sbin/btrace/bt_parse.y  28 Apr 2022 21:04:24 -  1.46
+++ usr.sbin/btrace/bt_parse.y  11 Nov 2022 14:34:37 -
@@ -218,6 +218,7 @@ variable: lvar  { $$ = bl_find($1); }
 factor : '(' expr ')'  { $$ = $2; }
| NUMBER{ $$ = ba_new($1, B_AT_LONG); }
| BUILTIN   { $$ = ba_new(NULL, $1); }
+   | CSTRING   { $$ = ba_new($1, B_AT_STR); }
| staticv
| variable
| mentry
Index: usr.sbin/btrace/btrace.c
===
RCS file: /cvs/src/usr.sbin/btrace/btrace.c,v
retrieving revision 1.64
diff -u -p -r1.64 btrace.c
--- usr.sbin/btrace/btrace.c11 Nov 2022 10:51:39 -  1.64
+++ usr.sbin/btrace/btrace.c11 Nov 2022 14:44:15 -
@@ -434,14 +434,23 @@ rules_setup(int fd)
struct bt_rule *r, *rbegin = NULL;
struct bt_probe *bp;
struct bt_stmt *bs;
+   struct bt_arg *ba;
int dokstack = 0, on = 1;
uint64_t evtflags;
 
TAILQ_FOREACH(r, &g_rules, br_next) {
evtflags = 0;
-   SLIST_FOREACH(bs, &r->br_action, bs_next) {
-   struct bt_arg *ba;
 
+   if (r->br_filter != NULL &&
+   r->br_filter->bf_condition != NULL)  {
+
+   bs = r->br_filter->bf_condition;
+   ba = SLIST_FIRST(&bs->bs_args);
+
+   evtflags |= ba2dtflags(ba);
+   }
+
+   SLIST_FOREACH(bs, &r->br_action, bs_next) {
SLIST_FOREACH(ba, &bs->bs_args, ba_next)
evtflags |= ba2dtflags(ba);
 
@@ -1175,6 +1184,36 @@ baexpr2long(struct bt_arg *ba, struct dt
lhs = ba->ba_value;
rhs = SLIST_NEXT(lhs, ba_next);
 
+   /*
+* String comparison also use '==' and '!='.
+*/
+   if (lhs->ba_type == B_AT_STR ||
+   (rhs != NULL && rhs->ba_type == B_AT_STR)) {
+   char lstr[STRLEN], rstr[STRLEN];
+
+   strlcpy(lstr, ba2str(lhs, dtev), sizeof(lstr));
+   strlcpy(rstr, ba2str(rhs, dtev), sizeof(rstr));
+
+   result = strncmp(lstr, rstr, STRLEN) == 0;
+
+   switch (ba->ba_type) {
+   case B_AT_OP_EQ:
+   break;
+   case B_AT_OP_NE:
+   result = !result;
+   break;
+   default:
+   warnx("operation '%d' unsupported on strings",
+   ba->ba_type);
+   result = 1;
+   }
+
+   debug("ba=%p eval '(%s %s %s) = %d'\n", ba, lstr, ba_name(ba),
+  rstr, result);
+
+   goto out;
+   }
+
lval = ba2long(lhs, dtev);
if (rhs == NULL) {
rval = 0;
@@ -1233,9 +1272,10 @@ baexpr2long(struct bt_arg *ba, struct dt
xabort("unsupported operation %d", ba->ba_type);
}
 
-   debug("ba=%p eval '%ld %s %ld = %d'\n", ba, lval, ba_name(ba),
+   debug("ba=%p eval '(%ld %s %ld) = %d'\n", ba, lval, ba_name(ba),
   rval, result);
 
+out:
--recursions;
 
return result;
@@ -1245,10 +1285,15 @@ const char *
 ba_name(struct bt_arg *ba)
 {
switch (ba->ba_type) {
+   case B_AT_STR:
+   return (const char *)ba->ba_value;
+   case B_AT_LONG:
+   return ba2str(ba, NULL);
case B_AT_NIL:
return "0";
case B_AT_VAR:
case B_AT_MAP:
+   case B_AT_HIST:
break;
case B_AT_BI_PID:
return "pid";
@@ -1326,7 +1371,8 @@ ba_name(struct bt_arg *ba)
xabort("unsupported type %d", ba->ba_type);
}
 
-   assert(ba->ba_type == B_AT_VAR || ba->ba_type == B_AT_MAP);
+   assert(ba->ba_type == B_AT_VAR || ba->ba_type == B_AT_MAP ||
+   ba->ba_type == B_AT_HIST);
 
static char buf[64];
size_t sz;
@@ -1516,9 +1562,13 @@ ba2str(struct bt_arg *ba, struct dt_evt 
 int
 ba2dtflags(struct bt_arg *ba)
 {
+   static long recursions;
struct bt_arg *bval;
int flags = 0;
 
+   if (++recursions >= __MAXOPER

Re: push kernel lock inside ifioctl_get()

2022-11-08 Thread Martin Pieuchot
On 08/11/22(Tue) 15:28, Klemens Nanni wrote:
> After this mechanical move, I can unlock the individual SIOCG* in there.

I'd suggest grabbing the KERNEL_LOCK() after NET_LOCK_SHARED().
Otherwise you might spin for the first one then release it when going
to sleep.

> OK?
> 
> Index: if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.667
> diff -u -p -r1.667 if.c
> --- if.c  8 Nov 2022 15:20:24 -   1.667
> +++ if.c  8 Nov 2022 15:26:07 -
> @@ -2426,33 +2426,43 @@ ifioctl_get(u_long cmd, caddr_t data)
>   size_t bytesdone;
>   const char *label;
>  
> - KERNEL_LOCK();
> -
>   switch(cmd) {
>   case SIOCGIFCONF:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = ifconf(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCIFGCLONERS:
> + KERNEL_LOCK();
>   error = if_clone_list((struct if_clonereq *)data);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGMEMB:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgroupmembers(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGATTR:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgroupattribs(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGLIST:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgrouplist(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   }
> +
> + KERNEL_LOCK();
>  
>   ifp = if_unit(ifr->ifr_name);
>   if (ifp == NULL) {
> 



Mark sched_yield(2) as NOLOCK

2022-11-08 Thread Martin Pieuchot
Now that mmap/munmap/mprotect(2) are no longer creating contention it is
possible to see that sched_yield(2) is one of the syscalls waiting for
the KERNEL_LOCK() to be released.  However this is no longer necessary.

Traversing `ps_threads' require either the KERNEL_LOCK() or the
SCHED_LOCK() and we are holding both in this case.  So let's drop the
requirement for the KERNEL_LOCK().

ok?

Index: kern/syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.235
diff -u -p -r1.235 syscalls.master
--- kern/syscalls.master8 Nov 2022 11:05:57 -   1.235
+++ kern/syscalls.master8 Nov 2022 13:09:10 -
@@ -531,7 +531,7 @@
 #else
 297UNIMPL
 #endif
-298STD { int sys_sched_yield(void); }
+298STD NOLOCK  { int sys_sched_yield(void); }
 299STD NOLOCK  { pid_t sys_getthrid(void); }
 300OBSOL   t32___thrsleep
 301STD NOLOCK  { int sys___thrwakeup(const volatile void *ident, \



Re: xenstore.c: return error number

2022-11-08 Thread Martin Pieuchot
On 01/11/22(Tue) 15:26, Masato Asou wrote:
> Hi,
> 
> Return error number instead of call panic().

Makes sense to me.  Do you know how this error can occur?  Is is a logic
error or are we trusting values produced by a third party?

> comment, ok?
> --
> ASOU Masato
> 
> diff --git a/sys/dev/pv/xenstore.c b/sys/dev/pv/xenstore.c
> index 1e4f15d30eb..dc89ba0fa6d 100644
> --- a/sys/dev/pv/xenstore.c
> +++ b/sys/dev/pv/xenstore.c
> @@ -118,6 +118,7 @@ struct xs_msg {
>   struct xs_msghdr xsm_hdr;
>   uint32_t xsm_read;
>   uint32_t xsm_dlen;
> + int  xsm_error;
>   uint8_t *xsm_data;
>   TAILQ_ENTRY(xs_msg)  xsm_link;
>  };
> @@ -566,9 +567,7 @@ xs_intr(void *arg)
>   }
>  
>   if (xsm->xsm_hdr.xmh_len > xsm->xsm_dlen)
> - panic("message too large: %d vs %d for type %d, rid %u",
> - xsm->xsm_hdr.xmh_len, xsm->xsm_dlen, xsm->xsm_hdr.xmh_type,
> - xsm->xsm_hdr.xmh_rid);
> + xsm->xsm_error = EMSGSIZE;
>  
>   len = MIN(xsm->xsm_hdr.xmh_len - xsm->xsm_read, avail);
>   if (len) {
> @@ -800,7 +799,9 @@ xs_cmd(struct xs_transaction *xst, int cmd, const char 
> *path,
>   error = xs_geterror(xsm);
>   DPRINTF("%s: xenstore request %d \"%s\" error %s\n",
>   xs->xs_sc->sc_dev.dv_xname, cmd, path, xsm->xsm_data);
> - } else if (mode == READ) {
> + } else if (xsm->xsm_error != 0)
> + error = xsm->xsm_error;
> + else if (mode == READ) {
>   KASSERT(iov && iov_cnt);
>   error = xs_parse(xst, xsm, iov, iov_cnt);
>   }
> 



Re: Please test: unlock mprotect/mmap/munmap

2022-11-08 Thread Martin Pieuchot
On 08/11/22(Tue) 11:12, Mark Kettenis wrote:
> > Date: Tue, 8 Nov 2022 10:32:14 +0100
> > From: Christian Weisgerber 
> > 
> > Martin Pieuchot:
> > 
> > > These 3 syscalls should now be ready to run w/o KERNEL_LOCK().  This
> > > will reduce contention a lot.  I'd be happy to hear from test reports
> > > on many architectures and possible workloads.
> > 
> > This survived a full amd64 package build.
> 
> \8/
> 
> I think that means it should be comitted.

I agree.  This has been tested on i386, riscv64, m88k, arm64, amd64 (of
course) and sparc64.  I'm pretty confident.



Re: push kernel lock down in ifioctl()

2022-11-07 Thread Martin Pieuchot
On 07/11/22(Mon) 15:16, Klemens Nanni wrote:
> Not all interface ioctls need the kernel lock, but they all grab it.
> 
> Here's a mechanical diff splitting the single lock/unlock around
> ifioctl() into individual lock/unlock dances inside ifioctl().
> 
> From there we can unlock individual ioctls piece by piece.
> 
> Survives regress on sparc64 and didn't blow up on my amd64 notebook yet.
> 
> Feedback? Objection? OK?

Makes sense.  Your diff is missing the kern/sys_socket.c chunk.

This stuff is hairy.  I'd suggest moving very very carefully.  For
example, I wouldn't bother releasing the KERNEL_LOCK() before the
if_put().  Yes, what you're suggesting is correct.  Or at least should
be...

> Index: net/if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.665
> diff -u -p -r1.665 if.c
> --- net/if.c  8 Sep 2022 10:22:06 -   1.665
> +++ net/if.c  7 Nov 2022 15:13:01 -
> @@ -1942,19 +1942,25 @@ ifioctl(struct socket *so, u_long cmd, c
>   case SIOCIFCREATE:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   error = if_clone_create(ifr->ifr_name, 0);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCIFDESTROY:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   error = if_clone_destroy(ifr->ifr_name);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCSIFGATTR:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   NET_LOCK();
>   error = if_setgroupattribs(data);
>   NET_UNLOCK();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFCONF:
>   case SIOCIFGCLONERS:
> @@ -1973,12 +1979,19 @@ ifioctl(struct socket *so, u_long cmd, c
>   case SIOCGIFRDOMAIN:
>   case SIOCGIFGROUP:
>   case SIOCGIFLLPRIO:
> - return (ifioctl_get(cmd, data));
> + KERNEL_LOCK();
> + error = ifioctl_get(cmd, data);
> + KERNEL_UNLOCK();
> + return (error);
>   }
>  
> + KERNEL_LOCK();
> +
>   ifp = if_unit(ifr->ifr_name);
> - if (ifp == NULL)
> + if (ifp == NULL) {
> + KERNEL_UNLOCK();
>   return (ENXIO);
> + }
>   oif_flags = ifp->if_flags;
>   oif_xflags = ifp->if_xflags;
>  
> @@ -2396,6 +2409,8 @@ forceup:
>  
>   if (((oif_flags ^ ifp->if_flags) & IFF_UP) != 0)
>   getmicrotime(&ifp->if_lastchange);
> +
> + KERNEL_UNLOCK();
>  
>   if_put(ifp);
>  
> 



Please test: unlock mprotect/mmap/munmap

2022-11-06 Thread Martin Pieuchot
These 3 syscalls should now be ready to run w/o KERNEL_LOCK().  This
will reduce contention a lot.  I'd be happy to hear from test reports
on many architectures and possible workloads.

Do not forget to run "make syscalls" before building the kernel.

Index: syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.234
diff -u -p -r1.234 syscalls.master
--- syscalls.master 25 Oct 2022 16:10:31 -  1.234
+++ syscalls.master 6 Nov 2022 10:50:45 -
@@ -126,7 +126,7 @@
struct sigaction *osa); }
 47 STD NOLOCK  { gid_t sys_getgid(void); }
 48 STD NOLOCK  { int sys_sigprocmask(int how, sigset_t mask); }
-49 STD { void *sys_mmap(void *addr, size_t len, int prot, \
+49 STD NOLOCK  { void *sys_mmap(void *addr, size_t len, int prot, \
int flags, int fd, off_t pos); }
 50 STD { int sys_setlogin(const char *namebuf); }
 #ifdef ACCOUNTING
@@ -171,8 +171,8 @@
const struct kevent *changelist, int nchanges, \
struct kevent *eventlist, int nevents, \
const struct timespec *timeout); }
-73 STD { int sys_munmap(void *addr, size_t len); }
-74 STD { int sys_mprotect(void *addr, size_t len, \
+73 STD NOLOCK  { int sys_munmap(void *addr, size_t len); }
+74 STD NOLOCK  { int sys_mprotect(void *addr, size_t len, \
int prot); }
 75 STD { int sys_madvise(void *addr, size_t len, \
int behav); }



Re: Towards unlocking mmap(2) & munmap(2)

2022-10-30 Thread Martin Pieuchot
On 30/10/22(Sun) 12:45, Klemens Nanni wrote:
> On Sun, Oct 30, 2022 at 12:40:02PM +, Klemens Nanni wrote:
> > regress on i386/GENERIC.MP+WITNESS with this diff shows
> 
> Another one;  This machine has three read-only NFS mounts, but none of
> them are used during builds or regress.

It's the same.  See archives of bugs@ for discussion about this lock
order reversal and a potential fix from visa@.

> 
> This one is most certainly from the NFS regress tests themselves:
> 127.0.0.1:/mnt/regress-nfs-server  3548  2088  1284   
>  62%/mnt/regress-nfs-client
> 
> witness: lock order reversal:
>  1st 0xd6381eb8 vmmaplk (&map->lock)
>  2nd 0xf5c98d24 nfsnode (&np->n_lock)
> lock order data w2 -> w1 missing
> lock order "&map->lock"(rwlock) -> "&np->n_lock"(rrwlock) first seen at:
> #0  rw_enter+0x57
> #1  rrw_enter+0x3d
> #2  nfs_lock+0x27
> #3  VOP_LOCK+0x50
> #4  vn_lock+0x91
> #5  vn_rdwr+0x64
> #6  vndstrategy+0x2bd
> #7  physio+0x18f
> #8  vndwrite+0x1a
> #9  spec_write+0x74
> #10 VOP_WRITE+0x3f
> #11 vn_write+0xde
> #12 dofilewritev+0xbb
> #13 sys_pwrite+0x55
> #14 syscall+0x2ec
> #15 Xsyscall_untramp+0xa9
> 



Re: Towards unlocking mmap(2) & munmap(2)

2022-10-30 Thread Martin Pieuchot
On 30/10/22(Sun) 12:40, Klemens Nanni wrote:
> On Fri, Oct 28, 2022 at 11:08:55AM +0200, Martin Pieuchot wrote:
> > On 20/10/22(Thu) 16:17, Martin Pieuchot wrote:
> > > On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> > > > Diff below adds a minimalist set of assertions to ensure proper locks
> > > > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > > > mmap(2) for anons and munmap(2).
> > > > 
> > > > Please test it with WITNESS enabled and report back.
> > > 
> > > New version of the diff that includes a lock/unlock dance  in 
> > > uvm_map_teardown().  While grabbing this lock should not be strictly
> > > necessary because no other reference to the map should exist when the
> > > reaper is holding it, it helps make progress with asserts.  Grabbing
> > > the lock is easy and it can also save us a lot of time if there is any
> > > reference counting bugs (like we've discovered w/ vnode and swapping).
> > 
> > Here's an updated version that adds a lock/unlock dance in
> > uvm_map_deallocate() to satisfy the assert in uvm_unmap_remove().
> > Thanks to tb@ from pointing this out.
> > 
> > I received many positive feedback and test reports, I'm now asking for
> > oks.
> 
> regress on i386/GENERIC.MP+WITNESS with this diff shows

This isn't related to this diff.



Re: Towards unlocking mmap(2) & munmap(2)

2022-10-28 Thread Martin Pieuchot
On 20/10/22(Thu) 16:17, Martin Pieuchot wrote:
> On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> > Diff below adds a minimalist set of assertions to ensure proper locks
> > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > mmap(2) for anons and munmap(2).
> > 
> > Please test it with WITNESS enabled and report back.
> 
> New version of the diff that includes a lock/unlock dance  in 
> uvm_map_teardown().  While grabbing this lock should not be strictly
> necessary because no other reference to the map should exist when the
> reaper is holding it, it helps make progress with asserts.  Grabbing
> the lock is easy and it can also save us a lot of time if there is any
> reference counting bugs (like we've discovered w/ vnode and swapping).

Here's an updated version that adds a lock/unlock dance in
uvm_map_deallocate() to satisfy the assert in uvm_unmap_remove().
Thanks to tb@ from pointing this out.

I received many positive feedback and test reports, I'm now asking for
oks.


Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  28 Oct 2022 08:41:30 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 28 Oct 2022 08:41:30 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.301
diff -u -p -r1.301 uvm_map.c
--- uvm/uvm_map.c   24 Oct 2022 15:11:56 -  1.301
+++ uvm/uvm_map.c   28 Oct 2022 08:46:28 -
@@ -491,6 +491,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +572,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1457,6 +1461,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1584,6 +1590,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(&map->addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1704,6 +1712,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, &first)) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1868,6 +1878,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1996,10 +2008,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return 0;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrl

Re: vmd: remove the user quota tracking

2022-10-27 Thread Matthew Martin
On Wed, Oct 12, 2022 at 09:20:06AM -0400, Dave Voutila wrote:
> 
> 1 week bump for the below. If you use this feature or currently hacking
> on it, speak up by end of week. I'm sharpening my axes.

Are the axes sharp?

> > diff refs/heads/master refs/heads/vmd-user
> > commit - bfe2092d87b190d9f89c4a6f2728a539b7f88233
> > commit + e84ff2c7628a811e00044a447ad906d6e24beac0
> > blob - 374d7de6629e072065b5c0232536c23c1e5bbbe0
> > blob + a192223cf118e2a8764b24f965a15acbf8ae506f
> > --- usr.sbin/vmd/config.c
> > +++ usr.sbin/vmd/config.c
> > @@ -98,12 +98,6 @@ config_init(struct vmd *env)
> > return (-1);
> > TAILQ_INIT(env->vmd_switches);
> > }
> > -   if (what & CONFIG_USERS) {
> > -   if ((env->vmd_users = calloc(1,
> > -   sizeof(*env->vmd_users))) == NULL)
> > -   return (-1);
> > -   TAILQ_INIT(env->vmd_users);
> > -   }
> >
> > return (0);
> >  }
> > @@ -238,13 +232,6 @@ config_setvm(struct privsep *ps, struct vmd_vm *vm, ui
> > return (EALREADY);
> > }
> >
> > -   /* increase the user reference counter and check user limits */
> > -   if (vm->vm_user != NULL && user_get(vm->vm_user->usr_id.uid) != NULL) {
> > -   user_inc(vcp, vm->vm_user, 1);
> > -   if (user_checklimit(vm->vm_user, vcp) == -1)
> > -   return (EPERM);
> > -   }
> > -
> > /*
> >  * Rate-limit the VM so that it cannot restart in a loop:
> >  * if the VM restarts after less than VM_START_RATE_SEC seconds,
> > blob - 2f3ac1a76f2c3e458919eca85c238a668c10422a
> > blob + 755cbedb6a18502a87724502ec86e9e426961701
> > --- usr.sbin/vmd/vmd.c
> > +++ usr.sbin/vmd/vmd.c
> > @@ -1188,9 +1188,6 @@ vm_stop(struct vmd_vm *vm, int keeptty, const char *ca
> > vm->vm_state &= ~(VM_STATE_RECEIVED | VM_STATE_RUNNING
> > | VM_STATE_SHUTDOWN);
> >
> > -   user_inc(&vm->vm_params.vmc_params, vm->vm_user, 0);
> > -   user_put(vm->vm_user);
> > -
> > if (vm->vm_iev.ibuf.fd != -1) {
> > event_del(&vm->vm_iev.ev);
> > close(vm->vm_iev.ibuf.fd);
> > @@ -1243,7 +1240,6 @@ vm_remove(struct vmd_vm *vm, const char *caller)
> >
> > TAILQ_REMOVE(env->vmd_vms, vm, vm_entry);
> >
> > -   user_put(vm->vm_user);
> > vm_stop(vm, 0, caller);
> > free(vm);
> >  }
> > @@ -1286,7 +1282,6 @@ vm_register(struct privsep *ps, struct vmop_create_par
> > struct vmd_vm   *vm = NULL, *vm_parent = NULL;
> > struct vm_create_params *vcp = &vmc->vmc_params;
> > struct vmop_owner   *vmo = NULL;
> > -   struct vmd_user *usr = NULL;
> > uint32_t nid, rng;
> > unsigned int i, j;
> > struct vmd_switch   *sw;
> > @@ -1362,13 +1357,6 @@ vm_register(struct privsep *ps, struct 
> > vmop_create_par
> > }
> > }
> >
> > -   /* track active users */
> > -   if (uid != 0 && env->vmd_users != NULL &&
> > -   (usr = user_get(uid)) == NULL) {
> > -   log_warnx("could not add user");
> > -   goto fail;
> > -   }
> > -
> > if ((vm = calloc(1, sizeof(*vm))) == NULL)
> > goto fail;
> >
> > @@ -1379,7 +1367,6 @@ vm_register(struct privsep *ps, struct vmop_create_par
> > vm->vm_tty = -1;
> > vm->vm_receive_fd = -1;
> > vm->vm_state &= ~VM_STATE_PAUSED;
> > -   vm->vm_user = usr;
> >
> > for (i = 0; i < VMM_MAX_DISKS_PER_VM; i++)
> > for (j = 0; j < VM_MAX_BASE_PER_DISK; j++)
> > @@ -1903,104 +1890,6 @@ struct vmd_user *
> > return (NULL);
> >  }
> >
> > -struct vmd_user *
> > -user_get(uid_t uid)
> > -{
> > -   struct vmd_user *usr;
> > -
> > -   if (uid == 0)
> > -   return (NULL);
> > -
> > -   /* first try to find an existing user */
> > -   TAILQ_FOREACH(usr, env->vmd_users, usr_entry) {
> > -   if (usr->usr_id.uid == uid)
> > -   goto done;
> > -   }
> > -
> > -   if ((usr = calloc(1, sizeof(*usr))) == NULL) {
> > -   log_warn("could not allocate user");
> > -   return (NULL);
> > -   }
> > -
> > -   usr->usr_id.uid = uid;
> > -   usr->usr_id.gid = -1;
> > -   TAILQ_INSERT_TAIL(env->vmd_users, usr, usr_entry);
> > -
> > - done:
> > -   DPRINTF("%s: uid %d #%d +",
> > -   __func__, usr->usr_id.uid, usr->usr_refcnt + 1);
> > -   usr->usr_refcnt++;
> > -
> > -   return (usr);
> > -}
> > -
> > -void
> > -user_put(struct vmd_user *usr)
> > -{
> > -   if (usr == NULL)
> > -   return;
> > -
> > -   DPRINTF("%s: uid %d #%d -",
> > -   __func__, usr->usr_id.uid, usr->usr_refcnt - 1);
> > -
> > -   if (--usr->usr_refcnt > 0)
> > -   return;
> > -
> > -   TAILQ_REMOVE(env->vmd_users, usr, usr_entry);
> > -   free(usr);
> > -}
> > -
> > -void
> > -user_inc(struct vm_create_params *vcp, struct vmd_user *usr, int inc)
> > -{
> > -   char mem[FMT_SCALED_STRSIZE];
> > -
> > -   if (usr == NULL)
> > -   return;
> > -
> > -   /* increment or decrement counters */
> > -   inc = inc ? 1 : 

Re: Towards unlocking mmap(2) & munmap(2)

2022-10-20 Thread Martin Pieuchot
On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> Diff below adds a minimalist set of assertions to ensure proper locks
> are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> mmap(2) for anons and munmap(2).
> 
> Please test it with WITNESS enabled and report back.

New version of the diff that includes a lock/unlock dance  in 
uvm_map_teardown().  While grabbing this lock should not be strictly
necessary because no other reference to the map should exist when the
reaper is holding it, it helps make progress with asserts.  Grabbing
the lock is easy and it can also save us a lot of time if there is any
reference counting bugs (like we've discovered w/ vnode and swapping).

Please test and report back.

Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  20 Oct 2022 14:09:30 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 20 Oct 2022 14:09:30 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.298
diff -u -p -r1.298 uvm_map.c
--- uvm/uvm_map.c   16 Oct 2022 16:16:37 -  1.298
+++ uvm/uvm_map.c   20 Oct 2022 14:09:31 -
@@ -491,6 +491,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +572,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1457,6 +1461,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1584,6 +1590,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(&map->addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1704,6 +1712,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, &first)) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1868,6 +1878,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1996,10 +2008,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return 0;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrlock(map);
 
/* Find first affected entry. */
entry = uvm_map_entrybyaddr(&map->addr, start);
@@ -2526,6 +2535,8 @@ uvm_map_teardown(struct vm_map *map)
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
 
+   vm_map_lock(map);
+
/* Remove address selectors. */
uvm_addr_destro

Re: vmd: remove the user quota tracking

2022-10-05 Thread Matthew Martin
On Wed, Oct 05, 2022 at 05:03:16PM -0400, Dave Voutila wrote:
> Matthew Martin recently presented a patch on tech@ [1] fixing some missed
> scaling from when I converted vmd(8) to use bytes instead of megabytes
> everywhere. I finally found time to wade through the code it touches and
> am proposing we simply "tedu" the incomplete feature.
> 
> Does anyone use this? (And if so, how?)
> 
> I don't see much value in this framework and it only adds additional
> state to track. Users can be confined by limits associated in
> login.conf(5) for the most part. There are more interesting things to
> work on, so unless anyone speaks up I'll look for an OK to remove it.
> 
> -dv
> 
> [1] https://marc.info/?l=openbsd-tech&m=166346196317673&w=2

For what it's worth this works for me (I can use double-p's packer
builder with the diff). Thanks

> diff refs/heads/master refs/heads/vmd-user
> commit - bfe2092d87b190d9f89c4a6f2728a539b7f88233
> commit + e84ff2c7628a811e00044a447ad906d6e24beac0
> blob - 374d7de6629e072065b5c0232536c23c1e5bbbe0
> blob + a192223cf118e2a8764b24f965a15acbf8ae506f
> --- usr.sbin/vmd/config.c
> +++ usr.sbin/vmd/config.c
> @@ -98,12 +98,6 @@ config_init(struct vmd *env)
>   return (-1);
>   TAILQ_INIT(env->vmd_switches);
>   }
> - if (what & CONFIG_USERS) {
> - if ((env->vmd_users = calloc(1,
> - sizeof(*env->vmd_users))) == NULL)
> - return (-1);
> - TAILQ_INIT(env->vmd_users);
> - }
> 
>   return (0);
>  }
> @@ -238,13 +232,6 @@ config_setvm(struct privsep *ps, struct vmd_vm *vm, ui
>   return (EALREADY);
>   }
> 
> - /* increase the user reference counter and check user limits */
> - if (vm->vm_user != NULL && user_get(vm->vm_user->usr_id.uid) != NULL) {
> - user_inc(vcp, vm->vm_user, 1);
> - if (user_checklimit(vm->vm_user, vcp) == -1)
> - return (EPERM);
> - }
> -
>   /*
>* Rate-limit the VM so that it cannot restart in a loop:
>* if the VM restarts after less than VM_START_RATE_SEC seconds,
> blob - 2f3ac1a76f2c3e458919eca85c238a668c10422a
> blob + 755cbedb6a18502a87724502ec86e9e426961701
> --- usr.sbin/vmd/vmd.c
> +++ usr.sbin/vmd/vmd.c
> @@ -1188,9 +1188,6 @@ vm_stop(struct vmd_vm *vm, int keeptty, const char *ca
>   vm->vm_state &= ~(VM_STATE_RECEIVED | VM_STATE_RUNNING
>   | VM_STATE_SHUTDOWN);
> 
> - user_inc(&vm->vm_params.vmc_params, vm->vm_user, 0);
> - user_put(vm->vm_user);
> -
>   if (vm->vm_iev.ibuf.fd != -1) {
>   event_del(&vm->vm_iev.ev);
>   close(vm->vm_iev.ibuf.fd);
> @@ -1243,7 +1240,6 @@ vm_remove(struct vmd_vm *vm, const char *caller)
> 
>   TAILQ_REMOVE(env->vmd_vms, vm, vm_entry);
> 
> - user_put(vm->vm_user);
>   vm_stop(vm, 0, caller);
>   free(vm);
>  }
> @@ -1286,7 +1282,6 @@ vm_register(struct privsep *ps, struct vmop_create_par
>   struct vmd_vm   *vm = NULL, *vm_parent = NULL;
>   struct vm_create_params *vcp = &vmc->vmc_params;
>   struct vmop_owner   *vmo = NULL;
> - struct vmd_user *usr = NULL;
>   uint32_t nid, rng;
>   unsigned int i, j;
>   struct vmd_switch   *sw;
> @@ -1362,13 +1357,6 @@ vm_register(struct privsep *ps, struct vmop_create_par
>   }
>   }
> 
> - /* track active users */
> - if (uid != 0 && env->vmd_users != NULL &&
> - (usr = user_get(uid)) == NULL) {
> - log_warnx("could not add user");
> - goto fail;
> - }
> -
>   if ((vm = calloc(1, sizeof(*vm))) == NULL)
>   goto fail;
> 
> @@ -1379,7 +1367,6 @@ vm_register(struct privsep *ps, struct vmop_create_par
>   vm->vm_tty = -1;
>   vm->vm_receive_fd = -1;
>   vm->vm_state &= ~VM_STATE_PAUSED;
> - vm->vm_user = usr;
> 
>   for (i = 0; i < VMM_MAX_DISKS_PER_VM; i++)
>   for (j = 0; j < VM_MAX_BASE_PER_DISK; j++)
> @@ -1903,104 +1890,6 @@ struct vmd_user *
>   return (NULL);
>  }
> 
> -struct vmd_user *
> -user_get(uid_t uid)
> -{
> - struct vmd_user *usr;
> -
> - if (uid == 0)
> - return (NULL);
> -
> - /* first try to find an existing user */
> - TAILQ_FOREACH(usr, env->vmd_users, usr_entry) {
> - if (usr->usr_id.uid == uid)
> - goto done;

Re: [patch] Fix vmd for user VMs

2022-10-03 Thread Matthew Martin
On Sat, Sep 24, 2022 at 08:32:55AM -0400, Dave Voutila wrote:
> 
> Matthew Martin  writes:
> 
> > When vmd/vmctl switched to handling memory in bytes, seems a few places
> > for user VMs were missed. Additionally the first hunk removes the quota
> > hit if the VM will not be created.
> >
> 
> Thanks, I'll take a deeper look this week. I don't use the user quota
> pieces, so I'll need to read through some of this to confirm. If you
> don't hear from me by end of week (October) you're welcome to nudge me.

October nudge

> > diff --git config.c config.c
> > index 374d7de6629..425c901f36a 100644
> > --- config.c
> > +++ config.c
> > @@ -241,8 +241,10 @@ config_setvm(struct privsep *ps, struct vmd_vm *vm, 
> > uint32_t peerid, uid_t uid)
> > /* increase the user reference counter and check user limits */
> > if (vm->vm_user != NULL && user_get(vm->vm_user->usr_id.uid) != NULL) {
> > user_inc(vcp, vm->vm_user, 1);
> > -   if (user_checklimit(vm->vm_user, vcp) == -1)
> > +   if (user_checklimit(vm->vm_user, vcp) == -1) {
> > +   user_inc(vcp, vm->vm_user, 0);
> > return (EPERM);
> > +   }
> > }
> >
> > /*
> > diff --git vmd.c vmd.c
> > index 2f3ac1a76f2..a7687d6ce93 100644
> > --- vmd.c
> > +++ vmd.c
> > @@ -1966,7 +1966,7 @@ user_inc(struct vm_create_params *vcp, struct 
> > vmd_user *usr, int inc)
> > usr->usr_maxifs += vcp->vcp_nnics * inc;
> >
> > if (log_getverbose() > 1) {
> > -   (void)fmt_scaled(usr->usr_maxmem * 1024 * 1024, mem);
> > +   (void)fmt_scaled(usr->usr_maxmem, mem);
> > log_debug("%s: %c uid %d ref %d cpu %llu mem %s ifs %llu",
> > __func__, inc == 1 ? '+' : '-',
> > usr->usr_id.uid, usr->usr_refcnt,
> > diff --git vmd.h vmd.h
> > index 9010ad6eb9f..8be7db3d059 100644
> > --- vmd.h
> > +++ vmd.h
> > @@ -67,7 +67,7 @@
> >
> >  /* default user instance limits */
> >  #define VM_DEFAULT_USER_MAXCPU 4
> > -#define VM_DEFAULT_USER_MAXMEM 2048
> > +#define VM_DEFAULT_USER_MAXMEM 2L * 1024 * 1024 * 1024 /* 2 GiB */
> >  #define VM_DEFAULT_USER_MAXIFS 8
> >
> >  /* vmd -> vmctl error codes */



[patch] Fix vmd for user VMs

2022-09-17 Thread Matthew Martin
When vmd/vmctl switched to handling memory in bytes, seems a few places
for user VMs were missed. Additionally the first hunk removes the quota
hit if the VM will not be created.


diff --git config.c config.c
index 374d7de6629..425c901f36a 100644
--- config.c
+++ config.c
@@ -241,8 +241,10 @@ config_setvm(struct privsep *ps, struct vmd_vm *vm, 
uint32_t peerid, uid_t uid)
/* increase the user reference counter and check user limits */
if (vm->vm_user != NULL && user_get(vm->vm_user->usr_id.uid) != NULL) {
user_inc(vcp, vm->vm_user, 1);
-   if (user_checklimit(vm->vm_user, vcp) == -1)
+   if (user_checklimit(vm->vm_user, vcp) == -1) {
+   user_inc(vcp, vm->vm_user, 0);
return (EPERM);
+   }
}
 
/*
diff --git vmd.c vmd.c
index 2f3ac1a76f2..a7687d6ce93 100644
--- vmd.c
+++ vmd.c
@@ -1966,7 +1966,7 @@ user_inc(struct vm_create_params *vcp, struct vmd_user 
*usr, int inc)
usr->usr_maxifs += vcp->vcp_nnics * inc;
 
if (log_getverbose() > 1) {
-   (void)fmt_scaled(usr->usr_maxmem * 1024 * 1024, mem);
+   (void)fmt_scaled(usr->usr_maxmem, mem);
log_debug("%s: %c uid %d ref %d cpu %llu mem %s ifs %llu",
__func__, inc == 1 ? '+' : '-',
usr->usr_id.uid, usr->usr_refcnt,
diff --git vmd.h vmd.h
index 9010ad6eb9f..8be7db3d059 100644
--- vmd.h
+++ vmd.h
@@ -67,7 +67,7 @@
 
 /* default user instance limits */
 #define VM_DEFAULT_USER_MAXCPU 4
-#define VM_DEFAULT_USER_MAXMEM 2048
+#define VM_DEFAULT_USER_MAXMEM 2L * 1024 * 1024 * 1024 /* 2 GiB */
 #define VM_DEFAULT_USER_MAXIFS 8
 
 /* vmd -> vmctl error codes */



Re: Towards unlocking mmap(2) & munmap(2)

2022-09-14 Thread Martin Pieuchot
On 14/09/22(Wed) 15:47, Klemens Nanni wrote:
> On 14.09.22 18:55, Mike Larkin wrote:
> > On Sun, Sep 11, 2022 at 12:26:31PM +0200, Martin Pieuchot wrote:
> > > Diff below adds a minimalist set of assertions to ensure proper locks
> > > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > > mmap(2) for anons and munmap(2).
> > > 
> > > Please test it with WITNESS enabled and report back.
> > > 
> > 
> > Do you want this tested in conjunction with the aiodoned diff or by itself?
> 
> This diff looks like a subset of the previous uvm lock assertion diff
> that came out of the previous "unlock mmap(2) for anonymous mappings"
> thread[0].
> 
> https://marc.info/?l=openbsd-tech&m=164423248318212&w=2
> 
> It didn't land eventually, I **think** syzcaller was a blocker which we
> only realised once it was committed and picked up by syzcaller.
> 
> Now it's been some time and more UVM changes landed, but the majority
> (if not all) lock assertions and comments from the above linked diff
> should still hold true.
> 
> mpi, I can dust off and resend that diff, If you want.
> Nothing for release, but perhaps it helps testing your current efforts.

Please hold on, this diff is known to trigger a KASSERT() with witness.
I'll send an update version soon.

Thank you for disregarding this diff for the moment.



Towards unlocking mmap(2) & munmap(2)

2022-09-11 Thread Martin Pieuchot
Diff below adds a minimalist set of assertions to ensure proper locks
are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
mmap(2) for anons and munmap(2).

Please test it with WITNESS enabled and report back.

Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  11 Sep 2022 09:08:10 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 11 Sep 2022 08:57:35 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.294
diff -u -p -r1.294 uvm_map.c
--- uvm/uvm_map.c   15 Aug 2022 15:53:45 -  1.294
+++ uvm/uvm_map.c   11 Sep 2022 09:37:44 -
@@ -162,6 +162,8 @@ int  uvm_map_inentry_recheck(u_long, v
 struct p_inentry *);
 boolean_t   uvm_map_inentry_fix(struct proc *, struct p_inentry *,
 vaddr_t, int (*)(vm_map_entry_t), u_long);
+boolean_t   uvm_map_is_stack_remappable(struct vm_map *,
+vaddr_t, vsize_t);
 /*
  * Tree management functions.
  */
@@ -491,6 +493,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +574,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1446,6 +1452,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1573,6 +1581,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(&map->addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1692,6 +1702,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, &first)) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1843,6 +1855,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1971,10 +1985,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrlock(map);
 
/* Find first affected entry. */
entry = uvm_map_entrybyaddr(&map->addr, start);
@@ -4027,6 +4038,8 @@ uvm_map_checkprot(struct vm_map *map, va
 {
struct vm_map_entry *entry;
 
+   vm_map_assert_anylock(map);
+
if (start < map->min_offset || end > map->max_offset || start > end)
return FALSE;
if (start == end)
@@ -4886,6 +4899,7 @@ uvm_map_freelist_update(struct vm_map *m
 vaddr_t b_start, vaddr_t b_end, vaddr_t s_start, vaddr_t s_end, int flags)
 {
KDASSERT(b_end >= b_star

uvm_vnode locking & documentation

2022-09-10 Thread Martin Pieuchot
Previous fix from gnezdo@ pointed out that `u_flags' accesses should be
serialized by `vmobjlock'.  Diff below documents this and fix the
remaining places where the lock isn't yet taken.  One exception still
remains, the first loop of uvm_vnp_sync().  This cannot be fixed right
now due to possible deadlocks but that's not a reason for not documenting
& fixing the rest of this file.

This has been tested on amd64 and arm64.

Comments?  Oks?

Index: uvm/uvm_vnode.c
===
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.128
diff -u -p -r1.128 uvm_vnode.c
--- uvm/uvm_vnode.c 10 Sep 2022 16:14:36 -  1.128
+++ uvm/uvm_vnode.c 10 Sep 2022 18:23:57 -
@@ -68,11 +68,8 @@
  * we keep a simpleq of vnodes that are currently being sync'd.
  */
 
-LIST_HEAD(uvn_list_struct, uvm_vnode);
-struct uvn_list_struct uvn_wlist;  /* writeable uvns */
-
-SIMPLEQ_HEAD(uvn_sq_struct, uvm_vnode);
-struct uvn_sq_struct uvn_sync_q;   /* sync'ing uvns */
+LIST_HEAD(, uvm_vnode) uvn_wlist;  /* [K] writeable uvns */
+SIMPLEQ_HEAD(, uvm_vnode)  uvn_sync_q; /* [S] sync'ing uvns */
 struct rwlock uvn_sync_lock;   /* locks sync operation */
 
 extern int rebooting;
@@ -144,41 +141,40 @@ uvn_attach(struct vnode *vp, vm_prot_t a
struct partinfo pi;
u_quad_t used_vnode_size = 0;
 
-   /* first get a lock on the uvn. */
-   while (uvn->u_flags & UVM_VNODE_BLOCKED) {
-   uvn->u_flags |= UVM_VNODE_WANTED;
-   tsleep_nsec(uvn, PVM, "uvn_attach", INFSLP);
-   }
-
/* if we're mapping a BLK device, make sure it is a disk. */
if (vp->v_type == VBLK && bdevsw[major(vp->v_rdev)].d_type != D_DISK) {
return NULL;
}
 
+   /* first get a lock on the uvn. */
+   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
+   while (uvn->u_flags & UVM_VNODE_BLOCKED) {
+   uvn->u_flags |= UVM_VNODE_WANTED;
+   rwsleep_nsec(uvn, uvn->u_obj.vmobjlock, PVM, "uvn_attach",
+   INFSLP);
+   }
+
/*
 * now uvn must not be in a blocked state.
 * first check to see if it is already active, in which case
 * we can bump the reference count, check to see if we need to
 * add it to the writeable list, and then return.
 */
-   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
if (uvn->u_flags & UVM_VNODE_VALID) {   /* already active? */
KASSERT(uvn->u_obj.uo_refs > 0);
 
uvn->u_obj.uo_refs++;   /* bump uvn ref! */
-   rw_exit(uvn->u_obj.vmobjlock);
 
/* check for new writeable uvn */
if ((accessprot & PROT_WRITE) != 0 &&
(uvn->u_flags & UVM_VNODE_WRITEABLE) == 0) {
-   LIST_INSERT_HEAD(&uvn_wlist, uvn, u_wlist);
-   /* we are now on wlist! */
uvn->u_flags |= UVM_VNODE_WRITEABLE;
+   LIST_INSERT_HEAD(&uvn_wlist, uvn, u_wlist);
}
+   rw_exit(uvn->u_obj.vmobjlock);
 
return (&uvn->u_obj);
}
-   rw_exit(uvn->u_obj.vmobjlock);
 
/*
 * need to call VOP_GETATTR() to get the attributes, but that could
@@ -189,6 +185,7 @@ uvn_attach(struct vnode *vp, vm_prot_t a
 * it.
 */
uvn->u_flags = UVM_VNODE_ALOCK;
+   rw_exit(uvn->u_obj.vmobjlock);
 
if (vp->v_type == VBLK) {
/*
@@ -213,9 +210,11 @@ uvn_attach(struct vnode *vp, vm_prot_t a
}
 
if (result != 0) {
+   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
if (uvn->u_flags & UVM_VNODE_WANTED)
wakeup(uvn);
uvn->u_flags = 0;
+   rw_exit(uvn->u_obj.vmobjlock);
return NULL;
}
 
@@ -236,18 +235,19 @@ uvn_attach(struct vnode *vp, vm_prot_t a
uvn->u_nio = 0;
uvn->u_size = used_vnode_size;
 
-   /* if write access, we need to add it to the wlist */
-   if (accessprot & PROT_WRITE) {
-   LIST_INSERT_HEAD(&uvn_wlist, uvn, u_wlist);
-   uvn->u_flags |= UVM_VNODE_WRITEABLE;/* we are on wlist! */
-   }
-
/*
 * add a reference to the vnode.   this reference will stay as long
 * as there is a valid mapping of the vnode.   dropped when the
 * reference count goes to zero.
 */
vref(vp);
+
+   /* if write access, we need to add it to the wlist */
+   if (accessprot & PROT_WRITE) {
+   uvn->u_flags |= UVM_VNODE_WRITEABLE;
+   LIST_INSERT_HEAD(&uvn_wlist, uvn, u_wlist);
+   }
+
if (oldflags & UVM_VNODE_WANTED)
wakeup(uvn);
 
@@ -273,6 +273,7 @@ uvn_reference(struct uvm_object *uobj)
struct uvm_vnode *uvn = (struct uvm_vnode *) uobj;
 #endif
 
+

Re: Unmap page in uvm_anon_release()

2022-09-10 Thread Martin Pieuchot
On 10/09/22(Sat) 15:12, Mark Kettenis wrote:
> > Date: Sat, 10 Sep 2022 14:18:02 +0200
> > From: Martin Pieuchot 
> > 
> > Diff below fixes a bug exposed when swapping on arm64.  When an anon is
> > released make sure the all the pmap references to the related page are
> > removed.
> 
> I'm a little bit puzzled by this.  So these pages are still mapped
> even though there are no references to the anon anymore?

I don't know.  I just realised that all the code paths leading to
uvm_pagefree() get rid of the pmap references by calling page_protect()
except a couple of them in the aiodone daemon and the clustering code in
the pager.

This can't hurt and make the existing code coherent.  Maybe it just
hides the bug, I don't know.



Unmap page in uvm_anon_release()

2022-09-10 Thread Martin Pieuchot
Diff below fixes a bug exposed when swapping on arm64.  When an anon is
released make sure the all the pmap references to the related page are
removed.

We could move the pmap_page_protect(pg, PROT_NONE) inside uvm_pagefree()
to avoid future issue but that's for a later refactoring.

With this diff I can no longer reproduce the SIGBUS issue on the
rockpro64 and swapping is stable as long as I/O from sdmmc(4) work.

This should be good enough to commit the diff that got reverted, but I'll
wait to be sure there's no regression.

ok?

Index: uvm/uvm_anon.c
===
RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
retrieving revision 1.54
diff -u -p -r1.54 uvm_anon.c
--- uvm/uvm_anon.c  26 Mar 2021 13:40:05 -  1.54
+++ uvm/uvm_anon.c  10 Sep 2022 12:10:34 -
@@ -255,6 +255,7 @@ uvm_anon_release(struct vm_anon *anon)
KASSERT(anon->an_ref == 0);
 
uvm_lock_pageq();
+   pmap_page_protect(pg, PROT_NONE);
uvm_pagefree(pg);
uvm_unlock_pageq();
KASSERT(anon->an_page == NULL);
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 10 Sep 2022 12:10:34 -
@@ -396,7 +396,6 @@ uvmfault_anonget(struct uvm_faultinfo *u
 * anon and try again.
 */
if (pg->pg_flags & PG_RELEASED) {
-   pmap_page_protect(pg, PROT_NONE);
KASSERT(anon->an_ref == 0);
/*
 * Released while we had unlocked amap.



Re: ps(1): add -d (descendancy) option to display parent/child process relationships

2022-09-01 Thread Martin Schröder
Am Do., 1. Sept. 2022 um 05:38 Uhr schrieb Job Snijders :
> Some ps(1) implementations have an '-d' ('descendancy') option. Through
> ASCII art parent/child process relationships are grouped and displayed.
>
> Thoughts?

gnu ps has

-d Select all processes except session leaders.

and

   f  ASCII art process hierarchy (forest).

   --forest
  ASCII art process tree.

Best
Martin



Re: ps(1): add -d (descendancy) option to display parent/child process relationships

2022-09-01 Thread Martin Pieuchot
On 01/09/22(Thu) 03:37, Job Snijders wrote:
> Dear all,
> 
> Some ps(1) implementations have an '-d' ('descendancy') option. Through
> ASCII art parent/child process relationships are grouped and displayed.
> Here is an example:
> 
> $ ps ad -O ppid,user
>   PID  PPID USER TT  STATTIME COMMAND
> 18180 12529 job  pb  I+p  0:00.01 `-- -sh (sh)
> 26689 56460 job  p3  Ip   0:00.01   `-- -ksh (ksh)
>  5153 26689 job  p3  I+p  0:40.18 `-- mutt
> 62046 25272 job  p4  Sp   0:00.25   `-- -ksh (ksh)
> 61156 62046 job  p4  R+/0 0:00.00 `-- ps -ad -O ppid
> 26816  2565 job  p5  Ip   0:00.01   `-- -ksh (ksh)
> 79431 26816 root p5  Ip   0:00.16 `-- /bin/ksh
> 43915 79431 _rpki-cl p5  S+pU 0:06.97   `-- rpki-client
> 70511 43915 _rpki-cl p5  I+pU 0:01.26 |-- rpki-client: parser 
> (rpki-client)
> 96992 43915 _rpki-cl p5  I+pU 0:00.00 |-- rpki-client: rsync 
> (rpki-client)
> 49160 43915 _rpki-cl p5  S+p  0:01.52 |-- rpki-client: http 
> (rpki-client)
> 99329 43915 _rpki-cl p5  S+p  0:03.20 `-- rpki-client: rrdp 
> (rpki-client)
> 
> The functionality is similar to pstree(1) in the ports collection.
> 
> The below changeset borrows heavily from the following two
> implementations:
> 
> 
> https://github.com/freebsd/freebsd-src/commit/044fce530f89a819827d351de364d208a30e9645.patch
> 
> https://github.com/NetBSD/src/commit/b82f6d00d93d880d3976c4f1e88c33d88a8054ad.patch
> 
> Thoughts?

I'd love to have such feature in base.

> Index: extern.h
> ===
> RCS file: /cvs/src/bin/ps/extern.h,v
> retrieving revision 1.23
> diff -u -p -r1.23 extern.h
> --- extern.h  5 Jan 2022 04:10:36 -   1.23
> +++ extern.h  1 Sep 2022 03:31:36 -
> @@ -44,44 +44,44 @@ extern VAR var[];
>  extern VARENT *vhead;
>  
>  __BEGIN_DECLS
> -void  command(const struct kinfo_proc *, VARENT *);
> -void  cputime(const struct kinfo_proc *, VARENT *);
> +void  command(const struct pinfo *, VARENT *);
> +void  cputime(const struct pinfo *, VARENT *);
>  int   donlist(void);
> -void  elapsed(const struct kinfo_proc *, VARENT *);
> +void  elapsed(const struct pinfo *, VARENT *);
>  doublegetpcpu(const struct kinfo_proc *);
> -doublegetpmem(const struct kinfo_proc *);
> -void  gname(const struct kinfo_proc *, VARENT *);
> -void  supgid(const struct kinfo_proc *, VARENT *);
> -void  supgrp(const struct kinfo_proc *, VARENT *);
> -void  logname(const struct kinfo_proc *, VARENT *);
> -void  longtname(const struct kinfo_proc *, VARENT *);
> -void  lstarted(const struct kinfo_proc *, VARENT *);
> -void  maxrss(const struct kinfo_proc *, VARENT *);
> +doublegetpmem(const struct pinfo *);
> +void  gname(const struct pinfo *, VARENT *);
> +void  supgid(const struct pinfo *, VARENT *);
> +void  supgrp(const struct pinfo *, VARENT *);
> +void  logname(const struct pinfo *, VARENT *);
> +void  longtname(const struct pinfo *, VARENT *);
> +void  lstarted(const struct pinfo *, VARENT *);
> +void  maxrss(const struct pinfo *, VARENT *);
>  void  nlisterr(struct nlist *);
> -void  p_rssize(const struct kinfo_proc *, VARENT *);
> -void  pagein(const struct kinfo_proc *, VARENT *);
> +void  p_rssize(const struct pinfo *, VARENT *);
> +void  pagein(const struct pinfo *, VARENT *);
>  void  parsefmt(char *);
> -void  pcpu(const struct kinfo_proc *, VARENT *);
> -void  pmem(const struct kinfo_proc *, VARENT *);
> -void  pri(const struct kinfo_proc *, VARENT *);
> +void  pcpu(const struct pinfo *, VARENT *);
> +void  pmem(const struct pinfo *, VARENT *);
> +void  pri(const struct pinfo *, VARENT *);
>  void  printheader(void);
> -void  pvar(const struct kinfo_proc *kp, VARENT *);
> -void  pnice(const struct kinfo_proc *kp, VARENT *);
> -void  rgname(const struct kinfo_proc *, VARENT *);
> -void  rssize(const struct kinfo_proc *, VARENT *);
> -void  runame(const struct kinfo_proc *, VARENT *);
> +void  pvar(const struct pinfo *, VARENT *);
> +void  pnice(const struct pinfo *, VARENT *);
> +void  rgname(const struct pinfo *, VARENT *);
> +void  rssize(const struct pinfo *, VARENT *);
> +void  runame(const struct pinfo *, VARENT *);
>  void  showkey(void);
> -void  started(const struct kinfo_proc *, VARENT *);
> -void  printstate(const struct kinfo_proc *, VARENT *);
> -void  printpledge(const struct kinfo_proc *, VARENT *);
> -void  tdev(const struct kinfo_proc *, VARENT *);
> -void  tname(const struct kinfo_proc *, VARENT *);
> -void  tsize(const struct kinfo_proc *, VARENT *);
> -void  dsize(const struct kinfo_proc *, VARENT *);
> -void  ssize(const struct kinfo_proc *, VARENT *);
> -void  ucomm(const struct kinfo_proc *, VARENT *);
> -void  curwd(const struct kinfo_proc *, VARENT *);
> -void  euname(const struct kinfo_proc *, VARENT *);
> -void  vsize(const struct kinfo_proc *, VARENT 

Re: pdaemon locking tweak

2022-08-30 Thread Martin Pieuchot
On 30/08/22(Tue) 15:28, Jonathan Gray wrote:
> On Mon, Aug 29, 2022 at 01:46:20PM +0200, Martin Pieuchot wrote:
> > Diff below refactors the pdaemon's locking by introducing a new *trylock()
> > function for a given page.  This is shamelessly stolen from NetBSD.
> > 
> > This is part of my ongoing effort to untangle the locks used by the page
> > daemon.
> > 
> > ok?
> 
> if (pmap_is_referenced(p)) {
>   uvm_pageactivate(p);
> 
> is no longer under held slock.  Which I believe is intended,
> just not obvious looking at the diff.
> 
> The page queue is already locked on entry to uvmpd_scan_inactive()

Thanks for spotting this.  Indeed the locking required for
uvm_pageactivate() is different in my local tree.  For now
let's keep the existing order of operations.

Updated diff below.

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   30 Aug 2022 08:30:58 -  1.103
+++ uvm/uvm_pdaemon.c   30 Aug 2022 08:39:19 -
@@ -101,6 +101,7 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
+struct rwlock  *uvmpd_trylockowner(struct vm_page *);
 void   uvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
@@ -367,6 +368,34 @@ uvm_aiodone_daemon(void *arg)
}
 }
 
+/*
+ * uvmpd_trylockowner: trylock the page's owner.
+ *
+ * => return the locked rwlock on success.  otherwise, return NULL.
+ */
+struct rwlock *
+uvmpd_trylockowner(struct vm_page *pg)
+{
+
+   struct uvm_object *uobj = pg->uobject;
+   struct rwlock *slock;
+
+   if (uobj != NULL) {
+   slock = uobj->vmobjlock;
+   } else {
+   struct vm_anon *anon = pg->uanon;
+
+   KASSERT(anon != NULL);
+   slock = anon->an_lock;
+   }
+
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
+   return NULL;
+   }
+
+   return slock;
+}
+
 
 /*
  * uvmpd_dropswap: free any swap allocated to this page.
@@ -474,51 +503,43 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 
anon = p->uanon;
uobj = p->uobject;
-   if (p->pg_flags & PQ_ANON) {
+
+   /*
+* first we attempt to lock the object that this page
+* belongs to.  if our attempt fails we skip on to
+* the next page (no harm done).  it is important to
+* "try" locking the object as we are locking in the
+* wrong order (pageq -> object) and we don't want to
+* deadlock.
+*/
+   slock = uvmpd_trylockowner(p);
+   if (slock == NULL) {
+   continue;
+   }
+
+   /*
+* move referenced pages back to active queue
+* and skip to next page.
+*/
+   if (pmap_is_referenced(p)) {
+   uvm_pageactivate(p);
+   rw_exit(slock);
+   uvmexp.pdreact++;
+   continue;
+   }
+
+   if (p->pg_flags & PG_BUSY) {
+   rw_exit(slock);
+   uvmexp.pdbusy++;
+   continue;
+   }
+
+   /* does the page belong to an object? */
+   if (uobj != NULL) {
+   uvmexp.pdobscan++;
+   } else {
KASSERT(anon != NULL);
-   slock = anon->an_lock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   /* lock failed, skip this page */
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip to next page.
-*/
-   if (pmap_is_referenced(p)) {
-   uvm_pageactivate(p);
-   rw_exit(slock);
-   uvmexp.pdreact++;
-   continue;
-   }
-   if (p->pg_flags & PG_BUSY) {
- 

uvmpd_dropswap()

2022-08-29 Thread Martin Pieuchot
Small refactoring to introduce uvmpd_dropswap().  This will make an
upcoming rewrite of the pdaemon smaller & easier to review :o)

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.102
diff -u -p -r1.102 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   22 Aug 2022 12:03:32 -  1.102
+++ uvm/uvm_pdaemon.c   29 Aug 2022 11:55:52 -
@@ -105,6 +105,7 @@ voiduvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
+void   uvmpd_dropswap(struct vm_page *);
 
 /*
  * uvm_wait: wait (sleep) for the page daemon to free some pages
@@ -367,6 +368,23 @@ uvm_aiodone_daemon(void *arg)
 }
 
 
+/*
+ * uvmpd_dropswap: free any swap allocated to this page.
+ *
+ * => called with owner locked.
+ */
+void
+uvmpd_dropswap(struct vm_page *pg)
+{
+   struct vm_anon *anon = pg->uanon;
+
+   if ((pg->pg_flags & PQ_ANON) && anon->an_swslot) {
+   uvm_swap_free(anon->an_swslot, 1);
+   anon->an_swslot = 0;
+   } else if (pg->pg_flags & PQ_AOBJ) {
+   uao_dropswap(pg->uobject, pg->offset >> PAGE_SHIFT);
+   }
+}
 
 /*
  * uvmpd_scan_inactive: scan an inactive list for pages to clean or free.
@@ -566,16 +584,7 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
KASSERT(uvmexp.swpginuse <= uvmexp.swpages);
if ((p->pg_flags & PQ_SWAPBACKED) &&
uvmexp.swpginuse == uvmexp.swpages) {
-
-   if ((p->pg_flags & PQ_ANON) &&
-   p->uanon->an_swslot) {
-   uvm_swap_free(p->uanon->an_swslot, 1);
-   p->uanon->an_swslot = 0;
-   }
-   if (p->pg_flags & PQ_AOBJ) {
-   uao_dropswap(p->uobject,
-p->offset >> PAGE_SHIFT);
-   }
+   uvmpd_dropswap(p);
}
 
/*
@@ -599,16 +608,7 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
if (swap_backed) {
/* free old swap slot (if any) */
-   if (anon) {
-   if (anon->an_swslot) {
-   uvm_swap_free(anon->an_swslot,
-   1);
-   anon->an_swslot = 0;
-   }
-   } else {
-   uao_dropswap(uobj,
-p->offset >> PAGE_SHIFT);
-   }
+   uvmpd_dropswap(p);
 
/* start new cluster (if necessary) */
if (swslot == 0) {



pdaemon locking tweak

2022-08-29 Thread Martin Pieuchot
Diff below refactors the pdaemon's locking by introducing a new *trylock()
function for a given page.  This is shamelessly stolen from NetBSD.

This is part of my ongoing effort to untangle the locks used by the page
daemon.

ok?

Index: uvm//uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.102
diff -u -p -r1.102 uvm_pdaemon.c
--- uvm//uvm_pdaemon.c  22 Aug 2022 12:03:32 -  1.102
+++ uvm//uvm_pdaemon.c  29 Aug 2022 11:36:59 -
@@ -101,6 +101,7 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
+struct rwlock  *uvmpd_trylockowner(struct vm_page *);
 void   uvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
@@ -367,6 +368,34 @@ uvm_aiodone_daemon(void *arg)
 }
 
 
+/*
+ * uvmpd_trylockowner: trylock the page's owner.
+ *
+ * => return the locked rwlock on success.  otherwise, return NULL.
+ */
+struct rwlock *
+uvmpd_trylockowner(struct vm_page *pg)
+{
+
+   struct uvm_object *uobj = pg->uobject;
+   struct rwlock *slock;
+
+   if (uobj != NULL) {
+   slock = uobj->vmobjlock;
+   } else {
+   struct vm_anon *anon = pg->uanon;
+
+   KASSERT(anon != NULL);
+   slock = anon->an_lock;
+   }
+
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
+   return NULL;
+   }
+
+   return slock;
+}
+
 
 /*
  * uvmpd_scan_inactive: scan an inactive list for pages to clean or free.
@@ -454,53 +483,44 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
uvmexp.pdscans++;
nextpg = TAILQ_NEXT(p, pageq);
 
+   /*
+* move referenced pages back to active queue
+* and skip to next page.
+*/
+   if (pmap_is_referenced(p)) {
+   uvm_pageactivate(p);
+   uvmexp.pdreact++;
+   continue;
+   }
+
anon = p->uanon;
uobj = p->uobject;
-   if (p->pg_flags & PQ_ANON) {
+
+   /*
+* first we attempt to lock the object that this page
+* belongs to.  if our attempt fails we skip on to
+* the next page (no harm done).  it is important to
+* "try" locking the object as we are locking in the
+* wrong order (pageq -> object) and we don't want to
+* deadlock.
+*/
+   slock = uvmpd_trylockowner(p);
+   if (slock == NULL) {
+   continue;
+   }
+
+   if (p->pg_flags & PG_BUSY) {
+   rw_exit(slock);
+   uvmexp.pdbusy++;
+   continue;
+   }
+
+   /* does the page belong to an object? */
+   if (uobj != NULL) {
+   uvmexp.pdobscan++;
+   } else {
KASSERT(anon != NULL);
-   slock = anon->an_lock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   /* lock failed, skip this page */
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip to next page.
-*/
-   if (pmap_is_referenced(p)) {
-   uvm_pageactivate(p);
-   rw_exit(slock);
-   uvmexp.pdreact++;
-   continue;
-   }
-   if (p->pg_flags & PG_BUSY) {
-   rw_exit(slock);
-   uvmexp.pdbusy++;
-   continue;
-   }
uvmexp.pdanscan++;
-   } else {
-   KASSERT(uobj != NULL);
-   slock = uobj->vmobjlock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip to

Simplify locking code in pdaemon

2022-08-18 Thread Martin Pieuchot
Use a "slock" variable as done in multiple places to simplify the code.
The locking stay the same.  This is just a first step to simplify this
mess.

Also get rid of the return value of the function, it is never checked.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.101
diff -u -p -r1.101 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   28 Jun 2022 19:31:30 -  1.101
+++ uvm/uvm_pdaemon.c   18 Aug 2022 10:44:52 -
@@ -102,7 +102,7 @@ extern void drmbackoff(long);
  */
 
 void   uvmpd_scan(struct uvm_pmalloc *);
-boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
+void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -377,17 +377,16 @@ uvm_aiodone_daemon(void *arg)
  * => we handle the building of swap-backed clusters
  * => we return TRUE if we are exiting because we met our target
  */
-
-boolean_t
+void
 uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
-   boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
struct vm_page *swpps[SWCLUSTPAGES];/* XXX: see below */
+   struct rwlock *slock;
int swnpages, swcpages; /* XXX: see below */
int swslot;
struct vm_anon *anon;
@@ -402,7 +401,6 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
swslot = 0;
swnpages = swcpages = 0;
-   free = 0;
dirtyreacts = 0;
p = NULL;
 
@@ -431,18 +429,14 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
uobj = NULL;
anon = NULL;
-
if (p) {
/*
-* update our copy of "free" and see if we've met
-* our target
+* see if we've met our target
 */
free = uvmexp.free - BUFPAGES_DEFICIT;
if (((pma == NULL || (pma->pm_flags & UVM_PMA_FREED)) &&
(free + uvmexp.paging >= uvmexp.freetarg << 2)) ||
dirtyreacts == UVMPD_NUMDIRTYREACTS) {
-   retval = TRUE;
-
if (swslot == 0) {
/* exit now if no swap-i/o pending */
break;
@@ -450,9 +444,9 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 
/* set p to null to signal final swap i/o */
p = NULL;
+   nextpg = NULL;
}
}
-
if (p) {/* if (we have a new page to consider) */
/*
 * we are below target and have a new page to consider.
@@ -460,11 +454,12 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
uvmexp.pdscans++;
nextpg = TAILQ_NEXT(p, pageq);
 
+   anon = p->uanon;
+   uobj = p->uobject;
if (p->pg_flags & PQ_ANON) {
-   anon = p->uanon;
KASSERT(anon != NULL);
-   if (rw_enter(anon->an_lock,
-   RW_WRITE|RW_NOSLEEP)) {
+   slock = anon->an_lock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
/* lock failed, skip this page */
continue;
}
@@ -474,23 +469,20 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
if (pmap_is_referenced(p)) {
uvm_pageactivate(p);
-   rw_exit(anon->an_lock);
+   rw_exit(slock);
uvmexp.pdreact++;
continue;
}
if (p->pg_flags & PG_BUSY) {
-   rw_exit(anon->an_lock);
+   rw_exit(slock);
uvmexp.pdbusy++;
-   /* someone else owns page, skip it */
continue;
}
uvmexp.pdanscan++;
} else {
-   uobj = p->uobject;
  

Fix a race in uvm_pseg_release()

2022-08-18 Thread Martin Pieuchot
The lock must be grabbed before iterating on the global array, ok?

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.88
diff -u -p -r1.88 uvm_pager.c
--- uvm/uvm_pager.c 15 Aug 2022 03:21:04 -  1.88
+++ uvm/uvm_pager.c 18 Aug 2022 10:31:16 -
@@ -209,6 +209,7 @@ uvm_pseg_release(vaddr_t segaddr)
struct uvm_pseg *pseg;
vaddr_t va = 0;
 
+   mtx_enter(&uvm_pseg_lck);
for (pseg = &psegs[0]; pseg != &psegs[PSEG_NUMSEGS]; pseg++) {
if (pseg->start <= segaddr &&
segaddr < pseg->start + MAX_PAGER_SEGS * MAXBSIZE)
@@ -222,7 +223,6 @@ uvm_pseg_release(vaddr_t segaddr)
/* test for no remainder */
KDASSERT(segaddr == pseg->start + id * MAXBSIZE);
 
-   mtx_enter(&uvm_pseg_lck);
 
KASSERT(UVM_PSEG_INUSE(pseg, id));
 



Re: uvm_swap: introduce uvm_swap_data_lock

2022-08-17 Thread Martin Pieuchot
On 16/01/22(Sun) 15:35, Martin Pieuchot wrote:
> On 30/12/21(Thu) 23:38, Theo Buehler wrote:
> > The diff below does two things: it adds a uvm_swap_data_lock mutex and
> > trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad()
> 
> Why is it enough?  Which fields is the lock protecting in these
> function?  Is it `uvmexp.swpages', could that be documented?  

It is documented in the diff below.

> 
> What about `nswapdev'?  Why is the rwlock grabbed before reading it in
> sys_swapctl()?i

Because it is always modified with the lock, I added some documentation.

> What about `swpginuse'?

This is still under KERNEL_LOCK(), documented below.

> If the mutex/rwlock are used to protect the global `swap_priority' could
> that be also documented?  Once this is documented it should be trivial to
> see that some places are missing some locking.  Is it intentional?
> 
> > The uvm_swap_data_lock protects all swap data structures, so needs to be
> > grabbed a few times, many of them already documented in the comments.
> > 
> > For review, I suggest comparing to what NetBSD did and also going
> > through the consumers (swaplist_insert, swaplist_find, swaplist_trim)
> > and check that they are properly locked when called, or that there is
> > the KERNEL_LOCK() in place when swap data structures are manipulated.
> 
> I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce
> the differences with NetBSD.

Done.

> > In swapmount() I introduced locking since that's needed to be able to
> > assert that the proper locks are held in swaplist_{insert,find,trim}.
> 
> Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down?
> What about `uvmexp.nswget' and `uvmexp.swpgonly' in there?

This has been done as part of another change.  This diff uses an atomic
operation to increase `nswget' in case multiple threads fault on a page
in swap at the same time.

Updated diff below, ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.163
diff -u -p -r1.163 uvm_swap.c
--- uvm/uvm_swap.c  6 Aug 2022 13:44:04 -   1.163
+++ uvm/uvm_swap.c  17 Aug 2022 11:46:20 -
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -84,13 +85,16 @@
  * the system maintains a global data structure describing all swap
  * partitions/files.   there is a sorted LIST of "swappri" structures
  * which describe "swapdev"'s at that priority.   this LIST is headed
- * by the "swap_priority" global var.each "swappri" contains a 
+ * by the "swap_priority" global var.each "swappri" contains a
  * TAILQ of "swapdev" structures at that priority.
  *
  * locking:
  *  - swap_syscall_lock (sleep lock): this lock serializes the swapctl
  *system call and prevents the swap priority list from changing
  *while we are in the middle of a system call (e.g. SWAP_STATS).
+ *  - uvm_swap_data_lock (mutex): this lock protects all swap data
+ *structures including the priority list, the swapdev structures,
+ *and the swapmap arena.
  *
  * each swap device has the following info:
  *  - swap device in use (could be disabled, preventing future use)
@@ -106,7 +110,7 @@
  * userland controls and configures swap with the swapctl(2) system call.
  * the sys_swapctl performs the following operations:
  *  [1] SWAP_NSWAP: returns the number of swap devices currently configured
- *  [2] SWAP_STATS: given a pointer to an array of swapent structures 
+ *  [2] SWAP_STATS: given a pointer to an array of swapent structures
  * (passed in via "arg") of a size passed in via "misc" ... we load
  * the current swap config into the array.
  *  [3] SWAP_ON: given a pathname in arg (could be device or file) and a
@@ -208,9 +212,10 @@ struct extent *swapmap;/* controls the
 
 /* list of all active swap devices [by priority] */
 LIST_HEAD(swap_priority, swappri);
-struct swap_priority swap_priority;
+struct swap_priority swap_priority;/* [S] */
 
 /* locks */
+struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE);
 struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk");
 
 struct mutex oommtx = MUTEX_INITIALIZER(IPL_VM);
@@ -224,7 +229,7 @@ void swapdrum_add(struct swapdev *, in
 struct swapdev *swapdrum_getsdp(int);
 
 struct swapdev *swaplist_find(struct vnode *, int);
-voidswaplist_insert(struct swapdev *, 
+voidswaplist_insert(struct swapdev *,
 struct swappri *, int);
 voidswaplist_trim(void);
 
@@ -472,16 +477,19 @@ uvm_swap_finicrypt_all(void)
 /*
  * swaplist_in

Re: patch: change swblk_t type and use it in blist

2022-08-05 Thread Martin Pieuchot
On 05/08/22(Fri) 18:10, Sebastien Marie wrote:
> Hi,
> 
> When initially ported blist from DragonFlyBSD, we used custom type bsblk_t 
> and 
> bsbmp_t instead of the one used by DragonFlyBSD (swblk_t and u_swblk_t).
> 
> The reason was swblk_t is already defined on OpenBSD, and was incompatible 
> with 
> blist (int32_t). It is defined, but not used (outside some regress file which 
> seems to be not affected by type change).
> 
> This diff changes the __swblk_t definition in sys/_types.h to be 'unsigned 
> long', and switch back blist to use swblk_t (and u_swblk_t, even if it isn't 
> 'unsigned swblk_t').
> 
> It makes the diff with DragonFlyBSD more thin. I added a comment with the git 
> id 
> used for the initial port.
> 
> I tested it on i386 and amd64 (kernel and userland).
> 
> By changing bitmap type from 'u_long' to 'u_swblk_t' ('u_int64_t'), it makes 
> the 
> regress the same on 64 and 32bits archs (and it success on both).
> 
> Comments or OK ?

Makes sense to me.  I'm not a standard/type lawyer so I don't know if
this is fine for userland.  So I'm ok with it.

> diff /home/semarie/repos/openbsd/src
> commit - 73f52ef7130cefbe5a8fe028eedaad0e54be7303
> path + /home/semarie/repos/openbsd/src
> blob - e05867429cdd81c434f9ca589c1fb8c6d25957f8
> file + sys/sys/_types.h
> --- sys/sys/_types.h
> +++ sys/sys/_types.h
> @@ -60,7 +60,7 @@ typedef __uint8_t   __sa_family_t;  /* sockaddr 
> address f
>  typedef  __int32_t   __segsz_t;  /* segment size */
>  typedef  __uint32_t  __socklen_t;/* length type for network 
> syscalls */
>  typedef  long__suseconds_t;  /* microseconds (signed) */
> -typedef  __int32_t   __swblk_t;  /* swap offset */
> +typedef  unsigned long   __swblk_t;  /* swap offset */
>  typedef  __int64_t   __time_t;   /* epoch time */
>  typedef  __int32_t   __timer_t;  /* POSIX timer identifiers */
>  typedef  __uint32_t  __uid_t;/* user id */
> blob - 102ca95dd45ba6d9cab0f3fcbb033d6043ec1606
> file + sys/sys/blist.h
> --- sys/sys/blist.h
> +++ sys/sys/blist.h
> @@ -1,4 +1,5 @@
>  /* $OpenBSD: blist.h,v 1.1 2022/07/29 17:47:12 semarie Exp $ */
> +/* DragonFlyBSD:7b80531f545c7d3c51c1660130c71d01f6bccbe0:/sys/sys/blist.h */
>  /*
>   * Copyright (c) 2003,2004 The DragonFly Project.  All rights reserved.
>   * 
> @@ -65,15 +66,13 @@
>  #include 
>  #endif
>  
> -#define  SWBLK_BITS 64
> -typedef u_long bsbmp_t;
> -typedef u_long bsblk_t;
> +typedef u_int64_tu_swblk_t;
>  
>  /*
>   * note: currently use SWAPBLK_NONE as an absolute value rather then
>   * a flag bit.
>   */
> -#define SWAPBLK_NONE ((bsblk_t)-1)
> +#define SWAPBLK_NONE ((swblk_t)-1)
>  
>  /*
>   * blmeta and bl_bitmap_t MUST be a power of 2 in size.
> @@ -81,39 +80,39 @@ typedef u_long bsblk_t;
>  
>  typedef struct blmeta {
>   union {
> - bsblk_t bmu_avail;  /* space available under us */
> - bsbmp_t bmu_bitmap; /* bitmap if we are a leaf  */
> + swblk_t bmu_avail;  /* space available under us */
> + u_swblk_t   bmu_bitmap; /* bitmap if we are a leaf  */
>   } u;
> - bsblk_t bm_bighint; /* biggest contiguous block hint*/
> + swblk_t bm_bighint; /* biggest contiguous block hint*/
>  } blmeta_t;
>  
>  typedef struct blist {
> - bsblk_t bl_blocks;  /* area of coverage */
> + swblk_t bl_blocks;  /* area of coverage */
>   /* XXX int64_t bl_radix */
> - bsblk_t bl_radix;   /* coverage radix   */
> - bsblk_t bl_skip;/* starting skip*/
> - bsblk_t bl_free;/* number of free blocks*/
> + swblk_t bl_radix;   /* coverage radix   */
> + swblk_t bl_skip;/* starting skip*/
> + swblk_t bl_free;/* number of free blocks*/
>   blmeta_t*bl_root;   /* root of radix tree   */
> - bsblk_t bl_rootblks;/* bsblk_t blks allocated for tree */
> + swblk_t bl_rootblks;/* swblk_t blks allocated for tree */
>  } *blist_t;
>  
> -#define BLIST_META_RADIX (sizeof(bsbmp_t)*8/2)   /* 2 bits per */
> -#define BLIST_BMAP_RADIX (sizeof(bsbmp_t)*8) /* 1 bit per */
> +#define BLIST_META_RADIX (sizeof(u_swblk_t)*8/2) /* 2 bits per */
> +#define BLIST_BMAP_RADIX (sizeof(u_swblk_t)*8)   /* 1 bit per */
>  
>  /*
>   * The radix may exceed the size of a 64 bit signed (or unsigned) int
> - * when the maximal number of blocks is allocated.  With a 32-bit bsblk_t
> + * when the maximal number of blocks is allocated.  With a 32-bit swblk_t
>   * this corresponds to ~1G x PAGE_SIZE = 4096GB.  The swap code usually
>   * divides this by 4, leaving us with a capability of up to four 1TB swap
>   * devices.
>   *
> - * With a 64-bi

pf.conf(5): document new anchors limit

2022-07-21 Thread Martin Vahlensieck
Hi

This is a diff to document the new anchors limit in pf.conf(5).  I
inserted it as second-to-last item, as the following paragraph talks
about NMBCLUSTERS.  While here: Is the double entry for table-entries
intentional?

Best,

Martin

Index: pf.conf.5
===
RCS file: /cvs/src/share/man/man5/pf.conf.5,v
retrieving revision 1.596
diff -u -p -r1.596 pf.conf.5
--- pf.conf.5   27 May 2022 15:45:02 -  1.596
+++ pf.conf.5   21 Jul 2022 17:00:53 -
@@ -1287,6 +1287,7 @@ has the following defaults:
 .It tables Ta Dv PFR_KTABLE_HIWAT Ta Pq 1000
 .It table-entries Ta Dv PFR_KENTRY_HIWAT Ta Pq 20
 .It table-entries Ta Dv PFR_KENTRY_HIWAT_SMALL Ta Pq 10
+.It anchors Ta Dv PF_ANCHOR_HIWAT Ta Pq 512
 .It frags Ta Dv NMBCLUSTERS Ns /32 Ta Pq platform dependent
 .El
 .Pp



ypconnect(2): mention correct return value

2022-07-21 Thread Martin Vahlensieck
Hi

While looking at the recent YP changes I noticed that the RETURN
VALUES section of the man page is incorrect.  Here is an update (I
just copied the text from socket(2) and adjusted the function name).

Best,

Martin

Index: ypconnect.2
===
RCS file: /cvs/src/lib/libc/sys/ypconnect.2,v
retrieving revision 1.2
diff -u -p -r1.2 ypconnect.2
--- ypconnect.2 17 Jul 2022 05:48:26 -  1.2
+++ ypconnect.2 21 Jul 2022 17:08:57 -
@@ -45,7 +45,12 @@ general purpose.
 .Nm
 is only intended for use by internal libc YP functions.
 .Sh RETURN VALUES
-.Rv -std
+If successful,
+.Fn ypconnect
+returns a non-negative integer, the socket file descriptor.
+Otherwise, a value of \-1 is returned and
+.Va errno
+is set to indicate the error.
 .Sh ERRORS
 .Fn ypconnect
 will fail if:



Re: Introduce uvm_pagewait()

2022-07-11 Thread Martin Pieuchot
On 28/06/22(Tue) 14:13, Martin Pieuchot wrote:
> I'd like to abstract the use of PG_WANTED to start unifying & cleaning
> the various cases where a code path is waiting for a busy page.  Here's
> the first step.
> 
> ok?

Anyone?

> Index: uvm/uvm_amap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
> retrieving revision 1.90
> diff -u -p -r1.90 uvm_amap.c
> --- uvm/uvm_amap.c30 Aug 2021 16:59:17 -  1.90
> +++ uvm/uvm_amap.c28 Jun 2022 11:53:08 -
> @@ -781,9 +781,7 @@ ReStart:
>* it and then restart.
>*/
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(&pg->pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK,
> - "cownow", INFSLP);
> + uvm_pagewait(pg, amap->am_lock, "cownow");
>   goto ReStart;
>   }
>  
> Index: uvm/uvm_aobj.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
> retrieving revision 1.103
> diff -u -p -r1.103 uvm_aobj.c
> --- uvm/uvm_aobj.c29 Dec 2021 20:22:06 -  1.103
> +++ uvm/uvm_aobj.c28 Jun 2022 11:53:08 -
> @@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj)
>   while ((pg = RBT_ROOT(uvm_objtree, &uobj->memt)) != NULL) {
>   pmap_page_protect(pg, PROT_NONE);
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(&pg->pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det",
> - INFSLP);
> + uvm_pagewait(pg, uobj->vmobjlock, "uao_det");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   continue;
>   }
>   uao_dropswap(&aobj->u_obj, pg->offset >> PAGE_SHIFT);
> @@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
>  
>   /* Make sure page is unbusy, else wait for it. */
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(&pg->pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh",
> - INFSLP);
> + uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   curoff -= PAGE_SIZE;
>   continue;
>   }
> @@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t 
>  
>   /* page is there, see if we need to wait on it */
>   if ((ptmp->pg_flags & PG_BUSY) != 0) {
> - atomic_setbits_int(&ptmp->pg_flags, PG_WANTED);
> - rwsleep_nsec(ptmp, uobj->vmobjlock, PVM,
> - "uao_get", INFSLP);
> + uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   continue;   /* goto top of pps while loop */
>   }
>  
> Index: uvm/uvm_km.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_km.c,v
> retrieving revision 1.150
> diff -u -p -r1.150 uvm_km.c
> --- uvm/uvm_km.c  7 Jun 2022 12:07:45 -   1.150
> +++ uvm/uvm_km.c  28 Jun 2022 11:53:08 -
> @@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj,
>   for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
>   pp = uvm_pagelookup(uobj, curoff);
>   if (pp && pp->pg_flags & PG_BUSY) {
> - atomic_setbits_int(&pp->pg_flags, PG_WANTED);
> - rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm",
> - INFSLP);
> + uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   curoff -= PAGE_SIZE; /* loop back to us */
>   continue;
>   }
> Index: uvm/uvm_page.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> retrieving revision 1.166
> diff -u -p -r1.166 uvm_page.c
> --- 

PATCH: better prime testing for libressl

2022-07-08 Thread Martin Grenouilloux
Hello,

I'm suggesting you a diff against master of the implementation of the 
Baillie-PSW algorithm for primality testing.
The code in itself is commented and has information about what is being done 
and why.

The reason for this change is that in this paper from 2018 
(https://eprint.iacr.org/2018/749.pdf),
researchers pointed out the weaknesses of main algorithms for primality testing 
where one could generate
pseudoprimes to those tests resulting in fooling them and having composite 
numbers being taken as primes.

There is one that has been studied and that has no known pseudoprime and it is 
called the Baillie-PSW algorithm.
It is this one I and Theo Buehler have implemented as it is safer.

Me and Theo Buehler have been actively working on it for the past month to 
finally present it to you.

In hope this finds you well,

Regards,

Martin Grenouilloux.
EPITA - LSE
diff --git a/lib/libcrypto/Makefile b/lib/libcrypto/Makefile
index d6432cdc518..7e3a07c5fba 100644
--- a/lib/libcrypto/Makefile
+++ b/lib/libcrypto/Makefile
@@ -89,6 +89,7 @@ SRCS+= bn_print.c bn_rand.c bn_shift.c bn_word.c bn_blind.c
 SRCS+= bn_kron.c bn_sqrt.c bn_gcd.c bn_prime.c bn_err.c bn_sqr.c
 SRCS+= bn_recp.c bn_mont.c bn_mpi.c bn_exp2.c bn_gf2m.c bn_nist.c
 SRCS+= bn_depr.c bn_const.c bn_x931p.c
+SRCS+= bn_bpsw.c bn_isqrt.c
 
 # buffer/
 SRCS+= buffer.c buf_err.c buf_str.c
diff --git a/lib/libcrypto/bn/bn_bpsw.c b/lib/libcrypto/bn/bn_bpsw.c
new file mode 100644
index 000..d198899b2a4
--- /dev/null
+++ b/lib/libcrypto/bn/bn_bpsw.c
@@ -0,0 +1,401 @@
+/*	$OpenBSD$ */
+/*
+ * Copyright (c) 2022 Martin Grenouilloux 
+ * Copyright (c) 2022 Theo Buehler 
+ *
+ * Permission to use, copy, modify, and distribute this software for any
+ * purpose with or without fee is hereby granted, provided that the above
+ * copyright notice and this permission notice appear in all copies.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+ * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+ * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+ * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+ * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+ * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ */
+
+#include 
+
+#include "bn_lcl.h"
+#include "bn_prime.h"
+
+/*
+ * For an odd n compute a / 2 (mod n). If a is even, we can do a plain
+ * division, otherwise calculate (a + n) / 2. Then reduce (mod n).
+ */
+static int
+bn_division_by_two_mod_n(BIGNUM *r, BIGNUM *a, const BIGNUM *n, BN_CTX *ctx)
+{
+	if (!BN_is_odd(n))
+		return 0;
+
+	if (!BN_mod_ct(r, a, n, ctx))
+		return 0;
+
+	if (BN_is_odd(r)) {
+		if (!BN_add(r, r, n))
+			return 0;
+	}
+
+	if (!BN_rshift1(r, r))
+		return 0;
+
+	return 1;
+}
+
+/*
+ * Given the next binary digit of k and the current Lucas terms U and V, this
+ * helper computes the next terms in the Lucas sequence defined as follows:
+ *
+ *   U' = U * V  (mod n)
+ *   V' = (V^2 + D * U^2) / 2(mod n)
+ *
+ * If digit == 0, bn_lucas_step() returns U' and V'. If digit == 1, it returns
+ *
+ *   U'' = (U' + V') / 2 (mod n)
+ *   V'' = (V' + D * U') / 2 (mod n)
+ *
+ * Compare with FIPS 186-4, Appendix C.3.3, step 6.
+ */
+static int
+bn_lucas_step(BIGNUM *U, BIGNUM *V, int digit, const BIGNUM *D,
+const BIGNUM *n, BN_CTX *ctx)
+{
+	BIGNUM *tmp;
+	int ret = 0;
+
+	BN_CTX_start(ctx);
+
+	if ((tmp = BN_CTX_get(ctx)) == NULL)
+		goto done;
+
+	/* Store D * U^2 before computing U'. */
+	if (!BN_sqr(tmp, U, ctx))
+		goto done;
+	if (!BN_mul(tmp, D, tmp, ctx))
+		goto done;
+
+	/* U' = U * V (mod n). */
+	if (!BN_mod_mul(U, U, V, n, ctx))
+		goto done;
+
+	/* V' = (V^2 + D * U^2) / 2 (mod n). */
+	if (!BN_sqr(V, V, ctx))
+		goto done;
+	if (!BN_add(V, V, tmp))
+		goto done;
+	if (!bn_division_by_two_mod_n(V, V, n, ctx))
+		goto done;
+
+	if (digit == 1) {
+		/* Store D * U' before computing U''. */
+		if (!BN_mul(tmp, D, U, ctx))
+			goto done;
+
+		/* U'' = (U' + V') / 2 (mod n). */
+		if (!BN_add(U, U, V))
+			goto done;
+		if (!bn_division_by_two_mod_n(U, U, n, ctx))
+			goto done;
+
+		/* V'' = (V' + D * U') / 2 (mod n). */
+		if (!BN_add(V, V, tmp))
+			goto done;
+		if (!bn_division_by_two_mod_n(V, V, n, ctx))
+			goto done;
+	}
+
+	ret = 1;
+
+ done:
+	BN_CTX_end(ctx);
+
+	return ret;
+}
+
+/*
+ * Compute the Lucas terms U_k, V_k, see FIPS 186-4, Appendix C.3.3, steps 4-6.
+ */
+static int
+bn_lucas(BIGNUM *U, BIGNUM *V, const BIGNUM *k, const BIGNUM *D,
+const BIGNUM *n, BN_CTX *ctx)
+{
+	int digit, i;
+	int ret = 0;
+
+	if (!BN_one(U))
+		goto done;
+	if (!BN_one(V))
+		goto done;
+
+	/*
+	 * Iterate over 

Faster M operation for the swapper to be great again

2022-06-30 Thread Martin Pieuchot
Diff below uses two tricks to make uvm_pagermapin/out() faster and less
likely to fail in OOM situations.

These functions are used to map buffers when swapping pages in/out and
when faulting on mmaped files.  robert@ even measured a 75% improvement
when populating pages related to files that aren't yet in the buffer
cache.

The first trick is to use the direct map when available.  I'm doing this
for single pages but km_alloc(9) also does that for single segment...
uvm_io() only maps one page at a time for the moment so this should be
enough.

The second trick is to use pmap_kenter_pa() which doesn't fail and is
faster.

With this changes the "freeze" happening on my server when entering many
pages to swap in OOM situation is much shorter and the machine becomes
quickly responsive.

ok?

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.81
diff -u -p -r1.81 uvm_pager.c
--- uvm/uvm_pager.c 28 Jun 2022 19:07:40 -  1.81
+++ uvm/uvm_pager.c 30 Jun 2022 13:34:46 -
@@ -258,6 +258,16 @@ uvm_pagermapin(struct vm_page **pps, int
vsize_t size;
struct vm_page *pp;
 
+#ifdef __HAVE_PMAP_DIRECT
+   /* use direct mappings for single page */
+   if (npages == 1) {
+   KASSERT(pps[0]);
+   KASSERT(pps[0]->pg_flags & PG_BUSY);
+   kva = pmap_map_direct(pps[0]);
+   return kva;
+   }
+#endif
+
prot = PROT_READ;
if (flags & UVMPAGER_MAPIN_READ)
prot |= PROT_WRITE;
@@ -273,14 +283,7 @@ uvm_pagermapin(struct vm_page **pps, int
pp = *pps++;
KASSERT(pp);
KASSERT(pp->pg_flags & PG_BUSY);
-   /* Allow pmap_enter to fail. */
-   if (pmap_enter(pmap_kernel(), cva, VM_PAGE_TO_PHYS(pp),
-   prot, PMAP_WIRED | PMAP_CANFAIL | prot) != 0) {
-   pmap_remove(pmap_kernel(), kva, cva);
-   pmap_update(pmap_kernel());
-   uvm_pseg_release(kva);
-   return 0;
-   }
+   pmap_kenter_pa(cva, VM_PAGE_TO_PHYS(pp), prot);
}
pmap_update(pmap_kernel());
return kva;
@@ -294,8 +297,15 @@ uvm_pagermapin(struct vm_page **pps, int
 void
 uvm_pagermapout(vaddr_t kva, int npages)
 {
+#ifdef __HAVE_PMAP_DIRECT
+   /* use direct mappings for single page */
+   if (npages == 1) {
+   pmap_unmap_direct(kva);
+   return;
+   }
+#endif
 
-   pmap_remove(pmap_kernel(), kva, kva + ((vsize_t)npages << PAGE_SHIFT));
+   pmap_kremove(kva, (vsize_t)npages << PAGE_SHIFT);
pmap_update(pmap_kernel());
uvm_pseg_release(kva);
 



Re: Use SMR instead of SRP list in rtsock.c

2022-06-30 Thread Martin Pieuchot
On 30/06/22(Thu) 11:56, Claudio Jeker wrote:
> On Thu, Jun 30, 2022 at 12:34:33PM +0300, Vitaliy Makkoveev wrote:
> > On Thu, Jun 30, 2022 at 11:08:48AM +0200, Claudio Jeker wrote:
> > > This diff converts the SRP list to a SMR list in rtsock.c
> > > SRP is a bit strange with how it works and the SMR code is a bit easier to
> > > understand. Since we can sleep in the SMR_TAILQ_FOREACH() we need to grab
> > > a refcount on the route pcb so that we can leave the SMR critical section
> > > and then enter the SMR critical section at the end of the loop before
> > > dropping the refcount again.
> > > 
> > > The diff does not immeditaly explode but I doubt we can exploit
> > > parallelism in route_input() so this may fail at some later stage if it is
> > > wrong.
> > > 
> > > Comments from the lock critics welcome
> > 
> > We use `so_lock' rwlock(9) to protect route domain sockets. We can't
> > convert this SRP list to SMR list because we call solock() within
> > foreach loop.

We shouldn't use SRP list either, no?  Or are we allowed to sleep
holding a SRP reference?  That's the question that triggered this diff.

> because of the so_lock the code uses a refcnt on the route pcb to make
> sure that the object is not freed while we sleep. So that is handled by
> this diff.
>  
> > We can easily crash kernel by running in parallel some "route monitor"
> > commands and "while true; ifconfig vether0 create ; ifconfig vether0
> > destroy; done".
> 
> That does not cause problem on my system.
>  
> > > -- 
> > > :wq Claudio
> > > 
> > > Index: sys/net/rtsock.c
> > > ===
> > > RCS file: /cvs/src/sys/net/rtsock.c,v
> > > retrieving revision 1.334
> > > diff -u -p -r1.334 rtsock.c
> > > --- sys/net/rtsock.c  28 Jun 2022 10:01:13 -  1.334
> > > +++ sys/net/rtsock.c  30 Jun 2022 08:02:09 -
> > > @@ -71,7 +71,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > -#include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -107,8 +107,6 @@ struct walkarg {
> > >  };
> > >  
> > >  void route_prinit(void);
> > > -void rcb_ref(void *, void *);
> > > -void rcb_unref(void *, void *);
> > >  int  route_output(struct mbuf *, struct socket *, struct sockaddr *,
> > >   struct mbuf *);
> > >  int  route_ctloutput(int, struct socket *, int, int, struct mbuf *);
> > > @@ -149,7 +147,7 @@ intrt_setsource(unsigned int, struct 
> > >  struct rtpcb {
> > >   struct socket   *rop_socket;/* [I] */
> > >  
> > > - SRPL_ENTRY(rtpcb)   rop_list;
> > > + SMR_TAILQ_ENTRY(rtpcb)  rop_list;
> > >   struct refcnt   rop_refcnt;
> > >   struct timeout  rop_timeout;
> > >   unsigned introp_msgfilter;  /* [s] */
> > > @@ -162,8 +160,7 @@ struct rtpcb {
> > >  #define  sotortpcb(so)   ((struct rtpcb *)(so)->so_pcb)
> > >  
> > >  struct rtptable {
> > > - SRPL_HEAD(, rtpcb)  rtp_list;
> > > - struct srpl_rc  rtp_rc;
> > > + SMR_TAILQ_HEAD(, rtpcb) rtp_list;
> > >   struct rwlock   rtp_lk;
> > >   unsigned intrtp_count;
> > >  };
> > > @@ -185,29 +182,12 @@ struct rtptable rtptable;
> > >  void
> > >  route_prinit(void)
> > >  {
> > > - srpl_rc_init(&rtptable.rtp_rc, rcb_ref, rcb_unref, NULL);
> > >   rw_init(&rtptable.rtp_lk, "rtsock");
> > > - SRPL_INIT(&rtptable.rtp_list);
> > > + SMR_TAILQ_INIT(&rtptable.rtp_list);
> > >   pool_init(&rtpcb_pool, sizeof(struct rtpcb), 0,
> > >   IPL_SOFTNET, PR_WAITOK, "rtpcb", NULL);
> > >  }
> > >  
> > > -void
> > > -rcb_ref(void *null, void *v)
> > > -{
> > > - struct rtpcb *rop = v;
> > > -
> > > - refcnt_take(&rop->rop_refcnt);
> > > -}
> > > -
> > > -void
> > > -rcb_unref(void *null, void *v)
> > > -{
> > > - struct rtpcb *rop = v;
> > > -
> > > - refcnt_rele_wake(&rop->rop_refcnt);
> > > -}
> > > -
> > >  int
> > >  route_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf 
> > > *nam,
> > >  struct mbuf *control, struct proc *p)
> > > @@ -325,8 +305,7 @@ route_attach(struct socket *so, int prot
> > >   so->so_options |= SO_USELOOPBACK;
> > >  
> > >   rw_enter(&rtptable.rtp_lk, RW_WRITE);
> > > - SRPL_INSERT_HEAD_LOCKED(&rtptable.rtp_rc, &rtptable.rtp_list, rop,
> > > - rop_list);
> > > + SMR_TAILQ_INSERT_HEAD_LOCKED(&rtptable.rtp_list, rop, rop_list);
> > >   rtptable.rtp_count++;
> > >   rw_exit(&rtptable.rtp_lk);
> > >  
> > > @@ -347,8 +326,7 @@ route_detach(struct socket *so)
> > >   rw_enter(&rtptable.rtp_lk, RW_WRITE);
> > >  
> > >   rtptable.rtp_count--;
> > > - SRPL_REMOVE_LOCKED(&rtptable.rtp_rc, &rtptable.rtp_list, rop, rtpcb,
> > > - rop_list);
> > > + SMR_TAILQ_REMOVE_LOCKED(&rtptable.rtp_list, rop, rop_list);
> > >   rw_exit(&rtptable.rtp_lk);
> > >  
> > >   sounlock(so);
> > > @@ -356,6 +334,7 @@ route_detach(struct socket *so)
> > >   /* wait for all references to drop */
> > >   refcnt_finalize(&rop->rop_refcnt, "rtsockrefs");
> > 

Re: arp llinfo mutex

2022-06-29 Thread Martin Pieuchot
On 29/06/22(Wed) 19:40, Alexander Bluhm wrote:
> Hi,
> 
> To fix the KASSERT(la != NULL) we have to protect the rt_llinfo
> with a mutex.  The idea is to keep rt_llinfo and RTF_LLINFO consistent.
> Also do not put the mutex in the fast path.

Losing the RTM_ADD/DELETE race is not a bug.  I would not add a printf
in these cases.  I understand you might want one for debugging purposes
but I don't see any value in committing it.  Do you agree?

Note that some times the code checks for the RTF_LLINFO flags and some
time for rt_llinfo != NULL.  This is inconsistent and a bit confusing
now that we use a mutex to protect those states.

Could you document that rt_llinfo is now protected by the mutex (or
KERNEL_LOCK())?

Anyway this is an improvement ok mpi@

PS: What about ND6?

> Index: netinet/if_ether.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/if_ether.c,v
> retrieving revision 1.250
> diff -u -p -r1.250 if_ether.c
> --- netinet/if_ether.c27 Jun 2022 20:47:10 -  1.250
> +++ netinet/if_ether.c28 Jun 2022 14:00:12 -
> @@ -101,6 +101,8 @@ void arpreply(struct ifnet *, struct mbu
>  unsigned int);
>  
>  struct niqueue arpinq = NIQUEUE_INITIALIZER(50, NETISR_ARP);
> +
> +/* llinfo_arp live time, rt_llinfo and RTF_LLINFO are protected by arp_mtx */
>  struct mutex arp_mtx = MUTEX_INITIALIZER(IPL_SOFTNET);
>  
>  LIST_HEAD(, llinfo_arp) arp_list; /* [mN] list of all llinfo_arp structures 
> */
> @@ -155,7 +157,7 @@ void
>  arp_rtrequest(struct ifnet *ifp, int req, struct rtentry *rt)
>  {
>   struct sockaddr *gate = rt->rt_gateway;
> - struct llinfo_arp *la = (struct llinfo_arp *)rt->rt_llinfo;
> + struct llinfo_arp *la;
>   time_t uptime;
>  
>   NET_ASSERT_LOCKED();
> @@ -171,7 +173,7 @@ arp_rtrequest(struct ifnet *ifp, int req
>   rt->rt_expire = 0;
>   break;
>   }
> - if ((rt->rt_flags & RTF_LOCAL) && !la)
> + if ((rt->rt_flags & RTF_LOCAL) && rt->rt_llinfo == NULL)
>   rt->rt_expire = 0;
>   /*
>* Announce a new entry if requested or warn the user
> @@ -192,44 +194,54 @@ arp_rtrequest(struct ifnet *ifp, int req
>   }
>   satosdl(gate)->sdl_type = ifp->if_type;
>   satosdl(gate)->sdl_index = ifp->if_index;
> - if (la != NULL)
> - break; /* This happens on a route change */
>   /*
>* Case 2:  This route may come from cloning, or a manual route
>* add with a LL address.
>*/
>   la = pool_get(&arp_pool, PR_NOWAIT | PR_ZERO);
> - rt->rt_llinfo = (caddr_t)la;
>   if (la == NULL) {
>   log(LOG_DEBUG, "%s: pool get failed\n", __func__);
>   break;
>   }
>  
> + mtx_enter(&arp_mtx);
> + if (rt->rt_llinfo != NULL) {
> + /* we lost the race, another thread has entered it */
> + mtx_leave(&arp_mtx);
> + printf("%s: llinfo exists\n", __func__);
> + pool_put(&arp_pool, la);
> + break;
> + }
>   mq_init(&la->la_mq, LA_HOLD_QUEUE, IPL_SOFTNET);
> + rt->rt_llinfo = (caddr_t)la;
>   la->la_rt = rt;
>   rt->rt_flags |= RTF_LLINFO;
> + LIST_INSERT_HEAD(&arp_list, la, la_list);
>   if ((rt->rt_flags & RTF_LOCAL) == 0)
>   rt->rt_expire = uptime;
> - mtx_enter(&arp_mtx);
> - LIST_INSERT_HEAD(&arp_list, la, la_list);
>   mtx_leave(&arp_mtx);
> +
>   break;
>  
>   case RTM_DELETE:
> - if (la == NULL)
> - break;
>   mtx_enter(&arp_mtx);
> + la = (struct llinfo_arp *)rt->rt_llinfo;
> + if (la == NULL) {
> + /* we lost the race, another thread has removed it */
> + mtx_leave(&arp_mtx);
> + printf("%s: llinfo missing\n", __func__);
> + break;
> + }
>   LIST_REMOVE(la, la_list);
> - mtx_leave(&arp_mtx);
>   rt->rt_llinfo = NULL;
>   rt->rt_flags &= ~RTF_LLINFO;
>   atomic_sub_int(&la_hold_total, mq_purge(&la->la_mq));
> + mtx_leave(&arp_mtx);
> +
>   pool_put(&arp_pool, la);
>   break;
>  
>   case RTM_INVALIDATE:
> - if (la == NULL)
> - break;
>   if (!ISSET(rt->rt_flags, RTF_LOCAL))
>   arpinvalidate(rt);
>   break;
> @@ -363,8 +375,6 @@ arpresolve(struct ifnet *ifp, struct rte
>   goto bad;
>   }
>  
> - la = (struct llinfo_arp *)rt->rt_llinfo;
> 

Simplify aiodone daemon

2022-06-29 Thread Martin Pieuchot
The aiodone daemon accounts for and frees/releases pages they were
written to swap.  It is only used for asynchronous write.  The diff
below uses this knowledge to:

- Stop suggesting that uvm_swap_get() can be asynchronous.  There's an
  assert for PGO_SYNCIO 3 lines above.

- Remove unused support for asynchronous read, including error
  conditions, from uvm_aio_aiodone_pages().

- Grab the proper lock for each page that has been written to swap.
  This allows to enable an assert in uvm_page_unbusy().

- Move the uvm_anon_release() call outside of uvm_page_unbusy() and
  assert for the different anon cases.  This will allows us to unify
  code paths waiting for busy pages.

This is adapted/simplified from what is in NetBSD.

ok?

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  29 Jun 2022 11:16:35 -
@@ -143,7 +143,6 @@ struct pool uvm_aobj_pool;
 
 static struct uao_swhash_elt   *uao_find_swhash_elt(struct uvm_aobj *, int,
 boolean_t);
-static int  uao_find_swslot(struct uvm_object *, int);
 static boolean_tuao_flush(struct uvm_object *, voff_t,
 voff_t, int);
 static void uao_free(struct uvm_aobj *);
@@ -241,7 +240,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-inline static int
+int
 uao_find_swslot(struct uvm_object *uobj, int pageidx)
 {
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_aobj.h
--- uvm/uvm_aobj.h  21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_aobj.h  29 Jun 2022 11:16:35 -
@@ -60,6 +60,7 @@
 
 void uao_init(void);
 int uao_set_swslot(struct uvm_object *, int, int);
+int uao_find_swslot (struct uvm_object *, int);
 int uao_dropswap(struct uvm_object *, int);
 int uao_swap_off(int, int);
 int uao_shrink(struct uvm_object *, int);
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  29 Jun 2022 11:47:55 -
@@ -1036,13 +1036,14 @@ uvm_pagefree(struct vm_page *pg)
  * uvm_page_unbusy: unbusy an array of pages.
  *
  * => pages must either all belong to the same object, or all belong to anons.
+ * => if pages are object-owned, object must be locked.
  * => if pages are anon-owned, anons must have 0 refcount.
+ * => caller must make sure that anon-owned pages are not PG_RELEASED.
  */
 void
 uvm_page_unbusy(struct vm_page **pgs, int npgs)
 {
struct vm_page *pg;
-   struct uvm_object *uobj;
int i;
 
for (i = 0; i < npgs; i++) {
@@ -1052,35 +1053,19 @@ uvm_page_unbusy(struct vm_page **pgs, in
continue;
}
 
-#if notyet
-   /*
- * XXX swap case in uvm_aio_aiodone() is not holding the lock.
-*
-* This isn't compatible with the PG_RELEASED anon case below.
-*/
KASSERT(uvm_page_owner_locked_p(pg));
-#endif
KASSERT(pg->pg_flags & PG_BUSY);
 
if (pg->pg_flags & PG_WANTED) {
wakeup(pg);
}
if (pg->pg_flags & PG_RELEASED) {
-   uobj = pg->uobject;
-   if (uobj != NULL) {
-   uvm_lock_pageq();
-   pmap_page_protect(pg, PROT_NONE);
-   /* XXX won't happen right now */
-   if (pg->pg_flags & PQ_AOBJ)
-   uao_dropswap(uobj,
-   pg->offset >> PAGE_SHIFT);
-   uvm_pagefree(pg);
-   uvm_unlock_pageq();
-   } else {
-   rw_enter(pg->uanon->an_lock, RW_WRITE);
-   uvm_anon_release(pg->uanon);
-   }
+   KASSERT(pg->uobject != NULL ||
+   (pg->uanon != NULL && pg->uanon->an_ref > 0));
+   atomic_clearbits_int(&pg->pg_flags, PG_RELEASED);
+   uvm_pagefree(pg);
} else {
+   KASSERT((pg->pg_flags & PG_FAKE) == 0);
atomic_clearbits_int(&pg->pg_flags, PG_WANTED|PG_BUSY);
UVM_PAGE_OWN(pg, NULL);
   

Re: Unlocking pledge(2)

2022-06-28 Thread Martin Pieuchot
On 28/06/22(Tue) 18:17, Jeremie Courreges-Anglas wrote:
> 
> Initially I just wandered in syscall_mi.h and found the locking scheme
> super weird, even if technically correct.  pledge_syscall() better be
> safe to call without the kernel lock so I don't understand why we're
> sometimes calling it with the kernel lock and sometimes not.
> 
> ps_pledge is 64 bits so it's not possible to unset bits in an atomic
> manner on all architectures.  Even if we're only removing bits and there
> is probably no way to see a completely garbage value, it makes sense to
> just protect ps_pledge (and ps_execpledge) in the usual manner so that
> we can unlock the syscall.  The diff below protects the fields using
> ps_mtx even though I initially used a dedicated ps_pledge_mtx.
> unveil_destroy() needs to be moved after the critical section.
> regress/sys/kern/pledge looks happy with this.  The sys/syscall_mi.h
> change can be committed in a separate step.
> 
> Input and oks welcome.

This looks nice.  I doubt there's any existing program where you can
really test this.  Even firefox and chromium should do things
correctly.

Maybe you should write a regress test that tries to break the kernel.

> Index: arch/amd64/amd64/vmm.c
> ===
> RCS file: /home/cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.315
> diff -u -p -r1.315 vmm.c
> --- arch/amd64/amd64/vmm.c27 Jun 2022 15:12:14 -  1.315
> +++ arch/amd64/amd64/vmm.c28 Jun 2022 13:54:25 -
> @@ -713,7 +713,7 @@ pledge_ioctl_vmm(struct proc *p, long co
>   case VMM_IOC_CREATE:
>   case VMM_IOC_INFO:
>   /* The "parent" process in vmd forks and manages VMs */
> - if (p->p_p->ps_pledge & PLEDGE_PROC)
> + if (pledge_get(p->p_p) & PLEDGE_PROC)
>   return (0);
>   break;
>   case VMM_IOC_TERM:
> @@ -1312,7 +1312,7 @@ vm_find(uint32_t id, struct vm **res)
>* The managing vmm parent process can lookup all
>* all VMs and is indicated by PLEDGE_PROC.
>*/
> - if (((p->p_p->ps_pledge &
> + if (((pledge_get(p->p_p) &
>   (PLEDGE_VMM | PLEDGE_PROC)) == PLEDGE_VMM) &&
>   (vm->vm_creator_pid != p->p_p->ps_pid))
>   return (pledge_fail(p, EPERM, PLEDGE_VMM));
> Index: kern/init_sysent.c
> ===
> RCS file: /home/cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.238
> diff -u -p -r1.238 init_sysent.c
> --- kern/init_sysent.c27 Jun 2022 14:26:05 -  1.238
> +++ kern/init_sysent.c28 Jun 2022 15:18:25 -
> @@ -1,10 +1,10 @@
> -/*   $OpenBSD: init_sysent.c,v 1.238 2022/06/27 14:26:05 cheloha Exp $   
> */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
>   *
>   * DO NOT EDIT-- this file is automatically generated.
> - * created from; OpenBSD: syscalls.master,v 1.224 2022/05/16 07:36:04 
> mvs Exp 
> + * created from; OpenBSD: syscalls.master,v 1.225 2022/06/27 14:26:05 
> cheloha Exp 
>   */
>  
>  #include 
> @@ -248,7 +248,7 @@ const struct sysent sysent[] = {
>   sys_listen },   /* 106 = listen */
>   { 4, s(struct sys_chflagsat_args), 0,
>   sys_chflagsat },/* 107 = chflagsat */
> - { 2, s(struct sys_pledge_args), 0,
> + { 2, s(struct sys_pledge_args), SY_NOLOCK | 0,
>   sys_pledge },   /* 108 = pledge */
>   { 4, s(struct sys_ppoll_args), 0,
>   sys_ppoll },/* 109 = ppoll */
> Index: kern/kern_event.c
> ===
> RCS file: /home/cvs/src/sys/kern/kern_event.c,v
> retrieving revision 1.191
> diff -u -p -r1.191 kern_event.c
> --- kern/kern_event.c 27 Jun 2022 13:35:21 -  1.191
> +++ kern/kern_event.c 28 Jun 2022 13:55:18 -
> @@ -331,7 +331,7 @@ filt_procattach(struct knote *kn)
>   int s;
>  
>   if ((curproc->p_p->ps_flags & PS_PLEDGE) &&
> - (curproc->p_p->ps_pledge & PLEDGE_PROC) == 0)
> + (pledge_get(curproc->p_p) & PLEDGE_PROC) == 0)
>   return pledge_fail(curproc, EPERM, PLEDGE_PROC);
>  
>   if (kn->kn_id > PID_MAX)
> Index: kern/kern_pledge.c
> ===
> RCS file: /home/cvs/src/sys/kern/kern_pledge.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 kern_pledge.c
> --- kern/kern_pledge.c26 Jun 2022 06:11:49 -  1.282
> +++ kern/kern_pledge.c28 Jun 2022 15:21:46 -
> @@ -21,6 +21,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -465,13 +466,26 @@ sys_pledge(struct proc *p, void *v, regi
>   struct process *pr = p->p_p;
>   uint64_t p

Re: Fix the swapper

2022-06-28 Thread Martin Pieuchot
On 27/06/22(Mon) 15:44, Martin Pieuchot wrote:
> Diff below contain 3 parts that can be committed independently.  The 3
> of them are necessary to allow the pagedaemon to make progress in OOM
> situation and to satisfy all the allocations waiting for pages in
> specific ranges.
> 
> * uvm/uvm_pager.c part reserves a second segment for the page daemon.
>   This is necessary to ensure the two uvm_pagermapin() calls needed by
>   uvm_swap_io() succeed in emergency OOM situation.  (the 2nd segment is
>   necessary when encryption or bouncing is required)
> 
> * uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region
>   for the same reason.  Note that a sleeping point is introduced because
>   the pagedaemon is faster than the asynchronous I/O and in OOM
>   situation it tends to stay busy building cluster that it then discard
>   because no memory is available.
> 
> * uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive 
>   list of pages to account for a given memory range.  Without this the
>   daemon could spin infinitely doing nothing because the global limits
>   are reached.

Here's an updated diff with a fix on top:

 * in uvm/uvm_swap.c make sure uvm_swap_allocpages() is allowed to sleep
   when coming from uvm_fault().  This makes the faulting process wait
   instead of dying when there isn't any free pages to do the bouncing.

I'd appreciate more reviews and tests !

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.80
diff -u -p -r1.80 uvm_pager.c
--- uvm/uvm_pager.c 28 Jun 2022 12:10:37 -  1.80
+++ uvm/uvm_pager.c 28 Jun 2022 15:25:30 -
@@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[]
  * The number of uvm_pseg instances is dynamic using an array segs.
  * At most UVM_PSEG_COUNT instances can exist.
  *
- * psegs[0] always exists (so that the pager can always map in pages).
- * psegs[0] element 0 is always reserved for the pagedaemon.
+ * psegs[0/1] always exist (so that the pager can always map in pages).
+ * psegs[0/1] element 0 are always reserved for the pagedaemon.
  *
  * Any other pseg is automatically created when no space is available
  * and automatically destroyed when it is no longer in use.
@@ -93,6 +93,7 @@ uvm_pager_init(void)
 
/* init pager map */
uvm_pseg_init(&psegs[0]);
+   uvm_pseg_init(&psegs[1]);
mtx_init(&uvm_pseg_lck, IPL_VM);
 
/* init ASYNC I/O queue */
@@ -168,9 +169,10 @@ pager_seg_restart:
goto pager_seg_fail;
}
 
-   /* Keep index 0 reserved for pagedaemon. */
-   if (pseg == &psegs[0] && curproc != uvm.pagedaemon_proc)
-   i = 1;
+   /* Keep indexes 0,1 reserved for pagedaemon. */
+   if ((pseg == &psegs[0] || pseg == &psegs[1]) &&
+   (curproc != uvm.pagedaemon_proc))
+   i = 2;
else
i = 0;
 
@@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr)
pseg->use &= ~(1 << id);
wakeup(&psegs);
 
-   if (pseg != &psegs[0] && UVM_PSEG_EMPTY(pseg)) {
+   if ((pseg != &psegs[0] && pseg != &psegs[1]) && UVM_PSEG_EMPTY(pseg)) {
va = pseg->start;
pseg->start = 0;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   12 May 2022 12:49:31 -  1.99
+++ uvm/uvm_pdaemon.c   28 Jun 2022 13:59:49 -
@@ -101,8 +101,8 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
-void   uvmpd_scan(void);
-boolean_t  uvmpd_scan_inactive(struct pglist *);
+void   uvmpd_scan(struct uvm_pmalloc *);
+boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -281,7 +281,7 @@ uvm_pageout(void *arg)
if (pma != NULL ||
((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) ||
((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) {
-   uvmpd_scan();
+   uvmpd_scan(pma);
}
 
/*
@@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg)
  */
 
 boolean_t
-uvmpd_scan_inactive(struct pglist *pglst)
+uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
-   struct vm_pa

Introduce uvm_pagewait()

2022-06-28 Thread Martin Pieuchot
I'd like to abstract the use of PG_WANTED to start unifying & cleaning
the various cases where a code path is waiting for a busy page.  Here's
the first step.

ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.90
diff -u -p -r1.90 uvm_amap.c
--- uvm/uvm_amap.c  30 Aug 2021 16:59:17 -  1.90
+++ uvm/uvm_amap.c  28 Jun 2022 11:53:08 -
@@ -781,9 +781,7 @@ ReStart:
 * it and then restart.
 */
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(&pg->pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK,
-   "cownow", INFSLP);
+   uvm_pagewait(pg, amap->am_lock, "cownow");
goto ReStart;
}
 
Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  28 Jun 2022 11:53:08 -
@@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj)
while ((pg = RBT_ROOT(uvm_objtree, &uobj->memt)) != NULL) {
pmap_page_protect(pg, PROT_NONE);
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(&pg->pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det",
-   INFSLP);
+   uvm_pagewait(pg, uobj->vmobjlock, "uao_det");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
continue;
}
uao_dropswap(&aobj->u_obj, pg->offset >> PAGE_SHIFT);
@@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
 
/* Make sure page is unbusy, else wait for it. */
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(&pg->pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh",
-   INFSLP);
+   uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
curoff -= PAGE_SIZE;
continue;
}
@@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t 
 
/* page is there, see if we need to wait on it */
if ((ptmp->pg_flags & PG_BUSY) != 0) {
-   atomic_setbits_int(&ptmp->pg_flags, PG_WANTED);
-   rwsleep_nsec(ptmp, uobj->vmobjlock, PVM,
-   "uao_get", INFSLP);
+   uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
continue;   /* goto top of pps while loop */
}
 
Index: uvm/uvm_km.c
===
RCS file: /cvs/src/sys/uvm/uvm_km.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_km.c
--- uvm/uvm_km.c7 Jun 2022 12:07:45 -   1.150
+++ uvm/uvm_km.c28 Jun 2022 11:53:08 -
@@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj,
for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
pp = uvm_pagelookup(uobj, curoff);
if (pp && pp->pg_flags & PG_BUSY) {
-   atomic_setbits_int(&pp->pg_flags, PG_WANTED);
-   rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm",
-   INFSLP);
+   uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
curoff -= PAGE_SIZE; /* loop back to us */
continue;
}
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  28 Jun 2022 11:57:42 -
@@ -1087,6 +1087,23 @@ uvm_page_unbusy(struct vm_page **pgs, in
}
 }
 
+/*
+ * uvm_pagewait: wait for a busy page
+ *
+ * => page must be known PG_BUSY
+ * => object must be locked
+ * => object will be unlocked on return
+ */
+void
+uvm_pagewait(struct vm_page *pg, struct rwlock *lock, const char *wmesg)
+{
+   KASSERT(rw_lock_held(lock));
+   KASSERT((pg->pg_flags & PG_BUSY) != 0);
+
+   atomic_setbits_int(&pg->pg_flags, PG_WANTED);
+   rwsleep_nsec(pg, lock, PVM | PNORELOCK, wmesg, INFSLP);
+}
+
 #if defined(UVM_PAGE_TRKOW

Re: kernel lock in arp

2022-06-27 Thread Martin Pieuchot
On 27/06/22(Mon) 19:11, Alexander Bluhm wrote:
> On Mon, Jun 27, 2022 at 11:49:23AM +0200, Alexander Bluhm wrote:
> > On Sat, May 21, 2022 at 10:50:28PM +0300, Vitaliy Makkoveev wrote:
> > > This diff looks good, except the re-check after kernel lock. It???s
> > > supposed `rt??? could became inconsistent, right? But what stops to
> > > make it inconsistent after first unlocked RTF_LLINFO flag check?
> > >
> > > I think this re-check should gone.
> >
> > I have copied the re-check from intenal genua code.  I am not sure
> > if it is really needed.  We know from Hrvoje that the diff with
> > re-check is stable.  And we know that it crashes without kernel
> > lock at all.
> >
> > I have talked with mpi@ about it.  The main problem is that we have
> > no write lock when we change RTF_LLINFO.  Then rt_llinfo can get
> > NULL or inconsistent.
> >
> > Plan is that I put some lock asserts into route add and delete.
> > This helps to find the parts that modify RTF_LLINFO and rt_llinfo
> > without exclusive lock.
> >
> > Maybe we need some kernel lock somewhere else.  Or we want to use
> > some ARP mutex.  We could also add some comment and commit the diff
> > that I have.  We know that it is faster and stable.  Pushing the
> > kernel lock down or replacing it with something clever can always
> > be done later.
> 
> We need the re-check.  I have tested it with a printf.  It is
> triggered by running arp -d in a loop while forwarding.
> 
> The concurrent threads are these:
> 
> rtrequest_delete(8000246b7428,3,80775048,8000246b7510,0) at 
> rtrequest_delete+0x67
> rtdeletemsg(fd8834a23550,80775048,0) at rtdeletemsg+0x1ad
> rtrequest(b,8000246b7678,3,8000246b7718,0) at rtrequest+0x55c
> rt_clone(8000246b7780,8000246b78f8,0) at rt_clone+0x73
> rtalloc_mpath(8000246b78f8,fd8003169ad8,0) at rtalloc_mpath+0x4c
> ip_forward(fd80b8cc7e00,8077d048,fd8834a230f0,0) at 
> ip_forward+0x137
> ip_input_if(8000246b7a28,8000246b7a34,4,0,8077d048) at 
> ip_input_if+0x353
> ipv4_input(8077d048,fd80b8cc7e00) at ipv4_input+0x39
> ether_input(8077d048,fd80b8cc7e00) at ether_input+0x3ad
> if_input_process(8077d048,8000246b7b18) at if_input_process+0x6f
> ifiq_process(8077d458) at ifiq_process+0x69
> taskq_thread(80036080) at taskq_thread+0x100
> 
> rtrequest_delete(8000246c8d08,3,80775048,8000246c8df0,0) at 
> rtrequest_delete+0x67
> rtdeletemsg(fd8834a230f0,80775048,0) at rtdeletemsg+0x1ad
> rtrequest(b,8000246c8f58,3,8000246c8ff8,0) at rtrequest+0x55c
> rt_clone(8000246c9060,8000246c90b8,0) at rt_clone+0x73
> rtalloc_mpath(8000246c90b8,fd8002c754d8,0) at rtalloc_mpath+0x4c
> in_ouraddr(fd8094771b00,8077d048,8000246c9138) at 
> in_ouraddr+0x84
> ip_input_if(8000246c91d8,8000246c91e4,4,0,8077d048) at 
> ip_input_if+0x1cd
> ipv4_input(8077d048,fd8094771b00) at ipv4_input+0x39
> ether_input(8077d048,fd8094771b00) at ether_input+0x3ad
> if_input_process(8077d048,8000246c92c8) at if_input_process+0x6f
> ifiq_process(80781400) at ifiq_process+0x69
> taskq_thread(80036200) at taskq_thread+0x100
> 
> I have added a comment why kernel lock protects us.  I would like
> to get this in.  It has been tested, reduces the kernel lock and
> is faster.  A more clever lock can be done later.
> 
> ok?

I don't understand how the KERNEL_LOCK() there prevents rtdeletemsg()
from running.  rtrequest_delete() seems completely broken it assumes it
holds an exclusive lock.

To "fix" arp the KERNEL_LOCK() should also be taken in RTM_DELETE and
RTM_RESOLVE inside arp_rtrequest().  Or maybe around ifp->if_rtrequest()

But it doesn't mean there isn't another problem in rtdeletemsg()...

> Index: net/if_ethersubr.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v
> retrieving revision 1.281
> diff -u -p -r1.281 if_ethersubr.c
> --- net/if_ethersubr.c26 Jun 2022 21:19:53 -  1.281
> +++ net/if_ethersubr.c27 Jun 2022 16:55:15 -
> @@ -221,10 +221,7 @@ ether_resolve(struct ifnet *ifp, struct 
>  
>   switch (af) {
>   case AF_INET:
> - KERNEL_LOCK();
> - /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> - KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   eh->ether_type = htons(ETHERTYPE_IP);
> @@ -285,10 +282,7 @@ ether_resolve(struct ifnet *ifp, struct 
>   break;
>  #endif
>   case AF_INET:
> - KERNEL_LOCK();
> - /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> -

CoW & neighbor pages

2022-06-27 Thread Martin Pieuchot
When faulting a page after COW neighborhood pages are likely to already
be entered.   So speed up the fault by doing a narrow fault (do not try
to map in adjacent pages).

This is stolen from NetBSD.

ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.129
diff -u -p -r1.129 uvm_fault.c
--- uvm/uvm_fault.c 4 Apr 2022 09:27:05 -   1.129
+++ uvm/uvm_fault.c 27 Jun 2022 17:05:26 -
@@ -737,6 +737,16 @@ uvm_fault_check(struct uvm_faultinfo *uf
}
 
/*
+* for a case 2B fault waste no time on adjacent pages because
+* they are likely already entered.
+*/
+   if (uobj != NULL && amap != NULL &&
+   (flt->access_type & PROT_WRITE) != 0) {
+   /* wide fault (!narrow) */
+   flt->narrow = TRUE;
+   }
+
+   /*
 * establish range of interest based on advice from mapper
 * and then clip to fit map entry.   note that we only want
 * to do this the first time through the fault.   if we



Fix the swapper

2022-06-27 Thread Martin Pieuchot
Diff below contain 3 parts that can be committed independently.  The 3
of them are necessary to allow the pagedaemon to make progress in OOM
situation and to satisfy all the allocations waiting for pages in
specific ranges.

* uvm/uvm_pager.c part reserves a second segment for the page daemon.
  This is necessary to ensure the two uvm_pagermapin() calls needed by
  uvm_swap_io() succeed in emergency OOM situation.  (the 2nd segment is
  necessary when encryption or bouncing is required)

* uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region
  for the same reason.  Note that a sleeping point is introduced because
  the pagedaemon is faster than the asynchronous I/O and in OOM
  situation it tends to stay busy building cluster that it then discard
  because no memory is available.

* uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive 
  list of pages to account for a given memory range.  Without this the
  daemon could spin infinitely doing nothing because the global limits
  are reached.

At lot could be improved, but this at least makes swapping work in OOM
situations.

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.78
diff -u -p -r1.78 uvm_pager.c
--- uvm/uvm_pager.c 18 Feb 2022 09:04:38 -  1.78
+++ uvm/uvm_pager.c 27 Jun 2022 08:44:41 -
@@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[]
  * The number of uvm_pseg instances is dynamic using an array segs.
  * At most UVM_PSEG_COUNT instances can exist.
  *
- * psegs[0] always exists (so that the pager can always map in pages).
- * psegs[0] element 0 is always reserved for the pagedaemon.
+ * psegs[0/1] always exist (so that the pager can always map in pages).
+ * psegs[0/1] element 0 are always reserved for the pagedaemon.
  *
  * Any other pseg is automatically created when no space is available
  * and automatically destroyed when it is no longer in use.
@@ -93,6 +93,7 @@ uvm_pager_init(void)
 
/* init pager map */
uvm_pseg_init(&psegs[0]);
+   uvm_pseg_init(&psegs[1]);
mtx_init(&uvm_pseg_lck, IPL_VM);
 
/* init ASYNC I/O queue */
@@ -168,9 +169,10 @@ pager_seg_restart:
goto pager_seg_fail;
}
 
-   /* Keep index 0 reserved for pagedaemon. */
-   if (pseg == &psegs[0] && curproc != uvm.pagedaemon_proc)
-   i = 1;
+   /* Keep indexes 0,1 reserved for pagedaemon. */
+   if ((pseg == &psegs[0] || pseg == &psegs[1]) &&
+   (curproc != uvm.pagedaemon_proc))
+   i = 2;
else
i = 0;
 
@@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr)
pseg->use &= ~(1 << id);
wakeup(&psegs);
 
-   if (pseg != &psegs[0] && UVM_PSEG_EMPTY(pseg)) {
+   if ((pseg != &psegs[0] && pseg != &psegs[1]) && UVM_PSEG_EMPTY(pseg)) {
va = pseg->start;
pseg->start = 0;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   12 May 2022 12:49:31 -  1.99
+++ uvm/uvm_pdaemon.c   27 Jun 2022 13:24:54 -
@@ -101,8 +101,8 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
-void   uvmpd_scan(void);
-boolean_t  uvmpd_scan_inactive(struct pglist *);
+void   uvmpd_scan(struct uvm_pmalloc *);
+boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -281,7 +281,7 @@ uvm_pageout(void *arg)
if (pma != NULL ||
((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) ||
((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) {
-   uvmpd_scan();
+   uvmpd_scan(pma);
}
 
/*
@@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg)
  */
 
 boolean_t
-uvmpd_scan_inactive(struct pglist *pglst)
+uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
-   struct vm_page *pps[MAXBSIZE >> PAGE_SHIFT], **ppsp;
+   struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
-   struct vm_page *swpps[MAXBSIZE >> PAGE_SHIFT];  /* XXX: see below */
+   struct vm_page *swpps[SWCLUSTPAGES];/* XXX: see below */
int swnpages, swcpages; /* XXX: see below */
int swslot;
struct vm_anon *anon;
@@ -404,8 +404,27 @@ uvmpd_scan_inactive(struct pglist *pglst
swnpages = swcpages = 0;
free = 0;
  

pdaemon: reserve memory for swapping

2022-06-26 Thread Martin Pieuchot
uvm_swap_io() needs to perform up to 4 allocations to write pages to
disk.  In OOM situation uvm_swap_allocpages() always fail because the
kernel doesn't reserve enough pages.

Diff below set `uvmexp.reserve_pagedaemon' to the number of pages needed
to write a cluster of pages to disk.  With this my machine do not
deadlock and can push pages to swap in OOM case.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  26 Jun 2022 08:17:34 -
@@ -280,10 +280,13 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 
/*
 * init reserve thresholds
-* XXXCDC - values may need adjusting
+*
+* The pagedaemon needs to always be able to write pages to disk,
+* Reserve the minimum amount of pages, a cluster, required by
+* uvm_swap_allocpages()
 */
-   uvmexp.reserve_pagedaemon = 4;
-   uvmexp.reserve_kernel = 8;
+   uvmexp.reserve_pagedaemon = (MAXBSIZE >> PAGE_SHIFT);
+   uvmexp.reserve_kernel = uvmexp.reserve_pagedaemon + 4;
uvmexp.anonminpct = 10;
uvmexp.vnodeminpct = 10;
uvmexp.vtextminpct = 5;



Re: ssh-add(1): fix NULL in fprintf

2022-06-16 Thread Martin Vahlensieck
ping, diff attached

On Mon, May 16, 2022 at 09:21:42PM +0200, Martin Vahlensieck wrote:
> Hi
> 
> What's the status on this?  Anthing required from my side?  I have
> reattached the patch (with the changes Theo suggested).
> 
> Best,
> 
> Martin
> 
> On Mon, May 09, 2022 at 08:39:38PM +0200, Martin Vahlensieck wrote:
> > On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote:
> > > Martin Vahlensieck  wrote:
> > > 
> > > > if (!qflag) {
> > > > -   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > > > -   sshkey_type(key), comment);
> > > > +   fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path,
> > > > +   sshkey_type(key), comment ? " (" : "",
> > > > +   comment ? comment : "", comment ? ")" : "");
> > > 
> > > this is probably better as something like
> > > 
> > > > -   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > > > -   sshkey_type(key), comment ? comment : "no comment");
> > > 
> > > Which has a minor ambiguity, but probably harms noone.
> > > 
> > 
> > Index: ssh-add.c
> > ===
> > RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
> > retrieving revision 1.165
> > diff -u -p -r1.165 ssh-add.c
> > --- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
> > +++ ssh-add.c   9 May 2022 18:36:54 -
> > @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
> > }
> > if (!qflag) {
> > fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > -   sshkey_type(key), comment);
> > +   sshkey_type(key), comment ? comment : "no comment");
> > }
> > return 0;
> >  }
> > @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
> > certpath, filename);
> > sshkey_free(cert);
> > goto out;
> > -   } 
> > +   }
> >  
> > /* Graft with private bits */
> > if ((r = sshkey_to_certified(private)) != 0) {
> 
> Index: ssh-add.c
> ===
> RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
> retrieving revision 1.165
> diff -u -p -r1.165 ssh-add.c
> --- ssh-add.c 4 Feb 2022 02:49:17 -   1.165
> +++ ssh-add.c 9 May 2022 18:36:54 -
> @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
>   }
>   if (!qflag) {
>   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> - sshkey_type(key), comment);
> + sshkey_type(key), comment ? comment : "no comment");
>   }
>   return 0;
>  }
> @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
>   certpath, filename);
>   sshkey_free(cert);
>   goto out;
> - } 
> + }
>  
>   /* Graft with private bits */
>   if ((r = sshkey_to_certified(private)) != 0) {
> 

Index: ssh-add.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
retrieving revision 1.165
diff -u -p -r1.165 ssh-add.c
--- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
+++ ssh-add.c   9 May 2022 18:36:54 -
@@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
}
if (!qflag) {
fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
-   sshkey_type(key), comment);
+   sshkey_type(key), comment ? comment : "no comment");
}
return 0;
 }
@@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
certpath, filename);
sshkey_free(cert);
goto out;
-   } 
+   }
 
/* Graft with private bits */
if ((r = sshkey_to_certified(private)) != 0) {



Re: set RTF_DONE in sysctl_dumpentry for the routing table

2022-06-08 Thread Martin Pieuchot
On 08/06/22(Wed) 16:13, Claudio Jeker wrote:
> Notice while hacking in OpenBGPD. Unlike routing socket messages the
> messages from the sysctl interface have RTF_DONE not set.
> I think it would make sense to set RTF_DONE also in this case since it
> makes reusing code easier.
> 
> All messages sent out via sysctl_dumpentry() have been processed by the
> kernel so setting RTF_DONE kind of makes sense.

I agree, ok mpi@

> -- 
> :wq Claudio
> 
> Index: rtsock.c
> ===
> RCS file: /cvs/src/sys/net/rtsock.c,v
> retrieving revision 1.328
> diff -u -p -r1.328 rtsock.c
> --- rtsock.c  6 Jun 2022 14:45:41 -   1.328
> +++ rtsock.c  8 Jun 2022 14:10:20 -
> @@ -1987,7 +1987,7 @@ sysctl_dumpentry(struct rtentry *rt, voi
>   struct rt_msghdr *rtm = (struct rt_msghdr *)w->w_tmem;
>  
>   rtm->rtm_pid = curproc->p_p->ps_pid;
> - rtm->rtm_flags = rt->rt_flags;
> + rtm->rtm_flags = RTF_DONE | rt->rt_flags;
>   rtm->rtm_priority = rt->rt_priority & RTP_MASK;
>   rtm_getmetrics(&rt->rt_rmx, &rtm->rtm_rmx);
>   /* Do not account the routing table's reference. */
> 



Re: Fix clearing of sleep timeouts

2022-06-06 Thread Martin Pieuchot
On 06/06/22(Mon) 06:47, David Gwynne wrote:
> On Sun, Jun 05, 2022 at 03:57:39PM +, Visa Hankala wrote:
> > On Sun, Jun 05, 2022 at 12:27:32PM +0200, Martin Pieuchot wrote:
> > > On 05/06/22(Sun) 05:20, Visa Hankala wrote:
> > > > Encountered the following panic:
> > > > 
> > > > panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" 
> > > > failed: file "/usr/src/sys/kern/kern_synch.c", line 373
> > > > Stopped at  db_enter+0x10:  popq%rbp
> > > > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > > >  423109  57118 55 0x3  02  link
> > > >  330695  30276 55 0x3  03  link
> > > > * 46366  85501 55  0x1003  0x40804001  link
> > > >  188803  85501 55  0x1003  0x40820000K link
> > > > db_enter() at db_enter+0x10
> > > > panic(81f25d2b) at panic+0xbf
> > > > __assert(81f9a186,81f372c8,175,81f87c6c) at 
> > > > __assert+0x25
> > > > sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at 
> > > > sleep_setup+0x1d8
> > > > cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46
> > > > timeout_barrier(8000228a28b0) at timeout_barrier+0x109
> > > > timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2
> > > > sleep_finish(800022d64d90,1) at sleep_finish+0x16d
> > > > tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2
> > > > sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at 
> > > > sys_nanosleep+0x12d
> > > > syscall(800022d64f60) at syscall+0x374
> > > > 
> > > > The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously,
> > > > soft-interrupt-driven timeouts could be deleted synchronously without
> > > > blocking. Now, timeout_del_barrier() can sleep regardless of the type
> > > > of the timeout.
> > > > 
> > > > It looks that with small adjustments timeout_del_barrier() can sleep
> > > > in sleep_finish(). The management of run queues is not affected because
> > > > the timeout clearing happens after it. As timeout_del_barrier() does not
> > > > rely on a timeout or signal catching, there should be no risk of
> > > > unbounded recursion or unwanted signal side effects within the sleep
> > > > machinery. In a way, a sleep with a timeout is higher-level than
> > > > one without.
> > > 
> > > I trust you on the analysis.  However this looks very fragile to me.
> > > 
> > > The use of timeout_del_barrier() which can sleep using the global sleep
> > > queue is worrying me.  
> > 
> > I think the queue handling ends in sleep_finish() when SCHED_LOCK()
> > is released. The timeout clearing is done outside of it.
> 
> That's ok.
> 
> > The extra sleeping point inside sleep_finish() is subtle. It should not
> > matter in typical use. But is it permissible with the API? Also, if
> > timeout_del_barrier() sleeps, the thread's priority can change.
> 
> What other options do we have at this point? Spin? Allocate the timeout
> dynamically so sleep_finish doesn't have to wait for it and let the
> handler clean up? How would you stop the timeout handler waking up the
> thread if it's gone back to sleep again for some other reason?
> 
> Sleeping here is the least worst option.

I agree.  I don't think sleeping is bad here.  My concern is about how
sleeping is implemented.  There's a single API built on top of a single
global data structure which now calls itself recursively.  

I'm not sure how much work it would be to make cond_wait(9) use its own
sleep queue...  This is something independent from this fix though.

> As for timeout_del_barrier, if prio is a worry we can provide an
> advanced version of it that lets you pass the prio in. I'd also
> like to change timeout_barrier so it queues the barrier task at the
> head of the pending lists rather than at the tail.

I doubt prio matter.



Re: Fix clearing of sleep timeouts

2022-06-05 Thread Martin Pieuchot
On 05/06/22(Sun) 05:20, Visa Hankala wrote:
> Encountered the following panic:
> 
> panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: 
> file "/usr/src/sys/kern/kern_synch.c", line 373
> Stopped at  db_enter+0x10:  popq%rbp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>  423109  57118 55 0x3  02  link
>  330695  30276 55 0x3  03  link
> * 46366  85501 55  0x1003  0x40804001  link
>  188803  85501 55  0x1003  0x40820000K link
> db_enter() at db_enter+0x10
> panic(81f25d2b) at panic+0xbf
> __assert(81f9a186,81f372c8,175,81f87c6c) at 
> __assert+0x25
> sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at 
> sleep_setup+0x1d8
> cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46
> timeout_barrier(8000228a28b0) at timeout_barrier+0x109
> timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2
> sleep_finish(800022d64d90,1) at sleep_finish+0x16d
> tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2
> sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at 
> sys_nanosleep+0x12d
> syscall(800022d64f60) at syscall+0x374
> 
> The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously,
> soft-interrupt-driven timeouts could be deleted synchronously without
> blocking. Now, timeout_del_barrier() can sleep regardless of the type
> of the timeout.
> 
> It looks that with small adjustments timeout_del_barrier() can sleep
> in sleep_finish(). The management of run queues is not affected because
> the timeout clearing happens after it. As timeout_del_barrier() does not
> rely on a timeout or signal catching, there should be no risk of
> unbounded recursion or unwanted signal side effects within the sleep
> machinery. In a way, a sleep with a timeout is higher-level than
> one without.

I trust you on the analysis.  However this looks very fragile to me.

The use of timeout_del_barrier() which can sleep using the global sleep
queue is worrying me.  

> Note that endtsleep() can run and set P_TIMEOUT during
> timeout_del_barrier() when the thread is blocked in cond_wait().
> To avoid unnecessary atomic read-modify-write operations, the clearing
> of P_TIMEOUT could be conditional, but maybe that is an unnecessary
> optimization at this point.

I agree this optimization seems unnecessary at the moment.

> While it should be possible to make the code use timeout_del() instead
> of timeout_del_barrier(), the outcome might not be outright better. For
> example, sleep_setup() and endtsleep() would have to coordinate so that
> a late-running timeout from previous sleep cycle would not disturb the
> new cycle.

So that's the price for not having to sleep in sleep_finish(), right?

> To test the barrier path reliably, I made the code call
> timeout_del_barrier() twice in a row. The second call is guaranteed
> to sleep. Of course, this is not part of the patch.

ok mpi@

> Index: kern/kern_synch.c
> ===
> RCS file: src/sys/kern/kern_synch.c,v
> retrieving revision 1.187
> diff -u -p -r1.187 kern_synch.c
> --- kern/kern_synch.c 13 May 2022 15:32:00 -  1.187
> +++ kern/kern_synch.c 5 Jun 2022 05:04:45 -
> @@ -370,8 +370,8 @@ sleep_setup(struct sleep_state *sls, con
>   p->p_slppri = prio & PRIMASK;
>   TAILQ_INSERT_TAIL(&slpque[LOOKUP(ident)], p, p_runq);
>  
> - KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   if (timo) {
> + KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   sls->sls_timeout = 1;
>   timeout_add(&p->p_sleep_to, timo);
>   }
> @@ -432,13 +432,12 @@ sleep_finish(struct sleep_state *sls, in
>  
>   if (sls->sls_timeout) {
>   if (p->p_flag & P_TIMEOUT) {
> - atomic_clearbits_int(&p->p_flag, P_TIMEOUT);
>   error1 = EWOULDBLOCK;
>   } else {
> - /* This must not sleep. */
> + /* This can sleep. It must not use timeouts. */
>   timeout_del_barrier(&p->p_sleep_to);
> - KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   }
> + atomic_clearbits_int(&p->p_flag, P_TIMEOUT);
>   }
>  
>   /* Check if thread was woken up because of a unwind or signal */
> 



Re: start unlocking kbind(2)

2022-05-31 Thread Martin Pieuchot
On 18/05/22(Wed) 15:53, Alexander Bluhm wrote:
> On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote:
> > +   cookie = SCARG(uap, proc_cookie);
> > +   if (pr->ps_kbind_addr == pc) {
> > +   membar_datadep_consumer();
> > +   if (pr->ps_kbind_cookie != cookie)
> > +   goto sigill;
> > +   } else {
> 
> You must use membar_consumer() here.  membar_datadep_consumer() is
> a barrier between reading pointer and pointed data.  Only alpha
> requires membar_datadep_consumer() for that, everywhere else it is
> a NOP.
> 
> > +   mtx_enter(&pr->ps_mtx);
> > +   kpc = pr->ps_kbind_addr;
> 
> Do we need kpc variable?  I would prefer to read explicit
> pr->ps_kbind_addr in the two places where we use it.
> 
> I think the logic of barriers and mutexes is correct.
> 
> with the suggestions above OK bluhm@

I believe you should go ahead with the current diff.  ok with me.  Moving
the field under the scope of another lock can be easily done afterward.



Re: allow 240/4 in various network daemons

2022-05-28 Thread Martin Schröder
Am Sa., 28. Mai 2022 um 22:46 Uhr schrieb Seth David Schoen
:
> We're also interested in talking about whether there's an appropriate
> path for supporting non-broadcast use of addresses within 127/8, our
> most controversial change.  In Linux and FreeBSD, we're experimenting

IPv6 is now older than IPv4 was when v6 was introduced.
You are beating a very dead horse.

Best
Martin



Re: ffs_truncate: Missing uvm_vnp_uncache() w/ softdep

2022-05-24 Thread Martin Pieuchot
On 24/05/22(Tue) 15:24, Mark Kettenis wrote:
> > Date: Tue, 24 May 2022 14:28:39 +0200
> > From: Martin Pieuchot 
> > 
> > The softdep code path is missing a UVM cache invalidation compared to
> > the !softdep one.  This is necessary to flush pages of a persisting
> > vnode.
> > 
> > Since uvm_vnp_setsize() is also called later in this function for the
> > !softdep case move it to not call it twice.
> > 
> > ok?
> 
> I'm not sure this is correct.  I'm trying to understand why you're
> moving the uvm_uvn_setsize() call.  Are you just trying to call it
> twice?  Or are you trying to avoid calling it at all when we end up in
> an error path?
>
> The way you moved it means we'll still call it twice for "partially
> truncated" files with softdeps.  At least the way I understand the
> code is that the code will fsync the vnode and dropping down in the
> "normal" non-softdep code that will call uvm_vnp_setsize() (and
> uvn_vnp_uncache()) again.  So maybe you should move the
> uvm_uvn_setsize() call into the else case?

We might want to do that indeed.  I'm not sure what are the implications
of calling uvm_vnp_setsize/uncache() after VOP_FSYNC(), which might fail.
So I'd rather play safe and go with that diff.

> > Index: ufs/ffs/ffs_inode.c
> > ===
> > RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v
> > retrieving revision 1.81
> > diff -u -p -r1.81 ffs_inode.c
> > --- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 -  1.81
> > +++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 -
> > @@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le
> > if (length > fs->fs_maxfilesize)
> > return (EFBIG);
> >  
> > -   uvm_vnp_setsize(ovp, length);
> > oip->i_ci.ci_lasta = oip->i_ci.ci_clen 
> > = oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0;
> >  
> > if (DOINGSOFTDEP(ovp)) {
> > +   uvm_vnp_setsize(ovp, length);
> > +   (void) uvm_vnp_uncache(ovp);
> > if (length > 0 || softdep_slowdown(ovp)) {
> > /*
> >  * If a file is only partially truncated, then
> > 
> > 



Please test: rewrite of pdaemon

2022-05-24 Thread Martin Pieuchot
Diff below brings in & adapt most of the changes from NetBSD's r1.37 of
uvm_pdaemon.c.  My motivation for doing this is to untangle the inner
loop of uvmpd_scan_inactive() which will allow us to split the global
`pageqlock' mutex in a next step.

The idea behind this change is to get rid of the too-complex uvm_pager*
abstraction by checking early if a page is going to be flushed or
swapped to disk.  The loop is then clearly divided into two cases which
makes it more readable. 

This also opens the door to a better integration between UVM's vnode
layer and the buffer cache.

The main loop of uvmpd_scan_inactive() can be understood as below:

. If a page can be flushed we can call "uvn_flush()" directly and pass the
  PGO_ALLPAGES flag instead of building a cluster beforehand.  Note that,
  in its current form uvn_flush() is synchronous.

. If the page needs to be swapped, mark it as PG_PAGEOUT, build a cluster
  and once it is full call uvm_swap_put(). 

Please test this diff, do not hesitate to play with the `vm.swapencrypt.enable'
sysctl(2).

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  24 May 2022 12:31:34 -
@@ -143,7 +143,7 @@ struct pool uvm_aobj_pool;
 
 static struct uao_swhash_elt   *uao_find_swhash_elt(struct uvm_aobj *, int,
 boolean_t);
-static int  uao_find_swslot(struct uvm_object *, int);
+int uao_find_swslot(struct uvm_object *, int);
 static boolean_tuao_flush(struct uvm_object *, voff_t,
 voff_t, int);
 static void uao_free(struct uvm_aobj *);
@@ -241,7 +241,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-inline static int
+int
 uao_find_swslot(struct uvm_object *uobj, int pageidx)
 {
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_aobj.h
--- uvm/uvm_aobj.h  21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_aobj.h  24 May 2022 12:31:34 -
@@ -60,6 +60,7 @@
 
 void uao_init(void);
 int uao_set_swslot(struct uvm_object *, int, int);
+int uao_find_swslot (struct uvm_object *, int);
 int uao_dropswap(struct uvm_object *, int);
 int uao_swap_off(int, int);
 int uao_shrink(struct uvm_object *, int);
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.291
diff -u -p -r1.291 uvm_map.c
--- uvm/uvm_map.c   4 May 2022 14:58:26 -   1.291
+++ uvm/uvm_map.c   24 May 2022 12:31:34 -
@@ -3215,8 +3215,9 @@ uvm_object_printit(struct uvm_object *uo
  * uvm_page_printit: actually print the page
  */
 static const char page_flagbits[] =
-   "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5CLEANCHK\6RELEASED\7FAKE\10RDONLY"
-   "\11ZERO\12DEV\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ"
+   "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5PAGEOUT\6RELEASED\7FAKE\10RDONLY"
+   "\11ZERO\12DEV\13CLEANCHK"
+   "\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ"
"\27ENCRYPT\31PMAP0\32PMAP1\33PMAP2\34PMAP3\35PMAP4\36PMAP5";
 
 void
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  24 May 2022 12:32:54 -
@@ -960,6 +960,7 @@ uvm_pageclean(struct vm_page *pg)
 {
u_int flags_to_clear = 0;
 
+   KASSERT((pg->pg_flags & PG_PAGEOUT) == 0);
if ((pg->pg_flags & (PG_TABLED|PQ_ACTIVE|PQ_INACTIVE)) &&
(pg->uobject == NULL || !UVM_OBJ_IS_PMAP(pg->uobject)))
MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
@@ -978,11 +979,14 @@ uvm_pageclean(struct vm_page *pg)
rw_write_held(pg->uanon->an_lock));
 
/*
-* if the page was an object page (and thus "TABLED"), remove it
-* from the object.
+* remove page from its object or anon.
 */
-   if (pg->pg_flags & PG_TABLED)
+   if (pg->pg_flags & PG_TABLED) {
uvm_pageremove(pg);
+   } else if (pg->uanon != NULL) {
+   pg->uanon->an_page = NULL;
+   pg->uanon = NULL;
+   }
 
/*
 * now remove the page from the queues
@@ -996,10 +1000,6 @@ uvm_pageclean(struct vm_page *pg)
pg->wire_count = 0;
uvmexp.wired--;
}
-   if (pg->uanon) {
-   pg->uanon->an_page = NULL;
-   pg->uanon = NULL;
-

ffs_truncate: Missing uvm_vnp_uncache() w/ softdep

2022-05-24 Thread Martin Pieuchot
The softdep code path is missing a UVM cache invalidation compared to
the !softdep one.  This is necessary to flush pages of a persisting
vnode.

Since uvm_vnp_setsize() is also called later in this function for the
!softdep case move it to not call it twice.

ok?

Index: ufs/ffs/ffs_inode.c
===
RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v
retrieving revision 1.81
diff -u -p -r1.81 ffs_inode.c
--- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 -  1.81
+++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 -
@@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le
if (length > fs->fs_maxfilesize)
return (EFBIG);
 
-   uvm_vnp_setsize(ovp, length);
oip->i_ci.ci_lasta = oip->i_ci.ci_clen 
= oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0;
 
if (DOINGSOFTDEP(ovp)) {
+   uvm_vnp_setsize(ovp, length);
+   (void) uvm_vnp_uncache(ovp);
if (length > 0 || softdep_slowdown(ovp)) {
/*
 * If a file is only partially truncated, then



Re: Call uvm_vnp_uncache() before VOP_RENAME()

2022-05-24 Thread Martin Pieuchot
On 17/05/22(Tue) 16:55, Martin Pieuchot wrote:
> nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a
> possibly mmap'ed file before calling VOP_RENAME().
> 
> ok?

Anyone?

> Index: nfs/nfs_serv.c
> ===
> RCS file: /cvs/src/sys/nfs/nfs_serv.c,v
> retrieving revision 1.120
> diff -u -p -r1.120 nfs_serv.c
> --- nfs/nfs_serv.c11 Mar 2021 13:31:35 -  1.120
> +++ nfs/nfs_serv.c4 May 2022 15:29:06 -
> @@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd
>   error = -1;
>  out:
>   if (!error) {
> + if (tvp) {
> + (void)uvm_vnp_uncache(tvp);
> + }
>   error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, &fromnd.ni_cnd,
>  tond.ni_dvp, tond.ni_vp, &tond.ni_cnd);
>   } else {
> 



Re: start unlocking kbind(2)

2022-05-17 Thread Martin Pieuchot
On 17/05/22(Tue) 10:44, David Gwynne wrote:
> this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall
> argument checks can be done without the kernel lock.
> 
> care is taken to allow the pc/cookie checks to run without any lock by
> being careful with the order of the checks. all modifications to the
> pc/cookie state are serialised by the per process mutex.

I don't understand why it is safe to do the following check without
holding a mutex:

if (pr->ps_kbind_addr == pc)
...

Is there much differences when always grabbing the per-process mutex?

> i dont know enough about uvm to say whether it is safe to unlock the
> actual memory updates too, but even if i was confident i would still
> prefer to change it as a separate step.

I agree.

> Index: kern/init_sysent.c
> ===
> RCS file: /cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.236
> diff -u -p -r1.236 init_sysent.c
> --- kern/init_sysent.c1 May 2022 23:00:04 -   1.236
> +++ kern/init_sysent.c17 May 2022 00:36:03 -
> @@ -1,4 +1,4 @@
> -/*   $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $  */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
> @@ -204,7 +204,7 @@ const struct sysent sysent[] = {
>   sys_utimensat },/* 84 = utimensat */
>   { 2, s(struct sys_futimens_args), 0,
>   sys_futimens }, /* 85 = futimens */
> - { 3, s(struct sys_kbind_args), 0,
> + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0,
>   sys_kbind },/* 86 = kbind */
>   { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0,
>   sys_clock_gettime },/* 87 = clock_gettime */
> Index: kern/syscalls.master
> ===
> RCS file: /cvs/src/sys/kern/syscalls.master,v
> retrieving revision 1.223
> diff -u -p -r1.223 syscalls.master
> --- kern/syscalls.master  24 Feb 2022 07:41:51 -  1.223
> +++ kern/syscalls.master  17 May 2022 00:36:03 -
> @@ -194,7 +194,7 @@
>   const struct timespec *times, int flag); }
>  85   STD { int sys_futimens(int fd, \
>   const struct timespec *times); }
> -86   STD { int sys_kbind(const struct __kbind *param, \
> +86   STD NOLOCK  { int sys_kbind(const struct __kbind *param, \
>   size_t psize, int64_t proc_cookie); }
>  87   STD NOLOCK  { int sys_clock_gettime(clockid_t clock_id, \
>   struct timespec *tp); }
> Index: uvm/uvm_mmap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
> retrieving revision 1.169
> diff -u -p -r1.169 uvm_mmap.c
> --- uvm/uvm_mmap.c19 Jan 2022 10:43:48 -  1.169
> +++ uvm/uvm_mmap.c17 May 2022 00:36:03 -
> @@ -70,6 +70,7 @@
>  #include 
>  #include   /* for KBIND* */
>  #include 
> +#include 
>  
>  #include /* for __LDPGSZ */
>  
> @@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis
>   const char *data;
>   vaddr_t baseva, last_baseva, endva, pageoffset, kva;
>   size_t psize, s;
> - u_long pc;
> + u_long pc, kpc;
>   int count, i, extra;
> + uint64_t cookie;
>   int error;
>  
>   /*
>* extract syscall args from uap
>*/
>   paramp = SCARG(uap, param);
> - psize = SCARG(uap, psize);
>  
>   /* a NULL paramp disables the syscall for the process */
>   if (paramp == NULL) {
> + mtx_enter(&pr->ps_mtx);
>   if (pr->ps_kbind_addr != 0)
> - sigexit(p, SIGILL);
> + goto leave_sigill;
>   pr->ps_kbind_addr = BOGO_PC;
> + mtx_leave(&pr->ps_mtx);
>   return 0;
>   }
>  
>   /* security checks */
> +
> + /*
> +  * ps_kbind_addr can only be set to 0 or BOGO_PC by the
> +  * kernel, not by a call from userland.
> +  */
>   pc = PROC_PC(p);
> - if (pr->ps_kbind_addr == 0) {
> - pr->ps_kbind_addr = pc;
> - pr->ps_kbind_cookie = SCARG(uap, proc_cookie);
> - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC)
> - sigexit(p, SIGILL);
> - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie))
> - sigexit(p, SIGILL);
> + if (pc == 0 || pc == BOGO_PC)
> + goto sigill;
> +
> + cookie = SCARG(uap, proc_cookie);
> + if (pr->ps_kbind_addr == pc) {
> + membar_datadep_consumer();
> + if (pr->ps_kbind_cookie != cookie)
> + goto sigill;
> + } else {
> + mtx_enter(&pr->ps_mtx);
> + kpc = pr->ps_kbind_addr;
> +
> + /*
> +  * If we're the first thread in (kpc is 0), then
> +  

Call uvm_vnp_uncache() before VOP_RENAME()

2022-05-17 Thread Martin Pieuchot
nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a
possibly mmap'ed file before calling VOP_RENAME().

ok?

Index: nfs/nfs_serv.c
===
RCS file: /cvs/src/sys/nfs/nfs_serv.c,v
retrieving revision 1.120
diff -u -p -r1.120 nfs_serv.c
--- nfs/nfs_serv.c  11 Mar 2021 13:31:35 -  1.120
+++ nfs/nfs_serv.c  4 May 2022 15:29:06 -
@@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd
error = -1;
 out:
if (!error) {
+   if (tvp) {
+   (void)uvm_vnp_uncache(tvp);
+   }
error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, &fromnd.ni_cnd,
   tond.ni_dvp, tond.ni_vp, &tond.ni_cnd);
} else {



Re: ssh-add(1): fix NULL in fprintf

2022-05-16 Thread Martin Vahlensieck
Hi

What's the status on this?  Anthing required from my side?  I have
reattached the patch (with the changes Theo suggested).

Best,

Martin

On Mon, May 09, 2022 at 08:39:38PM +0200, Martin Vahlensieck wrote:
> On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote:
> > Martin Vahlensieck  wrote:
> > 
> > >   if (!qflag) {
> > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > > - sshkey_type(key), comment);
> > > + fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path,
> > > + sshkey_type(key), comment ? " (" : "",
> > > + comment ? comment : "", comment ? ")" : "");
> > 
> > this is probably better as something like
> > 
> > > - fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > > - sshkey_type(key), comment ? comment : "no comment");
> > 
> > Which has a minor ambiguity, but probably harms noone.
> > 
> 
> Index: ssh-add.c
> ===
> RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
> retrieving revision 1.165
> diff -u -p -r1.165 ssh-add.c
> --- ssh-add.c 4 Feb 2022 02:49:17 -   1.165
> +++ ssh-add.c 9 May 2022 18:36:54 -
> @@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
>   }
>   if (!qflag) {
>   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> - sshkey_type(key), comment);
> + sshkey_type(key), comment ? comment : "no comment");
>   }
>   return 0;
>  }
> @@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
>   certpath, filename);
>   sshkey_free(cert);
>   goto out;
> - } 
> + }
>  
>   /* Graft with private bits */
>   if ((r = sshkey_to_certified(private)) != 0) {

Index: ssh-add.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
retrieving revision 1.165
diff -u -p -r1.165 ssh-add.c
--- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
+++ ssh-add.c   9 May 2022 18:36:54 -
@@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
}
if (!qflag) {
fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
-   sshkey_type(key), comment);
+   sshkey_type(key), comment ? comment : "no comment");
}
return 0;
 }
@@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
certpath, filename);
sshkey_free(cert);
goto out;
-   } 
+   }
 
/* Graft with private bits */
if ((r = sshkey_to_certified(private)) != 0) {



Re: Picky, but much more efficient arc4random_uniform!

2022-05-14 Thread Matthew Martin
int
main() {
int results[3] = { 0, 0, 0 };
for (int i = 0; i < 10; i++) {
results[arc4random_uniform_fast_simple(3)]++;
}
for (int i = 0; i < 3; i++)
printf("%d: %d\n", i, results[i]);

return 0;
}

% ./a.out
0: 24809
1: 50011
2: 25180

You can't reuse bits because they'll be biased.



libcrypto/err_prn.c: skip BIO*

2022-05-12 Thread Martin Vahlensieck
Hi

As far as I can tell, this ends up calling vprintf eventually, so
skip the steps inbetween.

Best,

Martin

Index: err_prn.c
===
RCS file: /home/reposync/cvs/src/lib/libcrypto/err/err_prn.c,v
retrieving revision 1.19
diff -u -p -r1.19 err_prn.c
--- err_prn.c   7 Jan 2022 09:02:18 -   1.19
+++ err_prn.c   7 Jan 2022 16:13:48 -
@@ -92,12 +92,7 @@ ERR_print_errors_cb(int (*cb)(const char
 static int
 print_fp(const char *str, size_t len, void *fp)
 {
-   BIO bio;
-
-   BIO_set(&bio, BIO_s_file());
-   BIO_set_fp(&bio, fp, BIO_NOCLOSE);
-
-   return BIO_printf(&bio, "%s", str);
+   return fprintf(fp, "%s", str);
 }
 
 void



apply(1): constify two arguments

2022-05-12 Thread Martin Vahlensieck
Index: apply.c
===
RCS file: /cvs/src/usr.bin/apply/apply.c,v
retrieving revision 1.29
diff -u -p -r1.29 apply.c
--- apply.c 1 Apr 2018 17:45:05 -   1.29
+++ apply.c 12 May 2022 21:14:04 -
@@ -54,7 +54,7 @@ char  *str;
 size_t  sz;
 
 void
-stradd(char *p)
+stradd(const char *p)
 {
size_t n;
 
@@ -73,7 +73,7 @@ stradd(char *p)
 }
 
 void
-strset(char *p)
+strset(const char *p)
 {
if (str != NULL)
str[0] = '\0';



Re: uvm_pagedequeue()

2022-05-12 Thread Martin Pieuchot
On 10/05/22(Tue) 20:23, Mark Kettenis wrote:
> > Date: Tue, 10 May 2022 18:45:21 +0200
> > From: Martin Pieuchot 
> > 
> > On 05/05/22(Thu) 14:54, Martin Pieuchot wrote:
> > > Diff below introduces a new wrapper to manipulate active/inactive page
> > > queues. 
> > > 
> > > ok?
> > 
> > Anyone?
> 
> Sorry I started looking at this and got distracted.
> 
> I'm not sure about the changes to uvm_pageactivate().  It doesn't
> quite match what NetBSD does, but I guess NetBSD assumes that
> uvm_pageactiave() isn't called for a page that is already active?  And
> that's something we can't guarantee?

It does look at what NetBSD did 15 years ago.  We're not ready to synchronize
with NetBSD -current yet. 

We're getting there!

> The diff is correct though in the sense that it is equivalent to the
> code we already have.  So if this definitely is the direction you want
> to go:
> 
> ok kettenis@
> 
> > > Index: uvm/uvm_page.c
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> > > retrieving revision 1.165
> > > diff -u -p -r1.165 uvm_page.c
> > > --- uvm/uvm_page.c4 May 2022 14:58:26 -   1.165
> > > +++ uvm/uvm_page.c5 May 2022 12:49:13 -
> > > @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
> > >   /*
> > >* now remove the page from the queues
> > >*/
> > > - if (pg->pg_flags & PQ_ACTIVE) {
> > > - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> > > - flags_to_clear |= PQ_ACTIVE;
> > > - uvmexp.active--;
> > > - }
> > > - if (pg->pg_flags & PQ_INACTIVE) {
> > > - TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> > > - flags_to_clear |= PQ_INACTIVE;
> > > - uvmexp.inactive--;
> > > - }
> > > + uvm_pagedequeue(pg);
> > >  
> > >   /*
> > >* if the page was wired, unwire it now.
> > > @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
> > >   MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
> > >  
> > >   if (pg->wire_count == 0) {
> > > - if (pg->pg_flags & PQ_ACTIVE) {
> > > - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> > > - atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
> > > - uvmexp.active--;
> > > - }
> > > - if (pg->pg_flags & PQ_INACTIVE) {
> > > - TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> > > - atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
> > > - uvmexp.inactive--;
> > > - }
> > > + uvm_pagedequeue(pg);
> > >   uvmexp.wired++;
> > >   }
> > >   pg->wire_count++;
> > > @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
> > >   KASSERT(uvm_page_owner_locked_p(pg));
> > >   MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
> > >  
> > > + uvm_pagedequeue(pg);
> > > + if (pg->wire_count == 0) {
> > > + TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
> > > + atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
> > > + uvmexp.active++;
> > > +
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * uvm_pagedequeue: remove a page from any paging queue
> > > + */
> > > +void
> > > +uvm_pagedequeue(struct vm_page *pg)
> > > +{
> > > + if (pg->pg_flags & PQ_ACTIVE) {
> > > + TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> > > + atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
> > > + uvmexp.active--;
> > > + }
> > >   if (pg->pg_flags & PQ_INACTIVE) {
> > >   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> > >   atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
> > >   uvmexp.inactive--;
> > >   }
> > > - if (pg->wire_count == 0) {
> > > - /*
> > > -  * if page is already active, remove it from list so we
> > > -  * can put it at tail.  if it wasn't active, then mark
> > > -  * it active and bump active count
> > > -  */
> > > - if (pg->pg_flags & PQ_ACTIVE)
> > > - TAILQ_REMOVE(&uvm.page_active, pg, pag

Re: Mark pw_error __dead in util.h

2022-05-10 Thread Matthew Martin
On Tue, May 03, 2022 at 10:37:36PM -0500, Matthew Martin wrote:
> The function is already marked __dead in passwd.c, so appears to just be
> an oversight.

ping

diff --git util.h util.h
index dd64f478e23..752f8bb9fc5 100644
--- util.h
+++ util.h
@@ -97,7 +97,7 @@ void  pw_edit(int, const char *);
 void   pw_prompt(void);
 void   pw_copy(int, int, const struct passwd *, const struct passwd *);
 intpw_scan(char *, struct passwd *, int *);
-void   pw_error(const char *, int, int);
+__dead voidpw_error(const char *, int, int);
 intgetptmfd(void);
 intopenpty(int *, int *, char *, const struct termios *,
const struct winsize *);



Re: uvm_pagedequeue()

2022-05-10 Thread Martin Pieuchot
On 05/05/22(Thu) 14:54, Martin Pieuchot wrote:
> Diff below introduces a new wrapper to manipulate active/inactive page
> queues. 
> 
> ok?

Anyone?

> Index: uvm/uvm_page.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> retrieving revision 1.165
> diff -u -p -r1.165 uvm_page.c
> --- uvm/uvm_page.c4 May 2022 14:58:26 -   1.165
> +++ uvm/uvm_page.c5 May 2022 12:49:13 -
> @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
>   /*
>* now remove the page from the queues
>*/
> - if (pg->pg_flags & PQ_ACTIVE) {
> - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> - flags_to_clear |= PQ_ACTIVE;
> - uvmexp.active--;
> - }
> - if (pg->pg_flags & PQ_INACTIVE) {
> - TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> - flags_to_clear |= PQ_INACTIVE;
> - uvmexp.inactive--;
> - }
> + uvm_pagedequeue(pg);
>  
>   /*
>* if the page was wired, unwire it now.
> @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
>   MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
>  
>   if (pg->wire_count == 0) {
> - if (pg->pg_flags & PQ_ACTIVE) {
> - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> - atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
> - uvmexp.active--;
> - }
> - if (pg->pg_flags & PQ_INACTIVE) {
> - TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> - atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
> - uvmexp.inactive--;
> - }
> + uvm_pagedequeue(pg);
>   uvmexp.wired++;
>   }
>   pg->wire_count++;
> @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
>   KASSERT(uvm_page_owner_locked_p(pg));
>   MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
>  
> + uvm_pagedequeue(pg);
> + if (pg->wire_count == 0) {
> + TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
> + atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
> + uvmexp.active++;
> +
> + }
> +}
> +
> +/*
> + * uvm_pagedequeue: remove a page from any paging queue
> + */
> +void
> +uvm_pagedequeue(struct vm_page *pg)
> +{
> + if (pg->pg_flags & PQ_ACTIVE) {
> + TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> + atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
> + uvmexp.active--;
> + }
>   if (pg->pg_flags & PQ_INACTIVE) {
>   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
>   atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
>   uvmexp.inactive--;
>   }
> - if (pg->wire_count == 0) {
> - /*
> -  * if page is already active, remove it from list so we
> -  * can put it at tail.  if it wasn't active, then mark
> -  * it active and bump active count
> -  */
> - if (pg->pg_flags & PQ_ACTIVE)
> - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> - else {
> - atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
> - uvmexp.active++;
> - }
> -
> - TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
> - }
>  }
> -
>  /*
>   * uvm_pagezero: zero fill a page
>   */
> Index: uvm/uvm_page.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.h,v
> retrieving revision 1.67
> diff -u -p -r1.67 uvm_page.h
> --- uvm/uvm_page.h29 Jan 2022 06:25:33 -  1.67
> +++ uvm/uvm_page.h5 May 2022 12:49:13 -
> @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *);
>  #endif
>  
>  void uvm_pageactivate(struct vm_page *);
> +void uvm_pagedequeue(struct vm_page *);
>  vaddr_t  uvm_pageboot_alloc(vsize_t);
>  void uvm_pagecopy(struct vm_page *, struct vm_page *);
>  void uvm_pagedeactivate(struct vm_page *);
> 



Re: ssh-add(1): fix NULL in fprintf

2022-05-09 Thread Martin Vahlensieck
On Mon, May 09, 2022 at 10:42:29AM -0600, Theo de Raadt wrote:
> Martin Vahlensieck  wrote:
> 
> > if (!qflag) {
> > -   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > -   sshkey_type(key), comment);
> > +   fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path,
> > +   sshkey_type(key), comment ? " (" : "",
> > +   comment ? comment : "", comment ? ")" : "");
> 
> this is probably better as something like
> 
> > -   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
> > -   sshkey_type(key), comment ? comment : "no comment");
> 
> Which has a minor ambiguity, but probably harms noone.
> 

Index: ssh-add.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
retrieving revision 1.165
diff -u -p -r1.165 ssh-add.c
--- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
+++ ssh-add.c   9 May 2022 18:36:54 -
@@ -118,7 +118,7 @@ delete_one(int agent_fd, const struct ss
}
if (!qflag) {
fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
-   sshkey_type(key), comment);
+   sshkey_type(key), comment ? comment : "no comment");
}
return 0;
 }
@@ -392,7 +392,7 @@ add_file(int agent_fd, const char *filen
certpath, filename);
sshkey_free(cert);
goto out;
-   } 
+   }
 
/* Graft with private bits */
if ((r = sshkey_to_certified(private)) != 0) {



ssh-add(1): fix NULL in fprintf

2022-05-09 Thread Martin Vahlensieck
Hi

When removing an identity from the agent using the private key file,
ssh-add first tries to find the public key file.  If that fails,
it loads the public key from the private key file, but no comment
is loaded.  This means comment is NULL when it is used inside
delete_one to print `Identity removed: ...'

Below is a diff which only prints the braces and the comment if it
is not NULL.  Something similar is done in ssh-keygen.c line
2423-2425.

So with the following setup:
$ ssh-keygen -t ed25519 -f demo -C demo -N ''
$ mv demo.pub demo_pub
$ ssh-add demo
Identity added: demo (demo)
Before:
$ ssh-add -d demo
Identity removed: demo ED25519 ((null))
$ tail -n 1 /var/log/messages
May  9 18:15:53 demo ssh-add: vfprintf %s NULL in "Identity removed: %s %s 
(%s) "
After:
$ ssh-add -d demo
Identity removed: demo ED25519

Best,

Martin

P.S.: While here remove a trailing space as well.

Index: ssh-add.c
===
RCS file: /cvs/src/usr.bin/ssh/ssh-add.c,v
retrieving revision 1.165
diff -u -p -r1.165 ssh-add.c
--- ssh-add.c   4 Feb 2022 02:49:17 -   1.165
+++ ssh-add.c   9 May 2022 16:04:14 -
@@ -117,8 +117,9 @@ delete_one(int agent_fd, const struct ss
return r;
}
if (!qflag) {
-   fprintf(stderr, "Identity removed: %s %s (%s)\n", path,
-   sshkey_type(key), comment);
+   fprintf(stderr, "Identity removed: %s %s%s%s%s\n", path,
+   sshkey_type(key), comment ? " (" : "",
+   comment ? comment : "", comment ? ")" : "");
}
return 0;
 }
@@ -392,7 +393,7 @@ add_file(int agent_fd, const char *filen
certpath, filename);
sshkey_free(cert);
goto out;
-   } 
+   }
 
/* Graft with private bits */
if ((r = sshkey_to_certified(private)) != 0) {



  1   2   3   4   5   6   7   8   9   10   >