from:"Martin Pieuchot"

Re: Prevent off-by-one accounting hang in out-of-swap situations

2023-10-26 Thread Martin Pieuchot

On 26/10/23(Thu) 07:06, Miod Vallat wrote:
> > I wonder if the diff below makes a difference.  It's hard to debug and it
> > might be worth adding a counter for bad swap slots.
> 
> It did not help (but your diff is probably correct).

In that case I'd like to put both diffs in, are you ok with that?

> > Index: uvm/uvm_anon.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
> > retrieving revision 1.56
> > diff -u -p -r1.56 uvm_anon.c
> > --- uvm/uvm_anon.c  2 Sep 2023 08:24:40 -   1.56
> > +++ uvm/uvm_anon.c  22 Oct 2023 21:27:42 -
> > @@ -116,7 +116,7 @@ uvm_anfree_list(struct vm_anon *anon, st
> > uvm_unlock_pageq(); /* free the daemon */
> > }
> > } else {
> > -   if (anon->an_swslot != 0) {
> > +   if (anon->an_swslot != 0 && anon->an_swslot != SWSLOT_BAD) {
> > /* This page is no longer only in swap. */
> > KASSERT(uvmexp.swpgonly > 0);
> > atomic_dec_int(&uvmexp.swpgonly);

Re: Prevent off-by-one accounting hang in out-of-swap situations

2023-10-22 Thread Martin Pieuchot

On 22/10/23(Sun) 20:29, Miod Vallat wrote:
> > On 21/10/23(Sat) 14:28, Miod Vallat wrote:
> > > > Stuart, Miod, I wonder if this also help for the off-by-one issue you
> > > > are seeing.  It might not.
> > > 
> > > It makes the aforementioned issue disappear on the affected machine.
> > 
> > Thanks at lot for testing!
> 
> Spoke too soon. I have just hit
> 
> panic: kernel diagnostic assertion "uvmexp.swpgonly > 0" failed: file 
> "/usr/src/sys/uvm/uvm_anon.c", line 121
> Stopped at  db_enter+0x8:   add #0x4, r14
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *235984  11904  0 0x14000  0x2000  reaper
> db_enter() at db_enter+0x8
> panic() at panic+0x74
> __assert() at __assert+0x1c
> uvm_anfree_list() at uvm_anfree_list+0x156
> amap_wipeout() at amap_wipeout+0xe6
> uvm_unmap_detach() at uvm_unmap_detach+0x42
> uvm_map_teardown() at uvm_map_teardown+0x104
> uvmspace_free() at uvmspace_free+0x2a
> reaper() at reaper+0x86
> ddb> show uvmexp
> Current UVM status:
>   pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
>   14875 VM pages: 376 active, 2076 inactive, 1 wired, 7418 free (924
> zero)
>   min  10% (25) anon, 10% (25) vnode, 5% (12) vtext
>   freemin=495, free-target=660, inactive-target=2809, wired-max=4958
>   faults=73331603, traps=39755714, intrs=33863551, ctxswitch=11641480
> fpuswitch
> =0
>   softint=15742561, syscalls=39755712, kmapent=11
>   fault counts:
> noram=1, noanon=0, noamap=0, pgwait=1629, pgrele=0
> ok relocks(total)=1523991(1524022), anget(retries)=23905247(950233),
> amapco
> py=9049749
> neighbor anon/obj pg=12025732/40041442,
> gets(lock/unlock)=12859247/574102
> cases: anon=20680175, anoncow=3225049, obj=11467884, prcopy=1391019,
> przero
> =36545783
>   daemon and swap counts:
> woke=6868, revs=6246, scans=3525644, obscans=511526, anscans=2930634
> busy=0, freed=1973275, reactivate=83484, deactivate=3941988
> pageouts=94506, pending=94506, nswget=949421
> nswapdev=1
> swpages=4194415, swpginuse=621, swpgonly=0 paging=0
>   kernel pointers:
> objs(kern)=0x8c3ca94c

I wonder if the diff below makes a difference.  It's hard to debug and it
might be worth adding a counter for bad swap slots.

Index: uvm/uvm_anon.c
===
RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
retrieving revision 1.56
diff -u -p -r1.56 uvm_anon.c
--- uvm/uvm_anon.c  2 Sep 2023 08:24:40 -   1.56
+++ uvm/uvm_anon.c  22 Oct 2023 21:27:42 -
@@ -116,7 +116,7 @@ uvm_anfree_list(struct vm_anon *anon, st
uvm_unlock_pageq(); /* free the daemon */
}
} else {
-   if (anon->an_swslot != 0) {
+   if (anon->an_swslot != 0 && anon->an_swslot != SWSLOT_BAD) {
/* This page is no longer only in swap. */
KASSERT(uvmexp.swpgonly > 0);
atomic_dec_int(&uvmexp.swpgonly);

Re: Prevent off-by-one accounting hang in out-of-swap situations

2023-10-21 Thread Martin Pieuchot

On 21/10/23(Sat) 14:28, Miod Vallat wrote:
> > Stuart, Miod, I wonder if this also help for the off-by-one issue you
> > are seeing.  It might not.
> 
> It makes the aforementioned issue disappear on the affected machine.

Thanks at lot for testing!

> > Comments, ok?
> 
> > diff --git sys/uvm/uvm_pdaemon.c sys/uvm/uvm_pdaemon.c
> > index 284211d226c..a26a776df67 100644
> > --- sys/uvm/uvm_pdaemon.c
> > +++ sys/uvm/uvm_pdaemon.c
> 
> > @@ -917,9 +914,7 @@ uvmpd_scan(struct uvm_pmalloc *pma, struct 
> > uvm_constraint_range *constraint)
> >  */
> > free = uvmexp.free - BUFPAGES_DEFICIT;
> > swap_shortage = 0;
> > -   if (free < uvmexp.freetarg &&
> > -   uvmexp.swpginuse == uvmexp.swpages &&
> > -   !uvm_swapisfull() &&
> > +   if (free < uvmexp.freetarg && uvm_swapisfilled() && !uvm_swapisfull() &&
> > pages_freed == 0) {
> > swap_shortage = uvmexp.freetarg - free;
> > }
> 
> It's unfortunate that you now invoke two uvm_swapisxxx() routines, which
> will both acquire a mutex. Maybe a third uvm_swapisxxx routine could be
> introduced to compute the swapisfilled && !swapisfull condition at once?

I'm mot interested in such micro optimization yet.  Not acquiring a mutex
twice is IMHO not worth making half-shiny. 

However is someone is interested in going down in this direction, I'd
suggest try placing `uvmexp.freetarg' under the same lock and deal with
all its occurrences.  This is a possible next step to reduce the scope of
the uvm_lock_pageq() which is currently responsible for most of the
contention on MP in UVM.

Re: bt(5), btrace(8): execute END probe and print maps after exit() statement

2023-10-21 Thread Martin Pieuchot

On 18/10/23(Wed) 12:56, Scott Cheloha wrote:
> Hi,
> 
> A bt(5) exit() statement causes the btrace(8) interpreter to exit(3)
> immediately.
> 
> A BPFtrace exit() statement is more nuanced: the END probe is executed
> and the contents of all maps are printed before the interpreter exits.
> 
> This patch adds a halting check after the execution of each bt(5)
> statement.  If a statement causes the program to halt, the halt
> bubbles up to the top-level rule evaluation loop and terminates
> execution.  rules_teardown() then runs, just as if the program had
> received SIGTERM.
> 
> Two edge-like cases:
> 
> 1. You can exit() from BEGIN.  rules_setup() returns non-zero if this
>happens so the main loop knows to halt immediately.
> 
> 2. You can exit() from END.  This is just an early-return: the END probe
>doesn't run again.
> 
> Thoughts?

Makes sense to ease the transition from bpftrace scripts.  Ok with me if
you make sure the regression tests still pass.  Some outputs might
depend on the actual behavior and would need to be updated.

> 
> $ btrace -e '
> BEGIN {
>   @[probe] = "reached";
>   exit();
>   @[probe] = "not reached";
> }
> END {
>   @[probe] = "reached";
>   exit();
>   @[probe] = "not reached";
> }'
> 
> Index: btrace.c
> ===
> RCS file: /cvs/src/usr.sbin/btrace/btrace.c,v
> retrieving revision 1.79
> diff -u -p -r1.79 btrace.c
> --- btrace.c  12 Oct 2023 15:16:44 -  1.79
> +++ btrace.c  18 Oct 2023 17:54:16 -
> @@ -71,10 +71,10 @@ struct dtioc_probe_info   *dtpi_get_by_val
>   * Main loop and rule evaluation.
>   */
>  void  rules_do(int);
> -void  rules_setup(int);
> -void  rules_apply(int, struct dt_evt *);
> +int   rules_setup(int);
> +int   rules_apply(int, struct dt_evt *);
>  void  rules_teardown(int);
> -void  rule_eval(struct bt_rule *, struct dt_evt *);
> +int   rule_eval(struct bt_rule *, struct dt_evt *);
>  void  rule_printmaps(struct bt_rule *);
>  
>  /*
> @@ -84,7 +84,7 @@ uint64_t builtin_nsecs(struct dt_evt *
>  const char   *builtin_kstack(struct dt_evt *);
>  const char   *builtin_arg(struct dt_evt *, enum bt_argtype);
>  struct bt_arg*fn_str(struct bt_arg *, struct dt_evt *, char 
> *);
> -void  stmt_eval(struct bt_stmt *, struct dt_evt *);
> +int   stmt_eval(struct bt_stmt *, struct dt_evt *);
>  void  stmt_bucketize(struct bt_stmt *, struct dt_evt *);
>  void  stmt_clear(struct bt_stmt *);
>  void  stmt_delete(struct bt_stmt *, struct dt_evt *);
> @@ -405,6 +405,7 @@ void
>  rules_do(int fd)
>  {
>   struct sigaction sa;
> + int halt = 0;
>  
>   memset(&sa, 0, sizeof(sa));
>   sigemptyset(&sa.sa_mask);
> @@ -415,9 +416,9 @@ rules_do(int fd)
>   if (sigaction(SIGTERM, &sa, NULL))
>   err(1, "sigaction");
>  
> - rules_setup(fd);
> + halt = rules_setup(fd);
>  
> - while (!quit_pending && g_nprobes > 0) {
> + while (!quit_pending && !halt && g_nprobes > 0) {
>   static struct dt_evt devtbuf[64];
>   ssize_t rlen;
>   size_t i;
> @@ -434,8 +435,11 @@ rules_do(int fd)
>   if ((rlen % sizeof(struct dt_evt)) != 0)
>   err(1, "incorrect read");
>  
> - for (i = 0; i < rlen / sizeof(struct dt_evt); i++)
> - rules_apply(fd, &devtbuf[i]);
> + for (i = 0; i < rlen / sizeof(struct dt_evt); i++) {
> + halt = rules_apply(fd, &devtbuf[i]);
> + if (halt)
> + break;
> + }
>   }
>  
>   rules_teardown(fd);
> @@ -484,7 +488,7 @@ rules_action_scan(struct bt_stmt *bs)
>   return evtflags;
>  }
>  
> -void
> +int
>  rules_setup(int fd)
>  {
>   struct dtioc_probe_info *dtpi;
> @@ -493,7 +497,7 @@ rules_setup(int fd)
>   struct bt_probe *bp;
>   struct bt_stmt *bs;
>   struct bt_arg *ba;
> - int dokstack = 0, on = 1;
> + int dokstack = 0, halt = 0, on = 1;
>   uint64_t evtflags;
>  
>   TAILQ_FOREACH(r, &g_rules, br_next) {
> @@ -553,7 +557,7 @@ rules_setup(int fd)
>   clock_gettime(CLOCK_REALTIME, &bt_devt.dtev_tsp);
>  
>   if (rbegin)
> - rule_eval(rbegin, &bt_devt);
> + halt = rule_eval(rbegin, &bt_devt);
>  
>   /* Enable all probes */
>   TAILQ_FOREACH(r, &g_rules, br_next) {
> @@ -571,9 +575,14 @@ rules_setup(int fd)
>   if (ioctl(fd, DTIOCRECORD, &on))
>   err(1, "DTIOCRECORD");
>   }
> +
> + return halt;
>  }
>  
> -void
> +/*
> + * Returns non-zero if the program should halt.
> + */
> +int
>  rules_apply(int fd, struct dt_evt *dtev)
>  {
>   struct bt_r

Prevent off-by-one accounting hang in out-of-swap situations

2023-10-17 Thread Martin Pieuchot

Diff below makes out-of-swap checks more robust.

When a system is low on swap, two variables are used to adapt the
behavior of the page daemon and fault handler:

`swpginuse'
indicates how much swap space is being currently used

`swpgonly'
indicates how much swap space stores content of pages that are no
longer living in memory.

The diff below changes the heuristic to detect if the system is currently
out-of-swap.  In my tests it makes the system more stable by preventing
hangs.  It prevents from hangs that occur when the system has more than 99%
of it swap space filled.  When this happen, the checks using the variables
above to figure out if we are out-of-swap might never be true because of:

- Races between the fault-handler and the accounting of the page daemon
  due to asynchronous swapping

- The swap partition being never completely allocated (bad pages, 
  off-by-one, rounding error, size of a swap cluster...)

- Possible off-by-one accounting errors in swap space

So I'm adapting uvm_swapisfull() to return true as soon as more than
99% of the swap space is filled with pages which are no longer in memory.

I also introduce uvm_swapisfilled() to prevent later failures if there is
less than a cluster of space available (the minimum we try to swap out).
This prevent deadlocking if a few slots (pg_flags & PQ_SWAPBACKED) &&
-   uvmexp.swpginuse == uvmexp.swpages) {
+   if ((p->pg_flags & PQ_SWAPBACKED) && uvm_swapisfilled())
uvmpd_dropswap(p);
-   }
 
/*
 * the page we are looking at is dirty.   we must
@@ -917,9 +914,7 @@ uvmpd_scan(struct uvm_pmalloc *pma, struct 
uvm_constraint_range *constraint)
 */
free = uvmexp.free - BUFPAGES_DEFICIT;
swap_shortage = 0;
-   if (free < uvmexp.freetarg &&
-   uvmexp.swpginuse == uvmexp.swpages &&
-   !uvm_swapisfull() &&
+   if (free < uvmexp.freetarg && uvm_swapisfilled() && !uvm_swapisfull() &&
pages_freed == 0) {
swap_shortage = uvmexp.freetarg - free;
}
diff --git sys/uvm/uvm_swap.c sys/uvm/uvm_swap.c
index 27963259eba..913b2366a7c 100644
--- sys/uvm/uvm_swap.c
+++ sys/uvm/uvm_swap.c
@@ -1516,8 +1516,30 @@ ReTry:   /* XXXMRG */
 }
 
 /*
- * uvm_swapisfull: return true if all of available swap is allocated
- * and in use.
+ * uvm_swapisfilled: return true if the amount of free space in swap is
+ * smaller than the size of a cluster.
+ *
+ * As long as some swap slots are being used by pages currently in memory,
+ * it is possible to reuse them.  Even if the swap space has been completly
+ * filled we do not consider it full.
+ */
+int
+uvm_swapisfilled(void)
+{
+   int result;
+
+   mtx_enter(&uvm_swap_data_lock);
+   KASSERT(uvmexp.swpginuse <= uvmexp.swpages);
+   result = (uvmexp.swpginuse + SWCLUSTPAGES) >= uvmexp.swpages;
+   mtx_leave(&uvm_swap_data_lock);
+
+   return result;
+}
+
+/*
+ * uvm_swapisfull: return true if the amount of pages only in swap
+ * accounts for more than 99% of the total swap space.
+ *
  */
 int
 uvm_swapisfull(void)
@@ -1526,7 +1548,7 @@ uvm_swapisfull(void)
 
mtx_enter(&uvm_swap_data_lock);
KASSERT(uvmexp.swpgonly <= uvmexp.swpages);
-   result = (uvmexp.swpgonly == uvmexp.swpages);
+   result = (uvmexp.swpgonly >= (uvmexp.swpages * 99 / 100));
mtx_leave(&uvm_swap_data_lock);
 
return result;
diff --git sys/uvm/uvm_swap.h sys/uvm/uvm_swap.h
index 9904fe58cd7..f60237405ca 100644
--- sys/uvm/uvm_swap.h
+++ sys/uvm/uvm_swap.h
@@ -42,6 +42,7 @@ int   uvm_swap_put(int, struct vm_page **, 
int, int);
 intuvm_swap_alloc(int *, boolean_t);
 void   uvm_swap_free(int, int);
 void   uvm_swap_markbad(int, int);
+intuvm_swapisfilled(void);
 intuvm_swapisfull(void);
 void   uvm_swap_freepages(struct vm_page **, int);
 #ifdef HIBERNATE
diff --git sys/uvm/uvmexp.h sys/uvm/uvmexp.h
index de5f5fa367c..144494b73ff 100644
--- sys/uvm/uvmexp.h
+++ sys/uvm/uvmexp.h
@@ -83,7 +83,7 @@ struct uvmexp {
/* swap */
int nswapdev;   /* [S] number of configured swap devices in system */
int swpages;/* [S] number of PAGE_SIZE'ed swap pages */
-   int swpginuse;  /* [K] number of swap pages in use */
+   int swpginuse;  /* [S] number of swap pages in use */
int swpgonly;   /* [a] number of swap pages in use, not also in RAM */
int nswget; /* [a] number of swap pages moved from disk to RAM */
int nanon;  /* XXX number total of anon's in system */

small fix in uvmpd_scan_inactive()

2023-10-17 Thread Martin Pieuchot

Diff below merges two equivalent if blocks.  No functional change, ok?


Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.107
diff -u -p -r1.107 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   16 Oct 2023 11:32:54 -  1.107
+++ uvm/uvm_pdaemon.c   17 Oct 2023 10:28:25 -
@@ -650,6 +650,11 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
p->offset >> PAGE_SHIFT,
swslot + swcpages);
swcpages++;
+   rw_exit(slock);
+
+   /* cluster not full yet? */
+   if (swcpages < swnpages)
+   continue;
}
} else {
/* if p == NULL we must be doing a last swap i/o */
@@ -666,14 +671,6 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 * for object pages, we always do the pageout.
 */
if (swap_backed) {
-   if (p) {/* if we just added a page to cluster */
-   rw_exit(slock);
-
-   /* cluster not full yet? */
-   if (swcpages < swnpages)
-   continue;
-   }
-
/* starting I/O now... set up for it */
npages = swcpages;
ppsp = swpps;

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-18 Thread Martin Pieuchot

On 17/09/23(Sun) 11:22, Scott Cheloha wrote:
> v2 is attached.

Thanks.

> Clockintrs now have an argument.  If we pass the PCB as argument, we
> can avoid doing a linear search to find the PCB during the interrupt.
> 
> One thing I'm unsure about is whether I need to add a "barrier" flag
> to clockintr_cancel() and/or clockintr_disestablish().  This would
> cause the caller to block until the clockintr callback function has
> finished executing, which would ensure that it was safe to free the
> PCB.

Please do not reinvent the wheel.  Try to imitate what other kernel APIs
do.  Look at task_add(9) and timeout_add(9).  Call the functions add/del()
to match existing APIs, then we can add a clockintr_del_barrier() if needed.
Do not introduce functions before we need them.  I hope we won't need
it.

> On Mon, Sep 04, 2023 at 01:39:25PM +0100, Martin Pieuchot wrote:
> > On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> > > On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > [...]
> > > The goal of clockintr is to provide a machine-independent API for
> > > scheduling clock interrupts.  You can use it to implement something
> > > like hardclock() or statclock().  We are already using it to implement
> > > these functions, among others.
> > 
> > After reading all the code and the previous manuals, I understand it as
> > a low-level per-CPU timeout API with nanosecond precision.  Is that it?
> 
> Yes.
> 
> The distinguishing feature is that it is usually wired up to a
> platform backend, so it can deliver the interrupt at the requested
> expiration time with relatively low error.

Why shouldn't we use it and prefer timeout then?  That's unclear to me
and I'd love to have this clearly documented.

What happened to the manual? 
 
> > Apart from running periodically on a given CPU an important need for
> > dt(4) is to get a timestamps for every event.  Currently nanotime(9)
> > is used.  This is global to the system.  I saw that ftrace use different
> > clocks per-CPU which might be something to consider now that we're
> > moving to a per-CPU API.
> > 
> > It's all about cost of the instrumentation.  Note that if we use a
> > different per-CPU timestamp it has to work outside of clockintr because
> > probes can be anywhere.
> > 
> > Regarding clockintr_establish(), I'd rather have it *not* allocated the
> > `clockintr'.  I'd prefer waste some more space in every PCB.  The reason
> > for this is to keep live allocation in dt(4) to dt_pcb_alloc() only to
> > be able to not go through malloc(9) at some point in the future to not
> > interfere with the rest of the kernel.  Is there any downside to this?
> 
> You're asking me to change from callee-allocated clockintrs to
> caller-allocated clockintrs.  I don't want to do this.

Why not?  What is your plan?  You want to put many clockintrs in the
kernel?  I don't understand where you're going.  Can you explain?  From
my point of view this design decision is only added complexity for no
good reason

> I am hoping to expirment with using a per-CPU pool for clockintrs
> during the next release cycle.  I think keeping all the clockintrs on
> a single page in memory will have cache benefits when reinserting
> clockintrs into the sorted data structure.

I do not understand this wish for micro optimization.

The API has only on dynamic consumer: dt(4), I tell you I'd prefer to
allocate the structure outside of your API and you tell me you don't
want.  You want to use pool.  Really I don't understand...  Please let's
design an API for our needs and not based on best practises from outside.

Or am I completely missing something from the clockintr world conquest
plan?

If you want to play with pools, they are many other stuff you can do :o)
I just don't understand.

> > Can we have a different hook for the interval provider?
> 
> I think I have added this now?
> 
> "profile" uses dt_prov_profile_intr().
> 
> "interval" uses dt_prov_interval_intr().
> 
> Did you mean a different kind of "hook"?

That's it and please remove DT_ENTER().  There's no need for the use of
the macro inside dt(4).  I thought I already mentioned it.

> > Since we need only one clockintr and we don't care about the CPU
> > should we pick a random one?  Could that be implemented by passing
> > a NULL "struct cpu_info *" pointer to clockintr_establish()?  So
> > multiple "interval" probes would run on different CPUs...
> 
> It would be simpler to just stick to running the "interval" provider
> on the primary CPU.

Well the pr

Re: Use counters_read(9) from ddb(4)

2023-09-15 Thread Martin Pieuchot

On 11/09/23(Mon) 21:05, Martin Pieuchot wrote:
> On 06/09/23(Wed) 23:13, Alexander Bluhm wrote:
> > On Wed, Sep 06, 2023 at 12:23:33PM -0500, Scott Cheloha wrote:
> > > On Wed, Sep 06, 2023 at 01:04:19PM +0100, Martin Pieuchot wrote:
> > > > Debugging OOM is hard.  UVM uses per-CPU counters and sadly
> > > > counters_read(9) needs to allocate memory.  This is not acceptable in
> > > > ddb(4).  As a result I cannot see the content of UVM counters in OOM
> > > > situations.
> > > > 
> > > > Diff below introduces a *_static() variant of counters_read(9) that
> > > > takes a secondary buffer to avoid calling malloc(9).  Is it fine?  Do
> > > > you have a better idea?  Should we make it the default or using the
> > > > stack might be a problem?
> > > 
> > > Instead of adding a second interface I think we could get away with
> > > just extending counters_read(9) to take a scratch buffer as an optional
> > > fourth parameter:
> > > 
> > > void
> > > counters_read(struct cpumem *cm, uint64_t *output, unsigned int n,
> > > uint64_t *scratch);
> > > 
> > > "scratch"?  "temp"?  "tmp"?
> > 
> > scratch is fine for me
> 
> Fine with me.

Here's a full diff, works for me(tm), ok?

Index: sys/kern/kern_sysctl.c
===
RCS file: /cvs/src/sys/kern/kern_sysctl.c,v
retrieving revision 1.418
diff -u -p -r1.418 kern_sysctl.c
--- sys/kern/kern_sysctl.c  16 Jul 2023 03:01:31 -  1.418
+++ sys/kern/kern_sysctl.c  15 Sep 2023 13:29:53 -
@@ -519,7 +519,7 @@ kern_sysctl(int *name, u_int namelen, vo
unsigned int i;
 
memset(&mbs, 0, sizeof(mbs));
-   counters_read(mbstat, counters, MBSTAT_COUNT);
+   counters_read(mbstat, counters, MBSTAT_COUNT, NULL);
for (i = 0; i < MBSTAT_TYPES; i++)
mbs.m_mtypes[i] = counters[i];
 
Index: sys/kern/subr_evcount.c
===
RCS file: /cvs/src/sys/kern/subr_evcount.c,v
retrieving revision 1.15
diff -u -p -r1.15 subr_evcount.c
--- sys/kern/subr_evcount.c 5 Dec 2022 08:58:49 -   1.15
+++ sys/kern/subr_evcount.c 15 Sep 2023 14:01:55 -
@@ -101,7 +101,7 @@ evcount_sysctl(int *name, u_int namelen,
 {
int error = 0, s, nintr, i;
struct evcount *ec;
-   u_int64_t count;
+   uint64_t count, scratch;
 
if (newp != NULL)
return (EPERM);
@@ -129,7 +129,7 @@ evcount_sysctl(int *name, u_int namelen,
if (ec == NULL)
return (ENOENT);
if (ec->ec_percpu != NULL) {
-   counters_read(ec->ec_percpu, &count, 1);
+   counters_read(ec->ec_percpu, &count, 1, &scratch);
} else {
s = splhigh();
count = ec->ec_count;
Index: sys/kern/subr_percpu.c
===
RCS file: /cvs/src/sys/kern/subr_percpu.c,v
retrieving revision 1.10
diff -u -p -r1.10 subr_percpu.c
--- sys/kern/subr_percpu.c  3 Oct 2022 14:10:53 -   1.10
+++ sys/kern/subr_percpu.c  15 Sep 2023 14:16:41 -
@@ -159,17 +159,19 @@ counters_free(struct cpumem *cm, unsigne
 }
 
 void
-counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
+counters_read(struct cpumem *cm, uint64_t *output, unsigned int n,
+uint64_t *scratch)
 {
struct cpumem_iter cmi;
-   uint64_t *gen, *counters, *temp;
+   uint64_t *gen, *counters, *temp = scratch;
uint64_t enter, leave;
unsigned int i;
 
for (i = 0; i < n; i++)
output[i] = 0;
 
-   temp = mallocarray(n, sizeof(uint64_t), M_TEMP, M_WAITOK);
+   if (scratch == NULL)
+   temp = mallocarray(n, sizeof(uint64_t), M_TEMP, M_WAITOK);
 
gen = cpumem_first(&cmi, cm);
do {
@@ -202,7 +204,8 @@ counters_read(struct cpumem *cm, uint64_
gen = cpumem_next(&cmi, cm);
} while (gen != NULL);
 
-   free(temp, M_TEMP, n * sizeof(uint64_t));
+   if (scratch == NULL)
+   free(temp, M_TEMP, n * sizeof(uint64_t));
 }
 
 void
@@ -305,7 +308,8 @@ counters_free(struct cpumem *cm, unsigne
 }
 
 void
-counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
+counters_read(struct cpumem *cm, uint64_t *output, unsigned int n,
+uint64_t *scratch)
 {
uint64_t *counters;
unsigned int i;
Index: sys/net/pfkeyv2_convert.c
===
RCS file: /cvs/src/sys/net/pfkeyv2_convert.c,v
retrieving revis

Re: Use counters_read(9) from ddb(4)

2023-09-11 Thread Martin Pieuchot

On 06/09/23(Wed) 23:13, Alexander Bluhm wrote:
> On Wed, Sep 06, 2023 at 12:23:33PM -0500, Scott Cheloha wrote:
> > On Wed, Sep 06, 2023 at 01:04:19PM +0100, Martin Pieuchot wrote:
> > > Debugging OOM is hard.  UVM uses per-CPU counters and sadly
> > > counters_read(9) needs to allocate memory.  This is not acceptable in
> > > ddb(4).  As a result I cannot see the content of UVM counters in OOM
> > > situations.
> > > 
> > > Diff below introduces a *_static() variant of counters_read(9) that
> > > takes a secondary buffer to avoid calling malloc(9).  Is it fine?  Do
> > > you have a better idea?  Should we make it the default or using the
> > > stack might be a problem?
> > 
> > Instead of adding a second interface I think we could get away with
> > just extending counters_read(9) to take a scratch buffer as an optional
> > fourth parameter:
> > 
> > void
> > counters_read(struct cpumem *cm, uint64_t *output, unsigned int n,
> > uint64_t *scratch);
> > 
> > "scratch"?  "temp"?  "tmp"?
> 
> scratch is fine for me

Fine with me.

Use counters_read(9) from ddb(4)

2023-09-06 Thread Martin Pieuchot

Debugging OOM is hard.  UVM uses per-CPU counters and sadly
counters_read(9) needs to allocate memory.  This is not acceptable in
ddb(4).  As a result I cannot see the content of UVM counters in OOM
situations.

Diff below introduces a *_static() variant of counters_read(9) that
takes a secondary buffer to avoid calling malloc(9).  Is it fine?  Do
you have a better idea?  Should we make it the default or using the
stack might be a problem?

Thanks,
Martin

Index: kern/subr_percpu.c
===
RCS file: /cvs/src/sys/kern/subr_percpu.c,v
retrieving revision 1.10
diff -u -p -r1.10 subr_percpu.c
--- kern/subr_percpu.c  3 Oct 2022 14:10:53 -   1.10
+++ kern/subr_percpu.c  6 Sep 2023 11:54:31 -
@@ -161,15 +161,25 @@ counters_free(struct cpumem *cm, unsigne
 void
 counters_read(struct cpumem *cm, uint64_t *output, unsigned int n)
 {
-   struct cpumem_iter cmi;
-   uint64_t *gen, *counters, *temp;
-   uint64_t enter, leave;
+   uint64_t *temp;
unsigned int i;
 
for (i = 0; i < n; i++)
output[i] = 0;
 
temp = mallocarray(n, sizeof(uint64_t), M_TEMP, M_WAITOK);
+   counters_read_static(cm, output, n, temp);
+   free(temp, M_TEMP, n * sizeof(uint64_t));
+}
+
+void
+counters_read_static(struct cpumem *cm, uint64_t *output, unsigned int n,
+uint64_t *temp)
+{
+   struct cpumem_iter cmi;
+   uint64_t *gen, *counters;
+   uint64_t enter, leave;
+   unsigned int i;
 
gen = cpumem_first(&cmi, cm);
do {
@@ -201,8 +211,6 @@ counters_read(struct cpumem *cm, uint64_
 
gen = cpumem_next(&cmi, cm);
} while (gen != NULL);
-
-   free(temp, M_TEMP, n * sizeof(uint64_t));
 }
 
 void
Index: sys/percpu.h
===
RCS file: /cvs/src/sys/sys/percpu.h,v
retrieving revision 1.8
diff -u -p -r1.8 percpu.h
--- sys/percpu.h28 Aug 2018 15:15:02 -  1.8
+++ sys/percpu.h6 Sep 2023 11:52:55 -
@@ -114,6 +114,8 @@ struct cpumem   *counters_alloc(unsigned i
 struct cpumem  *counters_alloc_ncpus(struct cpumem *, unsigned int);
 voidcounters_free(struct cpumem *, unsigned int);
 voidcounters_read(struct cpumem *, uint64_t *, unsigned int);
+voidcounters_read_static(struct cpumem *, uint64_t *,
+unsigned int, uint64_t *);
 voidcounters_zero(struct cpumem *, unsigned int);
 
 static inline uint64_t *
Index: uvm/uvm_meter.c
===
RCS file: /cvs/src/sys/uvm/uvm_meter.c,v
retrieving revision 1.49
diff -u -p -r1.49 uvm_meter.c
--- uvm/uvm_meter.c 18 Aug 2023 09:18:52 -  1.49
+++ uvm/uvm_meter.c 6 Sep 2023 11:53:02 -
@@ -249,11 +249,12 @@ uvm_total(struct vmtotal *totalp)
 void
 uvmexp_read(struct uvmexp *uexp)
 {
-   uint64_t counters[exp_ncounters];
+   uint64_t counters[exp_ncounters], temp[exp_ncounters];
 
memcpy(uexp, &uvmexp, sizeof(*uexp));
 
-   counters_read(uvmexp_counters, counters, exp_ncounters);
+   counters_read_static(uvmexp_counters, counters, exp_ncounters,
+   temp);
 
/* stat counters */
uexp->faults = (int)counters[faults];

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-04 Thread Martin Pieuchot

On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > [...] 
> > The only behavior that needs to be preserved is the output of dumping
> > stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
> > be adapted with this change.  You can figure that out by looking at the
> > output of /usr/src/share/btrace/kprofile.bt without and with this diff.
> > 
> > Please generate a FlameGraph to make sure they're still the same.
> 
> dt_prov_profile_intr() runs at the same stack depth as hardclock(), so
> indeed they are still the same.

Lovely.

> > Apart from that I'd prefer if we could skip the mechanical change and
> > go straight to what dt(4) needs.  Otherwise we will have to re-design
> > everything.
> 
> I think a mechanical "move the code from point A to point B" patch is
> useful.  It makes the changes easier to follow when tracing the
> revision history in the future.
> 
> If you insist on skipping it, though, I think I can live without it.

I do insist.  It is really hard for me to follow and work with you
because you're too verbose for my capacity.  If you want to work with
me, please do smaller steps and do not mix so much in big diffs.  I
have plenty of possible comments but can deal with huge chunks.

> > The current code assumes the periodic entry points are external to dt(4).
> > This diff moves them in the middle of dt(4) but keeps the existing flow
> > which makes the code very convoluted.
> > 
> > A starting point to understand the added complexity it so see that the
> > DT_ENTER() macro are no longer needed if we move the entry points inside
> > dt(4).
> 
> I did see that.  It seems really easy to remove the macros in a
> subsequent patch, though.
> 
> Again, if you want to do it all in one patch that's OK.

Yes please.

> > The first periodic timeout is dt_prov_interval_enter().  It could be
> > implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
> > approach is that it uses too much code in the kernel which is a problem
> > when instrumenting the kernel itself.  Every subsystem used by dt(4) is
> > impossible to instrument with btrace(8).
> 
> I think you can avoid this instrumentation problem by using clockintr,
> where the callback functions are run from the hardware interrupt
> context, just like hardclock().

Fair enough.

> > The second periodic timeout it dt_prov_profile_enter().  It is similar
> > to the previous one and has to run on every CPU.
> > 
> > Both are currently bound to tick, but we want per-PCB time resolution.
> > We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
> > timeouts fires.
> 
> My current thought is that each PCB could have its own "dp_period" or
> "dp_interval", a count of nanoseconds.

Indeed.  We can have `dp_nsecs' and use that to determine if
clockintr_advance() needs to be called in dt_ioctl_record_start().

> > > - Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
> > >   PCB's implementing provider needs a clock interrupt to do its work, it
> > >   stores the handle in dp_clockintr.  The handle, if any, is established
> > >   during dtpv_alloc() and disestablished during dtpv_dealloc().
> > 
> > Sorry, but as I said I don't understand what is a clockintr.  How does it
> > fit in the kernel and how is it supposed to be used?
> 
> The goal of clockintr is to provide a machine-independent API for
> scheduling clock interrupts.  You can use it to implement something
> like hardclock() or statclock().  We are already using it to implement
> these functions, among others.

After reading all the code and the previous manuals, I understand it as
a low-level per-CPU timeout API with nanosecond precision.  Is that it?

> > Why have it per PCB and not per provider or for the whole driver instead?
> > Per-PCB implies that if I run 3 different profiling on a 32 CPU machines
> > I now have 96 different clockintr.  Is it what we want?
> 
> Yes, I think that sounds fine.  If we run into scaling problems we can
> always just change the underlying data structure to an RB tree or a
> minheap.

Fine.

> > If so, why not simply use timeout(9)?  What's the difference?
> 
> Some code needs to run from a hardware interrupt context.  It's going
> to be easier to profile more of the kernel if we're collecting stack
> traces during a clock interrupt.
> 
> Timeouts run at IPL_SOFTCLOCK.  You can profile process context
> code...  that's it.  Timeouts also only run on a single CPU.  Per-CPU
>

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-04 Thread Martin Pieuchot

On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > On 23/08/23(Wed) 18:52, Scott Cheloha wrote:
> > > This is the next patch in the clock interrupt reorganization series.
> > 
> > Thanks for your diff.  I'm sorry but it is really hard for me to help
> > review this diff because there is still no man page for this API+subsystem.
> > 
> > Can we start with that please?
> 
> Sure, a first draft of a clockintr_establish.9 manpage is included
> below.

Lovely, I'll answer to that first in this email.  Please commit it, so
we can tweak it in tree.  ok mpi@

> We also have a manpage in the tree, clockintr.9.  It is a bit out of
> date, but it covers the broad strokes of how the driver-facing portion
> of the subsystem works.

Currently reading it, should we get rid of `schedhz' in the manpage and
its remaining in the kernel?

Why isn't the statclock() always randomize?  I see a couple of
clocks/archs that do not use CL_RNDSTAT...   Is there any technical
reason apart from testing?

I don't understand what you mean with "Until we understand scheduler
lock contention better"?  What does lock contention has to do with
firing hardclock and statclock at the same moment?

> Index: share/man/man9/clockintr_establish.9
> ===
> RCS file: share/man/man9/clockintr_establish.9
> diff -N share/man/man9/clockintr_establish.9
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ share/man/man9/clockintr_establish.9  26 Aug 2023 01:44:37 -
> @@ -0,0 +1,239 @@
> +.\" $OpenBSD$
> +.\"
> +.\" Copyright (c) 2020-2023 Scott Cheloha 
> +.\"
> +.\" Permission to use, copy, modify, and distribute this software for any
> +.\" purpose with or without fee is hereby granted, provided that the above
> +.\" copyright notice and this permission notice appear in all copies.
> +.\"
> +.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> +.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> +.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> +.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> +.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> +.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> +.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> +.\"
> +.Dd $Mdocdate$
> +.Dt CLOCKINTR_ESTABLISH 9
> +.Os
> +.Sh NAME
> +.Nm clockintr_advance ,
> +.Nm clockintr_cancel ,
> +.Nm clockintr_disestablish ,
> +.Nm clockintr_establish ,
> +.Nm clockintr_expiration ,
> +.Nm clockintr_nsecuptime ,
> +.Nm clockintr_schedule ,
> +.Nm clockintr_stagger
> +.Nd schedule a function for execution from clock interrupt context
> +.Sh SYNOPSIS
> +.In sys/clockintr.h
> +.Ft uint64_t
> +.Fo clockintr_advance
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t interval"

Can we do s/interval/nsecs/?

> +.Fc
> +.Ft void
> +.Fo clockintr_cancel
> +.Fa "struct clockintr *cl"
> +.Fc
> +.Ft void
> +.Fo clockintr_disestablish
> +.Fa "struct clockintr *cl"
> +.Fc
> +.Ft struct clockintr *
> +.Fo clockintr_establish
> +.Fa "struct clockintr_queue *queue"
> +.Fa "void (*func)(struct clockintr *, void *)"
> +.Fc
> +.Ft uint64_t
> +.Fo clockintr_expiration
> +.Fa "const struct clockintr *cl"
> +.Fc
> +.Ft uint64_t
> +.Fo clockintr_nsecuptime
> +.Fa "const struct clockintr *cl"
> +.Fc
> +.Ft void
> +.Fo clockintr_schedule
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t abs"
> +.Fc
> +.Ft void
> +.Fo clockintr_stagger
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t interval"
> +.Fa "u_int numerator"
> +.Fa "u_int denominator"
> +.Fc
> +.Sh DESCRIPTION
> +The clock interrupt subsystem schedules functions for asynchronous execution
> +in the future from a hardware interrupt context.

This is the same description as timeout_set(9) apart from "hardware
interrupt context".  Why should I use this one and not the other one?
I'm confused.  I dislike choices.

> +.Pp
> +The
> +.Fn clockintr_establish
> +function allocates a new clock interrupt object
> +.Po
> +a
> +.Dq clockintr
> +.Pc
> +and binds it to the given clock interrupt
> +.Fa queue .
> +When the clockintr is executed,
> +the callback function
> +.Fa func
> +will be called from a hardware interrupt context on the CPU in control of th

Re: anon & pmap_page_protect

2023-09-01 Thread Martin Pieuchot

On 12/08/23(Sat) 10:43, Martin Pieuchot wrote:
> Since UVM has been imported, we zap mappings associated to anon pages
> before deactivating or freeing them.  Sadly, with the introduction of
> locking for amaps & anons, I added new code paths that do not respect
> this behavior.
> The diff below restores it by moving the call to pmap_page_protect()
> inside uvm_anon_release().  With it the 3 code paths using the function
> are now coherent with the rest of UVM.
> 
> I remember a discussion we had questioning the need for zapping such
> mappings.  I'm interested in hearing more arguments for or against this
> change. However, right now, I'm more concerned about coherency, so I'd
> like to commit the change below before we try something different.

Unless there's an objection, I'll commit this tomorrow in order to
unblock my upcoming UVM changes.

> Index: uvm/uvm_anon.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
> retrieving revision 1.55
> diff -u -p -u -7 -r1.55 uvm_anon.c
> --- uvm/uvm_anon.c11 Apr 2023 00:45:09 -  1.55
> +++ uvm/uvm_anon.c15 May 2023 13:55:28 -
> @@ -251,14 +251,15 @@ uvm_anon_release(struct vm_anon *anon)
>   KASSERT((pg->pg_flags & PG_RELEASED) != 0);
>   KASSERT((pg->pg_flags & PG_BUSY) != 0);
>   KASSERT(pg->uobject == NULL);
>   KASSERT(pg->uanon == anon);
>   KASSERT(anon->an_ref == 0);
>  
>   uvm_lock_pageq();
> + pmap_page_protect(pg, PROT_NONE);
>   uvm_pagefree(pg);
>   uvm_unlock_pageq();
>   KASSERT(anon->an_page == NULL);
>   lock = anon->an_lock;
>   uvm_anfree(anon);
>   rw_exit(lock);
>   /* Note: extra reference is held for PG_RELEASED case. */
> Index: uvm/uvm_fault.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
> retrieving revision 1.133
> diff -u -p -u -7 -r1.133 uvm_fault.c
> --- uvm/uvm_fault.c   4 Nov 2022 09:36:44 -   1.133
> +++ uvm/uvm_fault.c   15 May 2023 13:55:28 -
> @@ -392,15 +392,14 @@ uvmfault_anonget(struct uvm_faultinfo *u
>  
>   /*
>* if we were RELEASED during I/O, then our anon is
>* no longer part of an amap.   we need to free the
>* anon and try again.
>*/
>   if (pg->pg_flags & PG_RELEASED) {
> - pmap_page_protect(pg, PROT_NONE);
>   KASSERT(anon->an_ref == 0);
>   /*
>* Released while we had unlocked amap.
>*/
>   if (locked)
>   uvmfault_unlockall(ufi, NULL, NULL);
>   uvm_anon_release(anon); /* frees page for us */
>

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-08-24 Thread Martin Pieuchot

On 23/08/23(Wed) 18:52, Scott Cheloha wrote:
> This is the next patch in the clock interrupt reorganization series.

Thanks for your diff.  I'm sorry but it is really hard for me to help
review this diff because there is still no man page for this API+subsystem.

Can we start with that please?

> This patch moves the entry points for the interval and profile dt(4)
> providers from the hardclock(9) to a dedicated clock interrupt
> callback, dt_prov_profile_intr(), in dev/dt/dt_prov_profile.c.
> 
> - To preserve current behavior, (1) both provider entrypoints have
>   been moved into a single callback function, (2) the interrupt runs at
>   the same frequency as the hardclock, and (3) the interrupt is
>   staggered to co-occur with the hardclock on a given CPU.

The only behavior that needs to be preserved is the output of dumping
stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
be adapted with this change.  You can figure that out by looking at the
output of /usr/src/share/btrace/kprofile.bt without and with this diff.

Please generate a FlameGraph to make sure they're still the same.

Apart from that I'd prefer if we could skip the mechanical change and
go straight to what dt(4) needs.  Otherwise we will have to re-design
everything.   If you don't want to do this work, then leave it and tell
me what you need and what is your plan so I can help you and do it
myself.

dt(4) needs a way to schedule two different kind of periodic timeouts
with the higher precision possible.  It is currently plugged to hardclock
because there is nothing better.

The current code assumes the periodic entry points are external to dt(4).
This diff moves them in the middle of dt(4) but keeps the existing flow
which makes the code very convoluted. 
A starting point to understand the added complexity it so see that the
DT_ENTER() macro are no longer needed if we move the entry points inside
dt(4).

The first periodic timeout is dt_prov_interval_enter().  It could be
implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
approach is that it uses too much code in the kernel which is a problem
when instrumenting the kernel itself.  Every subsystem used by dt(4) is
impossible to instrument with btrace(8).

The second periodic timeout it dt_prov_profile_enter().  It is similar
to the previous one and has to run on every CPU.

Both are currently bound to tick, but we want per-PCB time resolution.
We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
timeouts fires.

> - Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
>   PCB's implementing provider needs a clock interrupt to do its work, it
>   stores the handle in dp_clockintr.  The handle, if any, is established
>   during dtpv_alloc() and disestablished during dtpv_dealloc().

Sorry, but as I said I don't understand what is a clockintr.  How does it
fit in the kernel and how is it supposed to be used?

Why have it per PCB and not per provider or for the whole driver instead?
Per-PCB implies that if I run 3 different profiling on a 32 CPU machines
I now have 96 different clockintr.  Is it what we want?  If so, why not
simply use timeout(9)?  What's the difference?

>   One alternative is to start running the clock interrupts when they
>   are allocated in dtpv_alloc() and stop them when they are freed in
>   dtpv_dealloc().  This is wasteful, though: the PCBs are not recording
>   yet, so the interrupts won't perform any useful work until the
>   controlling process enables recording.
> 
>   An additional pair of provider hooks, e.g. "dtpv_record_start" and
>   "dtpv_record_stop", might resolve this.

Another alternative would be to have a single hardclock-like handler for
dt(4).  Sorry, I don't understand the big picture behind clockintr, so I
can't help.  

> - We haven't needed to destroy clock interrupts yet, so the
>   clockintr_disestablish() function used in this patch is new.

So maybe that's fishy.  Are they supposed to be destroyed?  What is the
goal of this API?

Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Martin Pieuchot

On 12/08/23(Sat) 11:48, Visa Hankala wrote:
> On Sat, Aug 12, 2023 at 01:29:10PM +0200, Martin Pieuchot wrote:
> > On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> > > On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > > > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > > > removed from the scheduler before smr_flush() is called.  So there's no
> > > > need for the SMR thread to peg itself to such CPUs.  This currently
> > > > isn't a problem because we use per-CPU runqueues but it doesn't work
> > > > with a global one.  So the diff below skip halted CPUs.  It should also
> > > > speed up rebooting/halting on machine with a huge number of CPUs.
> > > 
> > > Because SPCF_HALTED does not (?) imply that the CPU has stopped
> > > processing interrupts, this skipping is not safe as is. Interrupt
> > > handlers might access SMR-protected data.
> > 
> > Interesting.  This is worse than I expected.  It seems we completely
> > forgot about suspend/resume and rebooting when we started pinning
> > interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
> > was enough to ensure no more code would be executed on secondary CPUs,
> > no?  Wouldn't it be better to remap interrupts to the primary CPU in
> > those cases?  Is it easily doable? 
> 
> I think device interrupt stopping already happens through
> config_suspend_all().

Indeed.  I'm a bit puzzled about the order of operations though.  In the
case of reboot/halt the existing order of operations are:

sched_stop_secondary_cpus() <--- remove secondary CPUs from the scheduler


vfs_shutdown()
if_downall()
uvm_swap_finicrypt_all()<--- happens on a single CPU but with
interrupts possibly on secondary CPUs


smr_flush() <--- tells the SMR thread to execute itself
on all CPUs even if they are out of the
scheduler

config_suspend_all()<--- stop interrupts from firing

x86_broadcast_ipi(X86_IPI_HALT) <--- stop secondary CPUs (on x86)


So do we want to keep the existing requirement of being able to execute
a thread on a CPU that has been removed from the scheduler?  That's is
what smr_flush() currently needs.  I find it surprising but I can add
that as a requirement for the upcoming scheduler.  I don't know if other
options are possible or even attractive.

Re: smr_grace_wait(): Skip halted CPUs

2023-08-12 Thread Martin Pieuchot

On 12/08/23(Sat) 10:57, Visa Hankala wrote:
> On Fri, Aug 11, 2023 at 09:52:15PM +0200, Martin Pieuchot wrote:
> > When stopping a machine, with "halt -p" for example, secondary CPUs are
> > removed from the scheduler before smr_flush() is called.  So there's no
> > need for the SMR thread to peg itself to such CPUs.  This currently
> > isn't a problem because we use per-CPU runqueues but it doesn't work
> > with a global one.  So the diff below skip halted CPUs.  It should also
> > speed up rebooting/halting on machine with a huge number of CPUs.
> 
> Because SPCF_HALTED does not (?) imply that the CPU has stopped
> processing interrupts, this skipping is not safe as is. Interrupt
> handlers might access SMR-protected data.

Interesting.  This is worse than I expected.  It seems we completely
forgot about suspend/resume and rebooting when we started pinning
interrupts on secondary CPUs, no?  Previously sched_stop_secondary_cpus()
was enough to ensure no more code would be executed on secondary CPUs,
no?  Wouldn't it be better to remap interrupts to the primary CPU in
those cases?  Is it easily doable? 

> One possible solution is to spin. When smr_grace_wait() sees
> SPCF_HALTED, it should probably call cpu_unidle(ci) and spin until
> condition READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp becomes true.
> However, for this to work, sched_idle() needs to invoke smr_idle().
> Here is a potential problem since the cpu_idle_{enter,cycle,leave}()
> logic is not consistent between architectures.

We're trying to move away from per-CPU runqueues.  That's how I found
this issue.  I don't see how this possible solution could work with a
global runqueue.  Do you?

> I think the intent in sched_idle() was that cpu_idle_enter() should
> block interrupts so that sched_idle() could check without races if
> the CPU can sleep. Now, on some architectures cpu_idle_enter() is
> a no-op. These architectures have to check the idle state in their
> cpu_idle_cycle() function before pausing the CPU.
> 
> To avoid touching architecture-specific code, cpu_is_idle() could
> be redefined to
> 
> ((ci)->ci_schedstate.spc_whichqs == 0 &&
>  (ci)->ci_schedstate.spc_smrgp == READ_ONCE(smr_grace_period))
> 
> Then the loop conditions
> 
>   while (!cpu_is_idle(curcpu())) {
> 
> and
> 
>   while (spc->spc_whichqs == 0) {
> 
> in sched_idle() would have to be changed to
> 
>   while (spc->spc_whichqs != 0) {
> 
> and
> 
>   while (cpu_is_idle(ci)) {
> 
> 
> :(
> 
> > Index: kern/kern_smr.c
> > ===
> > RCS file: /cvs/src/sys/kern/kern_smr.c,v
> > retrieving revision 1.16
> > diff -u -p -r1.16 kern_smr.c
> > --- kern/kern_smr.c 14 Aug 2022 01:58:27 -  1.16
> > +++ kern/kern_smr.c 11 Aug 2023 19:43:54 -
> > @@ -158,6 +158,8 @@ smr_grace_wait(void)
> > CPU_INFO_FOREACH(cii, ci) {
> > if (!CPU_IS_RUNNING(ci))
> > continue;
> > +   if (ci->ci_schedstate.spc_schedflags & SPCF_HALTED)
> > +   continue;
> > if (READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp)
> > continue;
> > sched_peg_curproc(ci);
> >

anon & pmap_page_protect

2023-08-12 Thread Martin Pieuchot

Since UVM has been imported, we zap mappings associated to anon pages
before deactivating or freeing them.  Sadly, with the introduction of
locking for amaps & anons, I added new code paths that do not respect
this behavior.
The diff below restores it by moving the call to pmap_page_protect()
inside uvm_anon_release().  With it the 3 code paths using the function
are now coherent with the rest of UVM.

I remember a discussion we had questioning the need for zapping such
mappings.  I'm interested in hearing more arguments for or against this
change. However, right now, I'm more concerned about coherency, so I'd
like to commit the change below before we try something different.

ok?

Index: uvm/uvm_anon.c
===
RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
retrieving revision 1.55
diff -u -p -u -7 -r1.55 uvm_anon.c
--- uvm/uvm_anon.c  11 Apr 2023 00:45:09 -  1.55
+++ uvm/uvm_anon.c  15 May 2023 13:55:28 -
@@ -251,14 +251,15 @@ uvm_anon_release(struct vm_anon *anon)
KASSERT((pg->pg_flags & PG_RELEASED) != 0);
KASSERT((pg->pg_flags & PG_BUSY) != 0);
KASSERT(pg->uobject == NULL);
KASSERT(pg->uanon == anon);
KASSERT(anon->an_ref == 0);
 
uvm_lock_pageq();
+   pmap_page_protect(pg, PROT_NONE);
uvm_pagefree(pg);
uvm_unlock_pageq();
KASSERT(anon->an_page == NULL);
lock = anon->an_lock;
uvm_anfree(anon);
rw_exit(lock);
/* Note: extra reference is held for PG_RELEASED case. */
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.133
diff -u -p -u -7 -r1.133 uvm_fault.c
--- uvm/uvm_fault.c 4 Nov 2022 09:36:44 -   1.133
+++ uvm/uvm_fault.c 15 May 2023 13:55:28 -
@@ -392,15 +392,14 @@ uvmfault_anonget(struct uvm_faultinfo *u
 
/*
 * if we were RELEASED during I/O, then our anon is
 * no longer part of an amap.   we need to free the
 * anon and try again.
 */
if (pg->pg_flags & PG_RELEASED) {
-   pmap_page_protect(pg, PROT_NONE);
KASSERT(anon->an_ref == 0);
/*
 * Released while we had unlocked amap.
 */
if (locked)
uvmfault_unlockall(ufi, NULL, NULL);
uvm_anon_release(anon); /* frees page for us */

smr_grace_wait(): Skip halted CPUs

2023-08-11 Thread Martin Pieuchot

When stopping a machine, with "halt -p" for example, secondary CPUs are
removed from the scheduler before smr_flush() is called.  So there's no
need for the SMR thread to peg itself to such CPUs.  This currently
isn't a problem because we use per-CPU runqueues but it doesn't work
with a global one.  So the diff below skip halted CPUs.  It should also
speed up rebooting/halting on machine with a huge number of CPUs.

ok?

Index: kern/kern_smr.c
===
RCS file: /cvs/src/sys/kern/kern_smr.c,v
retrieving revision 1.16
diff -u -p -r1.16 kern_smr.c
--- kern/kern_smr.c 14 Aug 2022 01:58:27 -  1.16
+++ kern/kern_smr.c 11 Aug 2023 19:43:54 -
@@ -158,6 +158,8 @@ smr_grace_wait(void)
CPU_INFO_FOREACH(cii, ci) {
if (!CPU_IS_RUNNING(ci))
continue;
+   if (ci->ci_schedstate.spc_schedflags & SPCF_HALTED)
+   continue;
if (READ_ONCE(ci->ci_schedstate.spc_smrgp) == smrgp)
continue;
sched_peg_curproc(ci);

Re: uvm_pagelookup(): moar sanity checks

2023-08-11 Thread Martin Pieuchot

On 11/08/23(Fri) 20:41, Mark Kettenis wrote:
> > Date: Fri, 11 Aug 2023 20:12:19 +0200
> > From: Martin Pieuchot 
> > 
> > Here's a simple diff to add some more sanity checks in uvm_pagelookup().
> > 
> > Nothing fancy, it helps documenting the flags and reduce the difference
> > with NetBSD.  This is part of my on-going work on UVM.
> > 
> > ok?
> 
> NetBSD really has that extra blank line after the return?

No.  It's my mistake.
 
> > Index: uvm/uvm_page.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> > retrieving revision 1.172
> > diff -u -p -r1.172 uvm_page.c
> > --- uvm/uvm_page.c  13 May 2023 09:24:59 -  1.172
> > +++ uvm/uvm_page.c  11 Aug 2023 17:55:43 -
> > @@ -1219,10 +1219,16 @@ struct vm_page *
> >  uvm_pagelookup(struct uvm_object *obj, voff_t off)
> >  {
> > /* XXX if stack is too much, handroll */
> > -   struct vm_page pg;
> > +   struct vm_page p, *pg;
> > +
> > +   p.offset = off;
> > +   pg = RBT_FIND(uvm_objtree, &obj->memt, &p);
> > +
> > +   KASSERT(pg == NULL || obj->uo_npages != 0);
> > +   KASSERT(pg == NULL || (pg->pg_flags & PG_RELEASED) == 0 ||
> > +   (pg->pg_flags & PG_BUSY) != 0);
> > +   return (pg);
> >  
> > -   pg.offset = off;
> > -   return RBT_FIND(uvm_objtree, &obj->memt, &pg);
> >  }
> >  
> >  /*
> > 
> >

uvm_pagelookup(): moar sanity checks

2023-08-11 Thread Martin Pieuchot

Here's a simple diff to add some more sanity checks in uvm_pagelookup().

Nothing fancy, it helps documenting the flags and reduce the difference
with NetBSD.  This is part of my on-going work on UVM.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.172
diff -u -p -r1.172 uvm_page.c
--- uvm/uvm_page.c  13 May 2023 09:24:59 -  1.172
+++ uvm/uvm_page.c  11 Aug 2023 17:55:43 -
@@ -1219,10 +1219,16 @@ struct vm_page *
 uvm_pagelookup(struct uvm_object *obj, voff_t off)
 {
/* XXX if stack is too much, handroll */
-   struct vm_page pg;
+   struct vm_page p, *pg;
+
+   p.offset = off;
+   pg = RBT_FIND(uvm_objtree, &obj->memt, &p);
+
+   KASSERT(pg == NULL || obj->uo_npages != 0);
+   KASSERT(pg == NULL || (pg->pg_flags & PG_RELEASED) == 0 ||
+   (pg->pg_flags & PG_BUSY) != 0);
+   return (pg);
 
-   pg.offset = off;
-   return RBT_FIND(uvm_objtree, &obj->memt, &pg);
 }
 
 /*

Re: hardclock(9), roundrobin: make roundrobin() an independent clock interrupt

2023-08-10 Thread Martin Pieuchot

On 10/08/23(Thu) 12:18, Scott Cheloha wrote:
> On Thu, Aug 10, 2023 at 01:05:27PM +0200, Martin Pieuchot wrote:
> [...] 
> > Can we get rid of `hardclock_period' and use a variable set to 100ms?
> > This should be tested on alpha which has a hz of 1024 but I'd argue this
> > is an improvement.
> 
> Sure, that's cleaner.  The updated patch below adds a new
> "roundrobin_period" variable initialized during clockintr_init().

I'd rather see this variable initialized in sched_bsd.c to 100ms without
depending on `hz'.  Is is possible?  My point is to untangle this completely
from `hz'.

> Index: kern/sched_bsd.c
> ===
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.79
> diff -u -p -r1.79 sched_bsd.c
> --- kern/sched_bsd.c  5 Aug 2023 20:07:55 -   1.79
> +++ kern/sched_bsd.c  10 Aug 2023 17:15:53 -
> @@ -54,9 +54,8 @@
>  #include 
>  #endif
>  
> -
> +uint32_t roundrobin_period;  /* [I] roundrobin period (ns) */
>  int  lbolt;  /* once a second sleep address */
> -int  rrticks_init;   /* # of hardclock ticks per roundrobin() */
>  
>  #ifdef MULTIPROCESSOR
>  struct __mp_lock sched_lock;
> @@ -69,21 +68,23 @@ uint32_t  decay_aftersleep(uint32_t, uin
>   * Force switch among equal priority processes every 100ms.
>   */
>  void
> -roundrobin(struct cpu_info *ci)
> +roundrobin(struct clockintr *cl, void *cf)
>  {
> + struct cpu_info *ci = curcpu();
>   struct schedstate_percpu *spc = &ci->ci_schedstate;
> + uint64_t count;
>  
> - spc->spc_rrticks = rrticks_init;
> + count = clockintr_advance(cl, roundrobin_period);
>  
>   if (ci->ci_curproc != NULL) {
> - if (spc->spc_schedflags & SPCF_SEENRR) {
> + if (spc->spc_schedflags & SPCF_SEENRR || count >= 2) {
>   /*
>* The process has already been through a roundrobin
>* without switching and may be hogging the CPU.
>* Indicate that the process should yield.
>*/
>   atomic_setbits_int(&spc->spc_schedflags,
> - SPCF_SHOULDYIELD);
> + SPCF_SEENRR | SPCF_SHOULDYIELD);
>   } else {
>   atomic_setbits_int(&spc->spc_schedflags,
>   SPCF_SEENRR);
> @@ -695,8 +696,6 @@ scheduler_start(void)
>* its job.
>*/
>   timeout_set(&schedcpu_to, schedcpu, &schedcpu_to);
> -
> - rrticks_init = hz / 10;
>   schedcpu(&schedcpu_to);
>  
>  #ifndef SMALL_KERNEL
> Index: kern/kern_sched.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.84
> diff -u -p -r1.84 kern_sched.c
> --- kern/kern_sched.c 5 Aug 2023 20:07:55 -   1.84
> +++ kern/kern_sched.c 10 Aug 2023 17:15:53 -
> @@ -102,6 +102,12 @@ sched_init_cpu(struct cpu_info *ci)
>   if (spc->spc_profclock == NULL)
>   panic("%s: clockintr_establish profclock", __func__);
>   }
> + if (spc->spc_roundrobin == NULL) {
> + spc->spc_roundrobin = clockintr_establish(&ci->ci_queue,
> + roundrobin);
> + if (spc->spc_roundrobin == NULL)
> + panic("%s: clockintr_establish roundrobin", __func__);
> + }
>  
>   kthread_create_deferred(sched_kthreads_create, ci);
>  
> Index: kern/kern_clockintr.c
> ===
> RCS file: /cvs/src/sys/kern/kern_clockintr.c,v
> retrieving revision 1.30
> diff -u -p -r1.30 kern_clockintr.c
> --- kern/kern_clockintr.c 5 Aug 2023 20:07:55 -   1.30
> +++ kern/kern_clockintr.c 10 Aug 2023 17:15:53 -
> @@ -69,6 +69,7 @@ clockintr_init(u_int flags)
>  
>   KASSERT(hz > 0 && hz <= 10);
>   hardclock_period = 10 / hz;
> + roundrobin_period = hardclock_period * 10;
>  
>   KASSERT(stathz >= 1 && stathz <= 10);
>  
> @@ -204,6 +205,11 @@ clockintr_cpu_init(const struct intrcloc
>   clockintr_stagger(spc->spc_profclock, profclock_period,
>   multiplier, MAXCPUS);
>   }
> + if (spc->spc_roundrobin->cl_expiration == 0) {
> + clockintr_stagger(spc->spc_roundrobin, hardclock_period,
> + multiplier, MAXCPUS);
> + }
>

Re: hardclock(9), roundrobin: make roundrobin() an independent clock interrupt

2023-08-10 Thread Martin Pieuchot

On 05/08/23(Sat) 17:17, Scott Cheloha wrote:
> This is the next piece of the clock interrupt reorganization patch
> series.

The round robin logic is here to make sure process doesn't hog a CPU.
The period to tell a process it should yield doesn't have to be tied
to the hardclock period.  We want to be sure a process doesn't run more
than 100ms at a time.

Is the priority of this new clock interrupt the same as the hardlock?

I don't understand what clockintr_advance() is doing.  Maybe you could
write a manual for it?  I'm afraid we could wait 200ms now?  Or what
`count' of 2 mean?

Same question for clockintr_stagger().

Can we get rid of `hardclock_period' and use a variable set to 100ms?
This should be tested on alpha which has a hz of 1024 but I'd argue this
is an improvement.

> This patch removes the roundrobin() call from hardclock() and makes
> roundrobin() an independent clock interrupt.
> 
> - Revise roundrobin() to make it a valid clock interrupt callback.
>   It remains periodic.  It still runs at one tenth of the hardclock
>   frequency.
> 
> - Account for multiple expirations in roundrobin().  If two or more
>   intervals have elapsed we set SPCF_SHOULDYIELD immediately.
> 
>   This preserves existing behavior: hardclock() is called multiple
>   times during clockintr_hardclock() if clock interrupts are blocked
>   for long enough.
> 
> - Each schedstate_percpu has its own roundrobin() handle, spc_roundrobin.
>   spc_roundrobin is established during sched_init_cpu(), staggered during
>   the first clockintr_cpu_init() call, and advanced during 
> clockintr_cpu_init().
>   Expirations during suspend/resume are discarded.
> 
> - spc_rrticks and rrticks_init are now useless.  Delete them.
> 
> ok?
> 
> Also, yes, I see the growing pile of scheduler-controlled clock
> interrupt handles.  My current plan is to move the setup code at the
> end of clockintr_cpu_init() to a different routine, maybe something
> like "sched_start_cpu()".  On the primary CPU you'd call it immediately
> after cpu_initclocks().  On secondary CPUs you'd call it at the end of
> cpu_hatch() just before cpu_switchto().
> 
> In any case, we will need to find a home for that code someplace.  It
> can't stay in clockintr_cpu_init() forever.
> 
> Index: kern/sched_bsd.c
> ===
> RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> retrieving revision 1.79
> diff -u -p -r1.79 sched_bsd.c
> --- kern/sched_bsd.c  5 Aug 2023 20:07:55 -   1.79
> +++ kern/sched_bsd.c  5 Aug 2023 22:15:25 -
> @@ -56,7 +56,6 @@
>  
>  
>  int  lbolt;  /* once a second sleep address */
> -int  rrticks_init;   /* # of hardclock ticks per roundrobin() */
>  
>  #ifdef MULTIPROCESSOR
>  struct __mp_lock sched_lock;
> @@ -69,21 +68,23 @@ uint32_t  decay_aftersleep(uint32_t, uin
>   * Force switch among equal priority processes every 100ms.
>   */
>  void
> -roundrobin(struct cpu_info *ci)
> +roundrobin(struct clockintr *cl, void *cf)
>  {
> + struct cpu_info *ci = curcpu();
>   struct schedstate_percpu *spc = &ci->ci_schedstate;
> + uint64_t count;
>  
> - spc->spc_rrticks = rrticks_init;
> + count = clockintr_advance(cl, hardclock_period * 10);
>  
>   if (ci->ci_curproc != NULL) {
> - if (spc->spc_schedflags & SPCF_SEENRR) {
> + if (spc->spc_schedflags & SPCF_SEENRR || count >= 2) {
>   /*
>* The process has already been through a roundrobin
>* without switching and may be hogging the CPU.
>* Indicate that the process should yield.
>*/
>   atomic_setbits_int(&spc->spc_schedflags,
> - SPCF_SHOULDYIELD);
> + SPCF_SEENRR | SPCF_SHOULDYIELD);
>   } else {
>   atomic_setbits_int(&spc->spc_schedflags,
>   SPCF_SEENRR);
> @@ -695,8 +696,6 @@ scheduler_start(void)
>* its job.
>*/
>   timeout_set(&schedcpu_to, schedcpu, &schedcpu_to);
> -
> - rrticks_init = hz / 10;
>   schedcpu(&schedcpu_to);
>  
>  #ifndef SMALL_KERNEL
> Index: kern/kern_sched.c
> ===
> RCS file: /cvs/src/sys/kern/kern_sched.c,v
> retrieving revision 1.84
> diff -u -p -r1.84 kern_sched.c
> --- kern/kern_sched.c 5 Aug 2023 20:07:55 -   1.84
> +++ kern/kern_sched.c 5 Aug 2023 22:15:25 -
> @@ -102,6 +102,12 @@ sched_init_cpu(struct cpu_info *ci)
>   if (spc->spc_profclock == NULL)
>   panic("%s: clockintr_establish profclock", __func__);
>   }
> + if (spc->spc_roundrobin == NULL) {
> + spc->spc_roundrobin = clockintr_establish(&ci->ci_queue,
> + roundrobin);
> + if (spc->spc_roundrobin == NULL)
> + panic("%s: clockintr_es

Re: [v2]: uvm_meter, schedcpu: make uvm_meter() an independent timeout

2023-08-03 Thread Martin Pieuchot

On 02/08/23(Wed) 18:27, Claudio Jeker wrote:
> On Wed, Aug 02, 2023 at 10:15:20AM -0500, Scott Cheloha wrote:
> > Now that the proc0 wakeup(9) is gone we can retry the other part of
> > the uvm_meter() patch.
> > 
> > uvm_meter() is meant to run every 5 seconds, but for historical
> > reasons it is called from schedcpu() and it is scheduled against the
> > UTC clock.  schedcpu() and uvm_meter() have different periods, so
> > uvm_meter() ought to be a separate timeout.  uvm_meter() is started
> > alongside schedcpu() so the two will still run in sync.
> > 
> > v1: https://marc.info/?l=openbsd-tech&m=168710929409153&w=2
> > 
> > ok?
> 
> I would refer if uvm_meter is killed and the load calcualtion moved to the
> scheduler.

Me too.

> > Index: sys/uvm/uvm_meter.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_meter.c,v
> > retrieving revision 1.46
> > diff -u -p -r1.46 uvm_meter.c
> > --- sys/uvm/uvm_meter.c 2 Aug 2023 13:54:45 -   1.46
> > +++ sys/uvm/uvm_meter.c 2 Aug 2023 15:13:49 -
> > @@ -85,10 +85,12 @@ void uvmexp_read(struct uvmexp *);
> >   * uvm_meter: calculate load average
> >   */
> >  void
> > -uvm_meter(void)
> > +uvm_meter(void *unused)
> >  {
> > -   if ((gettime() % 5) == 0)
> > -   uvm_loadav(&averunnable);
> > +   static struct timeout to = TIMEOUT_INITIALIZER(uvm_meter, NULL);
> > +
> > +   timeout_add_sec(&to, 5);
> > +   uvm_loadav(&averunnable);
> >  }
> >  
> >  /*
> > Index: sys/uvm/uvm_extern.h
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_extern.h,v
> > retrieving revision 1.170
> > diff -u -p -r1.170 uvm_extern.h
> > --- sys/uvm/uvm_extern.h21 Jun 2023 21:16:21 -  1.170
> > +++ sys/uvm/uvm_extern.h2 Aug 2023 15:13:49 -
> > @@ -414,7 +414,7 @@ voiduvmspace_free(struct vmspace *);
> >  struct vmspace *uvmspace_share(struct process *);
> >  intuvm_share(vm_map_t, vaddr_t, vm_prot_t,
> > vm_map_t, vaddr_t, vsize_t);
> > -void   uvm_meter(void);
> > +void   uvm_meter(void *);
> >  intuvm_sysctl(int *, u_int, void *, size_t *, 
> > void *, size_t, struct proc *);
> >  struct vm_page *uvm_pagealloc(struct uvm_object *,
> > Index: sys/kern/sched_bsd.c
> > ===
> > RCS file: /cvs/src/sys/kern/sched_bsd.c,v
> > retrieving revision 1.78
> > diff -u -p -r1.78 sched_bsd.c
> > --- sys/kern/sched_bsd.c25 Jul 2023 18:16:19 -  1.78
> > +++ sys/kern/sched_bsd.c2 Aug 2023 15:13:50 -
> > @@ -235,7 +235,6 @@ schedcpu(void *arg)
> > }
> > SCHED_UNLOCK(s);
> > }
> > -   uvm_meter();
> > wakeup(&lbolt);
> > timeout_add_sec(to, 1);
> >  }
> > @@ -688,6 +687,7 @@ scheduler_start(void)
> >  
> > rrticks_init = hz / 10;
> > schedcpu(&schedcpu_to);
> > +   uvm_meter(NULL);
> >  
> >  #ifndef SMALL_KERNEL
> > if (perfpolicy == PERFPOL_AUTO)
> > Index: share/man/man9/uvm_init.9
> > ===
> > RCS file: /cvs/src/share/man/man9/uvm_init.9,v
> > retrieving revision 1.7
> > diff -u -p -r1.7 uvm_init.9
> > --- share/man/man9/uvm_init.9   21 Jun 2023 21:16:21 -  1.7
> > +++ share/man/man9/uvm_init.9   2 Aug 2023 15:13:50 -
> > @@ -168,7 +168,7 @@ argument is ignored.
> >  .Ft void
> >  .Fn uvm_kernacc "caddr_t addr" "size_t len" "int rw"
> >  .Ft void
> > -.Fn uvm_meter
> > +.Fn uvm_meter "void *arg"
> >  .Ft int
> >  .Fn uvm_sysctl "int *name" "u_int namelen" "void *oldp" "size_t *oldlenp" 
> > "void *newp " "size_t newlen" "struct proc *p"
> >  .Ft int
> > @@ -212,7 +212,7 @@ access, in the kernel address space.
> >  .Pp
> >  The
> >  .Fn uvm_meter
> > -function calculates the load average and wakes up the swapper if necessary.
> > +timeout updates system load averages every five seconds.
> >  .Pp
> >  The
> >  .Fn uvm_sysctl
> 
> -- 
> :wq Claudio
>

Re: uvm_loadav: don't recompute schedstate_percpu.spc_nrun

2023-08-03 Thread Martin Pieuchot

On 02/08/23(Wed) 14:22, Claudio Jeker wrote:
> On Mon, Jul 31, 2023 at 10:21:11AM -0500, Scott Cheloha wrote:
> > On Fri, Jul 28, 2023 at 07:36:41PM -0500, Scott Cheloha wrote:
> > > claudio@ notes that uvm_loadav() pointlessly walks the allproc list to
> > > recompute schedstate_percpu.spn_nrun for each CPU.
> > > 
> > > We can just use the value instead of recomputing it.
> > 
> > Whoops, off-by-one.  The current load averaging code includes the
> > running thread in the nrun count if it is *not* the idle thread.
> 
> Yes, with this the loadavg seems to be consistent and following the number
> of running processes. The code seems to behave like before (with all its
> quirks).
> 
> OK claudio@, this is a good first step. Now I think this code should later
> be moved into kern_sched.c or sched_bsd.c and removed from uvm. Not sure why
> the load calculation is part of memory management...
> 
> On top of this I wonder about the per-CPU load calculation. In my opinion
> it is wrong to skip the calculation if the CPU is idle. Because of this
> there is no decay for idle CPUs and that feels wrong to me.
> Do we have a userland utility that reports spc_ldavg?

I don't understand why the SCHED_LOCK() is needed.  Since I'm really
against adding new uses for it, could you comment on that?

> > Index: uvm_meter.c
> > ===
> > RCS file: /cvs/src/sys/uvm/uvm_meter.c,v
> > retrieving revision 1.44
> > diff -u -p -r1.44 uvm_meter.c
> > --- uvm_meter.c 21 Jun 2023 21:16:21 -  1.44
> > +++ uvm_meter.c 31 Jul 2023 15:20:37 -
> > @@ -102,43 +102,29 @@ uvm_loadav(struct loadavg *avg)
> >  {
> > CPU_INFO_ITERATOR cii;
> > struct cpu_info *ci;
> > -   int i, nrun;
> > -   struct proc *p;
> > -   int nrun_cpu[MAXCPUS];
> > +   struct schedstate_percpu *spc;
> > +   u_int i, nrun = 0, nrun_cpu;
> > +   int s;
> >  
> > -   nrun = 0;
> > -   memset(nrun_cpu, 0, sizeof(nrun_cpu));
> >  
> > -   LIST_FOREACH(p, &allproc, p_list) {
> > -   switch (p->p_stat) {
> > -   case SSTOP:
> > -   case SSLEEP:
> > -   break;
> > -   case SRUN:
> > -   case SONPROC:
> > -   if (p == p->p_cpu->ci_schedstate.spc_idleproc)
> > -   continue;
> > -   /* FALLTHROUGH */
> > -   case SIDL:
> > -   nrun++;
> > -   if (p->p_cpu)
> > -   nrun_cpu[CPU_INFO_UNIT(p->p_cpu)]++;
> > -   }
> > +   SCHED_LOCK(s);
> > +   CPU_INFO_FOREACH(cii, ci) {
> > +   spc = &ci->ci_schedstate;
> > +   nrun_cpu = spc->spc_nrun;
> > +   if (ci->ci_curproc != spc->spc_idleproc)
> > +   nrun_cpu++;
> > +   if (nrun_cpu == 0)
> > +   continue;
> > +   spc->spc_ldavg = (cexp[0] * spc->spc_ldavg +
> > +   nrun_cpu * FSCALE *
> > +   (FSCALE - cexp[0])) >> FSHIFT;
> > +   nrun += nrun_cpu;
> > }
> > +   SCHED_UNLOCK(s);
> >  
> > for (i = 0; i < 3; i++) {
> > avg->ldavg[i] = (cexp[i] * avg->ldavg[i] +
> > nrun * FSCALE * (FSCALE - cexp[i])) >> FSHIFT;
> > -   }
> > -
> > -   CPU_INFO_FOREACH(cii, ci) {
> > -   struct schedstate_percpu *spc = &ci->ci_schedstate;
> > -
> > -   if (nrun_cpu[CPU_INFO_UNIT(ci)] == 0)
> > -   continue;
> > -   spc->spc_ldavg = (cexp[0] * spc->spc_ldavg +
> > -   nrun_cpu[CPU_INFO_UNIT(ci)] * FSCALE *
> > -   (FSCALE - cexp[0])) >> FSHIFT;
> > }
> >  }
> >  
> 
> -- 
> :wq Claudio
>

Re: Make USB WiFi drivers more robust

2023-07-24 Thread Martin Pieuchot

On 24/07/23(Mon) 12:07, Mark Kettenis wrote:
> Hi All,
> 
> I recently committed a change to the xhci(4) driver that fixed an
> issue with suspending a machine while it has USB devices plugged in.
> Unfortunately this diff had some unintended side effects.  After
> looking at the way the USB stack works, I've come to the conclusion
> that it is best to try and fix the drivers to self-protect against
> events coming in while the device is being detached.  Some drivers
> already do this, some drivers only do this partially.  The diff below
> makes sure that all of the USB WiFi drivers do this in a consistent
> way by checking that we're in the processes of detaching the devices
> at the following points:

We spend quite some time in the past years trying to get rid of the
usbd_is_dying() mechanism.  I'm quite sad to see such diff.  I've no
idea what the underlying issue is and if an alternative is possible.

The idea is that the USB stack should already take care of this, not
every driver.  Because we've seen it's hard for drivers to do that
correctly.

> 1. The driver's ioctl function.
> 
> 2. The driver's USB transfer completion callbacks.

Those are called by usb_transfer_complete().  This correspond to
xhci_xfer_done() and xhci_event_port_change().  Does those functions
need to be called during suspend?

It is not clear to me what your issue is.  I wish we could find a fix
inside xhci(4) or the USB stack.

Thanks,
Martin

Re: patch: atfork unlock

2023-05-21 Thread Martin Pieuchot

On 07/12/22(Wed) 22:17, Joel Knight wrote:
> Hi. As first mentioned on misc[1], I've identified a deadlock in libc
> when a process forks, the children are multi-threaded, and they set
> one or more atfork callbacks. The diff below causes ATFORK_UNLOCK() to
> release the lock even when the process isn't multi-threaded. This
> avoids the deadlock. With this patch applied, the test case I have for
> this issue succeeds and there are no new failures during a full 'make
> regress'.
> 
> Threading is outside my area of expertise so I've no idea if the fix
> proposed here is appropriate. I'm happy to take or test feedback.
> 
> The diff is below and a clean copy is here:
> https://www.packetmischief.ca/files/patches/atfork_on_fork.diff.

This sounds legit to me.  Anyone some time wants to take a look?

> .joel
> 
> 
> [1] https://marc.info/?l=openbsd-misc&m=166926508819790&w=2
> 
> 
> 
> Index: lib/libc/include/thread_private.h
> ===
> RCS file: /data/cvs-mirror/OpenBSD/src/lib/libc/include/thread_private.h,v
> retrieving revision 1.36
> diff -p -u -r1.36 thread_private.h
> --- lib/libc/include/thread_private.h 6 Jan 2021 19:54:17 - 1.36
> +++ lib/libc/include/thread_private.h 8 Dec 2022 04:28:45 -
> @@ -228,7 +228,7 @@ __END_HIDDEN_DECLS
>   } while (0)
>  #define _ATFORK_UNLOCK() \
>   do { \
> - if (__isthreaded) \
> + if (_thread_cb.tc_atfork_unlock != NULL) \
>   _thread_cb.tc_atfork_unlock(); \
>   } while (0)
> 
> Index: regress/lib/libpthread/pthread_atfork_on_fork/Makefile
> ===
> RCS file: regress/lib/libpthread/pthread_atfork_on_fork/Makefile
> diff -N regress/lib/libpthread/pthread_atfork_on_fork/Makefile
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ regress/lib/libpthread/pthread_atfork_on_fork/Makefile 7 Dec 2022
> 04:38:39 -
> @@ -0,0 +1,9 @@
> +# $OpenBSD$
> +
> +PROG= pthread_atfork_on_fork
> +
> +REGRESS_TARGETS= timeout
> +timeout:
> + timeout 10s ./${PROG}
> +
> +.include 
> Index: regress/lib/libpthread/pthread_atfork_on_fork/pthread_atfork_on_fork.c
> ===
> RCS file: 
> regress/lib/libpthread/pthread_atfork_on_fork/pthread_atfork_on_fork.c
> diff -N regress/lib/libpthread/pthread_atfork_on_fork/pthread_atfork_on_fork.c
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ regress/lib/libpthread/pthread_atfork_on_fork/pthread_atfork_on_fork.c
> 7 Dec 2022 04:59:10 -
> @@ -0,0 +1,94 @@
> +/* $OpenBSD$ */
> +
> +/*
> + * Copyright (c) 2022 Joel Knight 
> + *
> + * Permission to use, copy, modify, and distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +/*
> + * This test exercises atfork lock/unlock through multiple generations of
> + * forked child processes where each child also becomes multi-threaded.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define NUMBER_OF_GENERATIONS 4
> +#define NUMBER_OF_TEST_THREADS 2
> +
> +#define SAY(...) do { \
> +fprintf(stderr, "pid %5i ", getpid()); \
> +fprintf(stderr, __VA_ARGS__); \
> +} while (0)
> +
> +void
> +prepare(void)
> +{
> +/* Do nothing */
> +}
> +
> +void *
> +thread(void *arg)
> +{
> +return (NULL);
> +}
> +
> +int
> +test(int fork_level)
> +{
> +pid_t   proc_pid;
> +size_t  thread_index;
> +pthread_t   threads[NUMBER_OF_TEST_THREADS];
> +
> +proc_pid = fork();
> +fork_level = fork_level - 1;
> +
> +if (proc_pid == 0) {
> +SAY("generation %i\n", fork_level);
> +pthread_atfork(prepare, NULL, NULL);
> +
> +for (thread_index = 0; thread_index < NUMBER_OF_TEST_THREADS;
> thread_index++) {
> +pthread_create(&threads[thread_index], NULL, thread, NULL);
> +}
> +
> +for (thread_index = 0; thread_index < NUMBER_OF_TEST_THREADS;
> thread_index++) {
> +pthread_join(threads[thread_index], NULL);
> +}
> +
> +if (fork_level > 0) {
> +test(fork_level);
> +}
> +
> +SAY("exiting\n");
> +exit(0);
> +}
> +else {
> +SAY("parent waiting on child %i\n", proc_pid);
> +waitpid(proc_pid, 0, 0);
> +}
> +
> +return (0);
> +}
> +
> +int
> +main(int argc, char *argv[])
>

Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-29 Thread Martin Pieuchot

On 28/11/22(Mon) 15:04, Mark Kettenis wrote:
> > Date: Wed, 23 Nov 2022 17:33:26 +0100
> > From: Martin Pieuchot 
> > 
> > On 23/11/22(Wed) 16:34, Mark Kettenis wrote:
> > > > Date: Wed, 23 Nov 2022 10:52:32 +0100
> > > > From: Martin Pieuchot 
> > > > 
> > > > On 22/11/22(Tue) 23:40, Mark Kettenis wrote:
> > > > > > Date: Tue, 22 Nov 2022 17:47:44 +
> > > > > > From: Miod Vallat 
> > > > > > 
> > > > > > > Here is a diff.  Maybe bluhm@ can try this on the macppc machine 
> > > > > > > that
> > > > > > > triggered the original "vref used where vget required" problem?
> > > > > > 
> > > > > > On a similar machine it panics after a few hours with:
> > > > > > 
> > > > > > panic: uvn_flush: PGO_SYNCIO return 'try again' error (impossible)
> > > > > > 
> > > > > > The trace (transcribed by hand) is
> > > > > > uvn_flush+0x820
> > > > > > uvm_vnp_terminate+0x79
> > > > > > vclean+0xdc
> > > > > > vgonel+0x70
> > > > > > getnewvnode+0x240
> > > > > > ffs_vget+0xcc
> > > > > > ffs_inode_alloc+0x13c
> > > > > > ufs_makeinode+0x94
> > > > > > ufs_create+0x58
> > > > > > VOP_CREATE+0x48
> > > > > > vn_open+0x188
> > > > > > doopenat+0x1b4
> > > > > 
> > > > > Ah right, there is another path where we end up with a refcount of
> > > > > zero.  Should be fixable, but I need to think about this for a bit.
> > > > 
> > > > Not sure to understand what you mean with refcount of 0.  Could you
> > > > elaborate?
> > > 
> > > Sorry, I was thinking ahead a bit.  I'm pretty much convinced that the
> > > issue we're dealing with is a race between a vnode being
> > > recycled/cleaned and the pagedaemon paging out pages associated with
> > > that same vnode.
> > > 
> > > The crashes we've seen before were all in the pagedaemon path where we
> > > end up calling into the VFS layer with a vnode that has v_usecount ==
> > > 0.  My "fix" avoids that, but hits the issue that when we are in the
> > > codepath that is recycling/cleaning the vnode, we can't use vget() to
> > > get a reference to the vnode since it checks that the vnode isn't in
> > > the process of being cleaned.
> > > 
> > > But if we avoid that issue (by for example) skipping the vget() call
> > > if the UVM_VNODE_DYING flag is set, we run into the same scenario
> > > where we call into the VFS layer with v_usecount == 0.  Now that may
> > > not actually be a problem, but I need to investigate this a bit more.
> > 
> > When the UVM_VNODE_DYING flag is set the caller always own a valid
> > reference to the vnode.  Either because it is in the process of cleaning
> > it via  uvm_vnp_terminate() or because it uvn_detach() has been called
> > which means the reference to the vnode hasn't been dropped yet.  So I
> > believe `v_usecount' for such vnode is positive.
> 
> I don't think so.  The vnode that can be recycled is sitting on the
> freelist with v_usecount == 0.  When getnewvnode() decides to recycle
> a vnode it takes it off the freelist and calls vgonel(), which i turn
> calls vclean(), which only increases v_usecount if it is non-zero.  So
> at the point where uvn_vnp_terminate() gets called v_usecount
> definitely is 0.
> 
> That said, the vnode is no longer on the freelist at that point and
> since UVM_VNODE_DYING is set, uvn_vnp_uncache() will return
> immediately without calling vref() to get another reference.  So that
> is fine.
> 
> > > Or maybe calling into the VFS layer with a vnode that has v_usecount
> > > == 0 is perfectly fine and we should do the vget() dance I propose in
> > > uvm_vnp_unache() instead of in uvn_put().
> > 
> > I'm not following.  uvm_vnp_uncache() is always called with a valid
> > vnode, no?
> 
> Not sure what you mean with a "valid vnode"; uvm_vnp_uncache() checks
> the UVM_VNODE_VALID flag at the start, which suggests that it can be
> called in cases where that flag is not set.  But it will unlock and
> return immediately in that case, so it should be safe.
> 
> Anyway, I think I have convinced myself that in the case where the
> pagedaemon ends up calling uvn_

Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-23 Thread Martin Pieuchot

On 23/11/22(Wed) 16:34, Mark Kettenis wrote:
> > Date: Wed, 23 Nov 2022 10:52:32 +0100
> > From: Martin Pieuchot 
> > 
> > On 22/11/22(Tue) 23:40, Mark Kettenis wrote:
> > > > Date: Tue, 22 Nov 2022 17:47:44 +
> > > > From: Miod Vallat 
> > > > 
> > > > > Here is a diff.  Maybe bluhm@ can try this on the macppc machine that
> > > > > triggered the original "vref used where vget required" problem?
> > > > 
> > > > On a similar machine it panics after a few hours with:
> > > > 
> > > > panic: uvn_flush: PGO_SYNCIO return 'try again' error (impossible)
> > > > 
> > > > The trace (transcribed by hand) is
> > > > uvn_flush+0x820
> > > > uvm_vnp_terminate+0x79
> > > > vclean+0xdc
> > > > vgonel+0x70
> > > > getnewvnode+0x240
> > > > ffs_vget+0xcc
> > > > ffs_inode_alloc+0x13c
> > > > ufs_makeinode+0x94
> > > > ufs_create+0x58
> > > > VOP_CREATE+0x48
> > > > vn_open+0x188
> > > > doopenat+0x1b4
> > > 
> > > Ah right, there is another path where we end up with a refcount of
> > > zero.  Should be fixable, but I need to think about this for a bit.
> > 
> > Not sure to understand what you mean with refcount of 0.  Could you
> > elaborate?
> 
> Sorry, I was thinking ahead a bit.  I'm pretty much convinced that the
> issue we're dealing with is a race between a vnode being
> recycled/cleaned and the pagedaemon paging out pages associated with
> that same vnode.
> 
> The crashes we've seen before were all in the pagedaemon path where we
> end up calling into the VFS layer with a vnode that has v_usecount ==
> 0.  My "fix" avoids that, but hits the issue that when we are in the
> codepath that is recycling/cleaning the vnode, we can't use vget() to
> get a reference to the vnode since it checks that the vnode isn't in
> the process of being cleaned.
> 
> But if we avoid that issue (by for example) skipping the vget() call
> if the UVM_VNODE_DYING flag is set, we run into the same scenario
> where we call into the VFS layer with v_usecount == 0.  Now that may
> not actually be a problem, but I need to investigate this a bit more.

When the UVM_VNODE_DYING flag is set the caller always own a valid
reference to the vnode.  Either because it is in the process of cleaning
it via  uvm_vnp_terminate() or because it uvn_detach() has been called
which means the reference to the vnode hasn't been dropped yet.  So I
believe `v_usecount' for such vnode is positive.

> Or maybe calling into the VFS layer with a vnode that has v_usecount
> == 0 is perfectly fine and we should do the vget() dance I propose in
> uvm_vnp_unache() instead of in uvn_put().

I'm not following.  uvm_vnp_uncache() is always called with a valid
vnode, no?

Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-23 Thread Martin Pieuchot

On 22/11/22(Tue) 23:40, Mark Kettenis wrote:
> > Date: Tue, 22 Nov 2022 17:47:44 +
> > From: Miod Vallat 
> > 
> > > Here is a diff.  Maybe bluhm@ can try this on the macppc machine that
> > > triggered the original "vref used where vget required" problem?
> > 
> > On a similar machine it panics after a few hours with:
> > 
> > panic: uvn_flush: PGO_SYNCIO return 'try again' error (impossible)
> > 
> > The trace (transcribed by hand) is
> > uvn_flush+0x820
> > uvm_vnp_terminate+0x79
> > vclean+0xdc
> > vgonel+0x70
> > getnewvnode+0x240
> > ffs_vget+0xcc
> > ffs_inode_alloc+0x13c
> > ufs_makeinode+0x94
> > ufs_create+0x58
> > VOP_CREATE+0x48
> > vn_open+0x188
> > doopenat+0x1b4
> 
> Ah right, there is another path where we end up with a refcount of
> zero.  Should be fixable, but I need to think about this for a bit.

Not sure to understand what you mean with refcount of 0.  Could you
elaborate?

My understanding of the panic reported is that the proposed diff creates
a complicated relationship between the vnode and the UVM vnode layer.
The above problem occurs because VXLOCK is set on a vnode when it is
being recycled *before* calling uvm_vnp_terminate().   Now that uvn_io()
calls vget(9) it will fail because VXLOCK is set, which is what we want
during vclean(9).

Re: Get rid of UVM_VNODE_CANPERSIST

2022-11-22 Thread Martin Pieuchot

On 18/11/22(Fri) 21:33, Mark Kettenis wrote:
> > Date: Thu, 17 Nov 2022 20:23:37 +0100
> > From: Mark Kettenis 
> > 
> > > From: Jeremie Courreges-Anglas 
> > > Date: Thu, 17 Nov 2022 18:00:21 +0100
> > > 
> > > On Tue, Nov 15 2022, Martin Pieuchot  wrote:
> > > > UVM vnode objects include a reference count to keep track of the number
> > > > of processes that have the corresponding pages mapped in their VM space.
> > > >
> > > > When the last process referencing a given library or executable dies,
> > > > the reaper will munmap this object on its behalf.  When this happens it
> > > > doesn't free the associated pages to speed-up possible re-use of the
> > > > file.  Instead the pages are placed on the inactive list but stay ready
> > > > to be pmap_enter()'d without requiring I/O as soon as a newly process
> > > > needs to access them.
> > > >
> > > > The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
> > > > doesn't work well with swapping [0].  For some reason when the page 
> > > > daemon
> > > > wants to free pages on the inactive list it tries to flush the pages to
> > > > disk and panic(9) because it needs a valid reference to the vnode to do
> > > > so.
> > > >
> > > > This indicates that the mechanism described above, which seems to work
> > > > fine for RO mappings, is currently buggy in more complex situations.
> > > > Flushing the pages when the last reference of the UVM object is dropped
> > > > also doesn't seem to be enough as bluhm@ reported [1].
> > > >
> > > > The diff below, which has already be committed and reverted, gets rid of
> > > > the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
> > > > the arm64 caching bug has been found and fixed.
> > > >
> > > > Getting rid of this logic means more I/O will be generated and pages
> > > > might have a faster reuse cycle.  I'm aware this might introduce a small
> > > > slowdown,
> > > 
> > > Numbers for my usual make -j4 in libc,
> > > on an Unmatched riscv64 box, now:
> > >16m32.65s real21m36.79s user30m53.45s system
> > >16m32.37s real21m33.40s user31m17.98s system
> > >16m32.63s real21m35.74s user31m12.01s system
> > >16m32.13s real21m36.12s user31m06.92s system
> > > After:
> > >19m14.15s real21m09.39s user36m51.33s system
> > >19m19.11s real21m02.61s user36m58.46s system
> > >19m21.77s real21m09.23s user37m03.85s system
> > >19m09.39s real21m08.96s user36m36.00s system
> > > 
> > > 4 cores amd64 VM, before (-current plus an other diff):
> > >1m54.31s real 2m47.36s user 4m24.70s system
> > >1m52.64s real 2m45.68s user 4m23.46s system
> > >1m53.47s real 2m43.59s user 4m27.60s system
> > > After:
> > >2m34.12s real 2m51.15s user 6m20.91s system
> > >2m34.30s real 2m48.48s user 6m23.34s system
> > >2m37.07s real 2m49.60s user 6m31.53s system
> > > 
> > > > however I believe we should work towards loading files from the
> > > > buffer cache to save I/O cycles instead of having another layer of 
> > > > cache.
> > > > Such work isn't trivial and making sure the vnode <-> UVM relation is
> > > > simple and well understood is the first step in this direction.
> > > >
> > > > I'd appreciate if the diff below could be tested on many architectures,
> > > > include the offending rpi4.
> > > 
> > > Mike has already tested a make build on a riscv64 Unmatched.  I have
> > > also run regress in sys, lib/libc and lib/libpthread on that arch.  As
> > > far as I can see this looks stable on my machine, but what I really care
> > > about is the riscv64 bulk build cluster (I'm going to start another
> > > bulk build soon).
> > > 
> > > > Comments?  Oks?
> > > 
> > > The performance drop in my microbenchmark kinda worries me but it's only
> > > a microbenchmark...
> > 
> > I wouldn't call this a microbenchmark.  I fear this is typical for
> > builds of anything on clang architectures.  And I expect it to be
> > worse on single-processor machine where *every* time we exec

Get rid of UVM_VNODE_CANPERSIST

2022-11-15 Thread Martin Pieuchot

UVM vnode objects include a reference count to keep track of the number
of processes that have the corresponding pages mapped in their VM space.

When the last process referencing a given library or executable dies,
the reaper will munmap this object on its behalf.  When this happens it
doesn't free the associated pages to speed-up possible re-use of the
file.  Instead the pages are placed on the inactive list but stay ready
to be pmap_enter()'d without requiring I/O as soon as a newly process
needs to access them.

The mechanism to keep pages populated, known as UVM_VNODE_CANPERSIST,
doesn't work well with swapping [0].  For some reason when the page daemon
wants to free pages on the inactive list it tries to flush the pages to
disk and panic(9) because it needs a valid reference to the vnode to do
so.

This indicates that the mechanism described above, which seems to work
fine for RO mappings, is currently buggy in more complex situations.
Flushing the pages when the last reference of the UVM object is dropped
also doesn't seem to be enough as bluhm@ reported [1].

The diff below, which has already be committed and reverted, gets rid of
the UVM_VNODE_CANPERSIST logic.  I'd like to commit it again now that
the arm64 caching bug has been found and fixed.

Getting rid of this logic means more I/O will be generated and pages
might have a faster reuse cycle.  I'm aware this might introduce a small
slowdown, however I believe we should work towards loading files from the
buffer cache to save I/O cycles instead of having another layer of cache.
Such work isn't trivial and making sure the vnode <-> UVM relation is
simple and well understood is the first step in this direction.

I'd appreciate if the diff below could be tested on many architectures,
include the offending rpi4.

Comments?  Oks?

[0] https://marc.info/?l=openbsd-bugs&m=164846737707559&w=2 
[1] https://marc.info/?l=openbsd-bugs&m=166843373415030&w=2

Index: uvm/uvm_vnode.c
===
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.130
diff -u -p -r1.130 uvm_vnode.c
--- uvm/uvm_vnode.c 20 Oct 2022 13:31:52 -  1.130
+++ uvm/uvm_vnode.c 15 Nov 2022 13:28:28 -
@@ -161,11 +161,8 @@ uvn_attach(struct vnode *vp, vm_prot_t a
 * add it to the writeable list, and then return.
 */
if (uvn->u_flags & UVM_VNODE_VALID) {   /* already active? */
+   KASSERT(uvn->u_obj.uo_refs > 0);
 
-   /* regain vref if we were persisting */
-   if (uvn->u_obj.uo_refs == 0) {
-   vref(vp);
-   }
uvn->u_obj.uo_refs++;   /* bump uvn ref! */
 
/* check for new writeable uvn */
@@ -235,14 +232,14 @@ uvn_attach(struct vnode *vp, vm_prot_t a
KASSERT(uvn->u_obj.uo_refs == 0);
uvn->u_obj.uo_refs++;
oldflags = uvn->u_flags;
-   uvn->u_flags = UVM_VNODE_VALID|UVM_VNODE_CANPERSIST;
+   uvn->u_flags = UVM_VNODE_VALID;
uvn->u_nio = 0;
uvn->u_size = used_vnode_size;
 
/*
 * add a reference to the vnode.   this reference will stay as long
 * as there is a valid mapping of the vnode.   dropped when the
-* reference count goes to zero [and we either free or persist].
+* reference count goes to zero.
 */
vref(vp);
 
@@ -323,16 +320,6 @@ uvn_detach(struct uvm_object *uobj)
 */
vp->v_flag &= ~VTEXT;
 
-   /*
-* we just dropped the last reference to the uvn.   see if we can
-* let it "stick around".
-*/
-   if (uvn->u_flags & UVM_VNODE_CANPERSIST) {
-   /* won't block */
-   uvn_flush(uobj, 0, 0, PGO_DEACTIVATE|PGO_ALLPAGES);
-   goto out;
-   }
-
/* its a goner! */
uvn->u_flags |= UVM_VNODE_DYING;
 
@@ -382,7 +369,6 @@ uvn_detach(struct uvm_object *uobj)
/* wake up any sleepers */
if (oldflags & UVM_VNODE_WANTED)
wakeup(uvn);
-out:
rw_exit(uobj->vmobjlock);
 
/* drop our reference to the vnode. */
@@ -498,8 +484,8 @@ uvm_vnp_terminate(struct vnode *vp)
}
 
/*
-* done.   now we free the uvn if its reference count is zero
-* (true if we are zapping a persisting uvn).   however, if we are
+* done.   now we free the uvn if its reference count is zero.
+* however, if we are
 * terminating a uvn with active mappings we let it live ... future
 * calls down to the vnode layer will fail.
 */
@@ -507,14 +493,14 @@ uvm_vnp_terminate(struct vnode *vp)
if (uvn->u_obj.uo_refs) {
/*
 * uvn must live on it is dead-vnode state until all references
-* are gone.   restore flags.clear CANPERSIST state.
+* are gone.   restore flags.
 */
uvn->u_flags &= ~(UVM_VNODE_DYING|U

btrace: string comparison in filters

2022-11-11 Thread Martin Pieuchot

Diff below adds support for the common following idiom:

syscall:open:entry
/comm == "ksh"/
{
...
}

String comparison is tricky as it can be combined with any other
expression in filters, like:

syscall:mmap:entry
/comm == "cc" && pid != 4589/
{
...
}

I don't have the energy to change the parser so I went for the easy
solution to treat any "stupid" string comparisons as 'true' albeit
printing a warning.  I'd love if somebody with some yacc knowledge
could come up with a better solution.

ok?

Index: usr.sbin/btrace/bt_parse.y
===
RCS file: /cvs/src/usr.sbin/btrace/bt_parse.y,v
retrieving revision 1.46
diff -u -p -r1.46 bt_parse.y
--- usr.sbin/btrace/bt_parse.y  28 Apr 2022 21:04:24 -  1.46
+++ usr.sbin/btrace/bt_parse.y  11 Nov 2022 14:34:37 -
@@ -218,6 +218,7 @@ variable: lvar  { $$ = bl_find($1); }
 factor : '(' expr ')'  { $$ = $2; }
| NUMBER{ $$ = ba_new($1, B_AT_LONG); }
| BUILTIN   { $$ = ba_new(NULL, $1); }
+   | CSTRING   { $$ = ba_new($1, B_AT_STR); }
| staticv
| variable
| mentry
Index: usr.sbin/btrace/btrace.c
===
RCS file: /cvs/src/usr.sbin/btrace/btrace.c,v
retrieving revision 1.64
diff -u -p -r1.64 btrace.c
--- usr.sbin/btrace/btrace.c11 Nov 2022 10:51:39 -  1.64
+++ usr.sbin/btrace/btrace.c11 Nov 2022 14:44:15 -
@@ -434,14 +434,23 @@ rules_setup(int fd)
struct bt_rule *r, *rbegin = NULL;
struct bt_probe *bp;
struct bt_stmt *bs;
+   struct bt_arg *ba;
int dokstack = 0, on = 1;
uint64_t evtflags;
 
TAILQ_FOREACH(r, &g_rules, br_next) {
evtflags = 0;
-   SLIST_FOREACH(bs, &r->br_action, bs_next) {
-   struct bt_arg *ba;
 
+   if (r->br_filter != NULL &&
+   r->br_filter->bf_condition != NULL)  {
+
+   bs = r->br_filter->bf_condition;
+   ba = SLIST_FIRST(&bs->bs_args);
+
+   evtflags |= ba2dtflags(ba);
+   }
+
+   SLIST_FOREACH(bs, &r->br_action, bs_next) {
SLIST_FOREACH(ba, &bs->bs_args, ba_next)
evtflags |= ba2dtflags(ba);
 
@@ -1175,6 +1184,36 @@ baexpr2long(struct bt_arg *ba, struct dt
lhs = ba->ba_value;
rhs = SLIST_NEXT(lhs, ba_next);
 
+   /*
+* String comparison also use '==' and '!='.
+*/
+   if (lhs->ba_type == B_AT_STR ||
+   (rhs != NULL && rhs->ba_type == B_AT_STR)) {
+   char lstr[STRLEN], rstr[STRLEN];
+
+   strlcpy(lstr, ba2str(lhs, dtev), sizeof(lstr));
+   strlcpy(rstr, ba2str(rhs, dtev), sizeof(rstr));
+
+   result = strncmp(lstr, rstr, STRLEN) == 0;
+
+   switch (ba->ba_type) {
+   case B_AT_OP_EQ:
+   break;
+   case B_AT_OP_NE:
+   result = !result;
+   break;
+   default:
+   warnx("operation '%d' unsupported on strings",
+   ba->ba_type);
+   result = 1;
+   }
+
+   debug("ba=%p eval '(%s %s %s) = %d'\n", ba, lstr, ba_name(ba),
+  rstr, result);
+
+   goto out;
+   }
+
lval = ba2long(lhs, dtev);
if (rhs == NULL) {
rval = 0;
@@ -1233,9 +1272,10 @@ baexpr2long(struct bt_arg *ba, struct dt
xabort("unsupported operation %d", ba->ba_type);
}
 
-   debug("ba=%p eval '%ld %s %ld = %d'\n", ba, lval, ba_name(ba),
+   debug("ba=%p eval '(%ld %s %ld) = %d'\n", ba, lval, ba_name(ba),
   rval, result);
 
+out:
--recursions;
 
return result;
@@ -1245,10 +1285,15 @@ const char *
 ba_name(struct bt_arg *ba)
 {
switch (ba->ba_type) {
+   case B_AT_STR:
+   return (const char *)ba->ba_value;
+   case B_AT_LONG:
+   return ba2str(ba, NULL);
case B_AT_NIL:
return "0";
case B_AT_VAR:
case B_AT_MAP:
+   case B_AT_HIST:
break;
case B_AT_BI_PID:
return "pid";
@@ -1326,7 +1371,8 @@ ba_name(struct bt_arg *ba)
xabort("unsupported type %d", ba->ba_type);
}
 
-   assert(ba->ba_type == B_AT_VAR || ba->ba_type == B_AT_MAP);
+   assert(ba->ba_type == B_AT_VAR || ba->ba_type == B_AT_MAP ||
+   ba->ba_type == B_AT_HIST);
 
static char buf[64];
size_t sz;
@@ -1516,9 +1562,13 @@ ba2str(struct bt_arg *ba, struct dt_evt 
 int
 ba2dtflags(struct bt_arg *ba)
 {
+   static long recursions;
struct bt_arg *bval;
int flags = 0;
 
+   if (++recursions >= __MAXOPER

Re: push kernel lock inside ifioctl_get()

2022-11-08 Thread Martin Pieuchot

On 08/11/22(Tue) 15:28, Klemens Nanni wrote:
> After this mechanical move, I can unlock the individual SIOCG* in there.

I'd suggest grabbing the KERNEL_LOCK() after NET_LOCK_SHARED().
Otherwise you might spin for the first one then release it when going
to sleep.

> OK?
> 
> Index: if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.667
> diff -u -p -r1.667 if.c
> --- if.c  8 Nov 2022 15:20:24 -   1.667
> +++ if.c  8 Nov 2022 15:26:07 -
> @@ -2426,33 +2426,43 @@ ifioctl_get(u_long cmd, caddr_t data)
>   size_t bytesdone;
>   const char *label;
>  
> - KERNEL_LOCK();
> -
>   switch(cmd) {
>   case SIOCGIFCONF:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = ifconf(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCIFGCLONERS:
> + KERNEL_LOCK();
>   error = if_clone_list((struct if_clonereq *)data);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGMEMB:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgroupmembers(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGATTR:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgroupattribs(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFGLIST:
> + KERNEL_LOCK();
>   NET_LOCK_SHARED();
>   error = if_getgrouplist(data);
>   NET_UNLOCK_SHARED();
> + KERNEL_UNLOCK();
>   return (error);
>   }
> +
> + KERNEL_LOCK();
>  
>   ifp = if_unit(ifr->ifr_name);
>   if (ifp == NULL) {
>

Mark sched_yield(2) as NOLOCK

2022-11-08 Thread Martin Pieuchot

Now that mmap/munmap/mprotect(2) are no longer creating contention it is
possible to see that sched_yield(2) is one of the syscalls waiting for
the KERNEL_LOCK() to be released.  However this is no longer necessary.

Traversing `ps_threads' require either the KERNEL_LOCK() or the
SCHED_LOCK() and we are holding both in this case.  So let's drop the
requirement for the KERNEL_LOCK().

ok?

Index: kern/syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.235
diff -u -p -r1.235 syscalls.master
--- kern/syscalls.master8 Nov 2022 11:05:57 -   1.235
+++ kern/syscalls.master8 Nov 2022 13:09:10 -
@@ -531,7 +531,7 @@
 #else
 297UNIMPL
 #endif
-298STD { int sys_sched_yield(void); }
+298STD NOLOCK  { int sys_sched_yield(void); }
 299STD NOLOCK  { pid_t sys_getthrid(void); }
 300OBSOL   t32___thrsleep
 301STD NOLOCK  { int sys___thrwakeup(const volatile void *ident, \

Re: xenstore.c: return error number

2022-11-08 Thread Martin Pieuchot

On 01/11/22(Tue) 15:26, Masato Asou wrote:
> Hi,
> 
> Return error number instead of call panic().

Makes sense to me.  Do you know how this error can occur?  Is is a logic
error or are we trusting values produced by a third party?

> comment, ok?
> --
> ASOU Masato
> 
> diff --git a/sys/dev/pv/xenstore.c b/sys/dev/pv/xenstore.c
> index 1e4f15d30eb..dc89ba0fa6d 100644
> --- a/sys/dev/pv/xenstore.c
> +++ b/sys/dev/pv/xenstore.c
> @@ -118,6 +118,7 @@ struct xs_msg {
>   struct xs_msghdr xsm_hdr;
>   uint32_t xsm_read;
>   uint32_t xsm_dlen;
> + int  xsm_error;
>   uint8_t *xsm_data;
>   TAILQ_ENTRY(xs_msg)  xsm_link;
>  };
> @@ -566,9 +567,7 @@ xs_intr(void *arg)
>   }
>  
>   if (xsm->xsm_hdr.xmh_len > xsm->xsm_dlen)
> - panic("message too large: %d vs %d for type %d, rid %u",
> - xsm->xsm_hdr.xmh_len, xsm->xsm_dlen, xsm->xsm_hdr.xmh_type,
> - xsm->xsm_hdr.xmh_rid);
> + xsm->xsm_error = EMSGSIZE;
>  
>   len = MIN(xsm->xsm_hdr.xmh_len - xsm->xsm_read, avail);
>   if (len) {
> @@ -800,7 +799,9 @@ xs_cmd(struct xs_transaction *xst, int cmd, const char 
> *path,
>   error = xs_geterror(xsm);
>   DPRINTF("%s: xenstore request %d \"%s\" error %s\n",
>   xs->xs_sc->sc_dev.dv_xname, cmd, path, xsm->xsm_data);
> - } else if (mode == READ) {
> + } else if (xsm->xsm_error != 0)
> + error = xsm->xsm_error;
> + else if (mode == READ) {
>   KASSERT(iov && iov_cnt);
>   error = xs_parse(xst, xsm, iov, iov_cnt);
>   }
>

Re: Please test: unlock mprotect/mmap/munmap

2022-11-08 Thread Martin Pieuchot

On 08/11/22(Tue) 11:12, Mark Kettenis wrote:
> > Date: Tue, 8 Nov 2022 10:32:14 +0100
> > From: Christian Weisgerber 
> > 
> > Martin Pieuchot:
> > 
> > > These 3 syscalls should now be ready to run w/o KERNEL_LOCK().  This
> > > will reduce contention a lot.  I'd be happy to hear from test reports
> > > on many architectures and possible workloads.
> > 
> > This survived a full amd64 package build.
> 
> \8/
> 
> I think that means it should be comitted.

I agree.  This has been tested on i386, riscv64, m88k, arm64, amd64 (of
course) and sparc64.  I'm pretty confident.

Re: push kernel lock down in ifioctl()

2022-11-07 Thread Martin Pieuchot

On 07/11/22(Mon) 15:16, Klemens Nanni wrote:
> Not all interface ioctls need the kernel lock, but they all grab it.
> 
> Here's a mechanical diff splitting the single lock/unlock around
> ifioctl() into individual lock/unlock dances inside ifioctl().
> 
> From there we can unlock individual ioctls piece by piece.
> 
> Survives regress on sparc64 and didn't blow up on my amd64 notebook yet.
> 
> Feedback? Objection? OK?

Makes sense.  Your diff is missing the kern/sys_socket.c chunk.

This stuff is hairy.  I'd suggest moving very very carefully.  For
example, I wouldn't bother releasing the KERNEL_LOCK() before the
if_put().  Yes, what you're suggesting is correct.  Or at least should
be...

> Index: net/if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.665
> diff -u -p -r1.665 if.c
> --- net/if.c  8 Sep 2022 10:22:06 -   1.665
> +++ net/if.c  7 Nov 2022 15:13:01 -
> @@ -1942,19 +1942,25 @@ ifioctl(struct socket *so, u_long cmd, c
>   case SIOCIFCREATE:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   error = if_clone_create(ifr->ifr_name, 0);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCIFDESTROY:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   error = if_clone_destroy(ifr->ifr_name);
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCSIFGATTR:
>   if ((error = suser(p)) != 0)
>   return (error);
> + KERNEL_LOCK();
>   NET_LOCK();
>   error = if_setgroupattribs(data);
>   NET_UNLOCK();
> + KERNEL_UNLOCK();
>   return (error);
>   case SIOCGIFCONF:
>   case SIOCIFGCLONERS:
> @@ -1973,12 +1979,19 @@ ifioctl(struct socket *so, u_long cmd, c
>   case SIOCGIFRDOMAIN:
>   case SIOCGIFGROUP:
>   case SIOCGIFLLPRIO:
> - return (ifioctl_get(cmd, data));
> + KERNEL_LOCK();
> + error = ifioctl_get(cmd, data);
> + KERNEL_UNLOCK();
> + return (error);
>   }
>  
> + KERNEL_LOCK();
> +
>   ifp = if_unit(ifr->ifr_name);
> - if (ifp == NULL)
> + if (ifp == NULL) {
> + KERNEL_UNLOCK();
>   return (ENXIO);
> + }
>   oif_flags = ifp->if_flags;
>   oif_xflags = ifp->if_xflags;
>  
> @@ -2396,6 +2409,8 @@ forceup:
>  
>   if (((oif_flags ^ ifp->if_flags) & IFF_UP) != 0)
>   getmicrotime(&ifp->if_lastchange);
> +
> + KERNEL_UNLOCK();
>  
>   if_put(ifp);
>  
>

Please test: unlock mprotect/mmap/munmap

2022-11-06 Thread Martin Pieuchot

These 3 syscalls should now be ready to run w/o KERNEL_LOCK().  This
will reduce contention a lot.  I'd be happy to hear from test reports
on many architectures and possible workloads.

Do not forget to run "make syscalls" before building the kernel.

Index: syscalls.master
===
RCS file: /cvs/src/sys/kern/syscalls.master,v
retrieving revision 1.234
diff -u -p -r1.234 syscalls.master
--- syscalls.master 25 Oct 2022 16:10:31 -  1.234
+++ syscalls.master 6 Nov 2022 10:50:45 -
@@ -126,7 +126,7 @@
struct sigaction *osa); }
 47 STD NOLOCK  { gid_t sys_getgid(void); }
 48 STD NOLOCK  { int sys_sigprocmask(int how, sigset_t mask); }
-49 STD { void *sys_mmap(void *addr, size_t len, int prot, \
+49 STD NOLOCK  { void *sys_mmap(void *addr, size_t len, int prot, \
int flags, int fd, off_t pos); }
 50 STD { int sys_setlogin(const char *namebuf); }
 #ifdef ACCOUNTING
@@ -171,8 +171,8 @@
const struct kevent *changelist, int nchanges, \
struct kevent *eventlist, int nevents, \
const struct timespec *timeout); }
-73 STD { int sys_munmap(void *addr, size_t len); }
-74 STD { int sys_mprotect(void *addr, size_t len, \
+73 STD NOLOCK  { int sys_munmap(void *addr, size_t len); }
+74 STD NOLOCK  { int sys_mprotect(void *addr, size_t len, \
int prot); }
 75 STD { int sys_madvise(void *addr, size_t len, \
int behav); }

Re: Towards unlocking mmap(2) & munmap(2)

2022-10-30 Thread Martin Pieuchot

On 30/10/22(Sun) 12:45, Klemens Nanni wrote:
> On Sun, Oct 30, 2022 at 12:40:02PM +, Klemens Nanni wrote:
> > regress on i386/GENERIC.MP+WITNESS with this diff shows
> 
> Another one;  This machine has three read-only NFS mounts, but none of
> them are used during builds or regress.

It's the same.  See archives of bugs@ for discussion about this lock
order reversal and a potential fix from visa@.

> 
> This one is most certainly from the NFS regress tests themselves:
> 127.0.0.1:/mnt/regress-nfs-server  3548  2088  1284   
>  62%/mnt/regress-nfs-client
> 
> witness: lock order reversal:
>  1st 0xd6381eb8 vmmaplk (&map->lock)
>  2nd 0xf5c98d24 nfsnode (&np->n_lock)
> lock order data w2 -> w1 missing
> lock order "&map->lock"(rwlock) -> "&np->n_lock"(rrwlock) first seen at:
> #0  rw_enter+0x57
> #1  rrw_enter+0x3d
> #2  nfs_lock+0x27
> #3  VOP_LOCK+0x50
> #4  vn_lock+0x91
> #5  vn_rdwr+0x64
> #6  vndstrategy+0x2bd
> #7  physio+0x18f
> #8  vndwrite+0x1a
> #9  spec_write+0x74
> #10 VOP_WRITE+0x3f
> #11 vn_write+0xde
> #12 dofilewritev+0xbb
> #13 sys_pwrite+0x55
> #14 syscall+0x2ec
> #15 Xsyscall_untramp+0xa9
>

Re: Towards unlocking mmap(2) & munmap(2)

2022-10-30 Thread Martin Pieuchot

On 30/10/22(Sun) 12:40, Klemens Nanni wrote:
> On Fri, Oct 28, 2022 at 11:08:55AM +0200, Martin Pieuchot wrote:
> > On 20/10/22(Thu) 16:17, Martin Pieuchot wrote:
> > > On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> > > > Diff below adds a minimalist set of assertions to ensure proper locks
> > > > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > > > mmap(2) for anons and munmap(2).
> > > > 
> > > > Please test it with WITNESS enabled and report back.
> > > 
> > > New version of the diff that includes a lock/unlock dance  in 
> > > uvm_map_teardown().  While grabbing this lock should not be strictly
> > > necessary because no other reference to the map should exist when the
> > > reaper is holding it, it helps make progress with asserts.  Grabbing
> > > the lock is easy and it can also save us a lot of time if there is any
> > > reference counting bugs (like we've discovered w/ vnode and swapping).
> > 
> > Here's an updated version that adds a lock/unlock dance in
> > uvm_map_deallocate() to satisfy the assert in uvm_unmap_remove().
> > Thanks to tb@ from pointing this out.
> > 
> > I received many positive feedback and test reports, I'm now asking for
> > oks.
> 
> regress on i386/GENERIC.MP+WITNESS with this diff shows

This isn't related to this diff.

Re: Towards unlocking mmap(2) & munmap(2)

2022-10-28 Thread Martin Pieuchot

On 20/10/22(Thu) 16:17, Martin Pieuchot wrote:
> On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> > Diff below adds a minimalist set of assertions to ensure proper locks
> > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > mmap(2) for anons and munmap(2).
> > 
> > Please test it with WITNESS enabled and report back.
> 
> New version of the diff that includes a lock/unlock dance  in 
> uvm_map_teardown().  While grabbing this lock should not be strictly
> necessary because no other reference to the map should exist when the
> reaper is holding it, it helps make progress with asserts.  Grabbing
> the lock is easy and it can also save us a lot of time if there is any
> reference counting bugs (like we've discovered w/ vnode and swapping).

Here's an updated version that adds a lock/unlock dance in
uvm_map_deallocate() to satisfy the assert in uvm_unmap_remove().
Thanks to tb@ from pointing this out.

I received many positive feedback and test reports, I'm now asking for
oks.


Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  28 Oct 2022 08:41:30 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 28 Oct 2022 08:41:30 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.301
diff -u -p -r1.301 uvm_map.c
--- uvm/uvm_map.c   24 Oct 2022 15:11:56 -  1.301
+++ uvm/uvm_map.c   28 Oct 2022 08:46:28 -
@@ -491,6 +491,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +572,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1457,6 +1461,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1584,6 +1590,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(&map->addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1704,6 +1712,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, &first)) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1868,6 +1878,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1996,10 +2008,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return 0;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrl

Re: Towards unlocking mmap(2) & munmap(2)

2022-10-20 Thread Martin Pieuchot

On 11/09/22(Sun) 12:26, Martin Pieuchot wrote:
> Diff below adds a minimalist set of assertions to ensure proper locks
> are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> mmap(2) for anons and munmap(2).
> 
> Please test it with WITNESS enabled and report back.

New version of the diff that includes a lock/unlock dance  in 
uvm_map_teardown().  While grabbing this lock should not be strictly
necessary because no other reference to the map should exist when the
reaper is holding it, it helps make progress with asserts.  Grabbing
the lock is easy and it can also save us a lot of time if there is any
reference counting bugs (like we've discovered w/ vnode and swapping).

Please test and report back.

Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  20 Oct 2022 14:09:30 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 20 Oct 2022 14:09:30 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.298
diff -u -p -r1.298 uvm_map.c
--- uvm/uvm_map.c   16 Oct 2022 16:16:37 -  1.298
+++ uvm/uvm_map.c   20 Oct 2022 14:09:31 -
@@ -491,6 +491,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +572,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1457,6 +1461,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1584,6 +1590,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(&map->addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1704,6 +1712,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, &first)) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1868,6 +1878,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1996,10 +2008,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return 0;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrlock(map);
 
/* Find first affected entry. */
entry = uvm_map_entrybyaddr(&map->addr, start);
@@ -2526,6 +2535,8 @@ uvm_map_teardown(struct vm_map *map)
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
 
+   vm_map_lock(map);
+
/* Remove address selectors. */
uvm_addr_destro

Re: Towards unlocking mmap(2) & munmap(2)

2022-09-14 Thread Martin Pieuchot

On 14/09/22(Wed) 15:47, Klemens Nanni wrote:
> On 14.09.22 18:55, Mike Larkin wrote:
> > On Sun, Sep 11, 2022 at 12:26:31PM +0200, Martin Pieuchot wrote:
> > > Diff below adds a minimalist set of assertions to ensure proper locks
> > > are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
> > > mmap(2) for anons and munmap(2).
> > > 
> > > Please test it with WITNESS enabled and report back.
> > > 
> > 
> > Do you want this tested in conjunction with the aiodoned diff or by itself?
> 
> This diff looks like a subset of the previous uvm lock assertion diff
> that came out of the previous "unlock mmap(2) for anonymous mappings"
> thread[0].
> 
> https://marc.info/?l=openbsd-tech&m=164423248318212&w=2
> 
> It didn't land eventually, I **think** syzcaller was a blocker which we
> only realised once it was committed and picked up by syzcaller.
> 
> Now it's been some time and more UVM changes landed, but the majority
> (if not all) lock assertions and comments from the above linked diff
> should still hold true.
> 
> mpi, I can dust off and resend that diff, If you want.
> Nothing for release, but perhaps it helps testing your current efforts.

Please hold on, this diff is known to trigger a KASSERT() with witness.
I'll send an update version soon.

Thank you for disregarding this diff for the moment.

Towards unlocking mmap(2) & munmap(2)

2022-09-11 Thread Martin Pieuchot

Diff below adds a minimalist set of assertions to ensure proper locks
are held in uvm_mapanon() and uvm_unmap_remove() which are the guts of
mmap(2) for anons and munmap(2).

Please test it with WITNESS enabled and report back.

Index: uvm/uvm_addr.c
===
RCS file: /cvs/src/sys/uvm/uvm_addr.c,v
retrieving revision 1.31
diff -u -p -r1.31 uvm_addr.c
--- uvm/uvm_addr.c  21 Feb 2022 10:26:20 -  1.31
+++ uvm/uvm_addr.c  11 Sep 2022 09:08:10 -
@@ -416,6 +416,8 @@ uvm_addr_invoke(struct vm_map *map, stru
!(hint >= uaddr->uaddr_minaddr && hint < uaddr->uaddr_maxaddr))
return ENOMEM;
 
+   vm_map_assert_anylock(map);
+
error = (*uaddr->uaddr_functions->uaddr_select)(map, uaddr,
entry_out, addr_out, sz, align, offset, prot, hint);
 
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 11 Sep 2022 08:57:35 -
@@ -1626,6 +1626,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
struct vm_page *pg;
 
KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
+   vm_map_assert_anylock(map);
 
/*
 * we assume that the area we are unwiring has actually been wired
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.294
diff -u -p -r1.294 uvm_map.c
--- uvm/uvm_map.c   15 Aug 2022 15:53:45 -  1.294
+++ uvm/uvm_map.c   11 Sep 2022 09:37:44 -
@@ -162,6 +162,8 @@ int  uvm_map_inentry_recheck(u_long, v
 struct p_inentry *);
 boolean_t   uvm_map_inentry_fix(struct proc *, struct p_inentry *,
 vaddr_t, int (*)(vm_map_entry_t), u_long);
+boolean_t   uvm_map_is_stack_remappable(struct vm_map *,
+vaddr_t, vsize_t);
 /*
  * Tree management functions.
  */
@@ -491,6 +493,8 @@ uvmspace_dused(struct vm_map *map, vaddr
vaddr_t stack_begin, stack_end; /* Position of stack. */
 
KASSERT(map->flags & VM_MAP_ISVMSPACE);
+   vm_map_assert_anylock(map);
+
vm = (struct vmspace *)map;
stack_begin = MIN((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
stack_end = MAX((vaddr_t)vm->vm_maxsaddr, (vaddr_t)vm->vm_minsaddr);
@@ -570,6 +574,8 @@ uvm_map_isavail(struct vm_map *map, stru
if (addr + sz < addr)
return 0;
 
+   vm_map_assert_anylock(map);
+
/*
 * Kernel memory above uvm_maxkaddr is considered unavailable.
 */
@@ -1446,6 +1452,8 @@ uvm_map_mkentry(struct vm_map *map, stru
entry->guard = 0;
entry->fspace = 0;
 
+   vm_map_assert_wrlock(map);
+
/* Reset free space in first. */
free = uvm_map_uaddr_e(map, first);
uvm_mapent_free_remove(map, free, first);
@@ -1573,6 +1581,8 @@ boolean_t
 uvm_map_lookup_entry(struct vm_map *map, vaddr_t address,
 struct vm_map_entry **entry)
 {
+   vm_map_assert_anylock(map);
+
*entry = uvm_map_entrybyaddr(&map->addr, address);
return *entry != NULL && !UVM_ET_ISHOLE(*entry) &&
(*entry)->start <= address && (*entry)->end > address;
@@ -1692,6 +1702,8 @@ uvm_map_is_stack_remappable(struct vm_ma
vaddr_t end = addr + sz;
struct vm_map_entry *first, *iter, *prev = NULL;
 
+   vm_map_assert_anylock(map);
+
if (!uvm_map_lookup_entry(map, addr, &first)) {
printf("map stack 0x%lx-0x%lx of map %p failed: no mapping\n",
addr, end, map);
@@ -1843,6 +1855,8 @@ uvm_mapent_mkfree(struct vm_map *map, st
vaddr_t  addr;  /* Start of freed range. */
vaddr_t  end;   /* End of freed range. */
 
+   UVM_MAP_REQ_WRITE(map);
+
prev = *prev_ptr;
if (prev == entry)
*prev_ptr = prev = NULL;
@@ -1971,10 +1985,7 @@ uvm_unmap_remove(struct vm_map *map, vad
if (start >= end)
return;
 
-   if ((map->flags & VM_MAP_INTRSAFE) == 0)
-   splassert(IPL_NONE);
-   else
-   splassert(IPL_VM);
+   vm_map_assert_wrlock(map);
 
/* Find first affected entry. */
entry = uvm_map_entrybyaddr(&map->addr, start);
@@ -4027,6 +4038,8 @@ uvm_map_checkprot(struct vm_map *map, va
 {
struct vm_map_entry *entry;
 
+   vm_map_assert_anylock(map);
+
if (start < map->min_offset || end > map->max_offset || start > end)
return FALSE;
if (start == end)
@@ -4886,6 +4899,7 @@ uvm_map_freelist_update(struct vm_map *m
 vaddr_t b_start, vaddr_t b_end, vaddr_t s_start, vaddr_t s_end, int flags)
 {
KDASSERT(b_end >= b_star

uvm_vnode locking & documentation

2022-09-10 Thread Martin Pieuchot

Previous fix from gnezdo@ pointed out that `u_flags' accesses should be
serialized by `vmobjlock'.  Diff below documents this and fix the
remaining places where the lock isn't yet taken.  One exception still
remains, the first loop of uvm_vnp_sync().  This cannot be fixed right
now due to possible deadlocks but that's not a reason for not documenting
& fixing the rest of this file.

This has been tested on amd64 and arm64.

Comments?  Oks?

Index: uvm/uvm_vnode.c
===
RCS file: /cvs/src/sys/uvm/uvm_vnode.c,v
retrieving revision 1.128
diff -u -p -r1.128 uvm_vnode.c
--- uvm/uvm_vnode.c 10 Sep 2022 16:14:36 -  1.128
+++ uvm/uvm_vnode.c 10 Sep 2022 18:23:57 -
@@ -68,11 +68,8 @@
  * we keep a simpleq of vnodes that are currently being sync'd.
  */
 
-LIST_HEAD(uvn_list_struct, uvm_vnode);
-struct uvn_list_struct uvn_wlist;  /* writeable uvns */
-
-SIMPLEQ_HEAD(uvn_sq_struct, uvm_vnode);
-struct uvn_sq_struct uvn_sync_q;   /* sync'ing uvns */
+LIST_HEAD(, uvm_vnode) uvn_wlist;  /* [K] writeable uvns */
+SIMPLEQ_HEAD(, uvm_vnode)  uvn_sync_q; /* [S] sync'ing uvns */
 struct rwlock uvn_sync_lock;   /* locks sync operation */
 
 extern int rebooting;
@@ -144,41 +141,40 @@ uvn_attach(struct vnode *vp, vm_prot_t a
struct partinfo pi;
u_quad_t used_vnode_size = 0;
 
-   /* first get a lock on the uvn. */
-   while (uvn->u_flags & UVM_VNODE_BLOCKED) {
-   uvn->u_flags |= UVM_VNODE_WANTED;
-   tsleep_nsec(uvn, PVM, "uvn_attach", INFSLP);
-   }
-
/* if we're mapping a BLK device, make sure it is a disk. */
if (vp->v_type == VBLK && bdevsw[major(vp->v_rdev)].d_type != D_DISK) {
return NULL;
}
 
+   /* first get a lock on the uvn. */
+   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
+   while (uvn->u_flags & UVM_VNODE_BLOCKED) {
+   uvn->u_flags |= UVM_VNODE_WANTED;
+   rwsleep_nsec(uvn, uvn->u_obj.vmobjlock, PVM, "uvn_attach",
+   INFSLP);
+   }
+
/*
 * now uvn must not be in a blocked state.
 * first check to see if it is already active, in which case
 * we can bump the reference count, check to see if we need to
 * add it to the writeable list, and then return.
 */
-   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
if (uvn->u_flags & UVM_VNODE_VALID) {   /* already active? */
KASSERT(uvn->u_obj.uo_refs > 0);
 
uvn->u_obj.uo_refs++;   /* bump uvn ref! */
-   rw_exit(uvn->u_obj.vmobjlock);
 
/* check for new writeable uvn */
if ((accessprot & PROT_WRITE) != 0 &&
(uvn->u_flags & UVM_VNODE_WRITEABLE) == 0) {
-   LIST_INSERT_HEAD(&uvn_wlist, uvn, u_wlist);
-   /* we are now on wlist! */
uvn->u_flags |= UVM_VNODE_WRITEABLE;
+   LIST_INSERT_HEAD(&uvn_wlist, uvn, u_wlist);
}
+   rw_exit(uvn->u_obj.vmobjlock);
 
return (&uvn->u_obj);
}
-   rw_exit(uvn->u_obj.vmobjlock);
 
/*
 * need to call VOP_GETATTR() to get the attributes, but that could
@@ -189,6 +185,7 @@ uvn_attach(struct vnode *vp, vm_prot_t a
 * it.
 */
uvn->u_flags = UVM_VNODE_ALOCK;
+   rw_exit(uvn->u_obj.vmobjlock);
 
if (vp->v_type == VBLK) {
/*
@@ -213,9 +210,11 @@ uvn_attach(struct vnode *vp, vm_prot_t a
}
 
if (result != 0) {
+   rw_enter(uvn->u_obj.vmobjlock, RW_WRITE);
if (uvn->u_flags & UVM_VNODE_WANTED)
wakeup(uvn);
uvn->u_flags = 0;
+   rw_exit(uvn->u_obj.vmobjlock);
return NULL;
}
 
@@ -236,18 +235,19 @@ uvn_attach(struct vnode *vp, vm_prot_t a
uvn->u_nio = 0;
uvn->u_size = used_vnode_size;
 
-   /* if write access, we need to add it to the wlist */
-   if (accessprot & PROT_WRITE) {
-   LIST_INSERT_HEAD(&uvn_wlist, uvn, u_wlist);
-   uvn->u_flags |= UVM_VNODE_WRITEABLE;/* we are on wlist! */
-   }
-
/*
 * add a reference to the vnode.   this reference will stay as long
 * as there is a valid mapping of the vnode.   dropped when the
 * reference count goes to zero.
 */
vref(vp);
+
+   /* if write access, we need to add it to the wlist */
+   if (accessprot & PROT_WRITE) {
+   uvn->u_flags |= UVM_VNODE_WRITEABLE;
+   LIST_INSERT_HEAD(&uvn_wlist, uvn, u_wlist);
+   }
+
if (oldflags & UVM_VNODE_WANTED)
wakeup(uvn);
 
@@ -273,6 +273,7 @@ uvn_reference(struct uvm_object *uobj)
struct uvm_vnode *uvn = (struct uvm_vnode *) uobj;
 #endif
 
+

Re: Unmap page in uvm_anon_release()

2022-09-10 Thread Martin Pieuchot

On 10/09/22(Sat) 15:12, Mark Kettenis wrote:
> > Date: Sat, 10 Sep 2022 14:18:02 +0200
> > From: Martin Pieuchot 
> > 
> > Diff below fixes a bug exposed when swapping on arm64.  When an anon is
> > released make sure the all the pmap references to the related page are
> > removed.
> 
> I'm a little bit puzzled by this.  So these pages are still mapped
> even though there are no references to the anon anymore?

I don't know.  I just realised that all the code paths leading to
uvm_pagefree() get rid of the pmap references by calling page_protect()
except a couple of them in the aiodone daemon and the clustering code in
the pager.

This can't hurt and make the existing code coherent.  Maybe it just
hides the bug, I don't know.

Unmap page in uvm_anon_release()

2022-09-10 Thread Martin Pieuchot

Diff below fixes a bug exposed when swapping on arm64.  When an anon is
released make sure the all the pmap references to the related page are
removed.

We could move the pmap_page_protect(pg, PROT_NONE) inside uvm_pagefree()
to avoid future issue but that's for a later refactoring.

With this diff I can no longer reproduce the SIGBUS issue on the
rockpro64 and swapping is stable as long as I/O from sdmmc(4) work.

This should be good enough to commit the diff that got reverted, but I'll
wait to be sure there's no regression.

ok?

Index: uvm/uvm_anon.c
===
RCS file: /cvs/src/sys/uvm/uvm_anon.c,v
retrieving revision 1.54
diff -u -p -r1.54 uvm_anon.c
--- uvm/uvm_anon.c  26 Mar 2021 13:40:05 -  1.54
+++ uvm/uvm_anon.c  10 Sep 2022 12:10:34 -
@@ -255,6 +255,7 @@ uvm_anon_release(struct vm_anon *anon)
KASSERT(anon->an_ref == 0);
 
uvm_lock_pageq();
+   pmap_page_protect(pg, PROT_NONE);
uvm_pagefree(pg);
uvm_unlock_pageq();
KASSERT(anon->an_page == NULL);
Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.132
diff -u -p -r1.132 uvm_fault.c
--- uvm/uvm_fault.c 31 Aug 2022 01:27:04 -  1.132
+++ uvm/uvm_fault.c 10 Sep 2022 12:10:34 -
@@ -396,7 +396,6 @@ uvmfault_anonget(struct uvm_faultinfo *u
 * anon and try again.
 */
if (pg->pg_flags & PG_RELEASED) {
-   pmap_page_protect(pg, PROT_NONE);
KASSERT(anon->an_ref == 0);
/*
 * Released while we had unlocked amap.

Re: ps(1): add -d (descendancy) option to display parent/child process relationships

2022-09-01 Thread Martin Pieuchot

On 01/09/22(Thu) 03:37, Job Snijders wrote:
> Dear all,
> 
> Some ps(1) implementations have an '-d' ('descendancy') option. Through
> ASCII art parent/child process relationships are grouped and displayed.
> Here is an example:
> 
> $ ps ad -O ppid,user
>   PID  PPID USER TT  STATTIME COMMAND
> 18180 12529 job  pb  I+p  0:00.01 `-- -sh (sh)
> 26689 56460 job  p3  Ip   0:00.01   `-- -ksh (ksh)
>  5153 26689 job  p3  I+p  0:40.18 `-- mutt
> 62046 25272 job  p4  Sp   0:00.25   `-- -ksh (ksh)
> 61156 62046 job  p4  R+/0 0:00.00 `-- ps -ad -O ppid
> 26816  2565 job  p5  Ip   0:00.01   `-- -ksh (ksh)
> 79431 26816 root p5  Ip   0:00.16 `-- /bin/ksh
> 43915 79431 _rpki-cl p5  S+pU 0:06.97   `-- rpki-client
> 70511 43915 _rpki-cl p5  I+pU 0:01.26 |-- rpki-client: parser 
> (rpki-client)
> 96992 43915 _rpki-cl p5  I+pU 0:00.00 |-- rpki-client: rsync 
> (rpki-client)
> 49160 43915 _rpki-cl p5  S+p  0:01.52 |-- rpki-client: http 
> (rpki-client)
> 99329 43915 _rpki-cl p5  S+p  0:03.20 `-- rpki-client: rrdp 
> (rpki-client)
> 
> The functionality is similar to pstree(1) in the ports collection.
> 
> The below changeset borrows heavily from the following two
> implementations:
> 
> 
> https://github.com/freebsd/freebsd-src/commit/044fce530f89a819827d351de364d208a30e9645.patch
> 
> https://github.com/NetBSD/src/commit/b82f6d00d93d880d3976c4f1e88c33d88a8054ad.patch
> 
> Thoughts?

I'd love to have such feature in base.

> Index: extern.h
> ===
> RCS file: /cvs/src/bin/ps/extern.h,v
> retrieving revision 1.23
> diff -u -p -r1.23 extern.h
> --- extern.h  5 Jan 2022 04:10:36 -   1.23
> +++ extern.h  1 Sep 2022 03:31:36 -
> @@ -44,44 +44,44 @@ extern VAR var[];
>  extern VARENT *vhead;
>  
>  __BEGIN_DECLS
> -void  command(const struct kinfo_proc *, VARENT *);
> -void  cputime(const struct kinfo_proc *, VARENT *);
> +void  command(const struct pinfo *, VARENT *);
> +void  cputime(const struct pinfo *, VARENT *);
>  int   donlist(void);
> -void  elapsed(const struct kinfo_proc *, VARENT *);
> +void  elapsed(const struct pinfo *, VARENT *);
>  doublegetpcpu(const struct kinfo_proc *);
> -doublegetpmem(const struct kinfo_proc *);
> -void  gname(const struct kinfo_proc *, VARENT *);
> -void  supgid(const struct kinfo_proc *, VARENT *);
> -void  supgrp(const struct kinfo_proc *, VARENT *);
> -void  logname(const struct kinfo_proc *, VARENT *);
> -void  longtname(const struct kinfo_proc *, VARENT *);
> -void  lstarted(const struct kinfo_proc *, VARENT *);
> -void  maxrss(const struct kinfo_proc *, VARENT *);
> +doublegetpmem(const struct pinfo *);
> +void  gname(const struct pinfo *, VARENT *);
> +void  supgid(const struct pinfo *, VARENT *);
> +void  supgrp(const struct pinfo *, VARENT *);
> +void  logname(const struct pinfo *, VARENT *);
> +void  longtname(const struct pinfo *, VARENT *);
> +void  lstarted(const struct pinfo *, VARENT *);
> +void  maxrss(const struct pinfo *, VARENT *);
>  void  nlisterr(struct nlist *);
> -void  p_rssize(const struct kinfo_proc *, VARENT *);
> -void  pagein(const struct kinfo_proc *, VARENT *);
> +void  p_rssize(const struct pinfo *, VARENT *);
> +void  pagein(const struct pinfo *, VARENT *);
>  void  parsefmt(char *);
> -void  pcpu(const struct kinfo_proc *, VARENT *);
> -void  pmem(const struct kinfo_proc *, VARENT *);
> -void  pri(const struct kinfo_proc *, VARENT *);
> +void  pcpu(const struct pinfo *, VARENT *);
> +void  pmem(const struct pinfo *, VARENT *);
> +void  pri(const struct pinfo *, VARENT *);
>  void  printheader(void);
> -void  pvar(const struct kinfo_proc *kp, VARENT *);
> -void  pnice(const struct kinfo_proc *kp, VARENT *);
> -void  rgname(const struct kinfo_proc *, VARENT *);
> -void  rssize(const struct kinfo_proc *, VARENT *);
> -void  runame(const struct kinfo_proc *, VARENT *);
> +void  pvar(const struct pinfo *, VARENT *);
> +void  pnice(const struct pinfo *, VARENT *);
> +void  rgname(const struct pinfo *, VARENT *);
> +void  rssize(const struct pinfo *, VARENT *);
> +void  runame(const struct pinfo *, VARENT *);
>  void  showkey(void);
> -void  started(const struct kinfo_proc *, VARENT *);
> -void  printstate(const struct kinfo_proc *, VARENT *);
> -void  printpledge(const struct kinfo_proc *, VARENT *);
> -void  tdev(const struct kinfo_proc *, VARENT *);
> -void  tname(const struct kinfo_proc *, VARENT *);
> -void  tsize(const struct kinfo_proc *, VARENT *);
> -void  dsize(const struct kinfo_proc *, VARENT *);
> -void  ssize(const struct kinfo_proc *, VARENT *);
> -void  ucomm(const struct kinfo_proc *, VARENT *);
> -void  curwd(const struct kinfo_proc *, VARENT *);
> -void  euname(const struct kinfo_proc *, VARENT *);
> -void  vsize(const struct kinfo_proc *, VARENT

Re: pdaemon locking tweak

2022-08-30 Thread Martin Pieuchot

On 30/08/22(Tue) 15:28, Jonathan Gray wrote:
> On Mon, Aug 29, 2022 at 01:46:20PM +0200, Martin Pieuchot wrote:
> > Diff below refactors the pdaemon's locking by introducing a new *trylock()
> > function for a given page.  This is shamelessly stolen from NetBSD.
> > 
> > This is part of my ongoing effort to untangle the locks used by the page
> > daemon.
> > 
> > ok?
> 
> if (pmap_is_referenced(p)) {
>   uvm_pageactivate(p);
> 
> is no longer under held slock.  Which I believe is intended,
> just not obvious looking at the diff.
> 
> The page queue is already locked on entry to uvmpd_scan_inactive()

Thanks for spotting this.  Indeed the locking required for
uvm_pageactivate() is different in my local tree.  For now
let's keep the existing order of operations.

Updated diff below.

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   30 Aug 2022 08:30:58 -  1.103
+++ uvm/uvm_pdaemon.c   30 Aug 2022 08:39:19 -
@@ -101,6 +101,7 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
+struct rwlock  *uvmpd_trylockowner(struct vm_page *);
 void   uvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
@@ -367,6 +368,34 @@ uvm_aiodone_daemon(void *arg)
}
 }
 
+/*
+ * uvmpd_trylockowner: trylock the page's owner.
+ *
+ * => return the locked rwlock on success.  otherwise, return NULL.
+ */
+struct rwlock *
+uvmpd_trylockowner(struct vm_page *pg)
+{
+
+   struct uvm_object *uobj = pg->uobject;
+   struct rwlock *slock;
+
+   if (uobj != NULL) {
+   slock = uobj->vmobjlock;
+   } else {
+   struct vm_anon *anon = pg->uanon;
+
+   KASSERT(anon != NULL);
+   slock = anon->an_lock;
+   }
+
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
+   return NULL;
+   }
+
+   return slock;
+}
+
 
 /*
  * uvmpd_dropswap: free any swap allocated to this page.
@@ -474,51 +503,43 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 
anon = p->uanon;
uobj = p->uobject;
-   if (p->pg_flags & PQ_ANON) {
+
+   /*
+* first we attempt to lock the object that this page
+* belongs to.  if our attempt fails we skip on to
+* the next page (no harm done).  it is important to
+* "try" locking the object as we are locking in the
+* wrong order (pageq -> object) and we don't want to
+* deadlock.
+*/
+   slock = uvmpd_trylockowner(p);
+   if (slock == NULL) {
+   continue;
+   }
+
+   /*
+* move referenced pages back to active queue
+* and skip to next page.
+*/
+   if (pmap_is_referenced(p)) {
+   uvm_pageactivate(p);
+   rw_exit(slock);
+   uvmexp.pdreact++;
+   continue;
+   }
+
+   if (p->pg_flags & PG_BUSY) {
+   rw_exit(slock);
+   uvmexp.pdbusy++;
+   continue;
+   }
+
+   /* does the page belong to an object? */
+   if (uobj != NULL) {
+   uvmexp.pdobscan++;
+   } else {
KASSERT(anon != NULL);
-   slock = anon->an_lock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   /* lock failed, skip this page */
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip to next page.
-*/
-   if (pmap_is_referenced(p)) {
-   uvm_pageactivate(p);
-   rw_exit(slock);
-   uvmexp.pdreact++;
-   continue;
-   }
-   if (p->pg_flags & PG_BUSY) {
-

uvmpd_dropswap()

2022-08-29 Thread Martin Pieuchot

Small refactoring to introduce uvmpd_dropswap().  This will make an
upcoming rewrite of the pdaemon smaller & easier to review :o)

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.102
diff -u -p -r1.102 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   22 Aug 2022 12:03:32 -  1.102
+++ uvm/uvm_pdaemon.c   29 Aug 2022 11:55:52 -
@@ -105,6 +105,7 @@ voiduvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
+void   uvmpd_dropswap(struct vm_page *);
 
 /*
  * uvm_wait: wait (sleep) for the page daemon to free some pages
@@ -367,6 +368,23 @@ uvm_aiodone_daemon(void *arg)
 }
 
 
+/*
+ * uvmpd_dropswap: free any swap allocated to this page.
+ *
+ * => called with owner locked.
+ */
+void
+uvmpd_dropswap(struct vm_page *pg)
+{
+   struct vm_anon *anon = pg->uanon;
+
+   if ((pg->pg_flags & PQ_ANON) && anon->an_swslot) {
+   uvm_swap_free(anon->an_swslot, 1);
+   anon->an_swslot = 0;
+   } else if (pg->pg_flags & PQ_AOBJ) {
+   uao_dropswap(pg->uobject, pg->offset >> PAGE_SHIFT);
+   }
+}
 
 /*
  * uvmpd_scan_inactive: scan an inactive list for pages to clean or free.
@@ -566,16 +584,7 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
KASSERT(uvmexp.swpginuse <= uvmexp.swpages);
if ((p->pg_flags & PQ_SWAPBACKED) &&
uvmexp.swpginuse == uvmexp.swpages) {
-
-   if ((p->pg_flags & PQ_ANON) &&
-   p->uanon->an_swslot) {
-   uvm_swap_free(p->uanon->an_swslot, 1);
-   p->uanon->an_swslot = 0;
-   }
-   if (p->pg_flags & PQ_AOBJ) {
-   uao_dropswap(p->uobject,
-p->offset >> PAGE_SHIFT);
-   }
+   uvmpd_dropswap(p);
}
 
/*
@@ -599,16 +608,7 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
if (swap_backed) {
/* free old swap slot (if any) */
-   if (anon) {
-   if (anon->an_swslot) {
-   uvm_swap_free(anon->an_swslot,
-   1);
-   anon->an_swslot = 0;
-   }
-   } else {
-   uao_dropswap(uobj,
-p->offset >> PAGE_SHIFT);
-   }
+   uvmpd_dropswap(p);
 
/* start new cluster (if necessary) */
if (swslot == 0) {

pdaemon locking tweak

2022-08-29 Thread Martin Pieuchot

Diff below refactors the pdaemon's locking by introducing a new *trylock()
function for a given page.  This is shamelessly stolen from NetBSD.

This is part of my ongoing effort to untangle the locks used by the page
daemon.

ok?

Index: uvm//uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.102
diff -u -p -r1.102 uvm_pdaemon.c
--- uvm//uvm_pdaemon.c  22 Aug 2022 12:03:32 -  1.102
+++ uvm//uvm_pdaemon.c  29 Aug 2022 11:36:59 -
@@ -101,6 +101,7 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
+struct rwlock  *uvmpd_trylockowner(struct vm_page *);
 void   uvmpd_scan(struct uvm_pmalloc *);
 void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
@@ -367,6 +368,34 @@ uvm_aiodone_daemon(void *arg)
 }
 
 
+/*
+ * uvmpd_trylockowner: trylock the page's owner.
+ *
+ * => return the locked rwlock on success.  otherwise, return NULL.
+ */
+struct rwlock *
+uvmpd_trylockowner(struct vm_page *pg)
+{
+
+   struct uvm_object *uobj = pg->uobject;
+   struct rwlock *slock;
+
+   if (uobj != NULL) {
+   slock = uobj->vmobjlock;
+   } else {
+   struct vm_anon *anon = pg->uanon;
+
+   KASSERT(anon != NULL);
+   slock = anon->an_lock;
+   }
+
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
+   return NULL;
+   }
+
+   return slock;
+}
+
 
 /*
  * uvmpd_scan_inactive: scan an inactive list for pages to clean or free.
@@ -454,53 +483,44 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
uvmexp.pdscans++;
nextpg = TAILQ_NEXT(p, pageq);
 
+   /*
+* move referenced pages back to active queue
+* and skip to next page.
+*/
+   if (pmap_is_referenced(p)) {
+   uvm_pageactivate(p);
+   uvmexp.pdreact++;
+   continue;
+   }
+
anon = p->uanon;
uobj = p->uobject;
-   if (p->pg_flags & PQ_ANON) {
+
+   /*
+* first we attempt to lock the object that this page
+* belongs to.  if our attempt fails we skip on to
+* the next page (no harm done).  it is important to
+* "try" locking the object as we are locking in the
+* wrong order (pageq -> object) and we don't want to
+* deadlock.
+*/
+   slock = uvmpd_trylockowner(p);
+   if (slock == NULL) {
+   continue;
+   }
+
+   if (p->pg_flags & PG_BUSY) {
+   rw_exit(slock);
+   uvmexp.pdbusy++;
+   continue;
+   }
+
+   /* does the page belong to an object? */
+   if (uobj != NULL) {
+   uvmexp.pdobscan++;
+   } else {
KASSERT(anon != NULL);
-   slock = anon->an_lock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   /* lock failed, skip this page */
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip to next page.
-*/
-   if (pmap_is_referenced(p)) {
-   uvm_pageactivate(p);
-   rw_exit(slock);
-   uvmexp.pdreact++;
-   continue;
-   }
-   if (p->pg_flags & PG_BUSY) {
-   rw_exit(slock);
-   uvmexp.pdbusy++;
-   continue;
-   }
uvmexp.pdanscan++;
-   } else {
-   KASSERT(uobj != NULL);
-   slock = uobj->vmobjlock;
-   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
-   continue;
-   }
-   /*
-* move referenced pages back to active queue
-* and skip to

Simplify locking code in pdaemon

2022-08-18 Thread Martin Pieuchot

Use a "slock" variable as done in multiple places to simplify the code.
The locking stay the same.  This is just a first step to simplify this
mess.

Also get rid of the return value of the function, it is never checked.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.101
diff -u -p -r1.101 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   28 Jun 2022 19:31:30 -  1.101
+++ uvm/uvm_pdaemon.c   18 Aug 2022 10:44:52 -
@@ -102,7 +102,7 @@ extern void drmbackoff(long);
  */
 
 void   uvmpd_scan(struct uvm_pmalloc *);
-boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
+void   uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -377,17 +377,16 @@ uvm_aiodone_daemon(void *arg)
  * => we handle the building of swap-backed clusters
  * => we return TRUE if we are exiting because we met our target
  */
-
-boolean_t
+void
 uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
-   boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
struct vm_page *swpps[SWCLUSTPAGES];/* XXX: see below */
+   struct rwlock *slock;
int swnpages, swcpages; /* XXX: see below */
int swslot;
struct vm_anon *anon;
@@ -402,7 +401,6 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
swslot = 0;
swnpages = swcpages = 0;
-   free = 0;
dirtyreacts = 0;
p = NULL;
 
@@ -431,18 +429,14 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
uobj = NULL;
anon = NULL;
-
if (p) {
/*
-* update our copy of "free" and see if we've met
-* our target
+* see if we've met our target
 */
free = uvmexp.free - BUFPAGES_DEFICIT;
if (((pma == NULL || (pma->pm_flags & UVM_PMA_FREED)) &&
(free + uvmexp.paging >= uvmexp.freetarg << 2)) ||
dirtyreacts == UVMPD_NUMDIRTYREACTS) {
-   retval = TRUE;
-
if (swslot == 0) {
/* exit now if no swap-i/o pending */
break;
@@ -450,9 +444,9 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 
/* set p to null to signal final swap i/o */
p = NULL;
+   nextpg = NULL;
}
}
-
if (p) {/* if (we have a new page to consider) */
/*
 * we are below target and have a new page to consider.
@@ -460,11 +454,12 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
uvmexp.pdscans++;
nextpg = TAILQ_NEXT(p, pageq);
 
+   anon = p->uanon;
+   uobj = p->uobject;
if (p->pg_flags & PQ_ANON) {
-   anon = p->uanon;
KASSERT(anon != NULL);
-   if (rw_enter(anon->an_lock,
-   RW_WRITE|RW_NOSLEEP)) {
+   slock = anon->an_lock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
/* lock failed, skip this page */
continue;
}
@@ -474,23 +469,20 @@ uvmpd_scan_inactive(struct uvm_pmalloc *
 */
if (pmap_is_referenced(p)) {
uvm_pageactivate(p);
-   rw_exit(anon->an_lock);
+   rw_exit(slock);
uvmexp.pdreact++;
continue;
}
if (p->pg_flags & PG_BUSY) {
-   rw_exit(anon->an_lock);
+   rw_exit(slock);
uvmexp.pdbusy++;
-   /* someone else owns page, skip it */
continue;
}
uvmexp.pdanscan++;
} else {
-   uobj = p->uobject;

Fix a race in uvm_pseg_release()

2022-08-18 Thread Martin Pieuchot

The lock must be grabbed before iterating on the global array, ok?

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.88
diff -u -p -r1.88 uvm_pager.c
--- uvm/uvm_pager.c 15 Aug 2022 03:21:04 -  1.88
+++ uvm/uvm_pager.c 18 Aug 2022 10:31:16 -
@@ -209,6 +209,7 @@ uvm_pseg_release(vaddr_t segaddr)
struct uvm_pseg *pseg;
vaddr_t va = 0;
 
+   mtx_enter(&uvm_pseg_lck);
for (pseg = &psegs[0]; pseg != &psegs[PSEG_NUMSEGS]; pseg++) {
if (pseg->start <= segaddr &&
segaddr < pseg->start + MAX_PAGER_SEGS * MAXBSIZE)
@@ -222,7 +223,6 @@ uvm_pseg_release(vaddr_t segaddr)
/* test for no remainder */
KDASSERT(segaddr == pseg->start + id * MAXBSIZE);
 
-   mtx_enter(&uvm_pseg_lck);
 
KASSERT(UVM_PSEG_INUSE(pseg, id));

Re: uvm_swap: introduce uvm_swap_data_lock

2022-08-17 Thread Martin Pieuchot

On 16/01/22(Sun) 15:35, Martin Pieuchot wrote:
> On 30/12/21(Thu) 23:38, Theo Buehler wrote:
> > The diff below does two things: it adds a uvm_swap_data_lock mutex and
> > trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad()
> 
> Why is it enough?  Which fields is the lock protecting in these
> function?  Is it `uvmexp.swpages', could that be documented?  

It is documented in the diff below.

> 
> What about `nswapdev'?  Why is the rwlock grabbed before reading it in
> sys_swapctl()?i

Because it is always modified with the lock, I added some documentation.

> What about `swpginuse'?

This is still under KERNEL_LOCK(), documented below.

> If the mutex/rwlock are used to protect the global `swap_priority' could
> that be also documented?  Once this is documented it should be trivial to
> see that some places are missing some locking.  Is it intentional?
> 
> > The uvm_swap_data_lock protects all swap data structures, so needs to be
> > grabbed a few times, many of them already documented in the comments.
> > 
> > For review, I suggest comparing to what NetBSD did and also going
> > through the consumers (swaplist_insert, swaplist_find, swaplist_trim)
> > and check that they are properly locked when called, or that there is
> > the KERNEL_LOCK() in place when swap data structures are manipulated.
> 
> I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce
> the differences with NetBSD.

Done.

> > In swapmount() I introduced locking since that's needed to be able to
> > assert that the proper locks are held in swaplist_{insert,find,trim}.
> 
> Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down?
> What about `uvmexp.nswget' and `uvmexp.swpgonly' in there?

This has been done as part of another change.  This diff uses an atomic
operation to increase `nswget' in case multiple threads fault on a page
in swap at the same time.

Updated diff below, ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.163
diff -u -p -r1.163 uvm_swap.c
--- uvm/uvm_swap.c  6 Aug 2022 13:44:04 -   1.163
+++ uvm/uvm_swap.c  17 Aug 2022 11:46:20 -
@@ -45,6 +45,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -84,13 +85,16 @@
  * the system maintains a global data structure describing all swap
  * partitions/files.   there is a sorted LIST of "swappri" structures
  * which describe "swapdev"'s at that priority.   this LIST is headed
- * by the "swap_priority" global var.each "swappri" contains a 
+ * by the "swap_priority" global var.each "swappri" contains a
  * TAILQ of "swapdev" structures at that priority.
  *
  * locking:
  *  - swap_syscall_lock (sleep lock): this lock serializes the swapctl
  *system call and prevents the swap priority list from changing
  *while we are in the middle of a system call (e.g. SWAP_STATS).
+ *  - uvm_swap_data_lock (mutex): this lock protects all swap data
+ *structures including the priority list, the swapdev structures,
+ *and the swapmap arena.
  *
  * each swap device has the following info:
  *  - swap device in use (could be disabled, preventing future use)
@@ -106,7 +110,7 @@
  * userland controls and configures swap with the swapctl(2) system call.
  * the sys_swapctl performs the following operations:
  *  [1] SWAP_NSWAP: returns the number of swap devices currently configured
- *  [2] SWAP_STATS: given a pointer to an array of swapent structures 
+ *  [2] SWAP_STATS: given a pointer to an array of swapent structures
  * (passed in via "arg") of a size passed in via "misc" ... we load
  * the current swap config into the array.
  *  [3] SWAP_ON: given a pathname in arg (could be device or file) and a
@@ -208,9 +212,10 @@ struct extent *swapmap;/* controls the
 
 /* list of all active swap devices [by priority] */
 LIST_HEAD(swap_priority, swappri);
-struct swap_priority swap_priority;
+struct swap_priority swap_priority;/* [S] */
 
 /* locks */
+struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE);
 struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk");
 
 struct mutex oommtx = MUTEX_INITIALIZER(IPL_VM);
@@ -224,7 +229,7 @@ void swapdrum_add(struct swapdev *, in
 struct swapdev *swapdrum_getsdp(int);
 
 struct swapdev *swaplist_find(struct vnode *, int);
-voidswaplist_insert(struct swapdev *, 
+voidswaplist_insert(struct swapdev *,
 struct swappri *, int);
 voidswaplist_trim(void);
 
@@ -472,16 +477,19 @@ uvm_swap_finicrypt_all(void)
 /*
  * swaplist_in

Re: patch: change swblk_t type and use it in blist

2022-08-05 Thread Martin Pieuchot

On 05/08/22(Fri) 18:10, Sebastien Marie wrote:
> Hi,
> 
> When initially ported blist from DragonFlyBSD, we used custom type bsblk_t 
> and 
> bsbmp_t instead of the one used by DragonFlyBSD (swblk_t and u_swblk_t).
> 
> The reason was swblk_t is already defined on OpenBSD, and was incompatible 
> with 
> blist (int32_t). It is defined, but not used (outside some regress file which 
> seems to be not affected by type change).
> 
> This diff changes the __swblk_t definition in sys/_types.h to be 'unsigned 
> long', and switch back blist to use swblk_t (and u_swblk_t, even if it isn't 
> 'unsigned swblk_t').
> 
> It makes the diff with DragonFlyBSD more thin. I added a comment with the git 
> id 
> used for the initial port.
> 
> I tested it on i386 and amd64 (kernel and userland).
> 
> By changing bitmap type from 'u_long' to 'u_swblk_t' ('u_int64_t'), it makes 
> the 
> regress the same on 64 and 32bits archs (and it success on both).
> 
> Comments or OK ?

Makes sense to me.  I'm not a standard/type lawyer so I don't know if
this is fine for userland.  So I'm ok with it.

> diff /home/semarie/repos/openbsd/src
> commit - 73f52ef7130cefbe5a8fe028eedaad0e54be7303
> path + /home/semarie/repos/openbsd/src
> blob - e05867429cdd81c434f9ca589c1fb8c6d25957f8
> file + sys/sys/_types.h
> --- sys/sys/_types.h
> +++ sys/sys/_types.h
> @@ -60,7 +60,7 @@ typedef __uint8_t   __sa_family_t;  /* sockaddr 
> address f
>  typedef  __int32_t   __segsz_t;  /* segment size */
>  typedef  __uint32_t  __socklen_t;/* length type for network 
> syscalls */
>  typedef  long__suseconds_t;  /* microseconds (signed) */
> -typedef  __int32_t   __swblk_t;  /* swap offset */
> +typedef  unsigned long   __swblk_t;  /* swap offset */
>  typedef  __int64_t   __time_t;   /* epoch time */
>  typedef  __int32_t   __timer_t;  /* POSIX timer identifiers */
>  typedef  __uint32_t  __uid_t;/* user id */
> blob - 102ca95dd45ba6d9cab0f3fcbb033d6043ec1606
> file + sys/sys/blist.h
> --- sys/sys/blist.h
> +++ sys/sys/blist.h
> @@ -1,4 +1,5 @@
>  /* $OpenBSD: blist.h,v 1.1 2022/07/29 17:47:12 semarie Exp $ */
> +/* DragonFlyBSD:7b80531f545c7d3c51c1660130c71d01f6bccbe0:/sys/sys/blist.h */
>  /*
>   * Copyright (c) 2003,2004 The DragonFly Project.  All rights reserved.
>   * 
> @@ -65,15 +66,13 @@
>  #include 
>  #endif
>  
> -#define  SWBLK_BITS 64
> -typedef u_long bsbmp_t;
> -typedef u_long bsblk_t;
> +typedef u_int64_tu_swblk_t;
>  
>  /*
>   * note: currently use SWAPBLK_NONE as an absolute value rather then
>   * a flag bit.
>   */
> -#define SWAPBLK_NONE ((bsblk_t)-1)
> +#define SWAPBLK_NONE ((swblk_t)-1)
>  
>  /*
>   * blmeta and bl_bitmap_t MUST be a power of 2 in size.
> @@ -81,39 +80,39 @@ typedef u_long bsblk_t;
>  
>  typedef struct blmeta {
>   union {
> - bsblk_t bmu_avail;  /* space available under us */
> - bsbmp_t bmu_bitmap; /* bitmap if we are a leaf  */
> + swblk_t bmu_avail;  /* space available under us */
> + u_swblk_t   bmu_bitmap; /* bitmap if we are a leaf  */
>   } u;
> - bsblk_t bm_bighint; /* biggest contiguous block hint*/
> + swblk_t bm_bighint; /* biggest contiguous block hint*/
>  } blmeta_t;
>  
>  typedef struct blist {
> - bsblk_t bl_blocks;  /* area of coverage */
> + swblk_t bl_blocks;  /* area of coverage */
>   /* XXX int64_t bl_radix */
> - bsblk_t bl_radix;   /* coverage radix   */
> - bsblk_t bl_skip;/* starting skip*/
> - bsblk_t bl_free;/* number of free blocks*/
> + swblk_t bl_radix;   /* coverage radix   */
> + swblk_t bl_skip;/* starting skip*/
> + swblk_t bl_free;/* number of free blocks*/
>   blmeta_t*bl_root;   /* root of radix tree   */
> - bsblk_t bl_rootblks;/* bsblk_t blks allocated for tree */
> + swblk_t bl_rootblks;/* swblk_t blks allocated for tree */
>  } *blist_t;
>  
> -#define BLIST_META_RADIX (sizeof(bsbmp_t)*8/2)   /* 2 bits per */
> -#define BLIST_BMAP_RADIX (sizeof(bsbmp_t)*8) /* 1 bit per */
> +#define BLIST_META_RADIX (sizeof(u_swblk_t)*8/2) /* 2 bits per */
> +#define BLIST_BMAP_RADIX (sizeof(u_swblk_t)*8)   /* 1 bit per */
>  
>  /*
>   * The radix may exceed the size of a 64 bit signed (or unsigned) int
> - * when the maximal number of blocks is allocated.  With a 32-bit bsblk_t
> + * when the maximal number of blocks is allocated.  With a 32-bit swblk_t
>   * this corresponds to ~1G x PAGE_SIZE = 4096GB.  The swap code usually
>   * divides this by 4, leaving us with a capability of up to four 1TB swap
>   * devices.
>   *
> - * With a 64-bi

Re: Introduce uvm_pagewait()

2022-07-11 Thread Martin Pieuchot

On 28/06/22(Tue) 14:13, Martin Pieuchot wrote:
> I'd like to abstract the use of PG_WANTED to start unifying & cleaning
> the various cases where a code path is waiting for a busy page.  Here's
> the first step.
> 
> ok?

Anyone?

> Index: uvm/uvm_amap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
> retrieving revision 1.90
> diff -u -p -r1.90 uvm_amap.c
> --- uvm/uvm_amap.c30 Aug 2021 16:59:17 -  1.90
> +++ uvm/uvm_amap.c28 Jun 2022 11:53:08 -
> @@ -781,9 +781,7 @@ ReStart:
>* it and then restart.
>*/
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(&pg->pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK,
> - "cownow", INFSLP);
> + uvm_pagewait(pg, amap->am_lock, "cownow");
>   goto ReStart;
>   }
>  
> Index: uvm/uvm_aobj.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
> retrieving revision 1.103
> diff -u -p -r1.103 uvm_aobj.c
> --- uvm/uvm_aobj.c29 Dec 2021 20:22:06 -  1.103
> +++ uvm/uvm_aobj.c28 Jun 2022 11:53:08 -
> @@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj)
>   while ((pg = RBT_ROOT(uvm_objtree, &uobj->memt)) != NULL) {
>   pmap_page_protect(pg, PROT_NONE);
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(&pg->pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det",
> - INFSLP);
> + uvm_pagewait(pg, uobj->vmobjlock, "uao_det");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   continue;
>   }
>   uao_dropswap(&aobj->u_obj, pg->offset >> PAGE_SHIFT);
> @@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
>  
>   /* Make sure page is unbusy, else wait for it. */
>   if (pg->pg_flags & PG_BUSY) {
> - atomic_setbits_int(&pg->pg_flags, PG_WANTED);
> - rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh",
> - INFSLP);
> + uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   curoff -= PAGE_SIZE;
>   continue;
>   }
> @@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t 
>  
>   /* page is there, see if we need to wait on it */
>   if ((ptmp->pg_flags & PG_BUSY) != 0) {
> - atomic_setbits_int(&ptmp->pg_flags, PG_WANTED);
> - rwsleep_nsec(ptmp, uobj->vmobjlock, PVM,
> - "uao_get", INFSLP);
> + uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   continue;   /* goto top of pps while loop */
>   }
>  
> Index: uvm/uvm_km.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_km.c,v
> retrieving revision 1.150
> diff -u -p -r1.150 uvm_km.c
> --- uvm/uvm_km.c  7 Jun 2022 12:07:45 -   1.150
> +++ uvm/uvm_km.c  28 Jun 2022 11:53:08 -
> @@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj,
>   for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
>   pp = uvm_pagelookup(uobj, curoff);
>   if (pp && pp->pg_flags & PG_BUSY) {
> - atomic_setbits_int(&pp->pg_flags, PG_WANTED);
> - rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm",
> - INFSLP);
> + uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm");
> + rw_enter(uobj->vmobjlock, RW_WRITE);
>   curoff -= PAGE_SIZE; /* loop back to us */
>   continue;
>   }
> Index: uvm/uvm_page.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> retrieving revision 1.166
> diff -u -p -r1.166 uvm_page.c
> ---

Faster M operation for the swapper to be great again

2022-06-30 Thread Martin Pieuchot

Diff below uses two tricks to make uvm_pagermapin/out() faster and less
likely to fail in OOM situations.

These functions are used to map buffers when swapping pages in/out and
when faulting on mmaped files.  robert@ even measured a 75% improvement
when populating pages related to files that aren't yet in the buffer
cache.

The first trick is to use the direct map when available.  I'm doing this
for single pages but km_alloc(9) also does that for single segment...
uvm_io() only maps one page at a time for the moment so this should be
enough.

The second trick is to use pmap_kenter_pa() which doesn't fail and is
faster.

With this changes the "freeze" happening on my server when entering many
pages to swap in OOM situation is much shorter and the machine becomes
quickly responsive.

ok?

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.81
diff -u -p -r1.81 uvm_pager.c
--- uvm/uvm_pager.c 28 Jun 2022 19:07:40 -  1.81
+++ uvm/uvm_pager.c 30 Jun 2022 13:34:46 -
@@ -258,6 +258,16 @@ uvm_pagermapin(struct vm_page **pps, int
vsize_t size;
struct vm_page *pp;
 
+#ifdef __HAVE_PMAP_DIRECT
+   /* use direct mappings for single page */
+   if (npages == 1) {
+   KASSERT(pps[0]);
+   KASSERT(pps[0]->pg_flags & PG_BUSY);
+   kva = pmap_map_direct(pps[0]);
+   return kva;
+   }
+#endif
+
prot = PROT_READ;
if (flags & UVMPAGER_MAPIN_READ)
prot |= PROT_WRITE;
@@ -273,14 +283,7 @@ uvm_pagermapin(struct vm_page **pps, int
pp = *pps++;
KASSERT(pp);
KASSERT(pp->pg_flags & PG_BUSY);
-   /* Allow pmap_enter to fail. */
-   if (pmap_enter(pmap_kernel(), cva, VM_PAGE_TO_PHYS(pp),
-   prot, PMAP_WIRED | PMAP_CANFAIL | prot) != 0) {
-   pmap_remove(pmap_kernel(), kva, cva);
-   pmap_update(pmap_kernel());
-   uvm_pseg_release(kva);
-   return 0;
-   }
+   pmap_kenter_pa(cva, VM_PAGE_TO_PHYS(pp), prot);
}
pmap_update(pmap_kernel());
return kva;
@@ -294,8 +297,15 @@ uvm_pagermapin(struct vm_page **pps, int
 void
 uvm_pagermapout(vaddr_t kva, int npages)
 {
+#ifdef __HAVE_PMAP_DIRECT
+   /* use direct mappings for single page */
+   if (npages == 1) {
+   pmap_unmap_direct(kva);
+   return;
+   }
+#endif
 
-   pmap_remove(pmap_kernel(), kva, kva + ((vsize_t)npages << PAGE_SHIFT));
+   pmap_kremove(kva, (vsize_t)npages << PAGE_SHIFT);
pmap_update(pmap_kernel());
uvm_pseg_release(kva);

Re: Use SMR instead of SRP list in rtsock.c

2022-06-30 Thread Martin Pieuchot

On 30/06/22(Thu) 11:56, Claudio Jeker wrote:
> On Thu, Jun 30, 2022 at 12:34:33PM +0300, Vitaliy Makkoveev wrote:
> > On Thu, Jun 30, 2022 at 11:08:48AM +0200, Claudio Jeker wrote:
> > > This diff converts the SRP list to a SMR list in rtsock.c
> > > SRP is a bit strange with how it works and the SMR code is a bit easier to
> > > understand. Since we can sleep in the SMR_TAILQ_FOREACH() we need to grab
> > > a refcount on the route pcb so that we can leave the SMR critical section
> > > and then enter the SMR critical section at the end of the loop before
> > > dropping the refcount again.
> > > 
> > > The diff does not immeditaly explode but I doubt we can exploit
> > > parallelism in route_input() so this may fail at some later stage if it is
> > > wrong.
> > > 
> > > Comments from the lock critics welcome
> > 
> > We use `so_lock' rwlock(9) to protect route domain sockets. We can't
> > convert this SRP list to SMR list because we call solock() within
> > foreach loop.

We shouldn't use SRP list either, no?  Or are we allowed to sleep
holding a SRP reference?  That's the question that triggered this diff.

> because of the so_lock the code uses a refcnt on the route pcb to make
> sure that the object is not freed while we sleep. So that is handled by
> this diff.
>  
> > We can easily crash kernel by running in parallel some "route monitor"
> > commands and "while true; ifconfig vether0 create ; ifconfig vether0
> > destroy; done".
> 
> That does not cause problem on my system.
>  
> > > -- 
> > > :wq Claudio
> > > 
> > > Index: sys/net/rtsock.c
> > > ===
> > > RCS file: /cvs/src/sys/net/rtsock.c,v
> > > retrieving revision 1.334
> > > diff -u -p -r1.334 rtsock.c
> > > --- sys/net/rtsock.c  28 Jun 2022 10:01:13 -  1.334
> > > +++ sys/net/rtsock.c  30 Jun 2022 08:02:09 -
> > > @@ -71,7 +71,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > -#include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -107,8 +107,6 @@ struct walkarg {
> > >  };
> > >  
> > >  void route_prinit(void);
> > > -void rcb_ref(void *, void *);
> > > -void rcb_unref(void *, void *);
> > >  int  route_output(struct mbuf *, struct socket *, struct sockaddr *,
> > >   struct mbuf *);
> > >  int  route_ctloutput(int, struct socket *, int, int, struct mbuf *);
> > > @@ -149,7 +147,7 @@ intrt_setsource(unsigned int, struct 
> > >  struct rtpcb {
> > >   struct socket   *rop_socket;/* [I] */
> > >  
> > > - SRPL_ENTRY(rtpcb)   rop_list;
> > > + SMR_TAILQ_ENTRY(rtpcb)  rop_list;
> > >   struct refcnt   rop_refcnt;
> > >   struct timeout  rop_timeout;
> > >   unsigned introp_msgfilter;  /* [s] */
> > > @@ -162,8 +160,7 @@ struct rtpcb {
> > >  #define  sotortpcb(so)   ((struct rtpcb *)(so)->so_pcb)
> > >  
> > >  struct rtptable {
> > > - SRPL_HEAD(, rtpcb)  rtp_list;
> > > - struct srpl_rc  rtp_rc;
> > > + SMR_TAILQ_HEAD(, rtpcb) rtp_list;
> > >   struct rwlock   rtp_lk;
> > >   unsigned intrtp_count;
> > >  };
> > > @@ -185,29 +182,12 @@ struct rtptable rtptable;
> > >  void
> > >  route_prinit(void)
> > >  {
> > > - srpl_rc_init(&rtptable.rtp_rc, rcb_ref, rcb_unref, NULL);
> > >   rw_init(&rtptable.rtp_lk, "rtsock");
> > > - SRPL_INIT(&rtptable.rtp_list);
> > > + SMR_TAILQ_INIT(&rtptable.rtp_list);
> > >   pool_init(&rtpcb_pool, sizeof(struct rtpcb), 0,
> > >   IPL_SOFTNET, PR_WAITOK, "rtpcb", NULL);
> > >  }
> > >  
> > > -void
> > > -rcb_ref(void *null, void *v)
> > > -{
> > > - struct rtpcb *rop = v;
> > > -
> > > - refcnt_take(&rop->rop_refcnt);
> > > -}
> > > -
> > > -void
> > > -rcb_unref(void *null, void *v)
> > > -{
> > > - struct rtpcb *rop = v;
> > > -
> > > - refcnt_rele_wake(&rop->rop_refcnt);
> > > -}
> > > -
> > >  int
> > >  route_usrreq(struct socket *so, int req, struct mbuf *m, struct mbuf 
> > > *nam,
> > >  struct mbuf *control, struct proc *p)
> > > @@ -325,8 +305,7 @@ route_attach(struct socket *so, int prot
> > >   so->so_options |= SO_USELOOPBACK;
> > >  
> > >   rw_enter(&rtptable.rtp_lk, RW_WRITE);
> > > - SRPL_INSERT_HEAD_LOCKED(&rtptable.rtp_rc, &rtptable.rtp_list, rop,
> > > - rop_list);
> > > + SMR_TAILQ_INSERT_HEAD_LOCKED(&rtptable.rtp_list, rop, rop_list);
> > >   rtptable.rtp_count++;
> > >   rw_exit(&rtptable.rtp_lk);
> > >  
> > > @@ -347,8 +326,7 @@ route_detach(struct socket *so)
> > >   rw_enter(&rtptable.rtp_lk, RW_WRITE);
> > >  
> > >   rtptable.rtp_count--;
> > > - SRPL_REMOVE_LOCKED(&rtptable.rtp_rc, &rtptable.rtp_list, rop, rtpcb,
> > > - rop_list);
> > > + SMR_TAILQ_REMOVE_LOCKED(&rtptable.rtp_list, rop, rop_list);
> > >   rw_exit(&rtptable.rtp_lk);
> > >  
> > >   sounlock(so);
> > > @@ -356,6 +334,7 @@ route_detach(struct socket *so)
> > >   /* wait for all references to drop */
> > >   refcnt_finalize(&rop->rop_refcnt, "rtsockrefs");
> >

Re: arp llinfo mutex

2022-06-29 Thread Martin Pieuchot

On 29/06/22(Wed) 19:40, Alexander Bluhm wrote:
> Hi,
> 
> To fix the KASSERT(la != NULL) we have to protect the rt_llinfo
> with a mutex.  The idea is to keep rt_llinfo and RTF_LLINFO consistent.
> Also do not put the mutex in the fast path.

Losing the RTM_ADD/DELETE race is not a bug.  I would not add a printf
in these cases.  I understand you might want one for debugging purposes
but I don't see any value in committing it.  Do you agree?

Note that some times the code checks for the RTF_LLINFO flags and some
time for rt_llinfo != NULL.  This is inconsistent and a bit confusing
now that we use a mutex to protect those states.

Could you document that rt_llinfo is now protected by the mutex (or
KERNEL_LOCK())?

Anyway this is an improvement ok mpi@

PS: What about ND6?

> Index: netinet/if_ether.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/if_ether.c,v
> retrieving revision 1.250
> diff -u -p -r1.250 if_ether.c
> --- netinet/if_ether.c27 Jun 2022 20:47:10 -  1.250
> +++ netinet/if_ether.c28 Jun 2022 14:00:12 -
> @@ -101,6 +101,8 @@ void arpreply(struct ifnet *, struct mbu
>  unsigned int);
>  
>  struct niqueue arpinq = NIQUEUE_INITIALIZER(50, NETISR_ARP);
> +
> +/* llinfo_arp live time, rt_llinfo and RTF_LLINFO are protected by arp_mtx */
>  struct mutex arp_mtx = MUTEX_INITIALIZER(IPL_SOFTNET);
>  
>  LIST_HEAD(, llinfo_arp) arp_list; /* [mN] list of all llinfo_arp structures 
> */
> @@ -155,7 +157,7 @@ void
>  arp_rtrequest(struct ifnet *ifp, int req, struct rtentry *rt)
>  {
>   struct sockaddr *gate = rt->rt_gateway;
> - struct llinfo_arp *la = (struct llinfo_arp *)rt->rt_llinfo;
> + struct llinfo_arp *la;
>   time_t uptime;
>  
>   NET_ASSERT_LOCKED();
> @@ -171,7 +173,7 @@ arp_rtrequest(struct ifnet *ifp, int req
>   rt->rt_expire = 0;
>   break;
>   }
> - if ((rt->rt_flags & RTF_LOCAL) && !la)
> + if ((rt->rt_flags & RTF_LOCAL) && rt->rt_llinfo == NULL)
>   rt->rt_expire = 0;
>   /*
>* Announce a new entry if requested or warn the user
> @@ -192,44 +194,54 @@ arp_rtrequest(struct ifnet *ifp, int req
>   }
>   satosdl(gate)->sdl_type = ifp->if_type;
>   satosdl(gate)->sdl_index = ifp->if_index;
> - if (la != NULL)
> - break; /* This happens on a route change */
>   /*
>* Case 2:  This route may come from cloning, or a manual route
>* add with a LL address.
>*/
>   la = pool_get(&arp_pool, PR_NOWAIT | PR_ZERO);
> - rt->rt_llinfo = (caddr_t)la;
>   if (la == NULL) {
>   log(LOG_DEBUG, "%s: pool get failed\n", __func__);
>   break;
>   }
>  
> + mtx_enter(&arp_mtx);
> + if (rt->rt_llinfo != NULL) {
> + /* we lost the race, another thread has entered it */
> + mtx_leave(&arp_mtx);
> + printf("%s: llinfo exists\n", __func__);
> + pool_put(&arp_pool, la);
> + break;
> + }
>   mq_init(&la->la_mq, LA_HOLD_QUEUE, IPL_SOFTNET);
> + rt->rt_llinfo = (caddr_t)la;
>   la->la_rt = rt;
>   rt->rt_flags |= RTF_LLINFO;
> + LIST_INSERT_HEAD(&arp_list, la, la_list);
>   if ((rt->rt_flags & RTF_LOCAL) == 0)
>   rt->rt_expire = uptime;
> - mtx_enter(&arp_mtx);
> - LIST_INSERT_HEAD(&arp_list, la, la_list);
>   mtx_leave(&arp_mtx);
> +
>   break;
>  
>   case RTM_DELETE:
> - if (la == NULL)
> - break;
>   mtx_enter(&arp_mtx);
> + la = (struct llinfo_arp *)rt->rt_llinfo;
> + if (la == NULL) {
> + /* we lost the race, another thread has removed it */
> + mtx_leave(&arp_mtx);
> + printf("%s: llinfo missing\n", __func__);
> + break;
> + }
>   LIST_REMOVE(la, la_list);
> - mtx_leave(&arp_mtx);
>   rt->rt_llinfo = NULL;
>   rt->rt_flags &= ~RTF_LLINFO;
>   atomic_sub_int(&la_hold_total, mq_purge(&la->la_mq));
> + mtx_leave(&arp_mtx);
> +
>   pool_put(&arp_pool, la);
>   break;
>  
>   case RTM_INVALIDATE:
> - if (la == NULL)
> - break;
>   if (!ISSET(rt->rt_flags, RTF_LOCAL))
>   arpinvalidate(rt);
>   break;
> @@ -363,8 +375,6 @@ arpresolve(struct ifnet *ifp, struct rte
>   goto bad;
>   }
>  
> - la = (struct llinfo_arp *)rt->rt_llinfo;
>

Simplify aiodone daemon

2022-06-29 Thread Martin Pieuchot

The aiodone daemon accounts for and frees/releases pages they were
written to swap.  It is only used for asynchronous write.  The diff
below uses this knowledge to:

- Stop suggesting that uvm_swap_get() can be asynchronous.  There's an
  assert for PGO_SYNCIO 3 lines above.

- Remove unused support for asynchronous read, including error
  conditions, from uvm_aio_aiodone_pages().

- Grab the proper lock for each page that has been written to swap.
  This allows to enable an assert in uvm_page_unbusy().

- Move the uvm_anon_release() call outside of uvm_page_unbusy() and
  assert for the different anon cases.  This will allows us to unify
  code paths waiting for busy pages.

This is adapted/simplified from what is in NetBSD.

ok?

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  29 Jun 2022 11:16:35 -
@@ -143,7 +143,6 @@ struct pool uvm_aobj_pool;
 
 static struct uao_swhash_elt   *uao_find_swhash_elt(struct uvm_aobj *, int,
 boolean_t);
-static int  uao_find_swslot(struct uvm_object *, int);
 static boolean_tuao_flush(struct uvm_object *, voff_t,
 voff_t, int);
 static void uao_free(struct uvm_aobj *);
@@ -241,7 +240,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-inline static int
+int
 uao_find_swslot(struct uvm_object *uobj, int pageidx)
 {
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_aobj.h
--- uvm/uvm_aobj.h  21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_aobj.h  29 Jun 2022 11:16:35 -
@@ -60,6 +60,7 @@
 
 void uao_init(void);
 int uao_set_swslot(struct uvm_object *, int, int);
+int uao_find_swslot (struct uvm_object *, int);
 int uao_dropswap(struct uvm_object *, int);
 int uao_swap_off(int, int);
 int uao_shrink(struct uvm_object *, int);
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  29 Jun 2022 11:47:55 -
@@ -1036,13 +1036,14 @@ uvm_pagefree(struct vm_page *pg)
  * uvm_page_unbusy: unbusy an array of pages.
  *
  * => pages must either all belong to the same object, or all belong to anons.
+ * => if pages are object-owned, object must be locked.
  * => if pages are anon-owned, anons must have 0 refcount.
+ * => caller must make sure that anon-owned pages are not PG_RELEASED.
  */
 void
 uvm_page_unbusy(struct vm_page **pgs, int npgs)
 {
struct vm_page *pg;
-   struct uvm_object *uobj;
int i;
 
for (i = 0; i < npgs; i++) {
@@ -1052,35 +1053,19 @@ uvm_page_unbusy(struct vm_page **pgs, in
continue;
}
 
-#if notyet
-   /*
- * XXX swap case in uvm_aio_aiodone() is not holding the lock.
-*
-* This isn't compatible with the PG_RELEASED anon case below.
-*/
KASSERT(uvm_page_owner_locked_p(pg));
-#endif
KASSERT(pg->pg_flags & PG_BUSY);
 
if (pg->pg_flags & PG_WANTED) {
wakeup(pg);
}
if (pg->pg_flags & PG_RELEASED) {
-   uobj = pg->uobject;
-   if (uobj != NULL) {
-   uvm_lock_pageq();
-   pmap_page_protect(pg, PROT_NONE);
-   /* XXX won't happen right now */
-   if (pg->pg_flags & PQ_AOBJ)
-   uao_dropswap(uobj,
-   pg->offset >> PAGE_SHIFT);
-   uvm_pagefree(pg);
-   uvm_unlock_pageq();
-   } else {
-   rw_enter(pg->uanon->an_lock, RW_WRITE);
-   uvm_anon_release(pg->uanon);
-   }
+   KASSERT(pg->uobject != NULL ||
+   (pg->uanon != NULL && pg->uanon->an_ref > 0));
+   atomic_clearbits_int(&pg->pg_flags, PG_RELEASED);
+   uvm_pagefree(pg);
} else {
+   KASSERT((pg->pg_flags & PG_FAKE) == 0);
atomic_clearbits_int(&pg->pg_flags, PG_WANTED|PG_BUSY);
UVM_PAGE_OWN(pg, NULL);

Re: Unlocking pledge(2)

2022-06-28 Thread Martin Pieuchot

On 28/06/22(Tue) 18:17, Jeremie Courreges-Anglas wrote:
> 
> Initially I just wandered in syscall_mi.h and found the locking scheme
> super weird, even if technically correct.  pledge_syscall() better be
> safe to call without the kernel lock so I don't understand why we're
> sometimes calling it with the kernel lock and sometimes not.
> 
> ps_pledge is 64 bits so it's not possible to unset bits in an atomic
> manner on all architectures.  Even if we're only removing bits and there
> is probably no way to see a completely garbage value, it makes sense to
> just protect ps_pledge (and ps_execpledge) in the usual manner so that
> we can unlock the syscall.  The diff below protects the fields using
> ps_mtx even though I initially used a dedicated ps_pledge_mtx.
> unveil_destroy() needs to be moved after the critical section.
> regress/sys/kern/pledge looks happy with this.  The sys/syscall_mi.h
> change can be committed in a separate step.
> 
> Input and oks welcome.

This looks nice.  I doubt there's any existing program where you can
really test this.  Even firefox and chromium should do things
correctly.

Maybe you should write a regress test that tries to break the kernel.

> Index: arch/amd64/amd64/vmm.c
> ===
> RCS file: /home/cvs/src/sys/arch/amd64/amd64/vmm.c,v
> retrieving revision 1.315
> diff -u -p -r1.315 vmm.c
> --- arch/amd64/amd64/vmm.c27 Jun 2022 15:12:14 -  1.315
> +++ arch/amd64/amd64/vmm.c28 Jun 2022 13:54:25 -
> @@ -713,7 +713,7 @@ pledge_ioctl_vmm(struct proc *p, long co
>   case VMM_IOC_CREATE:
>   case VMM_IOC_INFO:
>   /* The "parent" process in vmd forks and manages VMs */
> - if (p->p_p->ps_pledge & PLEDGE_PROC)
> + if (pledge_get(p->p_p) & PLEDGE_PROC)
>   return (0);
>   break;
>   case VMM_IOC_TERM:
> @@ -1312,7 +1312,7 @@ vm_find(uint32_t id, struct vm **res)
>* The managing vmm parent process can lookup all
>* all VMs and is indicated by PLEDGE_PROC.
>*/
> - if (((p->p_p->ps_pledge &
> + if (((pledge_get(p->p_p) &
>   (PLEDGE_VMM | PLEDGE_PROC)) == PLEDGE_VMM) &&
>   (vm->vm_creator_pid != p->p_p->ps_pid))
>   return (pledge_fail(p, EPERM, PLEDGE_VMM));
> Index: kern/init_sysent.c
> ===
> RCS file: /home/cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.238
> diff -u -p -r1.238 init_sysent.c
> --- kern/init_sysent.c27 Jun 2022 14:26:05 -  1.238
> +++ kern/init_sysent.c28 Jun 2022 15:18:25 -
> @@ -1,10 +1,10 @@
> -/*   $OpenBSD: init_sysent.c,v 1.238 2022/06/27 14:26:05 cheloha Exp $   
> */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
>   *
>   * DO NOT EDIT-- this file is automatically generated.
> - * created from; OpenBSD: syscalls.master,v 1.224 2022/05/16 07:36:04 
> mvs Exp 
> + * created from; OpenBSD: syscalls.master,v 1.225 2022/06/27 14:26:05 
> cheloha Exp 
>   */
>  
>  #include 
> @@ -248,7 +248,7 @@ const struct sysent sysent[] = {
>   sys_listen },   /* 106 = listen */
>   { 4, s(struct sys_chflagsat_args), 0,
>   sys_chflagsat },/* 107 = chflagsat */
> - { 2, s(struct sys_pledge_args), 0,
> + { 2, s(struct sys_pledge_args), SY_NOLOCK | 0,
>   sys_pledge },   /* 108 = pledge */
>   { 4, s(struct sys_ppoll_args), 0,
>   sys_ppoll },/* 109 = ppoll */
> Index: kern/kern_event.c
> ===
> RCS file: /home/cvs/src/sys/kern/kern_event.c,v
> retrieving revision 1.191
> diff -u -p -r1.191 kern_event.c
> --- kern/kern_event.c 27 Jun 2022 13:35:21 -  1.191
> +++ kern/kern_event.c 28 Jun 2022 13:55:18 -
> @@ -331,7 +331,7 @@ filt_procattach(struct knote *kn)
>   int s;
>  
>   if ((curproc->p_p->ps_flags & PS_PLEDGE) &&
> - (curproc->p_p->ps_pledge & PLEDGE_PROC) == 0)
> + (pledge_get(curproc->p_p) & PLEDGE_PROC) == 0)
>   return pledge_fail(curproc, EPERM, PLEDGE_PROC);
>  
>   if (kn->kn_id > PID_MAX)
> Index: kern/kern_pledge.c
> ===
> RCS file: /home/cvs/src/sys/kern/kern_pledge.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 kern_pledge.c
> --- kern/kern_pledge.c26 Jun 2022 06:11:49 -  1.282
> +++ kern/kern_pledge.c28 Jun 2022 15:21:46 -
> @@ -21,6 +21,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -465,13 +466,26 @@ sys_pledge(struct proc *p, void *v, regi
>   struct process *pr = p->p_p;
>   uint64_t p

Re: Fix the swapper

2022-06-28 Thread Martin Pieuchot

On 27/06/22(Mon) 15:44, Martin Pieuchot wrote:
> Diff below contain 3 parts that can be committed independently.  The 3
> of them are necessary to allow the pagedaemon to make progress in OOM
> situation and to satisfy all the allocations waiting for pages in
> specific ranges.
> 
> * uvm/uvm_pager.c part reserves a second segment for the page daemon.
>   This is necessary to ensure the two uvm_pagermapin() calls needed by
>   uvm_swap_io() succeed in emergency OOM situation.  (the 2nd segment is
>   necessary when encryption or bouncing is required)
> 
> * uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region
>   for the same reason.  Note that a sleeping point is introduced because
>   the pagedaemon is faster than the asynchronous I/O and in OOM
>   situation it tends to stay busy building cluster that it then discard
>   because no memory is available.
> 
> * uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive 
>   list of pages to account for a given memory range.  Without this the
>   daemon could spin infinitely doing nothing because the global limits
>   are reached.

Here's an updated diff with a fix on top:

 * in uvm/uvm_swap.c make sure uvm_swap_allocpages() is allowed to sleep
   when coming from uvm_fault().  This makes the faulting process wait
   instead of dying when there isn't any free pages to do the bouncing.

I'd appreciate more reviews and tests !

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.80
diff -u -p -r1.80 uvm_pager.c
--- uvm/uvm_pager.c 28 Jun 2022 12:10:37 -  1.80
+++ uvm/uvm_pager.c 28 Jun 2022 15:25:30 -
@@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[]
  * The number of uvm_pseg instances is dynamic using an array segs.
  * At most UVM_PSEG_COUNT instances can exist.
  *
- * psegs[0] always exists (so that the pager can always map in pages).
- * psegs[0] element 0 is always reserved for the pagedaemon.
+ * psegs[0/1] always exist (so that the pager can always map in pages).
+ * psegs[0/1] element 0 are always reserved for the pagedaemon.
  *
  * Any other pseg is automatically created when no space is available
  * and automatically destroyed when it is no longer in use.
@@ -93,6 +93,7 @@ uvm_pager_init(void)
 
/* init pager map */
uvm_pseg_init(&psegs[0]);
+   uvm_pseg_init(&psegs[1]);
mtx_init(&uvm_pseg_lck, IPL_VM);
 
/* init ASYNC I/O queue */
@@ -168,9 +169,10 @@ pager_seg_restart:
goto pager_seg_fail;
}
 
-   /* Keep index 0 reserved for pagedaemon. */
-   if (pseg == &psegs[0] && curproc != uvm.pagedaemon_proc)
-   i = 1;
+   /* Keep indexes 0,1 reserved for pagedaemon. */
+   if ((pseg == &psegs[0] || pseg == &psegs[1]) &&
+   (curproc != uvm.pagedaemon_proc))
+   i = 2;
else
i = 0;
 
@@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr)
pseg->use &= ~(1 << id);
wakeup(&psegs);
 
-   if (pseg != &psegs[0] && UVM_PSEG_EMPTY(pseg)) {
+   if ((pseg != &psegs[0] && pseg != &psegs[1]) && UVM_PSEG_EMPTY(pseg)) {
va = pseg->start;
pseg->start = 0;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   12 May 2022 12:49:31 -  1.99
+++ uvm/uvm_pdaemon.c   28 Jun 2022 13:59:49 -
@@ -101,8 +101,8 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
-void   uvmpd_scan(void);
-boolean_t  uvmpd_scan_inactive(struct pglist *);
+void   uvmpd_scan(struct uvm_pmalloc *);
+boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -281,7 +281,7 @@ uvm_pageout(void *arg)
if (pma != NULL ||
((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) ||
((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) {
-   uvmpd_scan();
+   uvmpd_scan(pma);
}
 
/*
@@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg)
  */
 
 boolean_t
-uvmpd_scan_inactive(struct pglist *pglst)
+uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
-   struct vm_pa

Introduce uvm_pagewait()

2022-06-28 Thread Martin Pieuchot

I'd like to abstract the use of PG_WANTED to start unifying & cleaning
the various cases where a code path is waiting for a busy page.  Here's
the first step.

ok?

Index: uvm/uvm_amap.c
===
RCS file: /cvs/src/sys/uvm/uvm_amap.c,v
retrieving revision 1.90
diff -u -p -r1.90 uvm_amap.c
--- uvm/uvm_amap.c  30 Aug 2021 16:59:17 -  1.90
+++ uvm/uvm_amap.c  28 Jun 2022 11:53:08 -
@@ -781,9 +781,7 @@ ReStart:
 * it and then restart.
 */
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(&pg->pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, amap->am_lock, PVM | PNORELOCK,
-   "cownow", INFSLP);
+   uvm_pagewait(pg, amap->am_lock, "cownow");
goto ReStart;
}
 
Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  28 Jun 2022 11:53:08 -
@@ -835,9 +835,8 @@ uao_detach(struct uvm_object *uobj)
while ((pg = RBT_ROOT(uvm_objtree, &uobj->memt)) != NULL) {
pmap_page_protect(pg, PROT_NONE);
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(&pg->pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uao_det",
-   INFSLP);
+   uvm_pagewait(pg, uobj->vmobjlock, "uao_det");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
continue;
}
uao_dropswap(&aobj->u_obj, pg->offset >> PAGE_SHIFT);
@@ -909,9 +908,8 @@ uao_flush(struct uvm_object *uobj, voff_
 
/* Make sure page is unbusy, else wait for it. */
if (pg->pg_flags & PG_BUSY) {
-   atomic_setbits_int(&pg->pg_flags, PG_WANTED);
-   rwsleep_nsec(pg, uobj->vmobjlock, PVM, "uaoflsh",
-   INFSLP);
+   uvm_pagewait(pg, uobj->vmobjlock, "uaoflsh");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
curoff -= PAGE_SIZE;
continue;
}
@@ -1147,9 +1145,8 @@ uao_get(struct uvm_object *uobj, voff_t 
 
/* page is there, see if we need to wait on it */
if ((ptmp->pg_flags & PG_BUSY) != 0) {
-   atomic_setbits_int(&ptmp->pg_flags, PG_WANTED);
-   rwsleep_nsec(ptmp, uobj->vmobjlock, PVM,
-   "uao_get", INFSLP);
+   uvm_pagewait(ptmp, uobj->vmobjlock, "uao_get");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
continue;   /* goto top of pps while loop */
}
 
Index: uvm/uvm_km.c
===
RCS file: /cvs/src/sys/uvm/uvm_km.c,v
retrieving revision 1.150
diff -u -p -r1.150 uvm_km.c
--- uvm/uvm_km.c7 Jun 2022 12:07:45 -   1.150
+++ uvm/uvm_km.c28 Jun 2022 11:53:08 -
@@ -255,9 +255,8 @@ uvm_km_pgremove(struct uvm_object *uobj,
for (curoff = start ; curoff < end ; curoff += PAGE_SIZE) {
pp = uvm_pagelookup(uobj, curoff);
if (pp && pp->pg_flags & PG_BUSY) {
-   atomic_setbits_int(&pp->pg_flags, PG_WANTED);
-   rwsleep_nsec(pp, uobj->vmobjlock, PVM, "km_pgrm",
-   INFSLP);
+   uvm_pagewait(pp, uobj->vmobjlock, "km_pgrm");
+   rw_enter(uobj->vmobjlock, RW_WRITE);
curoff -= PAGE_SIZE; /* loop back to us */
continue;
}
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  28 Jun 2022 11:57:42 -
@@ -1087,6 +1087,23 @@ uvm_page_unbusy(struct vm_page **pgs, in
}
 }
 
+/*
+ * uvm_pagewait: wait for a busy page
+ *
+ * => page must be known PG_BUSY
+ * => object must be locked
+ * => object will be unlocked on return
+ */
+void
+uvm_pagewait(struct vm_page *pg, struct rwlock *lock, const char *wmesg)
+{
+   KASSERT(rw_lock_held(lock));
+   KASSERT((pg->pg_flags & PG_BUSY) != 0);
+
+   atomic_setbits_int(&pg->pg_flags, PG_WANTED);
+   rwsleep_nsec(pg, lock, PVM | PNORELOCK, wmesg, INFSLP);
+}
+
 #if defined(UVM_PAGE_TRKOW

Re: kernel lock in arp

2022-06-27 Thread Martin Pieuchot

On 27/06/22(Mon) 19:11, Alexander Bluhm wrote:
> On Mon, Jun 27, 2022 at 11:49:23AM +0200, Alexander Bluhm wrote:
> > On Sat, May 21, 2022 at 10:50:28PM +0300, Vitaliy Makkoveev wrote:
> > > This diff looks good, except the re-check after kernel lock. It???s
> > > supposed `rt??? could became inconsistent, right? But what stops to
> > > make it inconsistent after first unlocked RTF_LLINFO flag check?
> > >
> > > I think this re-check should gone.
> >
> > I have copied the re-check from intenal genua code.  I am not sure
> > if it is really needed.  We know from Hrvoje that the diff with
> > re-check is stable.  And we know that it crashes without kernel
> > lock at all.
> >
> > I have talked with mpi@ about it.  The main problem is that we have
> > no write lock when we change RTF_LLINFO.  Then rt_llinfo can get
> > NULL or inconsistent.
> >
> > Plan is that I put some lock asserts into route add and delete.
> > This helps to find the parts that modify RTF_LLINFO and rt_llinfo
> > without exclusive lock.
> >
> > Maybe we need some kernel lock somewhere else.  Or we want to use
> > some ARP mutex.  We could also add some comment and commit the diff
> > that I have.  We know that it is faster and stable.  Pushing the
> > kernel lock down or replacing it with something clever can always
> > be done later.
> 
> We need the re-check.  I have tested it with a printf.  It is
> triggered by running arp -d in a loop while forwarding.
> 
> The concurrent threads are these:
> 
> rtrequest_delete(8000246b7428,3,80775048,8000246b7510,0) at 
> rtrequest_delete+0x67
> rtdeletemsg(fd8834a23550,80775048,0) at rtdeletemsg+0x1ad
> rtrequest(b,8000246b7678,3,8000246b7718,0) at rtrequest+0x55c
> rt_clone(8000246b7780,8000246b78f8,0) at rt_clone+0x73
> rtalloc_mpath(8000246b78f8,fd8003169ad8,0) at rtalloc_mpath+0x4c
> ip_forward(fd80b8cc7e00,8077d048,fd8834a230f0,0) at 
> ip_forward+0x137
> ip_input_if(8000246b7a28,8000246b7a34,4,0,8077d048) at 
> ip_input_if+0x353
> ipv4_input(8077d048,fd80b8cc7e00) at ipv4_input+0x39
> ether_input(8077d048,fd80b8cc7e00) at ether_input+0x3ad
> if_input_process(8077d048,8000246b7b18) at if_input_process+0x6f
> ifiq_process(8077d458) at ifiq_process+0x69
> taskq_thread(80036080) at taskq_thread+0x100
> 
> rtrequest_delete(8000246c8d08,3,80775048,8000246c8df0,0) at 
> rtrequest_delete+0x67
> rtdeletemsg(fd8834a230f0,80775048,0) at rtdeletemsg+0x1ad
> rtrequest(b,8000246c8f58,3,8000246c8ff8,0) at rtrequest+0x55c
> rt_clone(8000246c9060,8000246c90b8,0) at rt_clone+0x73
> rtalloc_mpath(8000246c90b8,fd8002c754d8,0) at rtalloc_mpath+0x4c
> in_ouraddr(fd8094771b00,8077d048,8000246c9138) at 
> in_ouraddr+0x84
> ip_input_if(8000246c91d8,8000246c91e4,4,0,8077d048) at 
> ip_input_if+0x1cd
> ipv4_input(8077d048,fd8094771b00) at ipv4_input+0x39
> ether_input(8077d048,fd8094771b00) at ether_input+0x3ad
> if_input_process(8077d048,8000246c92c8) at if_input_process+0x6f
> ifiq_process(80781400) at ifiq_process+0x69
> taskq_thread(80036200) at taskq_thread+0x100
> 
> I have added a comment why kernel lock protects us.  I would like
> to get this in.  It has been tested, reduces the kernel lock and
> is faster.  A more clever lock can be done later.
> 
> ok?

I don't understand how the KERNEL_LOCK() there prevents rtdeletemsg()
from running.  rtrequest_delete() seems completely broken it assumes it
holds an exclusive lock.

To "fix" arp the KERNEL_LOCK() should also be taken in RTM_DELETE and
RTM_RESOLVE inside arp_rtrequest().  Or maybe around ifp->if_rtrequest()

But it doesn't mean there isn't another problem in rtdeletemsg()...

> Index: net/if_ethersubr.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v
> retrieving revision 1.281
> diff -u -p -r1.281 if_ethersubr.c
> --- net/if_ethersubr.c26 Jun 2022 21:19:53 -  1.281
> +++ net/if_ethersubr.c27 Jun 2022 16:55:15 -
> @@ -221,10 +221,7 @@ ether_resolve(struct ifnet *ifp, struct 
>  
>   switch (af) {
>   case AF_INET:
> - KERNEL_LOCK();
> - /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> - KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   eh->ether_type = htons(ETHERTYPE_IP);
> @@ -285,10 +282,7 @@ ether_resolve(struct ifnet *ifp, struct 
>   break;
>  #endif
>   case AF_INET:
> - KERNEL_LOCK();
> - /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> -

CoW & neighbor pages

2022-06-27 Thread Martin Pieuchot

When faulting a page after COW neighborhood pages are likely to already
be entered.   So speed up the fault by doing a narrow fault (do not try
to map in adjacent pages).

This is stolen from NetBSD.

ok?

Index: uvm/uvm_fault.c
===
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
retrieving revision 1.129
diff -u -p -r1.129 uvm_fault.c
--- uvm/uvm_fault.c 4 Apr 2022 09:27:05 -   1.129
+++ uvm/uvm_fault.c 27 Jun 2022 17:05:26 -
@@ -737,6 +737,16 @@ uvm_fault_check(struct uvm_faultinfo *uf
}
 
/*
+* for a case 2B fault waste no time on adjacent pages because
+* they are likely already entered.
+*/
+   if (uobj != NULL && amap != NULL &&
+   (flt->access_type & PROT_WRITE) != 0) {
+   /* wide fault (!narrow) */
+   flt->narrow = TRUE;
+   }
+
+   /*
 * establish range of interest based on advice from mapper
 * and then clip to fit map entry.   note that we only want
 * to do this the first time through the fault.   if we

Fix the swapper

2022-06-27 Thread Martin Pieuchot

Diff below contain 3 parts that can be committed independently.  The 3
of them are necessary to allow the pagedaemon to make progress in OOM
situation and to satisfy all the allocations waiting for pages in
specific ranges.

* uvm/uvm_pager.c part reserves a second segment for the page daemon.
  This is necessary to ensure the two uvm_pagermapin() calls needed by
  uvm_swap_io() succeed in emergency OOM situation.  (the 2nd segment is
  necessary when encryption or bouncing is required)

* uvm/uvm_swap.c part pre-allocates 16 pages in the DMA-reachable region
  for the same reason.  Note that a sleeping point is introduced because
  the pagedaemon is faster than the asynchronous I/O and in OOM
  situation it tends to stay busy building cluster that it then discard
  because no memory is available.

* uvm/uvm_pdaemon.c part changes the inner-loop scanning the inactive 
  list of pages to account for a given memory range.  Without this the
  daemon could spin infinitely doing nothing because the global limits
  are reached.

At lot could be improved, but this at least makes swapping work in OOM
situations.

Index: uvm/uvm_pager.c
===
RCS file: /cvs/src/sys/uvm/uvm_pager.c,v
retrieving revision 1.78
diff -u -p -r1.78 uvm_pager.c
--- uvm/uvm_pager.c 18 Feb 2022 09:04:38 -  1.78
+++ uvm/uvm_pager.c 27 Jun 2022 08:44:41 -
@@ -58,8 +58,8 @@ const struct uvm_pagerops *uvmpagerops[]
  * The number of uvm_pseg instances is dynamic using an array segs.
  * At most UVM_PSEG_COUNT instances can exist.
  *
- * psegs[0] always exists (so that the pager can always map in pages).
- * psegs[0] element 0 is always reserved for the pagedaemon.
+ * psegs[0/1] always exist (so that the pager can always map in pages).
+ * psegs[0/1] element 0 are always reserved for the pagedaemon.
  *
  * Any other pseg is automatically created when no space is available
  * and automatically destroyed when it is no longer in use.
@@ -93,6 +93,7 @@ uvm_pager_init(void)
 
/* init pager map */
uvm_pseg_init(&psegs[0]);
+   uvm_pseg_init(&psegs[1]);
mtx_init(&uvm_pseg_lck, IPL_VM);
 
/* init ASYNC I/O queue */
@@ -168,9 +169,10 @@ pager_seg_restart:
goto pager_seg_fail;
}
 
-   /* Keep index 0 reserved for pagedaemon. */
-   if (pseg == &psegs[0] && curproc != uvm.pagedaemon_proc)
-   i = 1;
+   /* Keep indexes 0,1 reserved for pagedaemon. */
+   if ((pseg == &psegs[0] || pseg == &psegs[1]) &&
+   (curproc != uvm.pagedaemon_proc))
+   i = 2;
else
i = 0;
 
@@ -229,7 +231,7 @@ uvm_pseg_release(vaddr_t segaddr)
pseg->use &= ~(1 << id);
wakeup(&psegs);
 
-   if (pseg != &psegs[0] && UVM_PSEG_EMPTY(pseg)) {
+   if ((pseg != &psegs[0] && pseg != &psegs[1]) && UVM_PSEG_EMPTY(pseg)) {
va = pseg->start;
pseg->start = 0;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.99
diff -u -p -r1.99 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   12 May 2022 12:49:31 -  1.99
+++ uvm/uvm_pdaemon.c   27 Jun 2022 13:24:54 -
@@ -101,8 +101,8 @@ extern void drmbackoff(long);
  * local prototypes
  */
 
-void   uvmpd_scan(void);
-boolean_t  uvmpd_scan_inactive(struct pglist *);
+void   uvmpd_scan(struct uvm_pmalloc *);
+boolean_t  uvmpd_scan_inactive(struct uvm_pmalloc *, struct pglist *);
 void   uvmpd_tune(void);
 void   uvmpd_drop(struct pglist *);
 
@@ -281,7 +281,7 @@ uvm_pageout(void *arg)
if (pma != NULL ||
((uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) ||
((uvmexp.inactive + BUFPAGES_INACT) < uvmexp.inactarg)) {
-   uvmpd_scan();
+   uvmpd_scan(pma);
}
 
/*
@@ -379,15 +379,15 @@ uvm_aiodone_daemon(void *arg)
  */
 
 boolean_t
-uvmpd_scan_inactive(struct pglist *pglst)
+uvmpd_scan_inactive(struct uvm_pmalloc *pma, struct pglist *pglst)
 {
boolean_t retval = FALSE;   /* assume we haven't hit target */
int free, result;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
-   struct vm_page *pps[MAXBSIZE >> PAGE_SHIFT], **ppsp;
+   struct vm_page *pps[SWCLUSTPAGES], **ppsp;
int npages;
-   struct vm_page *swpps[MAXBSIZE >> PAGE_SHIFT];  /* XXX: see below */
+   struct vm_page *swpps[SWCLUSTPAGES];/* XXX: see below */
int swnpages, swcpages; /* XXX: see below */
int swslot;
struct vm_anon *anon;
@@ -404,8 +404,27 @@ uvmpd_scan_inactive(struct pglist *pglst
swnpages = swcpages = 0;
free = 0;

pdaemon: reserve memory for swapping

2022-06-26 Thread Martin Pieuchot

uvm_swap_io() needs to perform up to 4 allocations to write pages to
disk.  In OOM situation uvm_swap_allocpages() always fail because the
kernel doesn't reserve enough pages.

Diff below set `uvmexp.reserve_pagedaemon' to the number of pages needed
to write a cluster of pages to disk.  With this my machine do not
deadlock and can push pages to swap in OOM case.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  26 Jun 2022 08:17:34 -
@@ -280,10 +280,13 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 
/*
 * init reserve thresholds
-* XXXCDC - values may need adjusting
+*
+* The pagedaemon needs to always be able to write pages to disk,
+* Reserve the minimum amount of pages, a cluster, required by
+* uvm_swap_allocpages()
 */
-   uvmexp.reserve_pagedaemon = 4;
-   uvmexp.reserve_kernel = 8;
+   uvmexp.reserve_pagedaemon = (MAXBSIZE >> PAGE_SHIFT);
+   uvmexp.reserve_kernel = uvmexp.reserve_pagedaemon + 4;
uvmexp.anonminpct = 10;
uvmexp.vnodeminpct = 10;
uvmexp.vtextminpct = 5;

Re: set RTF_DONE in sysctl_dumpentry for the routing table

2022-06-08 Thread Martin Pieuchot

On 08/06/22(Wed) 16:13, Claudio Jeker wrote:
> Notice while hacking in OpenBGPD. Unlike routing socket messages the
> messages from the sysctl interface have RTF_DONE not set.
> I think it would make sense to set RTF_DONE also in this case since it
> makes reusing code easier.
> 
> All messages sent out via sysctl_dumpentry() have been processed by the
> kernel so setting RTF_DONE kind of makes sense.

I agree, ok mpi@

> -- 
> :wq Claudio
> 
> Index: rtsock.c
> ===
> RCS file: /cvs/src/sys/net/rtsock.c,v
> retrieving revision 1.328
> diff -u -p -r1.328 rtsock.c
> --- rtsock.c  6 Jun 2022 14:45:41 -   1.328
> +++ rtsock.c  8 Jun 2022 14:10:20 -
> @@ -1987,7 +1987,7 @@ sysctl_dumpentry(struct rtentry *rt, voi
>   struct rt_msghdr *rtm = (struct rt_msghdr *)w->w_tmem;
>  
>   rtm->rtm_pid = curproc->p_p->ps_pid;
> - rtm->rtm_flags = rt->rt_flags;
> + rtm->rtm_flags = RTF_DONE | rt->rt_flags;
>   rtm->rtm_priority = rt->rt_priority & RTP_MASK;
>   rtm_getmetrics(&rt->rt_rmx, &rtm->rtm_rmx);
>   /* Do not account the routing table's reference. */
>

Re: Fix clearing of sleep timeouts

2022-06-06 Thread Martin Pieuchot

On 06/06/22(Mon) 06:47, David Gwynne wrote:
> On Sun, Jun 05, 2022 at 03:57:39PM +, Visa Hankala wrote:
> > On Sun, Jun 05, 2022 at 12:27:32PM +0200, Martin Pieuchot wrote:
> > > On 05/06/22(Sun) 05:20, Visa Hankala wrote:
> > > > Encountered the following panic:
> > > > 
> > > > panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" 
> > > > failed: file "/usr/src/sys/kern/kern_synch.c", line 373
> > > > Stopped at  db_enter+0x10:  popq%rbp
> > > > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > > >  423109  57118 55 0x3  02  link
> > > >  330695  30276 55 0x3  03  link
> > > > * 46366  85501 55  0x1003  0x40804001  link
> > > >  188803  85501 55  0x1003  0x40820000K link
> > > > db_enter() at db_enter+0x10
> > > > panic(81f25d2b) at panic+0xbf
> > > > __assert(81f9a186,81f372c8,175,81f87c6c) at 
> > > > __assert+0x25
> > > > sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at 
> > > > sleep_setup+0x1d8
> > > > cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46
> > > > timeout_barrier(8000228a28b0) at timeout_barrier+0x109
> > > > timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2
> > > > sleep_finish(800022d64d90,1) at sleep_finish+0x16d
> > > > tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2
> > > > sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at 
> > > > sys_nanosleep+0x12d
> > > > syscall(800022d64f60) at syscall+0x374
> > > > 
> > > > The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously,
> > > > soft-interrupt-driven timeouts could be deleted synchronously without
> > > > blocking. Now, timeout_del_barrier() can sleep regardless of the type
> > > > of the timeout.
> > > > 
> > > > It looks that with small adjustments timeout_del_barrier() can sleep
> > > > in sleep_finish(). The management of run queues is not affected because
> > > > the timeout clearing happens after it. As timeout_del_barrier() does not
> > > > rely on a timeout or signal catching, there should be no risk of
> > > > unbounded recursion or unwanted signal side effects within the sleep
> > > > machinery. In a way, a sleep with a timeout is higher-level than
> > > > one without.
> > > 
> > > I trust you on the analysis.  However this looks very fragile to me.
> > > 
> > > The use of timeout_del_barrier() which can sleep using the global sleep
> > > queue is worrying me.  
> > 
> > I think the queue handling ends in sleep_finish() when SCHED_LOCK()
> > is released. The timeout clearing is done outside of it.
> 
> That's ok.
> 
> > The extra sleeping point inside sleep_finish() is subtle. It should not
> > matter in typical use. But is it permissible with the API? Also, if
> > timeout_del_barrier() sleeps, the thread's priority can change.
> 
> What other options do we have at this point? Spin? Allocate the timeout
> dynamically so sleep_finish doesn't have to wait for it and let the
> handler clean up? How would you stop the timeout handler waking up the
> thread if it's gone back to sleep again for some other reason?
> 
> Sleeping here is the least worst option.

I agree.  I don't think sleeping is bad here.  My concern is about how
sleeping is implemented.  There's a single API built on top of a single
global data structure which now calls itself recursively.  

I'm not sure how much work it would be to make cond_wait(9) use its own
sleep queue...  This is something independent from this fix though.

> As for timeout_del_barrier, if prio is a worry we can provide an
> advanced version of it that lets you pass the prio in. I'd also
> like to change timeout_barrier so it queues the barrier task at the
> head of the pending lists rather than at the tail.

I doubt prio matter.

Re: Fix clearing of sleep timeouts

2022-06-05 Thread Martin Pieuchot

On 05/06/22(Sun) 05:20, Visa Hankala wrote:
> Encountered the following panic:
> 
> panic: kernel diagnostic assertion "(p->p_flag & P_TIMEOUT) == 0" failed: 
> file "/usr/src/sys/kern/kern_synch.c", line 373
> Stopped at  db_enter+0x10:  popq%rbp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
>  423109  57118 55 0x3  02  link
>  330695  30276 55 0x3  03  link
> * 46366  85501 55  0x1003  0x40804001  link
>  188803  85501 55  0x1003  0x40820000K link
> db_enter() at db_enter+0x10
> panic(81f25d2b) at panic+0xbf
> __assert(81f9a186,81f372c8,175,81f87c6c) at 
> __assert+0x25
> sleep_setup(800022d64bf8,800022d64c98,20,81f66ac6,0) at 
> sleep_setup+0x1d8
> cond_wait(800022d64c98,81f66ac6) at cond_wait+0x46
> timeout_barrier(8000228a28b0) at timeout_barrier+0x109
> timeout_del_barrier(8000228a28b0) at timeout_del_barrier+0xa2
> sleep_finish(800022d64d90,1) at sleep_finish+0x16d
> tsleep(823a5130,120,81f0b730,2) at tsleep+0xb2
> sys_nanosleep(8000228a27f0,800022d64ea0,800022d64ef0) at 
> sys_nanosleep+0x12d
> syscall(800022d64f60) at syscall+0x374
> 
> The panic is a regression of sys/kern/kern_timeout.c r1.84. Previously,
> soft-interrupt-driven timeouts could be deleted synchronously without
> blocking. Now, timeout_del_barrier() can sleep regardless of the type
> of the timeout.
> 
> It looks that with small adjustments timeout_del_barrier() can sleep
> in sleep_finish(). The management of run queues is not affected because
> the timeout clearing happens after it. As timeout_del_barrier() does not
> rely on a timeout or signal catching, there should be no risk of
> unbounded recursion or unwanted signal side effects within the sleep
> machinery. In a way, a sleep with a timeout is higher-level than
> one without.

I trust you on the analysis.  However this looks very fragile to me.

The use of timeout_del_barrier() which can sleep using the global sleep
queue is worrying me.  

> Note that endtsleep() can run and set P_TIMEOUT during
> timeout_del_barrier() when the thread is blocked in cond_wait().
> To avoid unnecessary atomic read-modify-write operations, the clearing
> of P_TIMEOUT could be conditional, but maybe that is an unnecessary
> optimization at this point.

I agree this optimization seems unnecessary at the moment.

> While it should be possible to make the code use timeout_del() instead
> of timeout_del_barrier(), the outcome might not be outright better. For
> example, sleep_setup() and endtsleep() would have to coordinate so that
> a late-running timeout from previous sleep cycle would not disturb the
> new cycle.

So that's the price for not having to sleep in sleep_finish(), right?

> To test the barrier path reliably, I made the code call
> timeout_del_barrier() twice in a row. The second call is guaranteed
> to sleep. Of course, this is not part of the patch.

ok mpi@

> Index: kern/kern_synch.c
> ===
> RCS file: src/sys/kern/kern_synch.c,v
> retrieving revision 1.187
> diff -u -p -r1.187 kern_synch.c
> --- kern/kern_synch.c 13 May 2022 15:32:00 -  1.187
> +++ kern/kern_synch.c 5 Jun 2022 05:04:45 -
> @@ -370,8 +370,8 @@ sleep_setup(struct sleep_state *sls, con
>   p->p_slppri = prio & PRIMASK;
>   TAILQ_INSERT_TAIL(&slpque[LOOKUP(ident)], p, p_runq);
>  
> - KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   if (timo) {
> + KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   sls->sls_timeout = 1;
>   timeout_add(&p->p_sleep_to, timo);
>   }
> @@ -432,13 +432,12 @@ sleep_finish(struct sleep_state *sls, in
>  
>   if (sls->sls_timeout) {
>   if (p->p_flag & P_TIMEOUT) {
> - atomic_clearbits_int(&p->p_flag, P_TIMEOUT);
>   error1 = EWOULDBLOCK;
>   } else {
> - /* This must not sleep. */
> + /* This can sleep. It must not use timeouts. */
>   timeout_del_barrier(&p->p_sleep_to);
> - KASSERT((p->p_flag & P_TIMEOUT) == 0);
>   }
> + atomic_clearbits_int(&p->p_flag, P_TIMEOUT);
>   }
>  
>   /* Check if thread was woken up because of a unwind or signal */
>

Re: start unlocking kbind(2)

2022-05-31 Thread Martin Pieuchot

On 18/05/22(Wed) 15:53, Alexander Bluhm wrote:
> On Tue, May 17, 2022 at 10:44:54AM +1000, David Gwynne wrote:
> > +   cookie = SCARG(uap, proc_cookie);
> > +   if (pr->ps_kbind_addr == pc) {
> > +   membar_datadep_consumer();
> > +   if (pr->ps_kbind_cookie != cookie)
> > +   goto sigill;
> > +   } else {
> 
> You must use membar_consumer() here.  membar_datadep_consumer() is
> a barrier between reading pointer and pointed data.  Only alpha
> requires membar_datadep_consumer() for that, everywhere else it is
> a NOP.
> 
> > +   mtx_enter(&pr->ps_mtx);
> > +   kpc = pr->ps_kbind_addr;
> 
> Do we need kpc variable?  I would prefer to read explicit
> pr->ps_kbind_addr in the two places where we use it.
> 
> I think the logic of barriers and mutexes is correct.
> 
> with the suggestions above OK bluhm@

I believe you should go ahead with the current diff.  ok with me.  Moving
the field under the scope of another lock can be easily done afterward.

Re: ffs_truncate: Missing uvm_vnp_uncache() w/ softdep

2022-05-24 Thread Martin Pieuchot

On 24/05/22(Tue) 15:24, Mark Kettenis wrote:
> > Date: Tue, 24 May 2022 14:28:39 +0200
> > From: Martin Pieuchot 
> > 
> > The softdep code path is missing a UVM cache invalidation compared to
> > the !softdep one.  This is necessary to flush pages of a persisting
> > vnode.
> > 
> > Since uvm_vnp_setsize() is also called later in this function for the
> > !softdep case move it to not call it twice.
> > 
> > ok?
> 
> I'm not sure this is correct.  I'm trying to understand why you're
> moving the uvm_uvn_setsize() call.  Are you just trying to call it
> twice?  Or are you trying to avoid calling it at all when we end up in
> an error path?
>
> The way you moved it means we'll still call it twice for "partially
> truncated" files with softdeps.  At least the way I understand the
> code is that the code will fsync the vnode and dropping down in the
> "normal" non-softdep code that will call uvm_vnp_setsize() (and
> uvn_vnp_uncache()) again.  So maybe you should move the
> uvm_uvn_setsize() call into the else case?

We might want to do that indeed.  I'm not sure what are the implications
of calling uvm_vnp_setsize/uncache() after VOP_FSYNC(), which might fail.
So I'd rather play safe and go with that diff.

> > Index: ufs/ffs/ffs_inode.c
> > ===
> > RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v
> > retrieving revision 1.81
> > diff -u -p -r1.81 ffs_inode.c
> > --- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 -  1.81
> > +++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 -
> > @@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le
> > if (length > fs->fs_maxfilesize)
> > return (EFBIG);
> >  
> > -   uvm_vnp_setsize(ovp, length);
> > oip->i_ci.ci_lasta = oip->i_ci.ci_clen 
> > = oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0;
> >  
> > if (DOINGSOFTDEP(ovp)) {
> > +   uvm_vnp_setsize(ovp, length);
> > +   (void) uvm_vnp_uncache(ovp);
> > if (length > 0 || softdep_slowdown(ovp)) {
> > /*
> >  * If a file is only partially truncated, then
> > 
> >

Please test: rewrite of pdaemon

2022-05-24 Thread Martin Pieuchot

Diff below brings in & adapt most of the changes from NetBSD's r1.37 of
uvm_pdaemon.c.  My motivation for doing this is to untangle the inner
loop of uvmpd_scan_inactive() which will allow us to split the global
`pageqlock' mutex in a next step.

The idea behind this change is to get rid of the too-complex uvm_pager*
abstraction by checking early if a page is going to be flushed or
swapped to disk.  The loop is then clearly divided into two cases which
makes it more readable. 

This also opens the door to a better integration between UVM's vnode
layer and the buffer cache.

The main loop of uvmpd_scan_inactive() can be understood as below:

. If a page can be flushed we can call "uvn_flush()" directly and pass the
  PGO_ALLPAGES flag instead of building a cluster beforehand.  Note that,
  in its current form uvn_flush() is synchronous.

. If the page needs to be swapped, mark it as PG_PAGEOUT, build a cluster
  and once it is full call uvm_swap_put(). 

Please test this diff, do not hesitate to play with the `vm.swapencrypt.enable'
sysctl(2).

Index: uvm/uvm_aobj.c
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.c,v
retrieving revision 1.103
diff -u -p -r1.103 uvm_aobj.c
--- uvm/uvm_aobj.c  29 Dec 2021 20:22:06 -  1.103
+++ uvm/uvm_aobj.c  24 May 2022 12:31:34 -
@@ -143,7 +143,7 @@ struct pool uvm_aobj_pool;
 
 static struct uao_swhash_elt   *uao_find_swhash_elt(struct uvm_aobj *, int,
 boolean_t);
-static int  uao_find_swslot(struct uvm_object *, int);
+int uao_find_swslot(struct uvm_object *, int);
 static boolean_tuao_flush(struct uvm_object *, voff_t,
 voff_t, int);
 static void uao_free(struct uvm_aobj *);
@@ -241,7 +241,7 @@ uao_find_swhash_elt(struct uvm_aobj *aob
 /*
  * uao_find_swslot: find the swap slot number for an aobj/pageidx
  */
-inline static int
+int
 uao_find_swslot(struct uvm_object *uobj, int pageidx)
 {
struct uvm_aobj *aobj = (struct uvm_aobj *)uobj;
Index: uvm/uvm_aobj.h
===
RCS file: /cvs/src/sys/uvm/uvm_aobj.h,v
retrieving revision 1.17
diff -u -p -r1.17 uvm_aobj.h
--- uvm/uvm_aobj.h  21 Oct 2020 09:08:14 -  1.17
+++ uvm/uvm_aobj.h  24 May 2022 12:31:34 -
@@ -60,6 +60,7 @@
 
 void uao_init(void);
 int uao_set_swslot(struct uvm_object *, int, int);
+int uao_find_swslot (struct uvm_object *, int);
 int uao_dropswap(struct uvm_object *, int);
 int uao_swap_off(int, int);
 int uao_shrink(struct uvm_object *, int);
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.291
diff -u -p -r1.291 uvm_map.c
--- uvm/uvm_map.c   4 May 2022 14:58:26 -   1.291
+++ uvm/uvm_map.c   24 May 2022 12:31:34 -
@@ -3215,8 +3215,9 @@ uvm_object_printit(struct uvm_object *uo
  * uvm_page_printit: actually print the page
  */
 static const char page_flagbits[] =
-   "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5CLEANCHK\6RELEASED\7FAKE\10RDONLY"
-   "\11ZERO\12DEV\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ"
+   "\20\1BUSY\2WANTED\3TABLED\4CLEAN\5PAGEOUT\6RELEASED\7FAKE\10RDONLY"
+   "\11ZERO\12DEV\13CLEANCHK"
+   "\15PAGER1\21FREE\22INACTIVE\23ACTIVE\25ANON\26AOBJ"
"\27ENCRYPT\31PMAP0\32PMAP1\33PMAP2\34PMAP3\35PMAP4\36PMAP5";
 
 void
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.166
diff -u -p -r1.166 uvm_page.c
--- uvm/uvm_page.c  12 May 2022 12:48:36 -  1.166
+++ uvm/uvm_page.c  24 May 2022 12:32:54 -
@@ -960,6 +960,7 @@ uvm_pageclean(struct vm_page *pg)
 {
u_int flags_to_clear = 0;
 
+   KASSERT((pg->pg_flags & PG_PAGEOUT) == 0);
if ((pg->pg_flags & (PG_TABLED|PQ_ACTIVE|PQ_INACTIVE)) &&
(pg->uobject == NULL || !UVM_OBJ_IS_PMAP(pg->uobject)))
MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
@@ -978,11 +979,14 @@ uvm_pageclean(struct vm_page *pg)
rw_write_held(pg->uanon->an_lock));
 
/*
-* if the page was an object page (and thus "TABLED"), remove it
-* from the object.
+* remove page from its object or anon.
 */
-   if (pg->pg_flags & PG_TABLED)
+   if (pg->pg_flags & PG_TABLED) {
uvm_pageremove(pg);
+   } else if (pg->uanon != NULL) {
+   pg->uanon->an_page = NULL;
+   pg->uanon = NULL;
+   }
 
/*
 * now remove the page from the queues
@@ -996,10 +1000,6 @@ uvm_pageclean(struct vm_page *pg)
pg->wire_count = 0;
uvmexp.wired--;
}
-   if (pg->uanon) {
-   pg->uanon->an_page = NULL;
-   pg->uanon = NULL;
-

ffs_truncate: Missing uvm_vnp_uncache() w/ softdep

2022-05-24 Thread Martin Pieuchot

The softdep code path is missing a UVM cache invalidation compared to
the !softdep one.  This is necessary to flush pages of a persisting
vnode.

Since uvm_vnp_setsize() is also called later in this function for the
!softdep case move it to not call it twice.

ok?

Index: ufs/ffs/ffs_inode.c
===
RCS file: /cvs/src/sys/ufs/ffs/ffs_inode.c,v
retrieving revision 1.81
diff -u -p -r1.81 ffs_inode.c
--- ufs/ffs/ffs_inode.c 12 Dec 2021 09:14:59 -  1.81
+++ ufs/ffs/ffs_inode.c 4 May 2022 15:32:15 -
@@ -172,11 +172,12 @@ ffs_truncate(struct inode *oip, off_t le
if (length > fs->fs_maxfilesize)
return (EFBIG);
 
-   uvm_vnp_setsize(ovp, length);
oip->i_ci.ci_lasta = oip->i_ci.ci_clen 
= oip->i_ci.ci_cstart = oip->i_ci.ci_lastw = 0;
 
if (DOINGSOFTDEP(ovp)) {
+   uvm_vnp_setsize(ovp, length);
+   (void) uvm_vnp_uncache(ovp);
if (length > 0 || softdep_slowdown(ovp)) {
/*
 * If a file is only partially truncated, then

Re: Call uvm_vnp_uncache() before VOP_RENAME()

2022-05-24 Thread Martin Pieuchot

On 17/05/22(Tue) 16:55, Martin Pieuchot wrote:
> nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a
> possibly mmap'ed file before calling VOP_RENAME().
> 
> ok?

Anyone?

> Index: nfs/nfs_serv.c
> ===
> RCS file: /cvs/src/sys/nfs/nfs_serv.c,v
> retrieving revision 1.120
> diff -u -p -r1.120 nfs_serv.c
> --- nfs/nfs_serv.c11 Mar 2021 13:31:35 -  1.120
> +++ nfs/nfs_serv.c4 May 2022 15:29:06 -
> @@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd
>   error = -1;
>  out:
>   if (!error) {
> + if (tvp) {
> + (void)uvm_vnp_uncache(tvp);
> + }
>   error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, &fromnd.ni_cnd,
>  tond.ni_dvp, tond.ni_vp, &tond.ni_cnd);
>   } else {
>

Re: start unlocking kbind(2)

2022-05-17 Thread Martin Pieuchot

On 17/05/22(Tue) 10:44, David Gwynne wrote:
> this narrows the scope of the KERNEL_LOCK in kbind(2) so the syscall
> argument checks can be done without the kernel lock.
> 
> care is taken to allow the pc/cookie checks to run without any lock by
> being careful with the order of the checks. all modifications to the
> pc/cookie state are serialised by the per process mutex.

I don't understand why it is safe to do the following check without
holding a mutex:

if (pr->ps_kbind_addr == pc)
...

Is there much differences when always grabbing the per-process mutex?

> i dont know enough about uvm to say whether it is safe to unlock the
> actual memory updates too, but even if i was confident i would still
> prefer to change it as a separate step.

I agree.

> Index: kern/init_sysent.c
> ===
> RCS file: /cvs/src/sys/kern/init_sysent.c,v
> retrieving revision 1.236
> diff -u -p -r1.236 init_sysent.c
> --- kern/init_sysent.c1 May 2022 23:00:04 -   1.236
> +++ kern/init_sysent.c17 May 2022 00:36:03 -
> @@ -1,4 +1,4 @@
> -/*   $OpenBSD: init_sysent.c,v 1.236 2022/05/01 23:00:04 tedu Exp $  */
> +/*   $OpenBSD$   */
>  
>  /*
>   * System call switch table.
> @@ -204,7 +204,7 @@ const struct sysent sysent[] = {
>   sys_utimensat },/* 84 = utimensat */
>   { 2, s(struct sys_futimens_args), 0,
>   sys_futimens }, /* 85 = futimens */
> - { 3, s(struct sys_kbind_args), 0,
> + { 3, s(struct sys_kbind_args), SY_NOLOCK | 0,
>   sys_kbind },/* 86 = kbind */
>   { 2, s(struct sys_clock_gettime_args), SY_NOLOCK | 0,
>   sys_clock_gettime },/* 87 = clock_gettime */
> Index: kern/syscalls.master
> ===
> RCS file: /cvs/src/sys/kern/syscalls.master,v
> retrieving revision 1.223
> diff -u -p -r1.223 syscalls.master
> --- kern/syscalls.master  24 Feb 2022 07:41:51 -  1.223
> +++ kern/syscalls.master  17 May 2022 00:36:03 -
> @@ -194,7 +194,7 @@
>   const struct timespec *times, int flag); }
>  85   STD { int sys_futimens(int fd, \
>   const struct timespec *times); }
> -86   STD { int sys_kbind(const struct __kbind *param, \
> +86   STD NOLOCK  { int sys_kbind(const struct __kbind *param, \
>   size_t psize, int64_t proc_cookie); }
>  87   STD NOLOCK  { int sys_clock_gettime(clockid_t clock_id, \
>   struct timespec *tp); }
> Index: uvm/uvm_mmap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_mmap.c,v
> retrieving revision 1.169
> diff -u -p -r1.169 uvm_mmap.c
> --- uvm/uvm_mmap.c19 Jan 2022 10:43:48 -  1.169
> +++ uvm/uvm_mmap.c17 May 2022 00:36:03 -
> @@ -70,6 +70,7 @@
>  #include 
>  #include   /* for KBIND* */
>  #include 
> +#include 
>  
>  #include /* for __LDPGSZ */
>  
> @@ -1125,33 +1126,64 @@ sys_kbind(struct proc *p, void *v, regis
>   const char *data;
>   vaddr_t baseva, last_baseva, endva, pageoffset, kva;
>   size_t psize, s;
> - u_long pc;
> + u_long pc, kpc;
>   int count, i, extra;
> + uint64_t cookie;
>   int error;
>  
>   /*
>* extract syscall args from uap
>*/
>   paramp = SCARG(uap, param);
> - psize = SCARG(uap, psize);
>  
>   /* a NULL paramp disables the syscall for the process */
>   if (paramp == NULL) {
> + mtx_enter(&pr->ps_mtx);
>   if (pr->ps_kbind_addr != 0)
> - sigexit(p, SIGILL);
> + goto leave_sigill;
>   pr->ps_kbind_addr = BOGO_PC;
> + mtx_leave(&pr->ps_mtx);
>   return 0;
>   }
>  
>   /* security checks */
> +
> + /*
> +  * ps_kbind_addr can only be set to 0 or BOGO_PC by the
> +  * kernel, not by a call from userland.
> +  */
>   pc = PROC_PC(p);
> - if (pr->ps_kbind_addr == 0) {
> - pr->ps_kbind_addr = pc;
> - pr->ps_kbind_cookie = SCARG(uap, proc_cookie);
> - } else if (pc != pr->ps_kbind_addr || pc == BOGO_PC)
> - sigexit(p, SIGILL);
> - else if (pr->ps_kbind_cookie != SCARG(uap, proc_cookie))
> - sigexit(p, SIGILL);
> + if (pc == 0 || pc == BOGO_PC)
> + goto sigill;
> +
> + cookie = SCARG(uap, proc_cookie);
> + if (pr->ps_kbind_addr == pc) {
> + membar_datadep_consumer();
> + if (pr->ps_kbind_cookie != cookie)
> + goto sigill;
> + } else {
> + mtx_enter(&pr->ps_mtx);
> + kpc = pr->ps_kbind_addr;
> +
> + /*
> +  * If we're the first thread in (kpc is 0), then
> +

Call uvm_vnp_uncache() before VOP_RENAME()

2022-05-17 Thread Martin Pieuchot

nfsrv_rename() should behave like dorenameat() and tell UVM to "flush" a
possibly mmap'ed file before calling VOP_RENAME().

ok?

Index: nfs/nfs_serv.c
===
RCS file: /cvs/src/sys/nfs/nfs_serv.c,v
retrieving revision 1.120
diff -u -p -r1.120 nfs_serv.c
--- nfs/nfs_serv.c  11 Mar 2021 13:31:35 -  1.120
+++ nfs/nfs_serv.c  4 May 2022 15:29:06 -
@@ -1488,6 +1488,9 @@ nfsrv_rename(struct nfsrv_descript *nfsd
error = -1;
 out:
if (!error) {
+   if (tvp) {
+   (void)uvm_vnp_uncache(tvp);
+   }
error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, &fromnd.ni_cnd,
   tond.ni_dvp, tond.ni_vp, &tond.ni_cnd);
} else {

Re: uvm_pagedequeue()

2022-05-12 Thread Martin Pieuchot

On 10/05/22(Tue) 20:23, Mark Kettenis wrote:
> > Date: Tue, 10 May 2022 18:45:21 +0200
> > From: Martin Pieuchot 
> > 
> > On 05/05/22(Thu) 14:54, Martin Pieuchot wrote:
> > > Diff below introduces a new wrapper to manipulate active/inactive page
> > > queues. 
> > > 
> > > ok?
> > 
> > Anyone?
> 
> Sorry I started looking at this and got distracted.
> 
> I'm not sure about the changes to uvm_pageactivate().  It doesn't
> quite match what NetBSD does, but I guess NetBSD assumes that
> uvm_pageactiave() isn't called for a page that is already active?  And
> that's something we can't guarantee?

It does look at what NetBSD did 15 years ago.  We're not ready to synchronize
with NetBSD -current yet. 

We're getting there!

> The diff is correct though in the sense that it is equivalent to the
> code we already have.  So if this definitely is the direction you want
> to go:
> 
> ok kettenis@
> 
> > > Index: uvm/uvm_page.c
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> > > retrieving revision 1.165
> > > diff -u -p -r1.165 uvm_page.c
> > > --- uvm/uvm_page.c4 May 2022 14:58:26 -   1.165
> > > +++ uvm/uvm_page.c5 May 2022 12:49:13 -
> > > @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
> > >   /*
> > >* now remove the page from the queues
> > >*/
> > > - if (pg->pg_flags & PQ_ACTIVE) {
> > > - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> > > - flags_to_clear |= PQ_ACTIVE;
> > > - uvmexp.active--;
> > > - }
> > > - if (pg->pg_flags & PQ_INACTIVE) {
> > > - TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> > > - flags_to_clear |= PQ_INACTIVE;
> > > - uvmexp.inactive--;
> > > - }
> > > + uvm_pagedequeue(pg);
> > >  
> > >   /*
> > >* if the page was wired, unwire it now.
> > > @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
> > >   MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
> > >  
> > >   if (pg->wire_count == 0) {
> > > - if (pg->pg_flags & PQ_ACTIVE) {
> > > - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> > > - atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
> > > - uvmexp.active--;
> > > - }
> > > - if (pg->pg_flags & PQ_INACTIVE) {
> > > - TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> > > - atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
> > > - uvmexp.inactive--;
> > > - }
> > > + uvm_pagedequeue(pg);
> > >   uvmexp.wired++;
> > >   }
> > >   pg->wire_count++;
> > > @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
> > >   KASSERT(uvm_page_owner_locked_p(pg));
> > >   MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
> > >  
> > > + uvm_pagedequeue(pg);
> > > + if (pg->wire_count == 0) {
> > > + TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
> > > + atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
> > > + uvmexp.active++;
> > > +
> > > + }
> > > +}
> > > +
> > > +/*
> > > + * uvm_pagedequeue: remove a page from any paging queue
> > > + */
> > > +void
> > > +uvm_pagedequeue(struct vm_page *pg)
> > > +{
> > > + if (pg->pg_flags & PQ_ACTIVE) {
> > > + TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> > > + atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
> > > + uvmexp.active--;
> > > + }
> > >   if (pg->pg_flags & PQ_INACTIVE) {
> > >   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> > >   atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
> > >   uvmexp.inactive--;
> > >   }
> > > - if (pg->wire_count == 0) {
> > > - /*
> > > -  * if page is already active, remove it from list so we
> > > -  * can put it at tail.  if it wasn't active, then mark
> > > -  * it active and bump active count
> > > -  */
> > > - if (pg->pg_flags & PQ_ACTIVE)
> > > - TAILQ_REMOVE(&uvm.page_active, pg, pag

Re: uvm_pagedequeue()

2022-05-10 Thread Martin Pieuchot

On 05/05/22(Thu) 14:54, Martin Pieuchot wrote:
> Diff below introduces a new wrapper to manipulate active/inactive page
> queues. 
> 
> ok?

Anyone?

> Index: uvm/uvm_page.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> retrieving revision 1.165
> diff -u -p -r1.165 uvm_page.c
> --- uvm/uvm_page.c4 May 2022 14:58:26 -   1.165
> +++ uvm/uvm_page.c5 May 2022 12:49:13 -
> @@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
>   /*
>* now remove the page from the queues
>*/
> - if (pg->pg_flags & PQ_ACTIVE) {
> - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> - flags_to_clear |= PQ_ACTIVE;
> - uvmexp.active--;
> - }
> - if (pg->pg_flags & PQ_INACTIVE) {
> - TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> - flags_to_clear |= PQ_INACTIVE;
> - uvmexp.inactive--;
> - }
> + uvm_pagedequeue(pg);
>  
>   /*
>* if the page was wired, unwire it now.
> @@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
>   MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
>  
>   if (pg->wire_count == 0) {
> - if (pg->pg_flags & PQ_ACTIVE) {
> - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> - atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
> - uvmexp.active--;
> - }
> - if (pg->pg_flags & PQ_INACTIVE) {
> - TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
> - atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
> - uvmexp.inactive--;
> - }
> + uvm_pagedequeue(pg);
>   uvmexp.wired++;
>   }
>   pg->wire_count++;
> @@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
>   KASSERT(uvm_page_owner_locked_p(pg));
>   MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
>  
> + uvm_pagedequeue(pg);
> + if (pg->wire_count == 0) {
> + TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
> + atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
> + uvmexp.active++;
> +
> + }
> +}
> +
> +/*
> + * uvm_pagedequeue: remove a page from any paging queue
> + */
> +void
> +uvm_pagedequeue(struct vm_page *pg)
> +{
> + if (pg->pg_flags & PQ_ACTIVE) {
> + TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> + atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
> + uvmexp.active--;
> + }
>   if (pg->pg_flags & PQ_INACTIVE) {
>   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
>   atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
>   uvmexp.inactive--;
>   }
> - if (pg->wire_count == 0) {
> - /*
> -  * if page is already active, remove it from list so we
> -  * can put it at tail.  if it wasn't active, then mark
> -  * it active and bump active count
> -  */
> - if (pg->pg_flags & PQ_ACTIVE)
> - TAILQ_REMOVE(&uvm.page_active, pg, pageq);
> - else {
> - atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
> - uvmexp.active++;
> - }
> -
> - TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
> - }
>  }
> -
>  /*
>   * uvm_pagezero: zero fill a page
>   */
> Index: uvm/uvm_page.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_page.h,v
> retrieving revision 1.67
> diff -u -p -r1.67 uvm_page.h
> --- uvm/uvm_page.h29 Jan 2022 06:25:33 -  1.67
> +++ uvm/uvm_page.h5 May 2022 12:49:13 -
> @@ -224,6 +224,7 @@ boolean_t uvm_page_physget(paddr_t *);
>  #endif
>  
>  void uvm_pageactivate(struct vm_page *);
> +void uvm_pagedequeue(struct vm_page *);
>  vaddr_t  uvm_pageboot_alloc(vsize_t);
>  void uvm_pagecopy(struct vm_page *, struct vm_page *);
>  void uvm_pagedeactivate(struct vm_page *);
>

Re: uvm: Consider BUFPAGES_DEFICIT in swap_shortage

2022-05-09 Thread Martin Pieuchot

On 05/05/22(Thu) 10:56, Bob Beck wrote:
> On Thu, May 05, 2022 at 10:16:23AM -0600, Bob Beck wrote:
> > Ugh. You???re digging in the most perilous parts of the pile. 
> > 
> > I will go look with you??? sigh. (This is not yet an ok for that.)
> > 
> > > On May 5, 2022, at 7:53 AM, Martin Pieuchot  wrote:
> > > 
> > > When considering the amount of free pages in the page daemon a small
> > > amount is always kept for the buffer cache... except in one place.
> > > 
> > > The diff below gets rid of this exception.  This is important because
> > > uvmpd_scan() is called conditionally using the following check:
> > > 
> > >  if (uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) {
> > >...
> > >  }
> > > 
> > > So in case of swap shortage we might end up freeing fewer pages than
> > > wanted.
> 
> So a bit of backgroud. 
> 
> I am pretty much of the belief that this "low water mark" for pages is 
> nonsense now.  I was in the midst of trying to prove that
> to myself and therefore rip down some of the crazy accounting and
> very arbitrary limits in the buffer cache and got distracted.
> 
> Maybe something like this to start? (buf failing that I think
> your current diff is probably ok).

Thanks.  I'll commit my diff then to make the current code coherent and
let me progress in my refactoring.  Then we can consider changing this
magic.

> Index: sys/sys/mount.h
> ===
> RCS file: /cvs/src/sys/sys/mount.h,v
> retrieving revision 1.148
> diff -u -p -u -p -r1.148 mount.h
> --- sys/sys/mount.h   6 Apr 2021 14:17:35 -   1.148
> +++ sys/sys/mount.h   5 May 2022 16:50:50 -
> @@ -488,10 +488,8 @@ struct bcachestats {
>  #ifdef _KERNEL
>  extern struct bcachestats bcstats;
>  extern long buflowpages, bufhighpages, bufbackpages;
> -#define BUFPAGES_DEFICIT (((buflowpages - bcstats.numbufpages) < 0) ? 0 \
> -: buflowpages - bcstats.numbufpages)
> -#define BUFPAGES_INACT (((bcstats.numcleanpages - buflowpages) < 0) ? 0 \
> -: bcstats.numcleanpages - buflowpages)
> +#define BUFPAGES_DEFICIT 0
> +#define BUFPAGES_INACT bcstats.numcleanpages
>  extern int bufcachepercent;
>  extern void bufadjust(int);
>  struct uvm_constraint_range;
> 
> 
> > > 
> > > ok?
> > > 
> > > Index: uvm/uvm_pdaemon.c
> > > ===
> > > RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
> > > retrieving revision 1.98
> > > diff -u -p -r1.98 uvm_pdaemon.c
> > > --- uvm/uvm_pdaemon.c 4 May 2022 14:58:26 -   1.98
> > > +++ uvm/uvm_pdaemon.c 5 May 2022 13:40:28 -
> > > @@ -923,12 +923,13 @@ uvmpd_scan(void)
> > >* detect if we're not going to be able to page anything out
> > >* until we free some swap resources from active pages.
> > >*/
> > > + free = uvmexp.free - BUFPAGES_DEFICIT;
> > >   swap_shortage = 0;
> > > - if (uvmexp.free < uvmexp.freetarg &&
> > > + if (free < uvmexp.freetarg &&
> > >   uvmexp.swpginuse == uvmexp.swpages &&
> > >   !uvm_swapisfull() &&
> > >   pages_freed == 0) {
> > > - swap_shortage = uvmexp.freetarg - uvmexp.free;
> > > + swap_shortage = uvmexp.freetarg - free;
> > >   }
> > > 
> > >   for (p = TAILQ_FIRST(&uvm.page_active);
> > > 
> >

uvm: Consider BUFPAGES_DEFICIT in swap_shortage

2022-05-05 Thread Martin Pieuchot

When considering the amount of free pages in the page daemon a small
amount is always kept for the buffer cache... except in one place.

The diff below gets rid of this exception.  This is important because
uvmpd_scan() is called conditionally using the following check:
  
  if (uvmexp.free - BUFPAGES_DEFICIT) < uvmexp.freetarg) {
...
  }

So in case of swap shortage we might end up freeing fewer pages than
wanted.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.98
diff -u -p -r1.98 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   4 May 2022 14:58:26 -   1.98
+++ uvm/uvm_pdaemon.c   5 May 2022 13:40:28 -
@@ -923,12 +923,13 @@ uvmpd_scan(void)
 * detect if we're not going to be able to page anything out
 * until we free some swap resources from active pages.
 */
+   free = uvmexp.free - BUFPAGES_DEFICIT;
swap_shortage = 0;
-   if (uvmexp.free < uvmexp.freetarg &&
+   if (free < uvmexp.freetarg &&
uvmexp.swpginuse == uvmexp.swpages &&
!uvm_swapisfull() &&
pages_freed == 0) {
-   swap_shortage = uvmexp.freetarg - uvmexp.free;
+   swap_shortage = uvmexp.freetarg - free;
}
 
for (p = TAILQ_FIRST(&uvm.page_active);

uvm_pagedequeue()

2022-05-05 Thread Martin Pieuchot

Diff below introduces a new wrapper to manipulate active/inactive page
queues. 

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.165
diff -u -p -r1.165 uvm_page.c
--- uvm/uvm_page.c  4 May 2022 14:58:26 -   1.165
+++ uvm/uvm_page.c  5 May 2022 12:49:13 -
@@ -987,16 +987,7 @@ uvm_pageclean(struct vm_page *pg)
/*
 * now remove the page from the queues
 */
-   if (pg->pg_flags & PQ_ACTIVE) {
-   TAILQ_REMOVE(&uvm.page_active, pg, pageq);
-   flags_to_clear |= PQ_ACTIVE;
-   uvmexp.active--;
-   }
-   if (pg->pg_flags & PQ_INACTIVE) {
-   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
-   flags_to_clear |= PQ_INACTIVE;
-   uvmexp.inactive--;
-   }
+   uvm_pagedequeue(pg);
 
/*
 * if the page was wired, unwire it now.
@@ -1243,16 +1234,7 @@ uvm_pagewire(struct vm_page *pg)
MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
 
if (pg->wire_count == 0) {
-   if (pg->pg_flags & PQ_ACTIVE) {
-   TAILQ_REMOVE(&uvm.page_active, pg, pageq);
-   atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
-   uvmexp.active--;
-   }
-   if (pg->pg_flags & PQ_INACTIVE) {
-   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
-   atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
-   uvmexp.inactive--;
-   }
+   uvm_pagedequeue(pg);
uvmexp.wired++;
}
pg->wire_count++;
@@ -1324,28 +1306,32 @@ uvm_pageactivate(struct vm_page *pg)
KASSERT(uvm_page_owner_locked_p(pg));
MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
 
+   uvm_pagedequeue(pg);
+   if (pg->wire_count == 0) {
+   TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
+   atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
+   uvmexp.active++;
+
+   }
+}
+
+/*
+ * uvm_pagedequeue: remove a page from any paging queue
+ */
+void
+uvm_pagedequeue(struct vm_page *pg)
+{
+   if (pg->pg_flags & PQ_ACTIVE) {
+   TAILQ_REMOVE(&uvm.page_active, pg, pageq);
+   atomic_clearbits_int(&pg->pg_flags, PQ_ACTIVE);
+   uvmexp.active--;
+   }
if (pg->pg_flags & PQ_INACTIVE) {
TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
-   if (pg->wire_count == 0) {
-   /*
-* if page is already active, remove it from list so we
-* can put it at tail.  if it wasn't active, then mark
-* it active and bump active count
-*/
-   if (pg->pg_flags & PQ_ACTIVE)
-   TAILQ_REMOVE(&uvm.page_active, pg, pageq);
-   else {
-   atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
-   uvmexp.active++;
-   }
-
-   TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
-   }
 }
-
 /*
  * uvm_pagezero: zero fill a page
  */
Index: uvm/uvm_page.h
===
RCS file: /cvs/src/sys/uvm/uvm_page.h,v
retrieving revision 1.67
diff -u -p -r1.67 uvm_page.h
--- uvm/uvm_page.h  29 Jan 2022 06:25:33 -  1.67
+++ uvm/uvm_page.h  5 May 2022 12:49:13 -
@@ -224,6 +224,7 @@ boolean_t   uvm_page_physget(paddr_t *);
 #endif
 
 void   uvm_pageactivate(struct vm_page *);
+void   uvm_pagedequeue(struct vm_page *);
 vaddr_tuvm_pageboot_alloc(vsize_t);
 void   uvm_pagecopy(struct vm_page *, struct vm_page *);
 void   uvm_pagedeactivate(struct vm_page *);

Merge swap-backed and object-backed inactive lists

2022-05-02 Thread Martin Pieuchot

Let's simplify the existing logic and use a single list for inactive
pages.  uvmpd_scan_inactive() already does a lot of check if it finds
a page which is swap-backed.  This will be improved in a next change.

ok?

Index: uvm/uvm.h
===
RCS file: /cvs/src/sys/uvm/uvm.h,v
retrieving revision 1.68
diff -u -p -r1.68 uvm.h
--- uvm/uvm.h   24 Nov 2020 13:49:09 -  1.68
+++ uvm/uvm.h   2 May 2022 16:32:16 -
@@ -53,8 +53,7 @@ struct uvm {
 
/* vm_page queues */
struct pglist page_active;  /* [Q] allocated pages, in use */
-   struct pglist page_inactive_swp;/* [Q] pages inactive (reclaim/free) */
-   struct pglist page_inactive_obj;/* [Q] pages inactive (reclaim/free) */
+   struct pglist page_inactive;/* [Q] pages inactive (reclaim/free) */
/* Lock order: pageqlock, then fpageqlock. */
struct mutex pageqlock; /* [] lock for active/inactive page q */
struct mutex fpageqlock;/* [] lock for free page q  + pdaemon */
Index: uvm/uvm_map.c
===
RCS file: /cvs/src/sys/uvm/uvm_map.c,v
retrieving revision 1.290
diff -u -p -r1.290 uvm_map.c
--- uvm/uvm_map.c   12 Mar 2022 08:11:07 -  1.290
+++ uvm/uvm_map.c   2 May 2022 16:32:16 -
@@ -3281,8 +3281,7 @@ uvm_page_printit(struct vm_page *pg, boo
(*pr)("  >>> page not found in uvm_pmemrange <<<\n");
pgl = NULL;
} else if (pg->pg_flags & PQ_INACTIVE) {
-   pgl = (pg->pg_flags & PQ_SWAPBACKED) ?
-   &uvm.page_inactive_swp : &uvm.page_inactive_obj;
+   pgl = &uvm.page_inactive;
} else if (pg->pg_flags & PQ_ACTIVE) {
pgl = &uvm.page_active;
} else {
Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.164
diff -u -p -r1.164 uvm_page.c
--- uvm/uvm_page.c  28 Apr 2022 09:59:28 -  1.164
+++ uvm/uvm_page.c  2 May 2022 16:32:16 -
@@ -185,8 +185,7 @@ uvm_page_init(vaddr_t *kvm_startp, vaddr
 */
 
TAILQ_INIT(&uvm.page_active);
-   TAILQ_INIT(&uvm.page_inactive_swp);
-   TAILQ_INIT(&uvm.page_inactive_obj);
+   TAILQ_INIT(&uvm.page_inactive);
mtx_init(&uvm.pageqlock, IPL_VM);
mtx_init(&uvm.fpageqlock, IPL_VM);
uvm_pmr_init();
@@ -994,10 +993,7 @@ uvm_pageclean(struct vm_page *pg)
uvmexp.active--;
}
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(&uvm.page_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(&uvm.page_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
flags_to_clear |= PQ_INACTIVE;
uvmexp.inactive--;
}
@@ -1253,10 +1249,7 @@ uvm_pagewire(struct vm_page *pg)
uvmexp.active--;
}
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(&uvm.page_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(&uvm.page_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
@@ -1304,10 +1297,7 @@ uvm_pagedeactivate(struct vm_page *pg)
}
if ((pg->pg_flags & PQ_INACTIVE) == 0) {
KASSERT(pg->wire_count == 0);
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_INSERT_TAIL(&uvm.page_inactive_swp, pg, pageq);
-   else
-   TAILQ_INSERT_TAIL(&uvm.page_inactive_obj, pg, pageq);
+   TAILQ_INSERT_TAIL(&uvm.page_inactive, pg, pageq);
atomic_setbits_int(&pg->pg_flags, PQ_INACTIVE);
uvmexp.inactive++;
pmap_clear_reference(pg);
@@ -1335,10 +1325,7 @@ uvm_pageactivate(struct vm_page *pg)
MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
 
if (pg->pg_flags & PQ_INACTIVE) {
-   if (pg->pg_flags & PQ_SWAPBACKED)
-   TAILQ_REMOVE(&uvm.page_inactive_swp, pg, pageq);
-   else
-   TAILQ_REMOVE(&uvm.page_inactive_obj, pg, pageq);
+   TAILQ_REMOVE(&uvm.page_inactive, pg, pageq);
atomic_clearbits_int(&pg->pg_flags, PQ_INACTIVE);
uvmexp.inactive--;
}
Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.97
diff -u -p -r1.97 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   30 Apr 2022 1

uvmpd_scan(): Recheck PG_BUSY after locking the page

2022-04-28 Thread Martin Pieuchot

rw_enter(9) can sleep.  When the lock is finally acquired by the
pagedaemon the previous check might no longer be true and the page
could be busy.  In this case we shouldn't touch it.

Diff below recheck for PG_BUSY after acquiring the lock and also
use a variable for the lock to reduce the differences with NetBSD.

ok?

Index: uvm/uvm_pdaemon.c
===
RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
retrieving revision 1.96
diff -u -p -r1.96 uvm_pdaemon.c
--- uvm/uvm_pdaemon.c   11 Apr 2022 16:43:49 -  1.96
+++ uvm/uvm_pdaemon.c   28 Apr 2022 10:22:52 -
@@ -879,6 +879,8 @@ uvmpd_scan(void)
int free, inactive_shortage, swap_shortage, pages_freed;
struct vm_page *p, *nextpg;
struct uvm_object *uobj;
+   struct vm_anon *anon;
+   struct rwlock *slock;
boolean_t got_it;
 
MUTEX_ASSERT_LOCKED(&uvm.pageqlock);
@@ -947,20 +949,34 @@ uvmpd_scan(void)
 p != NULL && (inactive_shortage > 0 || swap_shortage > 0);
 p = nextpg) {
nextpg = TAILQ_NEXT(p, pageq);
-
-   /* skip this page if it's busy. */
-   if (p->pg_flags & PG_BUSY)
+   if (p->pg_flags & PG_BUSY) {
continue;
+   }
 
-   if (p->pg_flags & PQ_ANON) {
-   KASSERT(p->uanon != NULL);
-   if (rw_enter(p->uanon->an_lock, RW_WRITE|RW_NOSLEEP))
+   /*
+* lock the page's owner.
+*/
+   if (p->uobject != NULL) {
+   uobj = p->uobject;
+   slock = uobj->vmobjlock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
continue;
+   }
} else {
-   KASSERT(p->uobject != NULL);
-   if (rw_enter(p->uobject->vmobjlock,
-   RW_WRITE|RW_NOSLEEP))
+   anon = p->uanon;
+   KASSERT(p->uanon != NULL);
+   slock = anon->an_lock;
+   if (rw_enter(slock, RW_WRITE|RW_NOSLEEP)) {
continue;
+   }
+   }
+
+   /*
+* skip this page if it's busy.
+*/
+   if ((p->pg_flags & PG_BUSY) != 0) {
+   rw_exit(slock);
+   continue;
}
 
/*
@@ -997,10 +1013,11 @@ uvmpd_scan(void)
uvmexp.pddeact++;
inactive_shortage--;
}
-   if (p->pg_flags & PQ_ANON)
-   rw_exit(p->uanon->an_lock);
-   else
-   rw_exit(p->uobject->vmobjlock);
+
+   /*
+* we're done with this page.
+*/
+   rw_exit(slock);
}
 }

Call uvm_pageactivate() from uvm_pageunwire()

2022-04-26 Thread Martin Pieuchot

I'd like to use a proper interface to add/remove pages on the
active/inactive queues.  This will help for lock assertions and help
improving the existing LRU limitations.

Diff below makes uvm_pageunwire() call uvm_pageactivate() instead of
inserting the page itself.

ok?

Index: uvm/uvm_page.c
===
RCS file: /cvs/src/sys/uvm/uvm_page.c,v
retrieving revision 1.163
diff -u -p -r1.163 uvm_page.c
--- uvm/uvm_page.c  12 Mar 2022 12:34:22 -  1.163
+++ uvm/uvm_page.c  26 Apr 2022 12:13:59 -
@@ -1279,9 +1279,7 @@ uvm_pageunwire(struct vm_page *pg)
 
pg->wire_count--;
if (pg->wire_count == 0) {
-   TAILQ_INSERT_TAIL(&uvm.page_active, pg, pageq);
-   uvmexp.active++;
-   atomic_setbits_int(&pg->pg_flags, PQ_ACTIVE);
+   uvm_pageactivate(pg);
uvmexp.wired--;
}
 }

Decrement uvmexp.swpgonly

2022-04-26 Thread Martin Pieuchot

Small diff to decrement the counter only if the I/O succeed.  This
prevent a false positive if a check is performed before an error is
returned.

ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.154
diff -u -p -r1.154 uvm_swap.c
--- uvm/uvm_swap.c  17 Mar 2022 10:15:13 -  1.154
+++ uvm/uvm_swap.c  26 Apr 2022 12:04:52 -
@@ -1572,17 +1572,16 @@ uvm_swap_get(struct vm_page *page, int s
}
 
KERNEL_LOCK();
-   /* this page is (about to be) no longer only in swap. */
-   atomic_dec_int(&uvmexp.swpgonly);
-
result = uvm_swap_io(&page, swslot, 1, B_READ |
((flags & PGO_SYNCIO) ? 0 : B_ASYNC));
+   KERNEL_UNLOCK();
 
-   if (result != VM_PAGER_OK && result != VM_PAGER_PEND) {
-   /* oops, the read failed so it really is still only in swap. */
-   atomic_inc_int(&uvmexp.swpgonly);
+   if (result == VM_PAGER_OK || result == VM_PAGER_PEND) {
+   /*
+* this page is no longer only in swap.
+*/
+   atomic_dec_int(&uvmexp.swpgonly);
}
-   KERNEL_UNLOCK();
return (result);
 }

Re: refcount btrace

2022-04-11 Thread Martin Pieuchot

On 08/04/22(Fri) 12:16, Alexander Bluhm wrote:
> On Fri, Apr 08, 2022 at 02:39:34AM +, Visa Hankala wrote:
> > On Thu, Apr 07, 2022 at 07:55:11PM +0200, Alexander Bluhm wrote:
> > > On Wed, Mar 23, 2022 at 06:13:27PM +0100, Alexander Bluhm wrote:
> > > > In my opinion tracepoints give insight at minimal cost.  It is worth
> > > > it to have it in GENERIC to make it easy to use.
> > > 
> > > After release I want to revive the btrace of refcounts discussion.
> > > 
> > > As mpi@ mentioned the idea of dt(4) is to have these trace points
> > > in GENERIC kernel.  If you want to hunt a bug, just turn it on.
> > > Refounting is a common place for bugs, leaks can be detected easily.
> > > 
> > > The alternative are some defines that you can compile in and access
> > > from ddb.  This is more work and you would have to implement it for
> > > every recount.
> > > https://marc.info/?l=openbsd-tech&m=163786435916039&w=2
> > > 
> > > There is no measuarable performance difference.  dt(4) is written
> > > in a way that is is only one additional branch.  At least my goal
> > > is to add trace points to useful places when we identify them.
> > 
> > DT_INDEX_ENTER() still checks the index first, so it has two branches
> > in practice.
> > 
> > I think dt_tracing should be checked first so that it serves as
> > a gateway to the trace code. Under normal operation, the check's
> > outcome is always the same, which is easy even for simple branch
> > predictors.
> 
> Reordering the check is easy.  Now dt_tracing is first.
> 
> > I have a slight suspicion that dt(4) is now becoming a way to add code
> > that would be otherwise unacceptable. Also, how "durable" are trace
> > points perceived? Is an added trace point an achieved advantage that
> > is difficult to remove even when its utility has diminished? There is
> > a similarity to (ad hoc) debug printfs too.
> 
> As I understand dt(4) it is a replacement for debug printfs.  But
> it has advantages.  I can be turnd on selectively from userland.
> It does not spam the console, but can be processed in userland.  It
> is always there, you don't have to recompile.
> 
> Of course you always have the printf or tracepoint at the worng
> place.  I think people debugging the code should move them to
> the useful places.  Then we may end with generally useful tool.
> A least that is my hope.
> 
> There are obvious places to debug.  We have syscall entry and return.
> And I think reference counting is also generally interesting.

I'm happy if this can help debugging real reference counting issues.  Do
you have a script that could be committed to /usr/share/btrace to show
how to track reference counting using these probes?

> Index: dev/dt/dt_prov_static.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/dt/dt_prov_static.c,v
> retrieving revision 1.13
> diff -u -p -r1.13 dt_prov_static.c
> --- dev/dt/dt_prov_static.c   17 Mar 2022 14:53:59 -  1.13
> +++ dev/dt/dt_prov_static.c   8 Apr 2022 09:40:29 -
> @@ -87,6 +87,12 @@ DT_STATIC_PROBE1(smr, barrier_exit, "int
>  DT_STATIC_PROBE0(smr, wakeup);
>  DT_STATIC_PROBE2(smr, thread, "uint64_t", "uint64_t");
>  
> +/*
> + * reference counting
> + */
> +DT_STATIC_PROBE0(refcnt, none);
> +DT_STATIC_PROBE3(refcnt, inpcb, "void *", "int", "int");
> +DT_STATIC_PROBE3(refcnt, tdb, "void *", "int", "int");
>  
>  /*
>   * List of all static probes
> @@ -127,15 +133,24 @@ struct dt_probe *const dtps_static[] = {
>   &_DT_STATIC_P(smr, barrier_exit),
>   &_DT_STATIC_P(smr, wakeup),
>   &_DT_STATIC_P(smr, thread),
> + /* refcnt */
> + &_DT_STATIC_P(refcnt, none),
> + &_DT_STATIC_P(refcnt, inpcb),
> + &_DT_STATIC_P(refcnt, tdb),
>  };
>  
> +struct dt_probe *const *dtps_index_refcnt;
> +
>  int
>  dt_prov_static_init(void)
>  {
>   int i;
>  
> - for (i = 0; i < nitems(dtps_static); i++)
> + for (i = 0; i < nitems(dtps_static); i++) {
> + if (dtps_static[i] == &_DT_STATIC_P(refcnt, none))
> + dtps_index_refcnt = &dtps_static[i];
>   dt_dev_register_probe(dtps_static[i]);
> + }
>  
>   return i;
>  }
> Index: dev/dt/dtvar.h
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/dev/dt/dtvar.h,v
> retrieving revision 1.13
> diff -u -p -r1.13 dtvar.h
> --- dev/dt/dtvar.h27 Feb 2022 10:14:01 -  1.13
> +++ dev/dt/dtvar.h8 Apr 2022 09:42:19 -
> @@ -313,11 +313,30 @@ extern volatile uint32_tdt_tracing; /* 
>  #define  DT_STATIC_ENTER(func, name, args...) do {   
> \
>   extern struct dt_probe _DT_STATIC_P(func, name);\
>   struct dt_probe *dtp = &_DT_STATIC_P(func, name);   \
> - struct dt_provider *dtpv = dtp->dtp_prov;   \
>   \
>   if (__pr

Kill selrecord()

2022-04-11 Thread Martin Pieuchot

Now that poll(2) & select(2) use the kqueue backend under the hood we
can start retiring the old machinery. 

The diff below does not touch driver definitions, however it :

- kills selrecord() & doselwakeup()

- make it obvious that `kern.nselcoll' is now always 0 

- Change all poll/select hooks to return 0

- Kill a seltid check in wsdisplaystart() which is now always true

In a later step we could remove the *_poll() requirement from device
drivers and kill seltrue() & selfalse().

ok?

Index: arch/sparc64/dev/vldcp.c
===
RCS file: /cvs/src/sys/arch/sparc64/dev/vldcp.c,v
retrieving revision 1.22
diff -u -p -r1.22 vldcp.c
--- arch/sparc64/dev/vldcp.c24 Oct 2021 17:05:04 -  1.22
+++ arch/sparc64/dev/vldcp.c11 Apr 2022 16:38:32 -
@@ -581,44 +581,7 @@ vldcpioctl(dev_t dev, u_long cmd, caddr_
 int
 vldcppoll(dev_t dev, int events, struct proc *p)
 {
-   struct vldcp_softc *sc;
-   struct ldc_conn *lc;
-   uint64_t head, tail, state;
-   int revents = 0;
-   int s, err;
-
-   sc = vldcp_lookup(dev);
-   if (sc == NULL)
-   return (POLLERR);
-   lc = &sc->sc_lc;
-
-   s = spltty();
-   if (events & (POLLIN | POLLRDNORM)) {
-   err = hv_ldc_rx_get_state(lc->lc_id, &head, &tail, &state);
-
-   if (err == 0 && state == LDC_CHANNEL_UP && head != tail)
-   revents |= events & (POLLIN | POLLRDNORM);
-   }
-   if (events & (POLLOUT | POLLWRNORM)) {
-   err = hv_ldc_tx_get_state(lc->lc_id, &head, &tail, &state);
-
-   if (err == 0 && state == LDC_CHANNEL_UP && head != tail)
-   revents |= events & (POLLOUT | POLLWRNORM);
-   }
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM)) {
-   cbus_intr_setenabled(sc->sc_bustag, sc->sc_rx_ino,
-   INTR_ENABLED);
-   selrecord(p, &sc->sc_rsel);
-   }
-   if (events & (POLLOUT | POLLWRNORM)) {
-   cbus_intr_setenabled(sc->sc_bustag, sc->sc_tx_ino,
-   INTR_ENABLED);
-   selrecord(p, &sc->sc_wsel);
-   }
-   }
-   splx(s);
-   return revents;
+   return 0;
 }
 
 void
Index: dev/audio.c
===
RCS file: /cvs/src/sys/dev/audio.c,v
retrieving revision 1.198
diff -u -p -r1.198 audio.c
--- dev/audio.c 21 Mar 2022 19:22:39 -  1.198
+++ dev/audio.c 11 Apr 2022 16:38:52 -
@@ -2053,17 +2053,7 @@ audio_mixer_read(struct audio_softc *sc,
 int
 audio_mixer_poll(struct audio_softc *sc, int events, struct proc *p)
 {
-   int revents = 0;
-
-   mtx_enter(&audio_lock);
-   if (sc->mix_isopen && sc->mix_pending)
-   revents |= events & (POLLIN | POLLRDNORM);
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM))
-   selrecord(p, &sc->mix_sel);
-   }
-   mtx_leave(&audio_lock);
-   return revents;
+   return 0;
 }
 
 int
@@ -2101,21 +2091,7 @@ audio_mixer_close(struct audio_softc *sc
 int
 audio_poll(struct audio_softc *sc, int events, struct proc *p)
 {
-   int revents = 0;
-
-   mtx_enter(&audio_lock);
-   if ((sc->mode & AUMODE_RECORD) && sc->rec.used > 0)
-   revents |= events & (POLLIN | POLLRDNORM);
-   if ((sc->mode & AUMODE_PLAY) && sc->play.used < sc->play.len)
-   revents |= events & (POLLOUT | POLLWRNORM);
-   if (revents == 0) {
-   if (events & (POLLIN | POLLRDNORM))
-   selrecord(p, &sc->rec.sel);
-   if (events & (POLLOUT | POLLWRNORM))
-   selrecord(p, &sc->play.sel);
-   }
-   mtx_leave(&audio_lock);
-   return revents;
+   return 0;
 }
 
 int
Index: dev/hotplug.c
===
RCS file: /cvs/src/sys/dev/hotplug.c,v
retrieving revision 1.21
diff -u -p -r1.21 hotplug.c
--- dev/hotplug.c   25 Dec 2020 12:59:52 -  1.21
+++ dev/hotplug.c   11 Apr 2022 16:39:24 -
@@ -183,16 +183,7 @@ hotplugioctl(dev_t dev, u_long cmd, cadd
 int
 hotplugpoll(dev_t dev, int events, struct proc *p)
 {
-   int revents = 0;
-
-   if (events & (POLLIN | POLLRDNORM)) {
-   if (evqueue_count > 0)
-   revents |= events & (POLLIN | POLLRDNORM);
-   else
-   selrecord(p, &hotplug_sel);
-   }
-
-   return (revents);
+   return (0);
 }
 
 int
Index: dev/midi.c
===
RCS file: /cvs/src/sys/dev/midi.c,v
retrieving revision 1.54
diff -u -p -r1.54 midi.c
--- dev/midi.c  6 Apr 2022 18:59:27 -   1.54
+++ dev/midi.c  11 Apr 2022 16:39:31 -
@@ -334,31 +334,7 @@ done:
 int
 midipoll(

Re: refcount btrace

2022-03-21 Thread Martin Pieuchot

On 20/03/22(Sun) 05:39, Visa Hankala wrote:
> On Sat, Mar 19, 2022 at 12:10:11AM +0100, Alexander Bluhm wrote:
> > On Thu, Mar 17, 2022 at 07:25:27AM +, Visa Hankala wrote:
> > > On Thu, Mar 17, 2022 at 12:42:13AM +0100, Alexander Bluhm wrote:
> > > > I would like to use btrace to debug refernce counting.  The idea
> > > > is to a a tracepoint for every type of refcnt we have.  When it
> > > > changes, print the actual object, the current counter and the change
> > > > value.
> > > 
> > > > Do we want that feature?
> > > 
> > > I am against this in its current form. The code would become more
> > > complex, and the trace points can affect timing. There is a risk that
> > > the kernel behaves slightly differently when dt has been compiled in.
> > 
> > On our main architectures dt(4) is in GENERIC.  I see your timing
> > point for uvm structures.
> 
> In my opinion, having dt(4) enabled by default is another reason why
> there should be no carte blanche for adding trace points. Each trace
> point adds a tiny amount of bloat. Few users will use the tracing
> facility.
> 
> Maybe high-rate trace points could be behind a build option...

The whole point of dt(4) is to be able to debug GENERIC kernel.  I doubt
the cost of an additional if () block matters.

Swap encrypt under memory pressure

2022-03-12 Thread Martin Pieuchot

Try to allocate the buffer before doing the encryption, if it fails we
do not spend time doing the encryption.  This reduce the pressure when
swapping with low memory.

ok?

Index: uvm/uvm_swap.c
===
RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
retrieving revision 1.153
diff -u -p -r1.153 uvm_swap.c
--- uvm/uvm_swap.c  22 Feb 2022 01:15:02 -  1.153
+++ uvm/uvm_swap.c  12 Mar 2022 10:30:26 -
@@ -1690,6 +1690,26 @@ uvm_swap_io(struct vm_page **pps, int st
}
}
 
+   /*
+* now allocate a buf for the i/o.
+* [make sure we don't put the pagedaemon to sleep...]
+*/
+   pflag = (async || curproc == uvm.pagedaemon_proc) ? PR_NOWAIT :
+   PR_WAITOK;
+   bp = pool_get(&bufpool, pflag | PR_ZERO);
+
+   /*
+* if we failed to get a swapbuf, return "try again"
+*/
+   if (bp == NULL) {
+   if (bounce) {
+   uvm_pagermapout(bouncekva, npages);
+   uvm_swap_freepages(tpps, npages);
+   }
+   uvm_pagermapout(kva, npages);
+   return (VM_PAGER_AGAIN);
+   }
+
/* encrypt to swap */
if (write && bounce) {
int i, opages;
@@ -1729,35 +1749,6 @@ uvm_swap_io(struct vm_page **pps, int st
  PGO_PDFREECLUST);
 
kva = bouncekva;
-   }
-
-   /*
-* now allocate a buf for the i/o.
-* [make sure we don't put the pagedaemon to sleep...]
-*/
-   pflag = (async || curproc == uvm.pagedaemon_proc) ? PR_NOWAIT :
-   PR_WAITOK;
-   bp = pool_get(&bufpool, pflag | PR_ZERO);
-
-   /*
-* if we failed to get a swapbuf, return "try again"
-*/
-   if (bp == NULL) {
-   if (write && bounce) {
-#ifdef UVM_SWAP_ENCRYPT
-   int i;
-
-   /* swap encrypt needs cleanup */
-   if (encrypt)
-   for (i = 0; i < npages; i++)
-   SWAP_KEY_PUT(sdp, SWD_KEY(sdp,
-   startslot + i));
-#endif
-
-   uvm_pagermapout(kva, npages);
-   uvm_swap_freepages(tpps, npages);
-   }
-   return (VM_PAGER_AGAIN);
}
 
/*

Re: more MAKEDEV cleanup

2022-02-10 Thread Martin Pieuchot

On 05/04/21(Mon) 09:25, Miod Vallat wrote:
> The following diff attempts to clean up a few loose ends in the current
> MAKEDEV files:
> 
> - remove no-longer applicable device definitions (MSCP and SMD disks,
>   this kind of thing).
> - makes sure all platforms use the same `ramdisk' target for
>   installation media devices, rather than a mix of `ramd' and `ramdisk'.
> - moves as many `ramdisk' devices to MI land (bio, diskmap, random,
>   etc).
> - reduces the number of block devices in `ramdisk' targets to only one
>   per device, since the installer script will invoke MAKEDEV by itself
>   for the devices it needs to use.
> - sort device names in `all' and `ramdisk' MI lists to make maintainence
>   easier. This causes some ordering change in the `all' target in the
>   generated MAKEDEVs.

What happened to this?

> Index: MAKEDEV.common
> ===
> RCS file: /OpenBSD/src/etc/MAKEDEV.common,v
> retrieving revision 1.113
> diff -u -p -r1.113 MAKEDEV.common
> --- MAKEDEV.common12 Feb 2021 10:26:33 -  1.113
> +++ MAKEDEV.common5 Apr 2021 09:18:49 -
> @@ -114,7 +114,7 @@ dnl make a 'disktgt' macro that automati
>  dnl disktgt(rd, {-rd-})
>  dnl
>  dnl  target(all,rd,0)
> -dnl  target(ramd,rd,0)
> +dnl  target(ramdisk,rd,0)
>  dnl  disk_q(rd)
>  dnl  __devitem(rd, {-rd*-}, {-rd-})dnl
>  dnl
> @@ -122,62 +122,60 @@ dnl  Note: not all devices are generated
>  dnlits own extra list.
>  dnl
>  divert(1)dnl
> +target(all, acpi)dnl
> +target(all, apm)dnl
> +target(all, bio)dnl
> +target(all, bpf)dnl
> +twrget(all, com, tty0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b)dnl
> +twrget(all, czs, cua, a, b, c, d)dnl
> +target(all, diskmap)dnl
> +target(all, dt)dnl
>  twrget(all, fdesc, fd)dnl
> -target(all, st, 0, 1)dnl
> -target(all, std)dnl
> -target(all, ra, 0, 1, 2, 3)dnl
> -target(all, rx, 0, 1)dnl
> -target(all, wd, 0, 1, 2, 3)dnl
> -target(all, xd, 0, 1, 2, 3)dnl
> +target(all, fuse)dnl
> +target(all, hotplug)dnl
> +target(all, joy, 0, 1)dnl
> +target(all, kcov)dnl
> +target(all, kstat)dnl
> +target(all, local)dnl
> +target(all, lpt, 0, 1, 2)dnl
> +twrget(all, lpt, lpa, 0, 1, 2)dnl
> +target(all, par, 0)dnl
> +target(all, pci, 0, 1, 2, 3)dnl
>  target(all, pctr)dnl
>  target(all, pctr0)dnl
>  target(all, pf)dnl
> -target(all, apm)dnl
> -target(all, acpi)dnl
> +target(all, pppac)dnl
> +target(all, pppx)dnl
> +target(all, ptm)dnl
> +target(all, pty, 0)dnl
> +target(all, pvbus, 0, 1)dnl
> +target(all, radio, 0)dnl
> +target(all, rmidi, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> +twrget(all, rnd, random)dnl
> +twrget(all, speak, speaker)dnl
> +target(all, st, 0, 1)dnl
> +target(all, std)dnl
> +target(all, switch, 0, 1, 2, 3)dnl
> +target(all, tap, 0, 1, 2, 3)dnl
>  twrget(all, tth, ttyh, 0, 1)dnl
>  target(all, ttyA, 0, 1)dnl
> -twrget(all, mac_tty0, tty0, 0, 1)dnl
> -twrget(all, tzs, tty, a, b, c, d)dnl
> -twrget(all, czs, cua, a, b, c, d)dnl
>  target(all, ttyc, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> -twrget(all, com, tty0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b)dnl
> -twrget(all, mmcl, mmclock)dnl
> -target(all, lpt, 0, 1, 2)dnl
> -twrget(all, lpt, lpa, 0, 1, 2)dnl
> -target(all, joy, 0, 1)dnl
> -twrget(all, rnd, random)dnl
> -target(all, uk, 0)dnl
> -twrget(all, vi, video, 0, 1)dnl
> -twrget(all, speak, speaker)dnl
> -target(all, asc, 0)dnl
> -target(all, radio, 0)dnl
> +target(all, tun, 0, 1, 2, 3)dnl
>  target(all, tuner, 0)dnl
> -target(all, rmidi, 0, 1, 2, 3, 4, 5, 6, 7)dnl
> +twrget(all, tzs, tty, a, b, c, d)dnl
>  target(all, uall)dnl
> -target(all, pci, 0, 1, 2, 3)dnl
> -twrget(all, wsmouse, wscons)dnl
> -target(all, par, 0)dnl
> -target(all, apci, 0)dnl
> -target(all, local)dnl
> -target(all, ptm)dnl
> -target(all, hotplug)dnl
> -target(all, pppx)dnl
> -target(all, pppac)dnl
> -target(all, fuse)dnl
> +target(all, uk, 0)dnl
> +twrget(all, vi, video, 0, 1)dnl
>  target(all, vmm)dnl
> -target(all, pvbus, 0, 1)dnl
> -target(all, bpf)dnl
> -target(all, kcov)dnl
> -target(all, dt)dnl
> -target(all, kstat)dnl
> +target(all, vnd, 0, 1, 2, 3)dnl
> +target(all, vscsi, 0)dnl
> +target(all, wd, 0, 1, 2, 3)dnl
> +twrget(all, wsmouse, wscons)dnl
>  dnl
>  _mkdev(all, {-all-}, {-dnl
>  show_target(all)dnl
>  -})dnl
>  dnl
> -dnl XXX some arches use ramd, others ramdisk - needs to be fixed eventually
> -__devitem(ramdisk, ramdisk, Ramdisk kernel devices,nothing)dnl
> -dnl
>  target(usb, usb, 0, 1, 2, 3, 4, 5, 6, 7)dnl
>  target(usb, uhid, 0, 1, 2, 3, 4, 5, 6, 7)dnl
>  twrget(usb, fido, fido)dnl
> @@ -208,26 +206,26 @@ __devitem(ch, {-ch*-}, SCSI media change
>  _mcdev(ch, ch*, ch, {-major_ch_c-}, 660, operator)dnl
>  __devitem(uk, uk*, Unknown SCSI devices)dnl
>  _mcdev(uk, uk*, uk, {-major_uk_c-}, 640, operator)dnl
> -dnl XXX see ramdisk above
> -__devitem(ramd, ramdisk, Ramdisk kernel devices,nothing)dnl
>  dnl
> -_mkdev(ramd, ramdisk, {-dnl
> -show_target(ramd)dnl
> +__devitem(ramdisk, ramdisk, Ramdisk kernel devices,nothing)dnl
> +_mkdev(ramdisk, ramdisk, {-dnl
>

Re: uvm_unmap_kill_entry(): unwire with map lock held

2022-02-04 Thread Martin Pieuchot

On 04/02/22(Fri) 03:39, Klemens Nanni wrote:
> [...] 
> ... with the lock grabbed in uvm_map_teardown() that is, otherwise
> the first call path can lock against itself (regress/misc/posixtestsuite
> is a reproduce for this):
> 
>   vm_map_lock_read_ln+0x38
>   uvm_fault_unwire+0x58
>   uvm_unmap_kill_entry_withlock+0x68
>   uvm_unmap_remove+0x2d4
>   sys_munmap+0x11c
> 
> which is obvious in hindsight.
> 
> So grabbing the lock in uvm_map_teardown() instead avoids that while
> still ensuring a locked map in the path missing a lock.

This should be fine since this function should only be called when the
last reference of a map is dropped.  In other words the locking here is
necessary to satisfy the assertions.

I wonder if lock shouldn't be taken & released around uvm_map_teardown()
which makes it easier to see that this is called after the last refcount
decrement. 

> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 4 Feb 2022 02:51:00 -
> @@ -2734,6 +2751,7 @@ uvm_map_teardown(struct vm_map *map)
>   KERNEL_ASSERT_UNLOCKED();
>  
>   KASSERT((map->flags & VM_MAP_INTRSAFE) == 0);
> + vm_map_lock(map);
>  
>   /* Remove address selectors. */
>   uvm_addr_destroy(map->uaddr_exe);
>

Re: uvm_unmap_kill_entry(): unwire with map lock held

2022-01-31 Thread Martin Pieuchot

On 31/01/22(Mon) 10:24, Klemens Nanni wrote:
> Running with my uvm assert diff showed that uvm_fault_unwire_locked()
> was called without any locks held.
> 
> This happened when I rebooted my machine:
> 
>   uvm_fault_unwire_locked()
>   uvm_unmap_kill_entry_withlock()
>   uvm_unmap_kill_entry()
>   uvm_map_teardown()
>   uvmspace_free()
> 
> This code does not corrupt anything because
> uvm_unmap_kill_entry_withlock() is grabbing the kernel lock aorund its
> uvm_fault_unwire_locked() call.
> 
> But regardless of the kernel lock dances in this code path, the uvm map
> ought to be protected by its own lock.  uvm_fault_unwire() does that.
> 
> uvm_fault_unwire_locked()'s comment says the map must at least be read
> locked, which is what all other code paths to that function do.
> 
> This makes my latest assert diff happy in the reboot case (it did not
> always hit that assert).

I'm happy your asserts found a first bug.

I"m afraid calling this function below could result in a grabbing the
lock twice.  Can we be sure this doesn't happen?

uvm_unmap_kill_entry() is called in many different contexts and this is
currently a mess.  I don't know what NetBSD did in this area but it is
worth looking at and see if there isn't a good idea to untangle this.

> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 31 Jan 2022 09:28:04 -
> @@ -2132,7 +2143,7 @@ uvm_unmap_kill_entry_withlock(struct vm_
>   if (VM_MAPENT_ISWIRED(entry)) {
>   KERNEL_LOCK();
>   entry->wired_count = 0;
> - uvm_fault_unwire_locked(map, entry->start, entry->end);
> + uvm_fault_unwire(map, entry->start, entry->end);
>   KERNEL_UNLOCK();
>   }
>  
>

Re: unlock mmap(2) for anonymous mappings

2022-01-24 Thread Martin Pieuchot

On 24/01/22(Mon) 12:06, Klemens Nanni wrote:
> On Sun, Jan 16, 2022 at 09:22:50AM -0300, Martin Pieuchot wrote:
> > IMHO this approach of let's try if it works now and revert if it isn't
> > doesn't help us make progress.  I'd be more confident seeing diffs that
> > assert for the right lock in the functions called by uvm_mapanon() and
> > documentation about which lock is protecting which field of the data
> > structures.
> 
> I picked `vm_map's `size' as first underdocumented member.
> All accesses to it are protected by vm_map_lock(), either through the
> function itself or implicitly by already requiring the calling function
> to lock the map.

Could we use a vm_map_assert_locked() or something similar that reflect
the exclusive or shared (read) lock comments?  I don't trust comments.
It's too easy to miss a lock in a code path.

> So annotate functions using `size' wrt. the required lock.
> 
> Index: uvm_map.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.c,v
> retrieving revision 1.282
> diff -u -p -r1.282 uvm_map.c
> --- uvm_map.c 21 Dec 2021 22:21:32 -  1.282
> +++ uvm_map.c 21 Jan 2022 13:03:05 -
> @@ -805,6 +805,8 @@ uvm_map_sel_limits(vaddr_t *min, vaddr_t
>   * Fills in *start_ptr and *end_ptr to be the first and last entry describing
>   * the space.
>   * If called with prefilled *start_ptr and *end_ptr, they are to be correct.
> + *
> + * map must be at least read-locked.
>   */
>  int
>  uvm_map_isavail(struct vm_map *map, struct uvm_addr_state *uaddr,
> @@ -2206,6 +2208,8 @@ uvm_unmap_kill_entry(struct vm_map *map,
>   * If remove_holes, then remove ET_HOLE entries as well.
>   * If markfree, entry will be properly marked free, otherwise, no replacement
>   * entry will be put in the tree (corrupting the tree).
> + *
> + * map must be locked.
>   */
>  void
>  uvm_unmap_remove(struct vm_map *map, vaddr_t start, vaddr_t end,
> @@ -2976,6 +2980,9 @@ uvm_tree_sanity(struct vm_map *map, char
>   UVM_ASSERT(map, addr == vm_map_max(map), file, line);
>  }
>  
> +/*
> + * map must be at least read-locked.
> + */
>  void
>  uvm_tree_size_chk(struct vm_map *map, char *file, int line)
>  {
> Index: uvm_map.h
> ===
> RCS file: /cvs/src/sys/uvm/uvm_map.h,v
> retrieving revision 1.71
> diff -u -p -r1.71 uvm_map.h
> --- uvm_map.h 15 Dec 2021 12:53:53 -  1.71
> +++ uvm_map.h 21 Jan 2022 12:51:26 -
> @@ -272,7 +272,7 @@ struct vm_map {
>  
>   struct uvm_map_addr addr;   /* [v] Entry tree, by addr */
>  
> - vsize_t size;   /* virtual size */
> + vsize_t size;   /* [v] virtual size */
>   int ref_count;  /* [a] Reference count */
>   int flags;  /* flags */
>   struct mutexflags_lock; /* flags lock */
>

Re: DDBPROF: move option to amd64,i386 GENERIC

2022-01-18 Thread Martin Pieuchot

On 18/01/22(Tue) 04:38, Klemens Nanni wrote:
> While intended for more architectures, DDBPROF is strictly amd64 and
> i386 only, so the machine-independent sys/conf/GENERIC does not seem fit
> (until all architectures are supported).

This define should die.  There's no need to polish this turd.  Somebody
has to stand up and turn this into a "#if NDT > 0" with the audit that
goes with it.

> Index: conf/GENERIC
> ===
> RCS file: /cvs/src/sys/conf/GENERIC,v
> retrieving revision 1.281
> diff -u -p -r1.281 GENERIC
> --- conf/GENERIC  23 Dec 2021 10:04:14 -  1.281
> +++ conf/GENERIC  18 Jan 2022 04:28:29 -
> @@ -4,7 +4,6 @@
>  #GENERIC kernel
>  
>  option   DDB # in-kernel debugger
> -#option  DDBPROF # ddb(4) based profiling
>  #option  DDB_SAFE_CONSOLE # allow break into ddb during boot
>  #makeoptions DEBUG=""# do not compile full symbol table
>  #makeoptions PROF="-pg"  # build profiled kernel
> Index: arch/amd64/conf/GENERIC
> ===
> RCS file: /cvs/src/sys/arch/amd64/conf/GENERIC,v
> retrieving revision 1.510
> diff -u -p -r1.510 GENERIC
> --- arch/amd64/conf/GENERIC   4 Jan 2022 05:50:43 -   1.510
> +++ arch/amd64/conf/GENERIC   18 Jan 2022 04:28:04 -
> @@ -13,6 +13,8 @@ machine amd64
>  include  "../../../conf/GENERIC"
>  maxusers 80  # estimated number of users
>  
> +#option  DDBPROF # ddb(4) based profiling
> +
>  option   USER_PCICONF# user-space PCI configuration
>  
>  option   APERTURE# in-kernel aperture driver for XFree86
> Index: arch/i386/conf/GENERIC
> ===
> RCS file: /cvs/src/sys/arch/i386/conf/GENERIC,v
> retrieving revision 1.860
> diff -u -p -r1.860 GENERIC
> --- arch/i386/conf/GENERIC2 Jan 2022 23:14:27 -   1.860
> +++ arch/i386/conf/GENERIC18 Jan 2022 04:28:26 -
> @@ -13,6 +13,8 @@ machine i386
>  include  "../../../conf/GENERIC"
>  maxusers 80  # estimated number of users
>  
> +#option  DDBPROF # ddb(4) based profiling
> +
>  option   USER_PCICONF# user-space PCI configuration
>  
>  option   APERTURE# in-kernel aperture driver for XFree86
>

Re: uvm_swap: introduce uvm_swap_data_lock

2022-01-16 Thread Martin Pieuchot

Nice!

On 30/12/21(Thu) 23:38, Theo Buehler wrote:
> The diff below does two things: it adds a uvm_swap_data_lock mutex and
> trades it for the KERNEL_LOCK in uvm_swapisfull() and uvm_swap_markbad()

Why is it enough?  Which fields is the lock protecting in these
function?  Is it `uvmexp.swpages', could that be documented?  

What about `nswapdev'?  Why is the rwlock grabbed before reading it in
sys_swapctl()?i

What about `swpginuse'?

If the mutex/rwlock are used to protect the global `swap_priority' could
that be also documented?  Once this is documented it should be trivial to
see that some places are missing some locking.  Is it intentional?

> The uvm_swap_data_lock protects all swap data structures, so needs to be
> grabbed a few times, many of them already documented in the comments.
> 
> For review, I suggest comparing to what NetBSD did and also going
> through the consumers (swaplist_insert, swaplist_find, swaplist_trim)
> and check that they are properly locked when called, or that there is
> the KERNEL_LOCK() in place when swap data structures are manipulated.

I'd suggest using the KASSERT(rw_write_held()) idiom to further reduce
the differences with NetBSD.

> In swapmount() I introduced locking since that's needed to be able to
> assert that the proper locks are held in swaplist_{insert,find,trim}.

Could the KERNEL_LOCK() in uvm_swap_get() be pushed a bit further down?
What about `uvmexp.nswget' and `uvmexp.swpgonly' in there?

> Index: uvm/uvm_swap.c
> ===
> RCS file: /cvs/src/sys/uvm/uvm_swap.c,v
> retrieving revision 1.152
> diff -u -p -r1.152 uvm_swap.c
> --- uvm/uvm_swap.c12 Dec 2021 09:14:59 -  1.152
> +++ uvm/uvm_swap.c30 Dec 2021 15:47:20 -
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -91,6 +92,9 @@
>   *  - swap_syscall_lock (sleep lock): this lock serializes the swapctl
>   *system call and prevents the swap priority list from changing
>   *while we are in the middle of a system call (e.g. SWAP_STATS).
> + *  - uvm_swap_data_lock (mutex): this lock protects all swap data
> + *structures including the priority list, the swapdev structures,
> + *and the swapmap arena.
>   *
>   * each swap device has the following info:
>   *  - swap device in use (could be disabled, preventing future use)
> @@ -212,6 +216,7 @@ LIST_HEAD(swap_priority, swappri);
>  struct swap_priority swap_priority;
>  
>  /* locks */
> +struct mutex uvm_swap_data_lock = MUTEX_INITIALIZER(IPL_NONE);
>  struct rwlock swap_syscall_lock = RWLOCK_INITIALIZER("swplk");
>  
>  /*
> @@ -442,7 +447,7 @@ uvm_swap_finicrypt_all(void)
>  /*
>   * swaplist_insert: insert swap device "sdp" into the global list
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   * => caller must provide a newly malloc'd swappri structure (we will
>   *   FREE it if we don't need it... this it to prevent malloc blocking
>   *   here while adding swap)
> @@ -452,6 +457,9 @@ swaplist_insert(struct swapdev *sdp, str
>  {
>   struct swappri *spp, *pspp;
>  
> + rw_assert_wrlock(&swap_syscall_lock);
> + MUTEX_ASSERT_LOCKED(&uvm_swap_data_lock);
> +
>   /*
>* find entry at or after which to insert the new device.
>*/
> @@ -493,7 +501,7 @@ swaplist_insert(struct swapdev *sdp, str
>   * swaplist_find: find and optionally remove a swap device from the
>   *   global list.
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   * => we return the swapdev we found (and removed)
>   */
>  struct swapdev *
> @@ -502,6 +510,9 @@ swaplist_find(struct vnode *vp, boolean_
>   struct swapdev *sdp;
>   struct swappri *spp;
>  
> + rw_assert_wrlock(&swap_syscall_lock);
> + MUTEX_ASSERT_LOCKED(&uvm_swap_data_lock);
> +
>   /*
>* search the lists for the requested vp
>*/
> @@ -524,13 +535,16 @@ swaplist_find(struct vnode *vp, boolean_
>   * swaplist_trim: scan priority list for empty priority entries and kill
>   *   them.
>   *
> - * => caller must hold both swap_syscall_lock and uvm.swap_data_lock
> + * => caller must hold both swap_syscall_lock and uvm_swap_data_lock
>   */
>  void
>  swaplist_trim(void)
>  {
>   struct swappri *spp, *nextspp;
>  
> + rw_assert_wrlock(&swap_syscall_lock);
> + MUTEX_ASSERT_LOCKED(&uvm_swap_data_lock);
> +
>   LIST_FOREACH_SAFE(spp, &swap_priority, spi_swappri, nextspp) {
>   if (!TAILQ_EMPTY(&spp->spi_swapdev))
>   continue;
> @@ -543,7 +557,7 @@ swaplist_trim(void)
>   * swapdrum_add: add a "swapdev"'s blocks into /dev/drum's area.
>   *
>   * => caller must hold swap_syscall_lock
> - * => uvm.swap_data_lock should be unlocked (we may sleep)
> + * => uv

Re: unlock mmap(2) for anonymous mappings

2022-01-16 Thread Martin Pieuchot

On 14/01/22(Fri) 23:01, Mark Kettenis wrote:
> > Date: Tue, 11 Jan 2022 23:13:20 +
> > From: Klemens Nanni 
> > 
> > On Tue, Jan 11, 2022 at 09:54:44AM -0700, Theo de Raadt wrote:
> > > > Now this is clearly a "slow" path.  I don't think there is any reason
> > > > not to put all the code in that if (uvw_wxabort) block under the
> > > > kernel lock.  For now I think making access to ps_wxcounter atomic is
> > > > just too fine grained.
> > > 
> > > Right.  Lock the whole block.
> > 
> > Thanks everyone, here's the combined diff for that.
> 
> I think mpi@ should be involved in the actual unlocking of mmap(2),
> munmap(2) and mprotect(2).  But the changes to uvm_mmap.c are ok
> kettenis@ and can be committed now.

It isn't clear to me what changed since the last time this has been
tried.  Why is it safe now?  What are the locking assumptions?  

IMHO this approach of let's try if it works now and revert if it isn't
doesn't help us make progress.  I'd be more confident seeing diffs that
assert for the right lock in the functions called by uvm_mapanon() and
documentation about which lock is protecting which field of the data
structures.

NetBSD has done much of this and the code bases do not diverge much so
it can be useful to look there as well.

Re: patch: add a new ktrace facility for replacing some printf-debug

2022-01-07 Thread Martin Pieuchot

On 07/01/22(Fri) 10:54, Sebastien Marie wrote:
> Hi,
> 
> Debugging some code paths is complex: for example, unveil(2) code is
> involved inside VFS, and using DEBUG_UNVEIL means that the kernel is
> spamming printf() for all processes using unveil(2) (a lot of
> processes) instead of just the interested cases I want to follow.
> 
> So I cooked the following diff to add a KTRLOG() facility to be able
> to replace printf()-like debugging with a more process-limited method.

I wish you could debug such issue without having to change any kernel
code that's why I started btrace(8).

You should already be able to filter probes per thread/process, maybe you
want to replace some debug printf by static probes.  Alternatively you
could define DDBPROF and use the kprobe provider which allow you to
inspect prologue and epilogue of ELF functions.

Maybe btrace(8) is not yet fully functional to debug this particular
problem, but improving it should hopefully give us a tool to debug most
of the kernel issues without having to write a diff and boot a custom
kernel.  At least that's the goal.

Re: gprof: Profiling a multi-threaded application

2021-12-10 Thread Martin Pieuchot

On 10/12/21(Fri) 09:56, Yuichiro NAITO wrote:
> Any comments about this topic?

I'm ok with this approach.  I would appreciate if somebody
else could take it over, I'm too busy with other stuff.

> On 7/12/21 18:09, Yuichiro NAITO wrote:
> > Hi, Martin
> > 
> > n 2021/07/10 16:55, Martin Pieuchot wrote:
> > > Hello Yuichiro, thanks for your work !
> > 
> > Thanks for the response.
> > 
> > > > On 2021/06/16 16:34, Yuichiro NAITO wrote:
> > > > > When I compile a multi-threaded application with '-pg' option, I 
> > > > > always get no
> > > > > results in gmon.out. With bad luck, my application gets SIGSEGV while 
> > > > > running.
> > > > > Because the buffer that holds number of caller/callee count is the 
> > > > > only one
> > > > > in the process and will be broken by multi-threaded access.
> > > > > 
> > > > > I get the idea to solve this problem from NetBSD. NetBSD has 
> > > > > individual buffers
> > > > > for each thread and merges them at the end of profiling.
> > > 
> > > Note that the kernel use a similar approach but doesn't merge the buffer,
> > > instead it generates a file for each CPU.
> > 
> > Yes, so the number of output files are limited by the number of CPUs in 
> > case of
> > the kernel profiling. I think number of application threads can be 
> > increased more
> > casually. I don't want to see dozens of 'gmon.out' files.
> > 
> > > > > NetBSD stores the reference to the individual buffer by 
> > > > > pthread_setspecific(3).
> > > > > I think it causes infinite recursive call if whole libc library 
> > > > > (except
> > > > > mcount.c) is compiled with -pg.
> > > > > 
> > > > > The compiler generates '_mcount' function call at the beginning of 
> > > > > every
> > > > > functions. If '_mcount' calls pthread_getspecific(3) to get the 
> > > > > individual
> > > > > buffer, pthread_getspecific() calls '_mcount' again and causes 
> > > > > infinite
> > > > > recursion.
> > > > > 
> > > > > NetBSD prevents from infinite recursive call by checking a global 
> > > > > variable. But
> > > > > this approach also prevents simultaneously call of '_mcount' on a 
> > > > > multi-threaded
> > > > > application. It makes a little bit wrong results of profiling.
> > > > > 
> > > > > So I added a pointer to the buffer in `struct pthread` that can be 
> > > > > accessible
> > > > > via macro call as same as pthread_self(3). This approach can prevent 
> > > > > of
> > > > > infinite recursive call of '_mcount'.
> > > 
> > > Not calling a libc function for this makes sense.  However I'm not
> > > convinced that accessing `tib_thread' before _rthread_init() has been
> > > called is safe.
> > 
> > Before '_rthread_init’ is called, '__isthreaded' global variable is kept to 
> > be 0.
> > My patch doesn't access tib_thread in this case.
> > After calling `_rthread_init`,　`pthread_create()` changes `__isthreaded` to 
> > 1.
> > Tib_thread will be accessed by all threads that are newly created and the 
> > initial one.
> > 
> > I believe tib of the initial thread has been initialized in `_libc_preinit' 
> > function
> > in 'lib/libc/dlfcn/init.c'.
> > 
> > > I'm not sure where is the cleanest way to place the per-thread buffer,
> > > I'd suggest you ask guenther@ about this.
> > 
> > I added guenther@ in CC of this mail.
> > I hope I can get an advise about per-thread buffer.
> > 
> > > Maybe the initialization can be done outside of _mcount()?
> > 
> > Yes, I think tib is initialized in `pthread_create()` and `_libc_preinit()`.
> > 
> > > > > I obtained merging function from NetBSD that is called in '_mcleanup' 
> > > > > function.
> > > > > Merging function needs to walk through all the individual buffers,
> > > > > I added SLIST_ENTRY member in 'struct gmonparam' to make a list of 
> > > > > the buffers.
> > > > > And also added '#ifndef _KERNEL' for the SLIST_ENTRY member not to be 
> > > &g

Re: net write in pcb hash

2021-12-09 Thread Martin Pieuchot

On 08/12/21(Wed) 22:39, Alexander Bluhm wrote:
> On Wed, Dec 08, 2021 at 03:28:34PM -0300, Martin Pieuchot wrote:
> > On 04/12/21(Sat) 01:02, Alexander Bluhm wrote:
> > > Hi,
> > > 
> > > As I want a read-only net lock for forwarding, all paths should be
> > > checked for the correct net lock and asserts.  I found code in
> > > in_pcbhashlookup() where reading the PCB table results in a write
> > > to optimize the cache.
> > > 
> > > Porperly protecting PCB hashes is out of scope for parallel forwarding.
> > > Can we get away with this hack?  Only update the cache when we are
> > > in TCP oder UDP stack with the write lock.  The access from pf is
> > > read-only.
> > > 
> > > NET_WLOCKED() indicates whether we may change data structures.
> > 
> > I recall that we currently do not want to introduce such idiom: change
> > the behavior of the code depending on which lock is held by the caller.
> > 
> > Can we instead assert that a write-lock is held before modifying the
> > list?
> 
> We could also pass down the kind of lock that is used.  Goal is
> that pf uses shared net lock.  TCP and UDP will keep the exclusice
> net lock for a while.

Changing the logic of a function based on the type of a lock is not
different from the previous approach.

> Diff gets longer but perhaps a bit clearer what is going on.

I believe we want to split in_pcblookup_listen() into two functions.
One which is read-only and one which modifies the head of the hash
chain.
The read-only asserts for any lock and the one that modifies the hash
calls the former and assert for a write lock.

Alternatively, we could protect the PCB hash with a mutex.  This has the
advantage of not making the scope of the NET_LOCK() more complex.  In
the end we all know something like that will be done.  I don't know how
other BSD did this but I'm sure this will help getting the remaining
socket layer out of the NET_LOCK().

> Index: net/pf.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/pf.c,v
> retrieving revision 1.1122
> diff -u -p -r1.1122 pf.c
> --- net/pf.c  7 Jul 2021 18:38:25 -   1.1122
> +++ net/pf.c  8 Dec 2021 21:16:16 -
> @@ -3317,14 +3317,12 @@ pf_socket_lookup(struct pf_pdesc *pd)
>   sport = pd->hdr.tcp.th_sport;
>   dport = pd->hdr.tcp.th_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = &tcbtable;
>   break;
>   case IPPROTO_UDP:
>   sport = pd->hdr.udp.uh_sport;
>   dport = pd->hdr.udp.uh_dport;
>   PF_ASSERT_LOCKED();
> - NET_ASSERT_LOCKED();
>   tb = &udbtable;
>   break;
>   default:
> @@ -3348,22 +3346,24 @@ pf_socket_lookup(struct pf_pdesc *pd)
>* Fails when rtable is changed while evaluating the ruleset
>* The socket looked up will not match the one hit in the end.
>*/
> - inp = in_pcbhashlookup(tb, saddr->v4, sport, daddr->v4, dport,
> - pd->rdomain);
> + NET_ASSERT_LOCKED();
> + inp = in_pcbhashlookup_wlocked(tb, saddr->v4, sport, daddr->v4,
> + dport, pd->rdomain, 0);
>   if (inp == NULL) {
> - inp = in_pcblookup_listen(tb, daddr->v4, dport,
> - NULL, pd->rdomain);
> + inp = in_pcblookup_listen_wlocked(tb, daddr->v4, dport,
> + NULL, pd->rdomain, 0);
>   if (inp == NULL)
>   return (-1);
>   }
>   break;
>  #ifdef INET6
>   case AF_INET6:
> - inp = in6_pcbhashlookup(tb, &saddr->v6, sport, &daddr->v6,
> - dport, pd->rdomain);
> + NET_ASSERT_LOCKED();
> + inp = in6_pcbhashlookup_wlocked(tb, &saddr->v6, sport,
> + &daddr->v6, dport, pd->rdomain, 0);
>   if (inp == NULL) {
> - inp = in6_pcblookup_listen(tb, &daddr->v6, dport,
> - NULL, pd->rdomain);
> + inp = in6_pcblookup_listen_wlocked(tb, &daddr->v6,
> + dport, NULL, pd->rdomain, 0);
>   if (inp == NULL)
>   return (-1);
>   }
> Index: netinet/in_pcb.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1003 matches

Mail list logo