Re: ix(4)/riscv64: Make ix(4) work when MSI-X interrupts aren't available

2021-07-20 Thread Jonathan Matthew
On Tue, Jul 20, 2021 at 02:21:39PM +0200, Mark Kettenis wrote:
> > Date: Tue, 20 Jul 2021 21:55:56 +1000
> > From: Jonathan Matthew 
> > 
> > On Mon, Jul 19, 2021 at 07:37:10PM -0400, Ashton Fagg wrote:
> > > I have an Intel 82599 10 gigabit ethernet card I wanted to get working
> > > on my SiFive Unmatched board.
> > > 
> > > I found the ix(4) driver has some weirdness around MSI-X
> > > interrupts. While the driver supports operating both with and without
> > > MSI-X support, it's hard-coded via a flag rather than dynamically checking
> > > if it's available. If the flag is set (which it always is right now),
> > > but MSI-X isn't available, the driver will throw an error and the device
> > > won't work:
> > > 
> > > ix0 at pci7 dev 0 function 0 "Intel 82599" rev 0x01ixgbe_allocate_msix: 
> > > pci_intr_map_msix vec 0 failed
> > > 
> > > The root cause is this call failing in if_ix.c:
> > > 
> > >   if (pci_intr_map_msix(pa, i, )) {
> > >   printf("ixgbe_allocate_msix: "
> > >   "pci_intr_map_msix vec %d failed\n", i);
> > >   error = ENOMEM;
> > >   goto fail;
> > >   }
> > > 
> > > 
> > > Because in _pci_intr_map_msix (in sys/arch/riscv64/dev/pci_machdep.c):
> > > 
> > > if ((pa->pa_flags & PCI_FLAGS_MSI_ENABLED) == 0 ||
> > >   pci_get_capability(pc, tag, PCI_CAP_MSI, NULL, NULL) == 0)
> > >   return -1;
> > > 
> > > The PCI attach flags would not have PCI_FLAGS_MSI_ENABLED set.
> > > 
> > > The following diff remedies that by checking if PCI_FLAGS_MSI_ENABLED is
> > > actually set, rather than just trying and failing because the hard-coded
> > > flag says so. It also enables ix(4) in the kernel config for
> > > riscv64. Effectively, the driver will now only try to use MSI-X if the
> > > machine is advertising it to be available.
> > 
> > I'd rather not have to do this in every driver.  We otherwise check that 
> > flag
> > inside the pci interrupt functions rather than in the driver code, so we
> > should do so in pci_intr_msix_count() too, since that's what we call in
> > multi-queue nic drivers to decide whether to use MSI-X.  Drivers that only
> > want a single vector will just call pci_intr_map_msix() and fall back to MSI
> > or legacy interrupts if that fails.
> > 
> > I posted the alternate version of this diff to misc@ a few days ago,
> > which repeats the checks used to set PCI_FLAGS_MSI_ENABLED in
> > pci_intr_msix_count(), rather than passing in struct
> > pci_attach_args, in case we prefer to do it that way.
> 
> I don't really read misc@, so don't post your patches there.

Right, it was just there for testing.

> 
> > Mark, what do you think?
> 
> Yeah, making pci_intr_msix_count() should return 0 if MSIs are not
> supported.  A bit strange though to pass both pa and pa->pa_tag.  I'd
> change the function to only take pa as an argument.

Yes, on second look that makes sense.  Here's a better diff with that change,
and that also doesn't break arches without __HAVE_PCI_MSIX.  ok?

Index: if_bnxt.c
===
RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v
retrieving revision 1.32
diff -u -p -u -p -r1.32 if_bnxt.c
--- if_bnxt.c   24 Apr 2021 09:37:46 -  1.32
+++ if_bnxt.c   21 Jul 2021 03:24:44 -
@@ -537,7 +537,7 @@ bnxt_attach(struct device *parent, struc
sc->sc_flags |= BNXT_FLAG_MSIX;
intrstr = pci_intr_string(sc->sc_pc, ih);
 
-   nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
+   nmsix = pci_intr_msix_count(pa);
if (nmsix > 1) {
sc->sc_ih = pci_intr_establish(sc->sc_pc, ih,
IPL_NET | IPL_MPSAFE, bnxt_admin_intr, sc, 
DEVNAME(sc));
Index: if_ix.c
===
RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
retrieving revision 1.178
diff -u -p -u -p -r1.178 if_ix.c
--- if_ix.c 22 Dec 2020 23:25:37 -  1.178
+++ if_ix.c 21 Jul 2021 03:24:44 -
@@ -1783,7 +1783,7 @@ ixgbe_setup_msix(struct ix_softc *sc)
if (!ixgbe_enable_msix)
return;
 
-   nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
+   nmsix = pci_intr_msix_count(pa);
if (nmsix <= 1)
return;
 
Index: if_ixl.c
===
RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v
retrieving revision 1.74
diff -u -p -u -p -r1.74 if_ixl.c
--- if_ixl.c26 Mar 2021 08:02:34 -  1.74
+++ if_ixl.c21 Jul 2021 03:24:44 -
@@ -1795,7 +1795,7 @@ ixl_attach(struct device *parent, struct
}
 
if (pci_intr_map_msix(pa, 0, >sc_ih) == 0) {
-   int nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
+   int nmsix = pci_intr_msix_count(pa);
if (nmsix > 1) { /* we used 1 (the 0th) for the adminq */
nmsix--;
 

Re: nanosleep.2: miscellaneous cleanup

2021-07-20 Thread Theo de Raadt
Scott Cheloha  wrote:

> I'm not wed to keeping all the examples.  What if we merge examples 1
> and 2 and dumb the result down a bit?

Are you teaching people how to fill a structure, how to perform a loop,
or that they need to check errno?

I don't see the point of having any example.

Do we need an example in the close(2) manual page next?



Re: nanosleep.2: miscellaneous cleanup

2021-07-20 Thread Theo de Raadt
>  [EFAULT]  foo points outside the process's allocated address space.
>
> But i don't really i like that.  The word "allocated" makes me wonder
> because it sounds too much like malloc(3) for my taste.
> Usually, pointers to automatic and to static objects are acceptable,
> too, and are those "allocated"?  Some might say they are not.
> Besides, "process's" looks awkward.

Disagree on your dislike of this wording.

A process has an address space, VM_MIN_ADDRESS to VM_MAXUSER_ADDRESS.
Parts of this are not mapped, because nothing has allocated backing
resources.  Allocation happens via stack growth, execve, shm, mmap, etc
etc etc etc.  Those are all methods for "allocating" within the
processes' address space.

Unallocated space generates EFAULT.

Addresses outside the VM_MIN_ADDRESS to VM_MAXUSER_ADDRESS range also
generate EFAULT, and this is because backing resources are not permitted
to be allocated there.  No allocation, thus EFAULT.

"allocated" is correct terminology, and it has nothing to do with
malloc.

I don't see how trying to mince words is going to help anyone.

Unifying all the varients into one form will be quite a task.
I don't know if this is the right form to use, but I don't like the
argument you made.



Re: nanosleep.2: miscellaneous cleanup

2021-07-20 Thread Ingo Schwarze
Hi Scott,

Scott Cheloha wrote on Tue, Jul 20, 2021 at 05:20:16PM -0500:

> The nanosleep.2 page could use some cleanup.  Here's a bunch of fixes,
> rewrites, etc.
> 
> I've included my notes on the changes below.  I have some (mostly
> stylistic) questions in there, too.

Thanks for explaining what you are doing.  I'm snipping what i agree
with as well as what i have no opinion on and trust you on, but that
doesn't mean that mentioning it was useless.

> Should we reference sigaction(2) here alongside SA_RESTART?  e.g.:
> 
>   ... even if SA_RESTART is set for the interrupting signal
>   (see sigaction(2) for further details).

It isn't absolutely required.  On the one hand, people can use
"man -k any=SA_RESTART" to find out what it is.  On the other hand,
people can maybe be expected to know that sigaction(2) is the system
call to configure signal handling, and even if they ognore it, they
will easily find out with "man signal".

Then again, i can see how saying more precisely what SA_RESTART is
might be helpful especially for novice users, if it can be done
without being too much of a distraction.

In general, i dislike both parenthetic remarks and "see ... for details"
unless a more natural wording cannot be found.

Maybe something like:

  ... even if SA_RESTART was set with sigaction(2) for the
  interrupting signal.

Your call, i would say.

> I'm unsure about the correct voice here.  Should nanosleep actively
> return values?  Or should values "be returned"?

Both are acceptable.  In general, and not just in manual pages,
prefer the active voice when both work equally well.

> I'm unclear on whether I need the indent here in the .Bd block.  I
> think it looks better than unindented code, but maybe I'm flouting
> some practice we have.

Using tabs is permitted inside .Bd -literal and .Bd -unfilled.
They should definitely be used for indenting the bodies of for,
while, and if.

Whether you indent the whole thing depends on what looks better.
When the code is relatively narrow, indenting the whole example
is often a good idea.  When the code contains very long lines,
not indenting the whole example may be better.

If you choose to indent, i have no strong prefernce whether you
do it with tabs or with .Bd -literal -offset indent; maybe i very
slightly prefer the latter, because the global indentation is
a formatting choice rather than part of the displayed code.
But i really wouldn't complain about either.

> ERRORS
> - Simplify the opening sentence.  Yeesh, what a mouthful.

Indeed, ERRORS should not repeat the content of RETURN VALUES.

> Unsure if a second EFAULT case for remainder is a good thing here.
> Strictly speaking, NULL is not a valid part of the process address
> space.  Maybe that's too pedantic?

I don't think you can be too pedantic about where NULL is allowed
and where it isn't.  That's a notorious source of bugs, so precision
about NULL is usually a good thing.

> Also, do we have a more standard "boilerplate" sentence for EFAULT?

Judging from "man -l /usr/share/man/man2/*.2", the most common
wording is:

  [EFAULT]  foo points outside the process's allocated address space.

But i don't really i like that.  The word "allocated" makes me wonder
because it sounds too much like malloc(3) for my taste.
Usually, pointers to automatic and to static objects are acceptable,
too, and are those "allocated"?  Some might say they are not.
Besides, "process's" looks awkward.

These occur, too:

  [EFAULT]  The foo parameter is not in a valid part of the user
address space.

  [EFAULT]  foo references invalid memory.

  [EFAULT]  The name parameter specifies an area outside the
process address space.

  [EFAULT]  foo points to an illegal address.

  [EFAULT]  The userspace address uaddr is invalid.

  [EFAULT]  Part of buf points outside the process's allocated
address space.

  [EFAULT]  The buf parameter points to an invalid address.

  [EFAULT]  The argument foo specifies an invalid address.

  [EFAULT]  The foo parameter specified a bad address.

  [EFAULT]  The foo parameter points to memory not in
a valid part of the process address space.

  [EFAULT]  The address specified for foo is invalid.

  [EFAULT]  An argument address referenced invalid memory.

  In errno(2):
  14 EFAULT Bad address. The system detected an invalid address in
attempting to use an argument of a call.

  [EFAULT]  There was an error reading or writing the kevent
structure.

Quite a mess, i would say.

I think trying to explain over and over again what an invalid address
is, in each and every manual page, is neither reasonable nor feasible -
i'm not convinced the rules are really simple given how many types
of valid memory there are: static, stack, heap, ...

So i'd probably settle for a concise, simple wording, and i think
i like this one best:

  [EFAULT]  foo points to an invalid address.

What is valid can easily depend on the function 

Re: vscsi/iscsid: wait for scsi_probe to complete after connections are established

2021-07-20 Thread Ashton Fagg
Friendly ping.

Ashton Fagg  writes:

> I have been working on fixing an issue (which was partially fixed by a
> diff I sent in earlier this year) with iscsid. With iscsi disks in
> /etc/fstab, sometimes the devices aren't fully up and ready before fsck
> tries to run - causing the machine to go into single user mode on boot.
>
> The diff that was merged earlier in the year added a poll routine which
> monitors for connection success before letting iscsictl return. This
> fixed the issue in some cases, however it's still only half the fix. The
> principled fix is to, additionally, wait until the scsi_probe calls are
> complete - at which point we can reasonably assume the disk device are
> ready for use. This requires some work to the vscsi driver to make this
> happen, as well as changes to both iscsid and iscsictl. The diffs
> attached here are a full implementation of this.
>
> I was still encountering this issue on one of my machines (much slower
> than my normal machine), where the connections would be up but the
> scsi_probe calls would not have completed - even with my earlier diff
> this would still cause the machine to go to single user mode. However,
> this indeed fixes that problem completely and I've been running it for a
> couple of weeks with no problems.
>
> The diffs are designed to be applied in the order they appear. In
> summary, the proposed changes are:
>
> (1) taskq.diff adds a taskq_empty function, which lets us check to see
> if a taskq is, well, empty. This is used in (2). Updates the man pages
> for taskq/task_add to reflect this.
>
> (2) vscsi.diff adds an ioctl to the vscsi driver which lets us monitor
> for device event queue completion. To aid in this it also separates
> calls to scsi_probe and scsi_detach to a dedicated taskq, rather than
> using systq. Updates the man pages for vscsi to reflect this.
>
> (3) iscsid.diff does the plumbing to actually call the new ioctl and
> provide that information back to iscsictl during status poll.
>
> (4) iscsictl.diff integrates the device queue information into the
> polling routine. Updates the man page for iscsictl to describe the new
> behavior.
>
> Based on commmit messages around vscsi it seems there was some plan to
> do this quite some years ago but it's hard for me to know what the
> proposed method was (though I assume what was envisaged might be similar
> to something like this).
>
> Feedback sought and greatly welcomed. I am basically certain there are
> ways this can be improved.
>
> Thanks,
>
> Ash

Index: share/man/man9/task_add.9
===
RCS file: /cvs/src/share/man/man9/task_add.9,v
retrieving revision 1.22
diff -u -p -u -p -r1.22 task_add.9
--- share/man/man9/task_add.9	8 Jun 2020 00:29:51 -	1.22
+++ share/man/man9/task_add.9	13 Jul 2021 23:45:59 -
@@ -20,6 +20,7 @@
 .Sh NAME
 .Nm taskq_create ,
 .Nm taskq_destroy ,
+.Nm taskq_empty ,
 .Nm taskq_barrier ,
 .Nm taskq_del_barrier ,
 .Nm task_set ,
@@ -43,6 +44,8 @@
 .Fn taskq_barrier "struct taskq *tq"
 .Ft void
 .Fn taskq_del_barrier "struct taskq *tq" "struct task *t"
+.Ft int
+.Fn taskq_empty "struct taskq *tq"
 .Ft void
 .Fn task_set "struct task *t" "void (*fn)(void *)" "void *arg"
 .Ft int
@@ -86,6 +89,9 @@ argument:
 The threads servicing the taskq will be run without the kernel big lock.
 .El
 .Pp
+.Fn taskq_empty
+indicates whether there are any queued tasks.
+.Pp
 .Fn taskq_destroy
 causes the resources associated with a previously created taskq to be freed.
 It will wait till all the tasks in the work queue are completed before
@@ -220,6 +226,9 @@ or 0 if the task was not already on the 
 .Fn task_pending
 will return non-zero if the task is queued to run, or 0 if the task
 is not queued.
+.Pp
+.Fn taskq_empty
+will return 1 if there are no tasks queued, or 0 otherwise.
 .Sh SEE ALSO
 .Xr autoconf 9 ,
 .Xr spl 9
Index: sys/kern/kern_task.c
===
RCS file: /cvs/src/sys/kern/kern_task.c,v
retrieving revision 1.31
diff -u -p -u -p -r1.31 kern_task.c
--- sys/kern/kern_task.c	1 Aug 2020 08:40:20 -	1.31
+++ sys/kern/kern_task.c	13 Jul 2021 23:46:00 -
@@ -394,6 +394,18 @@ task_del(struct taskq *tq, struct task *
 }
 
 int
+taskq_empty(struct taskq *tq)
+{
+	int rv;
+
+	mtx_enter(>tq_mtx);
+	rv = TAILQ_EMPTY(>tq_worklist);
+	mtx_leave(>tq_mtx);
+
+	return (rv);
+}
+
+int
 taskq_next_work(struct taskq *tq, struct task *work)
 {
 	struct task *next;
Index: sys/sys/task.h
===
RCS file: /cvs/src/sys/sys/task.h,v
retrieving revision 1.18
diff -u -p -u -p -r1.18 task.h
--- sys/sys/task.h	1 Aug 2020 08:40:20 -	1.18
+++ sys/sys/task.h	13 Jul 2021 23:46:00 -
@@ -51,6 +51,8 @@ void		 taskq_barrier(struct taskq *);
 
 void		 taskq_del_barrier(struct taskq *, struct task *);
 
+int		 taskq_empty(struct taskq *);
+
 void		 task_set(struct task *, void (*)(void *), void *);
 

Re: nanosleep.2: miscellaneous cleanup

2021-07-20 Thread Scott Cheloha
On Tue, Jul 20, 2021 at 06:26:45PM -0600, Theo de Raadt wrote:
> I think this is excessively verbose for such a simple function, especially
> the addition of examples. 
> 
> There are many APIs which are hundreds of times more complicated which
> don't have this level of detailing, and I would argue such manual pages
> have the correct tone and complexity.
> 
> It stops feeling like a historically+adapted BSD manual page.

The only major addition to the length here is the new examples.  I
would argue that everything else is either clarified or missing from
the current page.

I'm not wed to keeping all the examples.  What if we merge examples 1
and 2 and dumb the result down a bit?

Index: nanosleep.2
===
RCS file: /cvs/src/lib/libc/sys/nanosleep.2,v
retrieving revision 1.16
diff -u -p -r1.16 nanosleep.2
--- nanosleep.2 31 Dec 2018 18:54:00 -  1.16
+++ nanosleep.2 21 Jul 2021 01:02:20 -
@@ -41,53 +41,95 @@
 .Ft int
 .Fn nanosleep "const struct timespec *timeout" "struct timespec *remainder"
 .Sh DESCRIPTION
+The
 .Fn nanosleep
-suspends execution of the calling process for the duration specified by
+function suspends execution of the calling thread for at least the
+duration specified by
 .Fa timeout .
-An unmasked signal will cause it to terminate the sleep early,
-regardless of the
+Delivery of an unmasked signal to the calling thread will terminate the
+sleep early,
+even if
 .Dv SA_RESTART
-value on the interrupting signal.
+is set for the interrupting signal.
 .Sh RETURN VALUES
-If the
+If
 .Fn nanosleep
-function returns because the requested duration has elapsed, the value
-returned will be zero.
+sleeps the full
+.Fa timeout
+without interruption it returns 0.
+If
+.Fa remainder
+is
+.Pf non- Dv NULL
+it is set to zero.
 .Pp
-If the
+If
 .Fn nanosleep
-function returns due to the delivery of a signal, the value returned
-will be \-1, and the global variable
+is interrupted by a signal it returns -1 and the global variable
 .Va errno
-will be set to indicate the interruption.
+is set to
+.Dv EINTR .
 If
 .Fa remainder
-is non-null, the timespec structure it references is updated to contain the
-unslept amount (the requested duration minus the duration actually slept).
-.Sh ERRORS
-If any of the following conditions occur, the
+is
+.Pf non- Dv NULL
+it is set to the unslept portion of the
+.Fa timeout .
+.Pp
+Otherwise,
 .Fn nanosleep
-function shall return \-1 and set
+returns -1 and the global variable
 .Va errno
-to the corresponding value.
+is set to indicate the error.
+.Sh EXAMPLES
+Sleep for five and a half seconds:
+.Bd -literal
+   struct timespec timeout;
+
+   timeout.tv_sec = 5;
+   timeout.tv_nsec = 500 * 1000 * 1000;
+   if (nanosleep(, NULL) == -1)
+   err(1, "nanosleep");
+.Ed
+.Pp
+Sleep for at least sixty seconds and print the time remaining whenever
+a signal interrupts the sleep:
+.Bd -literal
+   struct timespec timeout;
+
+   timeout.tv_sec = 60;
+   timeout.tv_nsec = 0;
+
+   while (nanosleep(, ) == -1) {
+   if (errno != EINTR)
+   err(1, "nanosleep");
+   printf("time left: %lld.%09ld seconds\\n",
+   (long long)timeout.tv_sec, timeout.tv_nsec);
+   }
+.Ed
+.Sh ERRORS
+.Fn nanosleep
+will fail if:
 .Bl -tag -width Er
 .It Bq Er EINTR
-.Fn nanosleep
-was interrupted by the delivery of a signal.
+The call is interrupted by the delivery of a signal.
 .It Bq Er EINVAL
 .Fa timeout
-specified a nanosecond value less than zero or greater than or equal to
-1000 million,
+specifies a nanosecond value less than zero or greater than or equal to
+one billion,
 or a second value less than zero.
 .It Bq Er EFAULT
-Either
 .Fa timeout
-or
-.Fa remainder
 points to memory that is not a valid part of the process address space.
+.It Bq Er EFAULT
+.Fa remainder
+is
+.Pf non- Dv NULL
+and points to memory that is not a valid part of the process address space.
 .El
 .Sh SEE ALSO
 .Xr sleep 1 ,
+.Xr sigaction 2 ,
 .Xr sleep 3
 .Sh STANDARDS
 The
@@ -97,17 +139,23 @@ function conforms to
 .Sh HISTORY
 The predecessor of this system call,
 .Fn sleep ,
-appeared in
-.At v3 ,
-but was removed when
+first appeared in
+.At v3 .
+It was removed in
+.At v7
+and replaced with a C library implementation based on
 .Xr alarm 3
-was introduced into
-.At v7 .
+and
+.Xr signal 3 .
+.Pp
 The
 .Fn nanosleep
-system call has been available since
+function first appeared in
+.St -p1003.1b-93 .
+.Pp
+This implementation of
+.Fn nanosleep
+first appeared in
 .Nx 1.3
 and was ported to
-.Ox 2.1
-and
-.Fx 3.0 .
+.Ox 2.1 .



Re: update xf86-video-amdgpu to latest git

2021-07-20 Thread rgc
@tech

i've been using this patch for a week
i've experience 4 hangs ... X hangs ... network is alive so i can do a safe 
reboot
via ssh. last hang occured just 5 mins ago while watching some YouTube videos.

2 behaviours:
- video/screen freezes indifinitely, which requires me to login via ssh and 
reboot
- video/screen freezes and X restarts, allowing me to just re-login


this laptop acts mostly like a server. it gets rebooted only when it hangs, or 
if
there is a new snapshot.

rgc

On Sat, Jul 10, 2021 at 08:04:54AM +0900, rgc wrote:
> On Thu, Jul 08, 2021 at 05:29:01PM +1000, Jonathan Gray wrote:
> > The latest xf86-video-amdgpu release was in 2019.
> > 
> > xf86-video-amdgpu-19.1.0..origin/master
> 
> 8>< snipped
> 
> just finished rebuilding xenocara
> build fine
> booted fine
> none of my previous issues (see links below) happening yet.
> 
> -current:
> 
> kern.version=OpenBSD 6.9-current (GENERIC.MP) #120: Thu Jul  8 23:45:06 MDT 
> 2021
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
> amdgpu0: PICASSO 10 CU rev 0x01
> 
> 
> this were my system issues/reports:
> https://marc.info/?l=openbsd-misc=161131537616993=2
> https://marc.info/?l=openbsd-misc=161736536311231=2
> 
> since my last update i have updated the BIOS FW (3.06)
> 
> rgc
> 



crypto error ipsec counter

2021-07-20 Thread Alexander Bluhm
Hi,

Propagate the crypto errors and count them in ipsec.  This is part
of a larger diff where I disable the crypto queues for ipsec.  I
think it cannot happen, but errors should always be checked.

tq is never NULL.

I know that tdb->tdb_odrops++ is not MP safe.  I have just copied
the code.  My plan is to address this in multiple places later.

ok?

bluhm

Index: crypto/crypto.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/crypto/crypto.c,v
retrieving revision 1.83
diff -u -p -r1.83 crypto.c
--- crypto/crypto.c 30 Jun 2021 12:21:02 -  1.83
+++ crypto/crypto.c 20 Jul 2021 22:51:56 -
@@ -388,7 +388,7 @@ int
 crypto_dispatch(struct cryptop *crp)
 {
struct taskq *tq = crypto_taskq;
-   int s;
+   int error = 0, s;
u_int32_t hid;
 
s = splvm();
@@ -399,14 +399,14 @@ crypto_dispatch(struct cryptop *crp)
}
splx(s);
 
-   if (tq && !(crp->crp_flags & CRYPTO_F_NOQUEUE)) {
+   if ((crp->crp_flags & CRYPTO_F_NOQUEUE) == 0) {
task_set(>crp_task, (void (*))crypto_invoke, crp);
task_add(tq, >crp_task);
} else {
-   crypto_invoke(crp);
+   error = crypto_invoke(crp);
}
 
-   return 0;
+   return error;
 }
 
 /*
Index: netinet/ipsec_output.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/ipsec_output.c,v
retrieving revision 1.82
diff -u -p -r1.82 ipsec_output.c
--- netinet/ipsec_output.c  8 Jul 2021 15:13:14 -   1.82
+++ netinet/ipsec_output.c  20 Jul 2021 23:09:06 -
@@ -418,7 +418,12 @@ ipsec_output_cb(struct cryptop *crp)
if (tdb->tdb_cryptoid != 0)
tdb->tdb_cryptoid = crp->crp_sid;
NET_UNLOCK();
-   crypto_dispatch(crp);
+   error = crypto_dispatch(crp);
+   if (error) {
+   DPRINTF("crypto dispatch error %d", error);
+   ipsecstat_inc(ipsec_odrops);
+   tdb->tdb_odrops++;
+   }
return;
}
DPRINTF("crypto error %d", crp->crp_etype);



Re: nanosleep.2: miscellaneous cleanup

2021-07-20 Thread Theo de Raadt
I think this is excessively verbose for such a simple function, especially
the addition of examples. 

There are many APIs which are hundreds of times more complicated which
don't have this level of detailing, and I would argue such manual pages
have the correct tone and complexity.

It stops feeling like a historically+adapted BSD manual page.



Re: crypto error ipsec counter

2021-07-20 Thread Vitaliy Makkoveev
ok mvs@

> On 21 Jul 2021, at 02:13, Alexander Bluhm  wrote:
> 
> Hi,
> 
> Propagate the crypto errors and count them in ipsec.  This is part
> of a larger diff where I disable the crypto queues for ipsec.  I
> think it cannot happen, but errors should always be checked.
> 
> tq is never NULL.
> 
> I know that tdb->tdb_odrops++ is not MP safe.  I have just copied
> the code.  My plan is to address this in multiple places later.
> 
> ok?
> 
> bluhm
> 
> Index: crypto/crypto.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/crypto/crypto.c,v
> retrieving revision 1.83
> diff -u -p -r1.83 crypto.c
> --- crypto/crypto.c   30 Jun 2021 12:21:02 -  1.83
> +++ crypto/crypto.c   20 Jul 2021 22:51:56 -
> @@ -388,7 +388,7 @@ int
> crypto_dispatch(struct cryptop *crp)
> {
>   struct taskq *tq = crypto_taskq;
> - int s;
> + int error = 0, s;
>   u_int32_t hid;
> 
>   s = splvm();
> @@ -399,14 +399,14 @@ crypto_dispatch(struct cryptop *crp)
>   }
>   splx(s);
> 
> - if (tq && !(crp->crp_flags & CRYPTO_F_NOQUEUE)) {
> + if ((crp->crp_flags & CRYPTO_F_NOQUEUE) == 0) {
>   task_set(>crp_task, (void (*))crypto_invoke, crp);
>   task_add(tq, >crp_task);
>   } else {
> - crypto_invoke(crp);
> + error = crypto_invoke(crp);
>   }
> 
> - return 0;
> + return error;
> }
> 
> /*
> Index: netinet/ipsec_output.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/ipsec_output.c,v
> retrieving revision 1.82
> diff -u -p -r1.82 ipsec_output.c
> --- netinet/ipsec_output.c8 Jul 2021 15:13:14 -   1.82
> +++ netinet/ipsec_output.c20 Jul 2021 23:09:06 -
> @@ -418,7 +418,12 @@ ipsec_output_cb(struct cryptop *crp)
>   if (tdb->tdb_cryptoid != 0)
>   tdb->tdb_cryptoid = crp->crp_sid;
>   NET_UNLOCK();
> - crypto_dispatch(crp);
> + error = crypto_dispatch(crp);
> + if (error) {
> + DPRINTF("crypto dispatch error %d", error);
> + ipsecstat_inc(ipsec_odrops);
> + tdb->tdb_odrops++;
> + }
>   return;
>   }
>   DPRINTF("crypto error %d", crp->crp_etype);
> 



nanosleep.2: miscellaneous cleanup

2021-07-20 Thread Scott Cheloha
Hi,

The nanosleep.2 page could use some cleanup.  Here's a bunch of fixes,
rewrites, etc.

I've included my notes on the changes below.  I have some (mostly
stylistic) questions in there, too.

--

DESCRIPTION

- We suspend the calling thread, not the calling process.

- Note that we block for "at least" the given timeout.  After all, we
  might block for longer.

- Be more explicit about signal delivery.

Should we reference sigaction(2) here alongside SA_RESTART?  e.g.:

... even if SA_RESTART is set for the interrupting signal
(see sigaction(2) for further details).

RETURN VALUES

- Make it clear there are three cases here: success, signal,
  and other errors.  Separate them into three paragraphs.

- Note that we update remainder in the success case *and* the signal
  case, but not the other case.

I'm unsure about the correct voice here.  Should nanosleep actively
return values?  Or should values "be returned"?

EXAMPLES

- Add some examples.

  The first example is trivial.  Just warmup.  Show them how to
  pack a timespec and make the call.

  The second example demonstrates how to set up a sub-second timeout.
  Sometimes people screw this up, I think it's worth highlighting.

  The last example is more complex.  Show them how to differentiate
  between a signal and programmer error.  Show how the remainder
  argument works.  Show that timeout and remainder can point to the
  same struct.

I'm unclear on whether I need the indent here in the .Bd block.  I
think it looks better than unindented code, but maybe I'm flouting
some practice we have.

ERRORS

- Simplify the opening sentence.  Yeesh, what a mouthful.

- "1000 million" is better written "one billion".  We've done this
  change already in other pages.

- Note that remainder must be non-NULL for an EFAULT to occur.

- Use the past tense consistently.

Unsure if a second EFAULT case for remainder is a good thing here.
Strictly speaking, NULL is not a valid part of the process address
space.  Maybe that's too pedantic?

Also, do we have a more standard "boilerplate" sentence for EFAULT?

SEE ALSO

- Cross reference sigaction(2).  We mention SA_RESTART in the
  description.

HISTORY

- Clarify what happened to the historical sleep system call a bit.

- I'm pretty sure nanosleep() is a POSIX invention, so note that it
  really first appeared in POSIX.1b (realtime extensions), and then talk
  about the implementation we got from NetBSD.

  If someone can find nanosleep() in the wild before 1993 we can fix
  the attribution.

- No need to mention that FreeBSD also got their implementation from
  NetBSD.

--

Index: nanosleep.2
===
RCS file: /cvs/src/lib/libc/sys/nanosleep.2,v
retrieving revision 1.16
diff -u -p -r1.16 nanosleep.2
--- nanosleep.2 31 Dec 2018 18:54:00 -  1.16
+++ nanosleep.2 20 Jul 2021 22:14:50 -
@@ -41,53 +41,114 @@
 .Ft int
 .Fn nanosleep "const struct timespec *timeout" "struct timespec *remainder"
 .Sh DESCRIPTION
+The
 .Fn nanosleep
-suspends execution of the calling process for the duration specified by
+function suspends execution of the calling thread for at least the
+duration specified by
 .Fa timeout .
-An unmasked signal will cause it to terminate the sleep early,
-regardless of the
+Delivery of an unmasked signal to the calling thread will terminate the
+sleep early,
+even if
 .Dv SA_RESTART
-value on the interrupting signal.
+is set for the interrupting signal.
 .Sh RETURN VALUES
-If the
+If
 .Fn nanosleep
-function returns because the requested duration has elapsed, the value
-returned will be zero.
+sleeps the full
+.Fa timeout
+without interruption it returns 0.
+If
+.Fa remainder
+is
+.Pf non- Dv NULL
+it is set to zero.
 .Pp
-If the
+If
 .Fn nanosleep
-function returns due to the delivery of a signal, the value returned
-will be \-1, and the global variable
+is interrupted by a signal it returns -1 and the global variable
 .Va errno
-will be set to indicate the interruption.
+is set to
+.Dv EINTR .
 If
 .Fa remainder
-is non-null, the timespec structure it references is updated to contain the
-unslept amount (the requested duration minus the duration actually slept).
-.Sh ERRORS
-If any of the following conditions occur, the
+is
+.Pf non- Dv NULL
+it is set to the unslept portion of the
+.Fa timeout .
+.Pp
+Otherwise,
 .Fn nanosleep
-function shall return \-1 and set
+returns -1 and the global variable
 .Va errno
-to the corresponding value.
+is set to indicate the error.
+.Sh EXAMPLES
+Count down from ten:
+.Bd -literal
+   struct timespec one_second;
+   int i;
+
+   one_second.tv_sec = 1;
+   one_second.tv_nsec = 0;
+
+   for (i = 10; i >= 0; i--) {
+   if (nanosleep(_second, NULL) == -1)
+   err(1, "nanosleep");
+   printf("%4d\\n", i);
+   }
+.Ed
+.Pp
+Check a condition once every one hundred milliseconds:
+.Bd -literal
+   struct timespec 

unwind(8): WIP support using a custom CA

2021-07-20 Thread Lucas
Hi tech@,

I decided to give it a shot at specifying a custom CA for DoT validation
in unwind(8). Patch at the bottom. Selfish reason for this is that I run
my own DoT resolver, using a self-signed certificate. Am able to use
unbound to query it, but the lack of support for a custom CA in unwind
stopped me from switching to it since its inception. Now that the system
is becoming more integrated (dhcpleased and resolved enabled by default,
deep-ish integration between resolvd and unwind), I gave it a shot.

The patch is just the code part, manpage changes are pending but wanted
to know first if the approach is correct and willing to be accepted.

I successfully tried:
- using a custom CA
- adding a DoT forwarder not covered by it and having it rejected
- removing my DoT forwarders and the custom CA to fallback to the
  standard CA

I haven't tried yet (but will do very soon):
- verifying that unwind works in early system, placing the custom CA
  file in the / mount (ofc won't work if the file resides in another
  mountpoint, thing to be documented)

Worth noting, currently, CA changes come into action when forwarders
are added or removed. If a config file undergoes only a CA change, it
won't be reflected in the running unwind. Am unsure if this could be
solved without much refactor, but haven't given up yet.

The default CA file was unveiled. I removed that call, but dunno if
instead I should be unveiling every new CA file or not. Blocklist aren't
managed in that way, fwiw.

Comments?


Index: frontend.c
===
RCS file: /home/cvs/src/sbin/unwind/frontend.c,v
retrieving revision 1.68
diff -u -p -r1.68 frontend.c
--- frontend.c  6 Feb 2021 18:01:02 -   1.68
+++ frontend.c  20 Jul 2021 18:21:30 -
@@ -365,6 +365,7 @@ frontend_dispatch_main(int fd, short eve
case IMSG_RECONF_FORWARDER:
case IMSG_RECONF_DOT_FORWARDER:
case IMSG_RECONF_FORCE:
+   case IMSG_RECONF_CA_FILE:
imsg_receive_config(, );
break;
case IMSG_RECONF_END:
Index: parse.y
===
RCS file: /home/cvs/src/sbin/unwind/parse.y,v
retrieving revision 1.25
diff -u -p -r1.25 parse.y
--- parse.y 27 Feb 2021 10:32:28 -  1.25
+++ parse.y 20 Jul 2021 20:40:34 -
@@ -103,6 +103,7 @@ typedef struct {
 %token FORWARDER DOT PORT ODOT_FORWARDER ODOT_DHCP
 %token AUTHENTICATION NAME PREFERENCE RECURSOR DHCP STUB
 %token BLOCK LIST LOG FORCE ACCEPT BOGUS
+%token CA FILE
 
 %token   STRING
 %token   NUMBER
@@ -120,6 +121,7 @@ grammar : /* empty */
| grammar uw_forwarder '\n'
| grammar block_list '\n'
| grammar force '\n'
+   | grammar cafile '\n'
| grammar error '\n'{ file->errors++; }
;
 
@@ -343,7 +345,7 @@ acceptbogus:ACCEPT BOGUS{ $$ = 1; }
 
 force_list:force_list optnl STRING {
struct force_tree_entry *e;
-   size_t   len;
+   size_t   len;
 
len = strlen($3);
e = malloc(sizeof(*e));
@@ -374,6 +376,20 @@ force_list:force_list optnl STRING {
}
;
 
+cafile :   CA FILE STRING {
+   if (conf->ca_file != NULL) {
+   yyerror("CA file already configured");
+   free($3);
+   YYERROR;
+   } else {
+   conf->ca_file = strdup($3);
+   if (conf->ca_file == NULL)
+   err(1, "strdup");
+   free($3);
+   }
+   }
+   ;
+
 %%
 
 struct keywords {
@@ -413,8 +429,10 @@ lookup(char *s)
{"authentication",  AUTHENTICATION},
{"block",   BLOCK},
{"bogus",   BOGUS},
+   {"ca",  CA},
{"dhcp",DHCP},
{"dot", DOT},
+   {"file",FILE},
{"force",   FORCE},
{"forwarder",   FORWARDER},
{"include", INCLUDE},
Index: printconf.c
===
RCS file: /home/cvs/src/sbin/unwind/printconf.c,v
retrieving revision 1.16
diff -u -p -r1.16 printconf.c
--- printconf.c 1 Dec 2019 14:37:34 -   1.16
+++ printconf.c 20 Jul 2021 20:41:25 -
@@ -33,6 +33,9 @@ print_config(struct uw_conf *conf)
int  i;
enum uw_resolver_typej;
 
+   if (conf->ca_file != NULL)
+ 

Re: pipex(4): use per-CPU counters for session statistics

2021-07-20 Thread Alexander Bluhm
On Tue, Jul 20, 2021 at 02:25:34AM +0300, Vitaliy Makkoveev wrote:
> With bluhm@'s "forwarding in parallel" diff pipex(4) session could be
> accessed in parallel through (*ifp->if_input) -> ether_input().
> 
> Except statistics and MPPE data PPPOE sessions are mostly immutable. We
> have only place where we should reset 'idle_time' in input path with
> shared netlock. But since pipex(4) garbage collector uses exclusive
> netlock we could reset it with shared netlock.
> 
> At the first step I propose to turn session statistics to per-CPU
> counters. This make sense because we could have sessions with MPPE
> disabled and PPPOE input will be lockless for this case.

OK bluhm@ with one nit

> +void
> +pipex_export_session_stats(struct pipex_session *session,
> +struct pipex_statistics *stats)
> +{
> + uint64_t counters[pxc_ncounters];
> +
> + counters_read(session->stat_counters, counters, pxc_ncounters);
> +

Could you put a memset(stats, 0, sizeof(struct pipex_statistics))
here?  When copying to userland we have to be extra careful.  On
some architectures padding may be different.  Structs may change.
With memset() we cannot leak kernel memory.

> + stats->ipackets = counters[pxc_ipackets];
> + stats->ierrors = counters[pxc_ierrors];
> + stats->ibytes = counters[pxc_ibytes];
> + stats->opackets = counters[pxc_opackets];
> + stats->oerrors = counters[pxc_oerrors];
> + stats->obytes = counters[pxc_obytes];
> + stats->idle_time = session->idle_time;
> +}



Re: parallel ipsec workaround

2021-07-20 Thread Vitaliy Makkoveev
On Tue, Jul 20, 2021 at 03:41:32PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> The current workaround to disable parallel IPsec does not work.
> Variable nettaskqs must not change at runtime.  Interface input
> queues choose the thread during init with ifiq_softnet = net_tq().
> So it cannot be modified after pfkeyv2_send() sets the first SA in
> kernel.  Also changing the calculation in net_tq() may call task_del()
> with a different taskq than task_add().
> 
> Instead of restricting the index of the first softnet task, use an
> exclusive lock.
> 
> For now just move the comment.  We can still decide if a write net
> lock or kernel lock is better.
> 
> ok?
> 

ok mvs@

> bluhm
> 
> Index: net/if.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if.c,v
> retrieving revision 1.642
> diff -u -p -r1.642 if.c
> --- net/if.c  30 Jun 2021 13:23:33 -  1.642
> +++ net/if.c  20 Jul 2021 13:38:51 -
> @@ -834,6 +834,12 @@ if_input_process(struct ifnet *ifp, stru
>* to PF globals, pipex globals, unicast and multicast addresses
>* lists and the socket layer.
>*/
> +
> + /*
> +  * XXXSMP IPsec data structures are not ready to be accessed
> +  * by multiple network threads in parallel.  In this case
> +  * use an exclusive lock.
> +  */
>   NET_LOCK();
>   while ((m = ml_dequeue(ml)) != NULL)
>   (*ifp->if_input)(ifp, m);
> @@ -3311,17 +3317,14 @@ unhandled_af(int af)
>   panic("unhandled af %d", af);
>  }
>  
> -/*
> - * XXXSMP This tunable is here to work around the fact that IPsec
> - * globals aren't ready to be accessed by multiple threads in
> - * parallel.
> - */
> -int   nettaskqs = NET_TASKQ;
> -
>  struct taskq *
>  net_tq(unsigned int ifindex)
>  {
>   struct taskq *t = NULL;
> + static int nettaskqs;
> +
> + if (nettaskqs == 0)
> + nettaskqs = min(NET_TASKQ, ncpus);
>  
>   t = nettqmp[ifindex % nettaskqs];
>  
> Index: net/pfkeyv2.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/pfkeyv2.c,v
> retrieving revision 1.218
> diff -u -p -r1.218 pfkeyv2.c
> --- net/pfkeyv2.c 14 Jul 2021 22:39:26 -  1.218
> +++ net/pfkeyv2.c 20 Jul 2021 12:48:30 -
> @@ -2019,14 +2019,6 @@ pfkeyv2_send(struct socket *so, void *me
>   }
>   TAILQ_INSERT_HEAD(_policy_head, ipo, ipo_list);
>   ipsec_in_use++;
> - /*
> -  * XXXSMP IPsec data structures are not ready to be
> -  * accessed by multiple Network threads in parallel,
> -  * so force all packets to be processed by the first
> -  * one.
> -  */
> - extern int nettaskqs;
> - nettaskqs = 1;
>   } else {
>   ipo->ipo_last_searched = ipo->ipo_flags = 0;
>   }
> 



Re: forwarding in parallel ipsec workaround

2021-07-20 Thread Alexander Bluhm
On Tue, Jul 20, 2021 at 02:26:02PM +0200, Alexander Bluhm wrote:
> > Note that having multiple threads competing for an exclusive rwlock will
> > generate unnecessary wakeup/sleep cycles every time the lock is released.
> > It is valuable to keep this in mind as it might add extra latency when
> > processing packets.
> 
> Of course.  What do you recommend?

We may have another alternative.

- Always use a shared net lock but also aquire kernel lock.

This seems to be a quick fix for all the MP problmes when running
in parallel.  Then the kernel lock can be removed step by step.

I will create a diff an test.

bluhm

> - Develop outside of the tree until all problems are fixed.
> - Delay work on parallel forwarding until IPsec is MP safe.
> - Accept a possible slowdown of IPsec.  In my measurements it gets
>   faster even with the exclusive lock.
> - Concentrate on making IPsec faster.  By removing the crypto
>   queues you gain much more performance than the exclusive lock may
>   cost.  Did you see the massive kernel locks in my graph?
>   
> http://bluhm.genua.de/perform/results/latest/patch-sys-ip-multiqueue.1/btrace/ssh_perform%40lt13_iperf3_-c10.4.56.36_-P10_-t10_-R-btrace-kstack.0.svg
> - Make ARP MP safe.  Currently we need the kernel lock there or
>   it crashes.  This creates latency for all kind of packets.
> - Convert the rwlock in pf to mutex.  I think your argument counts
>   much more there.  But I cannot prove it.
> 
> My plan is to commit what we have and improve where most pain is.
> This makes incremental steps easier.
> 
> bluhm



parallel ipsec workaround

2021-07-20 Thread Alexander Bluhm
Hi,

The current workaround to disable parallel IPsec does not work.
Variable nettaskqs must not change at runtime.  Interface input
queues choose the thread during init with ifiq_softnet = net_tq().
So it cannot be modified after pfkeyv2_send() sets the first SA in
kernel.  Also changing the calculation in net_tq() may call task_del()
with a different taskq than task_add().

Instead of restricting the index of the first softnet task, use an
exclusive lock.

For now just move the comment.  We can still decide if a write net
lock or kernel lock is better.

ok?

bluhm

Index: net/if.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/if.c,v
retrieving revision 1.642
diff -u -p -r1.642 if.c
--- net/if.c30 Jun 2021 13:23:33 -  1.642
+++ net/if.c20 Jul 2021 13:38:51 -
@@ -834,6 +834,12 @@ if_input_process(struct ifnet *ifp, stru
 * to PF globals, pipex globals, unicast and multicast addresses
 * lists and the socket layer.
 */
+
+   /*
+* XXXSMP IPsec data structures are not ready to be accessed
+* by multiple network threads in parallel.  In this case
+* use an exclusive lock.
+*/
NET_LOCK();
while ((m = ml_dequeue(ml)) != NULL)
(*ifp->if_input)(ifp, m);
@@ -3311,17 +3317,14 @@ unhandled_af(int af)
panic("unhandled af %d", af);
 }
 
-/*
- * XXXSMP This tunable is here to work around the fact that IPsec
- * globals aren't ready to be accessed by multiple threads in
- * parallel.
- */
-int nettaskqs = NET_TASKQ;
-
 struct taskq *
 net_tq(unsigned int ifindex)
 {
struct taskq *t = NULL;
+   static int nettaskqs;
+
+   if (nettaskqs == 0)
+   nettaskqs = min(NET_TASKQ, ncpus);
 
t = nettqmp[ifindex % nettaskqs];
 
Index: net/pfkeyv2.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/pfkeyv2.c,v
retrieving revision 1.218
diff -u -p -r1.218 pfkeyv2.c
--- net/pfkeyv2.c   14 Jul 2021 22:39:26 -  1.218
+++ net/pfkeyv2.c   20 Jul 2021 12:48:30 -
@@ -2019,14 +2019,6 @@ pfkeyv2_send(struct socket *so, void *me
}
TAILQ_INSERT_HEAD(_policy_head, ipo, ipo_list);
ipsec_in_use++;
-   /*
-* XXXSMP IPsec data structures are not ready to be
-* accessed by multiple Network threads in parallel,
-* so force all packets to be processed by the first
-* one.
-*/
-   extern int nettaskqs;
-   nettaskqs = 1;
} else {
ipo->ipo_last_searched = ipo->ipo_flags = 0;
}



Re: forwarding in parallel ipsec workaround

2021-07-20 Thread Alexander Bluhm
On Tue, Jul 20, 2021 at 10:08:09AM +0200, Martin Pieuchot wrote:
> On 19/07/21(Mon) 17:53, Alexander Bluhm wrote:
> > Hi,
> > 
> > I found why the IPsec workaround did not work.
> > 
> > At init time we set ifiq->ifiq_softnet = net_tq(ifp->if_index +
> > idx), but the workaround modifies net_tq() at runtime.  Modifying
> > net_tq() at runtime is bad anyway as task_add() and task_del() could
> > be called with different task queues.
> > 
> > So better use exclusive lock if IPsec is in use.  For me this is
> > running stable.
> 
> Note that having multiple threads competing for an exclusive rwlock will
> generate unnecessary wakeup/sleep cycles every time the lock is released.  
> It is valuable to keep this in mind as it might add extra latency when
> processing packets.

Of course.  What do you recommend?

- Develop outside of the tree until all problems are fixed.
- Delay work on parallel forwarding until IPsec is MP safe.
- Accept a possible slowdown of IPsec.  In my measurements it gets
  faster even with the exclusive lock.
- Concentrate on making IPsec faster.  By removing the crypto
  queues you gain much more performance than the exclusive lock may
  cost.  Did you see the massive kernel locks in my graph?
  
http://bluhm.genua.de/perform/results/latest/patch-sys-ip-multiqueue.1/btrace/ssh_perform%40lt13_iperf3_-c10.4.56.36_-P10_-t10_-R-btrace-kstack.0.svg
- Make ARP MP safe.  Currently we need the kernel lock there or
  it crashes.  This creates latency for all kind of packets.
- Convert the rwlock in pf to mutex.  I think your argument counts
  much more there.  But I cannot prove it.

My plan is to commit what we have and improve where most pain is.
This makes incremental steps easier.

bluhm

> > Index: net/if.c
> > ===
> > RCS file: /data/mirror/openbsd/cvs/src/sys/net/if.c,v
> > retrieving revision 1.642
> > diff -u -p -r1.642 if.c
> > --- net/if.c30 Jun 2021 13:23:33 -  1.642
> > +++ net/if.c19 Jul 2021 14:51:31 -
> > @@ -109,6 +109,10 @@
> >  #include 
> >  #endif
> >  
> > +#ifdef IPSEC
> > +#include 
> > +#endif
> > +
> >  #ifdef MPLS
> >  #include 
> >  #endif
> > @@ -238,7 +242,7 @@ int ifq_congestion;
> >  
> >  int netisr;
> >  
> > -#defineNET_TASKQ   1
> > +#defineNET_TASKQ   4
> >  struct taskq   *nettqmp[NET_TASKQ];
> >  
> >  struct task if_input_task_locked = TASK_INITIALIZER(if_netisr, NULL);
> > @@ -815,6 +819,7 @@ void
> >  if_input_process(struct ifnet *ifp, struct mbuf_list *ml)
> >  {
> > struct mbuf *m;
> > +   int exclusive_lock = 0;
> >  
> > if (ml_empty(ml))
> > return;
> > @@ -834,10 +839,25 @@ if_input_process(struct ifnet *ifp, stru
> >  * to PF globals, pipex globals, unicast and multicast addresses
> >  * lists and the socket layer.
> >  */
> > -   NET_LOCK();
> > +
> > +   /*
> > +* XXXSMP IPsec data structures are not ready to be
> > +* accessed by multiple Network threads in parallel.
> > +*/
> > +   if (ipsec_in_use)
> > +   exclusive_lock = 1;
> > +   if (exclusive_lock)
> > +   NET_LOCK();
> > +   else
> > +   NET_RLOCK_IN_SOFTNET();
> > +
> > while ((m = ml_dequeue(ml)) != NULL)
> > (*ifp->if_input)(ifp, m);
> > -   NET_UNLOCK();
> > +
> > +   if (exclusive_lock)
> > +   NET_UNLOCK();
> > +   else
> > +   NET_RUNLOCK_IN_SOFTNET();
> >  }
> >  
> >  void
> > @@ -895,6 +915,12 @@ if_netisr(void *unused)
> > KERNEL_UNLOCK();
> > }
> >  #endif
> > +   if (n & (1 << NETISR_IP))
> > +   ipintr();
> > +#ifdef INET6
> > +   if (n & (1 << NETISR_IPV6))
> > +   ip6intr();
> > +#endif
> >  #if NPPP > 0
> > if (n & (1 << NETISR_PPP)) {
> > KERNEL_LOCK();
> > @@ -3311,17 +3337,14 @@ unhandled_af(int af)
> > panic("unhandled af %d", af);
> >  }
> >  
> > -/*
> > - * XXXSMP This tunable is here to work around the fact that IPsec
> > - * globals aren't ready to be accessed by multiple threads in
> > - * parallel.
> > - */
> > -int nettaskqs = NET_TASKQ;
> > -
> >  struct taskq *
> >  net_tq(unsigned int ifindex)
> >  {
> > struct taskq *t = NULL;
> > +   static int nettaskqs;
> > +
> > +   if (nettaskqs == 0)
> > +   nettaskqs = min(NET_TASKQ, ncpus);
> >  
> > t = nettqmp[ifindex % nettaskqs];
> >  
> > Index: net/if_ethersubr.c
> > ===
> > RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v
> > retrieving revision 1.275
> > diff -u -p -r1.275 if_ethersubr.c
> > --- net/if_ethersubr.c  7 Jul 2021 20:19:01 -   1.275
> > +++ net/if_ethersubr.c  19 Jul 2021 14:32:48 -
> > @@ -222,7 +222,10 @@ ether_resolve(struct ifnet *ifp, struct 
> >  
> > switch (af) {
> > case AF_INET:
> > +   

Re: ix(4)/riscv64: Make ix(4) work when MSI-X interrupts aren't available

2021-07-20 Thread Mark Kettenis
> Date: Tue, 20 Jul 2021 21:55:56 +1000
> From: Jonathan Matthew 
> 
> On Mon, Jul 19, 2021 at 07:37:10PM -0400, Ashton Fagg wrote:
> > I have an Intel 82599 10 gigabit ethernet card I wanted to get working
> > on my SiFive Unmatched board.
> > 
> > I found the ix(4) driver has some weirdness around MSI-X
> > interrupts. While the driver supports operating both with and without
> > MSI-X support, it's hard-coded via a flag rather than dynamically checking
> > if it's available. If the flag is set (which it always is right now),
> > but MSI-X isn't available, the driver will throw an error and the device
> > won't work:
> > 
> > ix0 at pci7 dev 0 function 0 "Intel 82599" rev 0x01ixgbe_allocate_msix: 
> > pci_intr_map_msix vec 0 failed
> > 
> > The root cause is this call failing in if_ix.c:
> > 
> > if (pci_intr_map_msix(pa, i, )) {
> > printf("ixgbe_allocate_msix: "
> > "pci_intr_map_msix vec %d failed\n", i);
> > error = ENOMEM;
> > goto fail;
> > }
> > 
> > 
> > Because in _pci_intr_map_msix (in sys/arch/riscv64/dev/pci_machdep.c):
> > 
> > if ((pa->pa_flags & PCI_FLAGS_MSI_ENABLED) == 0 ||
> > pci_get_capability(pc, tag, PCI_CAP_MSI, NULL, NULL) == 0)
> > return -1;
> > 
> > The PCI attach flags would not have PCI_FLAGS_MSI_ENABLED set.
> > 
> > The following diff remedies that by checking if PCI_FLAGS_MSI_ENABLED is
> > actually set, rather than just trying and failing because the hard-coded
> > flag says so. It also enables ix(4) in the kernel config for
> > riscv64. Effectively, the driver will now only try to use MSI-X if the
> > machine is advertising it to be available.
> 
> I'd rather not have to do this in every driver.  We otherwise check that flag
> inside the pci interrupt functions rather than in the driver code, so we
> should do so in pci_intr_msix_count() too, since that's what we call in
> multi-queue nic drivers to decide whether to use MSI-X.  Drivers that only
> want a single vector will just call pci_intr_map_msix() and fall back to MSI
> or legacy interrupts if that fails.
> 
> I posted the alternate version of this diff to misc@ a few days ago,
> which repeats the checks used to set PCI_FLAGS_MSI_ENABLED in
> pci_intr_msix_count(), rather than passing in struct
> pci_attach_args, in case we prefer to do it that way.

I don't really read misc@, so don't post your patches there.

> Mark, what do you think?

Yeah, making pci_intr_msix_count() should return 0 if MSIs are not
supported.  A bit strange though to pass both pa and pa->pa_tag.  I'd
change the function to only take pa as an argument.

> Index: if_bnxt.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v
> retrieving revision 1.32
> diff -u -p -u -p -r1.32 if_bnxt.c
> --- if_bnxt.c 24 Apr 2021 09:37:46 -  1.32
> +++ if_bnxt.c 20 Jul 2021 11:23:22 -
> @@ -537,7 +537,7 @@ bnxt_attach(struct device *parent, struc
>   sc->sc_flags |= BNXT_FLAG_MSIX;
>   intrstr = pci_intr_string(sc->sc_pc, ih);
>  
> - nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
> + nmsix = pci_intr_msix_count(pa, pa->pa_tag);
>   if (nmsix > 1) {
>   sc->sc_ih = pci_intr_establish(sc->sc_pc, ih,
>   IPL_NET | IPL_MPSAFE, bnxt_admin_intr, sc, 
> DEVNAME(sc));
> Index: if_ix.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
> retrieving revision 1.178
> diff -u -p -u -p -r1.178 if_ix.c
> --- if_ix.c   22 Dec 2020 23:25:37 -  1.178
> +++ if_ix.c   20 Jul 2021 11:23:22 -
> @@ -1783,7 +1783,7 @@ ixgbe_setup_msix(struct ix_softc *sc)
>   if (!ixgbe_enable_msix)
>   return;
>  
> - nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
> + nmsix = pci_intr_msix_count(pa, pa->pa_tag);
>   if (nmsix <= 1)
>   return;
>  
> Index: if_ixl.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v
> retrieving revision 1.74
> diff -u -p -u -p -r1.74 if_ixl.c
> --- if_ixl.c  26 Mar 2021 08:02:34 -  1.74
> +++ if_ixl.c  20 Jul 2021 11:23:22 -
> @@ -1795,7 +1795,7 @@ ixl_attach(struct device *parent, struct
>   }
>  
>   if (pci_intr_map_msix(pa, 0, >sc_ih) == 0) {
> - int nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
> + int nmsix = pci_intr_msix_count(pa, pa->pa_tag);
>   if (nmsix > 1) { /* we used 1 (the 0th) for the adminq */
>   nmsix--;
>  
> Index: if_mcx.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_mcx.c,v
> retrieving revision 1.101
> diff -u -p -u -p -r1.101 if_mcx.c
> --- if_mcx.c  2 Jun 2021 19:16:11 -   1.101
> +++ if_mcx.c  20 Jul 

Re: ix(4)/riscv64: Make ix(4) work when MSI-X interrupts aren't available

2021-07-20 Thread Jonathan Matthew
On Mon, Jul 19, 2021 at 07:37:10PM -0400, Ashton Fagg wrote:
> I have an Intel 82599 10 gigabit ethernet card I wanted to get working
> on my SiFive Unmatched board.
> 
> I found the ix(4) driver has some weirdness around MSI-X
> interrupts. While the driver supports operating both with and without
> MSI-X support, it's hard-coded via a flag rather than dynamically checking
> if it's available. If the flag is set (which it always is right now),
> but MSI-X isn't available, the driver will throw an error and the device
> won't work:
> 
> ix0 at pci7 dev 0 function 0 "Intel 82599" rev 0x01ixgbe_allocate_msix: 
> pci_intr_map_msix vec 0 failed
> 
> The root cause is this call failing in if_ix.c:
> 
>   if (pci_intr_map_msix(pa, i, )) {
>   printf("ixgbe_allocate_msix: "
>   "pci_intr_map_msix vec %d failed\n", i);
>   error = ENOMEM;
>   goto fail;
>   }
> 
> 
> Because in _pci_intr_map_msix (in sys/arch/riscv64/dev/pci_machdep.c):
> 
> if ((pa->pa_flags & PCI_FLAGS_MSI_ENABLED) == 0 ||
>   pci_get_capability(pc, tag, PCI_CAP_MSI, NULL, NULL) == 0)
>   return -1;
> 
> The PCI attach flags would not have PCI_FLAGS_MSI_ENABLED set.
> 
> The following diff remedies that by checking if PCI_FLAGS_MSI_ENABLED is
> actually set, rather than just trying and failing because the hard-coded
> flag says so. It also enables ix(4) in the kernel config for
> riscv64. Effectively, the driver will now only try to use MSI-X if the
> machine is advertising it to be available.

I'd rather not have to do this in every driver.  We otherwise check that flag
inside the pci interrupt functions rather than in the driver code, so we
should do so in pci_intr_msix_count() too, since that's what we call in
multi-queue nic drivers to decide whether to use MSI-X.  Drivers that only
want a single vector will just call pci_intr_map_msix() and fall back to MSI
or legacy interrupts if that fails.

I posted the alternate version of this diff to misc@ a few days ago, which
repeats the checks used to set PCI_FLAGS_MSI_ENABLED in pci_intr_msix_count(),
rather than passing in struct pci_attach_args, in case we prefer to do it that
way.

Mark, what do you think?


Index: if_bnxt.c
===
RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v
retrieving revision 1.32
diff -u -p -u -p -r1.32 if_bnxt.c
--- if_bnxt.c   24 Apr 2021 09:37:46 -  1.32
+++ if_bnxt.c   20 Jul 2021 11:23:22 -
@@ -537,7 +537,7 @@ bnxt_attach(struct device *parent, struc
sc->sc_flags |= BNXT_FLAG_MSIX;
intrstr = pci_intr_string(sc->sc_pc, ih);
 
-   nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
+   nmsix = pci_intr_msix_count(pa, pa->pa_tag);
if (nmsix > 1) {
sc->sc_ih = pci_intr_establish(sc->sc_pc, ih,
IPL_NET | IPL_MPSAFE, bnxt_admin_intr, sc, 
DEVNAME(sc));
Index: if_ix.c
===
RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
retrieving revision 1.178
diff -u -p -u -p -r1.178 if_ix.c
--- if_ix.c 22 Dec 2020 23:25:37 -  1.178
+++ if_ix.c 20 Jul 2021 11:23:22 -
@@ -1783,7 +1783,7 @@ ixgbe_setup_msix(struct ix_softc *sc)
if (!ixgbe_enable_msix)
return;
 
-   nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
+   nmsix = pci_intr_msix_count(pa, pa->pa_tag);
if (nmsix <= 1)
return;
 
Index: if_ixl.c
===
RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v
retrieving revision 1.74
diff -u -p -u -p -r1.74 if_ixl.c
--- if_ixl.c26 Mar 2021 08:02:34 -  1.74
+++ if_ixl.c20 Jul 2021 11:23:22 -
@@ -1795,7 +1795,7 @@ ixl_attach(struct device *parent, struct
}
 
if (pci_intr_map_msix(pa, 0, >sc_ih) == 0) {
-   int nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
+   int nmsix = pci_intr_msix_count(pa, pa->pa_tag);
if (nmsix > 1) { /* we used 1 (the 0th) for the adminq */
nmsix--;
 
Index: if_mcx.c
===
RCS file: /cvs/src/sys/dev/pci/if_mcx.c,v
retrieving revision 1.101
diff -u -p -u -p -r1.101 if_mcx.c
--- if_mcx.c2 Jun 2021 19:16:11 -   1.101
+++ if_mcx.c20 Jul 2021 11:23:22 -
@@ -2831,7 +2831,7 @@ mcx_attach(struct device *parent, struct
goto teardown;
}
 
-   msix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
+   msix = pci_intr_msix_count(pa, pa->pa_tag);
if (msix < 2) {
printf(": not enough msi-x vectors\n");
goto teardown;
Index: if_vmx.c
===
RCS file: 

Re: iwx(4): Whitespace fix

2021-07-20 Thread Stefan Sperling
On Mon, Jul 19, 2021 at 09:20:17PM -0400, Ashton Fagg wrote:
> Found this while poking around - an extra newline in if_iwx.c.
> 

> Index: sys/dev/pci/if_iwx.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_iwx.c,v
> retrieving revision 1.69
> diff -u -p -u -p -r1.69 if_iwx.c
> --- sys/dev/pci/if_iwx.c  18 Jul 2021 13:07:13 -  1.69
> +++ sys/dev/pci/if_iwx.c  20 Jul 2021 01:17:35 -
> @@ -502,7 +502,6 @@ iwx_is_mimo_mcs(int mcs)
>  {
>   int ridx = iwx_mcs2ridx[mcs];
>   return iwx_is_mimo_ht_plcp(iwx_rates[ridx].ht_plcp);
> - 
>  }
>  
>  int


Sorry, but this patch is so trivial that it isn't really worth it.
A patch which grouped several stylistic fixes together might be more useful.
Since some iwx code was derived from iwm and iwn, fixing such issues across
all these 3 drivers would be even better.



Re: forwarding in parallel ipsec workaround

2021-07-20 Thread Martin Pieuchot
On 19/07/21(Mon) 17:53, Alexander Bluhm wrote:
> Hi,
> 
> I found why the IPsec workaround did not work.
> 
> At init time we set ifiq->ifiq_softnet = net_tq(ifp->if_index +
> idx), but the workaround modifies net_tq() at runtime.  Modifying
> net_tq() at runtime is bad anyway as task_add() and task_del() could
> be called with different task queues.
> 
> So better use exclusive lock if IPsec is in use.  For me this is
> running stable.

Note that having multiple threads competing for an exclusive rwlock will
generate unnecessary wakeup/sleep cycles every time the lock is released.  
It is valuable to keep this in mind as it might add extra latency when
processing packets.

> Index: net/if.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if.c,v
> retrieving revision 1.642
> diff -u -p -r1.642 if.c
> --- net/if.c  30 Jun 2021 13:23:33 -  1.642
> +++ net/if.c  19 Jul 2021 14:51:31 -
> @@ -109,6 +109,10 @@
>  #include 
>  #endif
>  
> +#ifdef IPSEC
> +#include 
> +#endif
> +
>  #ifdef MPLS
>  #include 
>  #endif
> @@ -238,7 +242,7 @@ int   ifq_congestion;
>  
>  int   netisr;
>  
> -#define  NET_TASKQ   1
> +#define  NET_TASKQ   4
>  struct taskq *nettqmp[NET_TASKQ];
>  
>  struct task if_input_task_locked = TASK_INITIALIZER(if_netisr, NULL);
> @@ -815,6 +819,7 @@ void
>  if_input_process(struct ifnet *ifp, struct mbuf_list *ml)
>  {
>   struct mbuf *m;
> + int exclusive_lock = 0;
>  
>   if (ml_empty(ml))
>   return;
> @@ -834,10 +839,25 @@ if_input_process(struct ifnet *ifp, stru
>* to PF globals, pipex globals, unicast and multicast addresses
>* lists and the socket layer.
>*/
> - NET_LOCK();
> +
> + /*
> +  * XXXSMP IPsec data structures are not ready to be
> +  * accessed by multiple Network threads in parallel.
> +  */
> + if (ipsec_in_use)
> + exclusive_lock = 1;
> + if (exclusive_lock)
> + NET_LOCK();
> + else
> + NET_RLOCK_IN_SOFTNET();
> +
>   while ((m = ml_dequeue(ml)) != NULL)
>   (*ifp->if_input)(ifp, m);
> - NET_UNLOCK();
> +
> + if (exclusive_lock)
> + NET_UNLOCK();
> + else
> + NET_RUNLOCK_IN_SOFTNET();
>  }
>  
>  void
> @@ -895,6 +915,12 @@ if_netisr(void *unused)
>   KERNEL_UNLOCK();
>   }
>  #endif
> + if (n & (1 << NETISR_IP))
> + ipintr();
> +#ifdef INET6
> + if (n & (1 << NETISR_IPV6))
> + ip6intr();
> +#endif
>  #if NPPP > 0
>   if (n & (1 << NETISR_PPP)) {
>   KERNEL_LOCK();
> @@ -3311,17 +3337,14 @@ unhandled_af(int af)
>   panic("unhandled af %d", af);
>  }
>  
> -/*
> - * XXXSMP This tunable is here to work around the fact that IPsec
> - * globals aren't ready to be accessed by multiple threads in
> - * parallel.
> - */
> -int   nettaskqs = NET_TASKQ;
> -
>  struct taskq *
>  net_tq(unsigned int ifindex)
>  {
>   struct taskq *t = NULL;
> + static int nettaskqs;
> +
> + if (nettaskqs == 0)
> + nettaskqs = min(NET_TASKQ, ncpus);
>  
>   t = nettqmp[ifindex % nettaskqs];
>  
> Index: net/if_ethersubr.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v
> retrieving revision 1.275
> diff -u -p -r1.275 if_ethersubr.c
> --- net/if_ethersubr.c7 Jul 2021 20:19:01 -   1.275
> +++ net/if_ethersubr.c19 Jul 2021 14:32:48 -
> @@ -222,7 +222,10 @@ ether_resolve(struct ifnet *ifp, struct 
>  
>   switch (af) {
>   case AF_INET:
> + KERNEL_LOCK();
> + /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> + KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   eh->ether_type = htons(ETHERTYPE_IP);
> @@ -245,7 +248,10 @@ ether_resolve(struct ifnet *ifp, struct 
>   break;
>  #ifdef INET6
>   case AF_INET6:
> + KERNEL_LOCK();
> + /* XXXSMP there is a MP race in nd6_resolve() */
>   error = nd6_resolve(ifp, rt, m, dst, eh->ether_dhost);
> + KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   eh->ether_type = htons(ETHERTYPE_IPV6);
> Index: net/ifq.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.c,v
> retrieving revision 1.44
> diff -u -p -r1.44 ifq.c
> --- net/ifq.c 9 Jul 2021 01:22:05 -   1.44
> +++ net/ifq.c 19 Jul 2021 14:32:48 -
> @@ -243,7 +243,7 @@ void
>  ifq_init(struct ifqueue *ifq, struct ifnet *ifp, unsigned int idx)
>  {
>   ifq->ifq_if = ifp;
> - ifq->ifq_softnet =