Re: timecounting: use full 96-bit product when computing high-res time

2021-10-14 Thread Scott Cheloha
> On Oct 14, 2021, at 18:09, Scott Cheloha  wrote:
> 
> On Thu, Oct 14, 2021 at 11:37:44PM +0200, Mark Kettenis wrote:
>>> Date: Thu, 14 Oct 2021 16:13:17 -0500
>>> From: Scott Cheloha 
>>> 
>>> [...]
>>> 
>>> Thoughts?
>> 
>> I never understood this code.  But I don't understand that if our
>> timecounters are only 32 bits, we need more than 64 bits to store a
>> time difference on the order of seconds...
> 
> Let's use the TSC on my laptop as a practical example.
> 
> […]
> 
> If we capture the full 96-bit result we can handle a blackout of
> 
>37513334511718751490 / 19066590438009199874 = 2.033601938739046
> 
> or roughly 2 seconds on the primary CPU.

Shoot, I typo'd this.

This divisor should be UINT64_MAX.  The
resulting quotient is still the same.

Must have copied the wrong number,
my bad.



missing logbl(3) where 64-bit long double

2021-10-14 Thread George Koehler
Hello tech list,

OpenBSD's libm is missing logbl(3) function on archs where long double
is double.  The missing logbl breaks my builds of "NEW: devel/dtools"
(ports list) on macppc and powerpc64.  I guess that other software
avoids logbl and prefers ilogbl(3) or frexp(3).

This diff adds logbl and fixes dtools, ok?
--George

Index: src/s_logb.c
===
RCS file: /cvs/src/lib/libm/src/s_logb.c,v
retrieving revision 1.4
diff -u -p -r1.4 s_logb.c
--- src/s_logb.c12 Sep 2016 19:47:02 -  1.4
+++ src/s_logb.c14 Oct 2021 21:49:22 -
@@ -33,3 +33,4 @@ logb(double x)
return (double) (ix-1023); 
 }
 DEF_STD(logb);
+LDBL_MAYBE_UNUSED_CLONE(logb);



Re: timecounting: use full 96-bit product when computing high-res time

2021-10-14 Thread Scott Cheloha
On Thu, Oct 14, 2021 at 11:37:44PM +0200, Mark Kettenis wrote:
> > Date: Thu, 14 Oct 2021 16:13:17 -0500
> > From: Scott Cheloha 
> > 
> > [...]
> > 
> > Thoughts?
> 
> I never understood this code.  But I don't understand that if our
> timecounters are only 32 bits, we need more than 64 bits to store a
> time difference on the order of seconds...

Let's use the TSC on my laptop as a practical example.

We have a 64-bit register, so it provides a 32-bit counter.  The
maximum value tc_delta() can return is UINT_MAX, i.e. UINT32_MAX.

What is the frequency of my TSC?

$ sysctl machdep.tscfreq
machdep.tscfreq=211200

Given that frequency, the unadjusted scale for the counter is:

scale = 2^64 / frequency = 2 ^64 / 211200 = 8734253822

Of course, we can't do that math with 64-bit integers, so we
approximate (see tc_windup() in kern_tc.c):

scale = (1 << 63) / frequency * 2 = 8734253822

(In this case it makes no difference, the scale is the same.)

Assuming the unadjusted scale, the maximum possible product is:

UINT32_MAX * 8734253822 = 4294967295 * 8734253822 = 37513334511718751490

Recall that UINT64_MAX is 18446744073709551615.  It's considerably
smaller than 37513334511718751490.

If we capture the full 96-bit result we can handle a blackout of

37513334511718751490 / 19066590438009199874 = 2.033601938739046

or roughly 2 seconds on the primary CPU.

If we don't handle the full 96-bit result we overflow after 1 second.

(It follows, then, that lower frequency timecounters yield longer grace
periods before we lose time.)



Re: timecounting: use full 96-bit product when computing high-res time

2021-10-14 Thread Mark Kettenis
> Date: Thu, 14 Oct 2021 16:13:17 -0500
> From: Scott Cheloha 
> 
> Hi,
> 
> When we compute high resolution time, both in the kernel and in libc,
> we get a 32-bit (or smaller) value from the active timecounter and
> scale it up into a 128-bit bintime.
> 
> The scaling math currently looks like this in the kernel:
> 
>   *bt = th->th_offset;
>   bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);
> 
> It looks similar in libc.
> 
> The problem with this code is that if the product
> 
>   th->tc_scale * tc_delta(th)
> 
> exceeds UINT64_MAX, the result overflows and we lose time.
> 
> This is unlikely to happen under normal circumstances.  Normally, the
> clock interrupt fires about HZ times per second, so the product
> doesn't get very close to overflow.  You'd need to run with interrupts
> disabled on CPU0 for about 1 second to see an overflow.
> 
> However, I have seen this happen before in the wild.  If you try to
> build LLVM on an EdgeRouter Lite (512MB RAM), it will swap *very
> slowly* to the SD card and eventually lose time.
> 
> I think it is inevitable that we will encounter bugs of this nature at
> some point.  In general, I would like the timecounting layer to be
> more robust to nasty situations like these.  Losing time is a
> catastrophic failure.  It screws everything up.
> 
> The solution to this problem is to use the full 96-bit product when we
> scale the count up into a bintime.  We're multiplying a u_int
> (32-bit), the count, by a uint64_t, the scale, but we're not capturing
> the upper 32 bits of that product.  If we did, we would have a longer
> grace period between clock interrupts before we lost time.
> 
> The attached patch adds a TIMECOUNT_TO_BINTIME() function to sys/time.h
> and puts it to use in sys/kern/kern_tc.c and lib/libc/sys/microtime.c.
> The math is a bit boring, see the patch if you are curious.
> 
> As for the cost, there is a small but significant increase in overhead
> when reading the clock with the TSC.  Slower timecounters (HPET, ACPI
> timer) are so slow the extra overhead is noise.
> 
> I ran my usual synthetic benchmark against a patched and unpatched
> libc for a number of different sample sizes.  I ran the benchmark in
> single-user mode to avoid ntpd(8) interference (adjtime(2), adjfreq(2))
> and set hw.setperf=100 to ensure uniform results.
> 
> It looks to me that on amd64, userspace clock_gettime(2) is up to ~10%
> slower with the patch.  But there is a lot of variation between the
> comparisons, so I don't think it's a consistent 10%.  I'd say 10% is
> an upper bound.
> 
> To ensure the discussion is well-framed: we're talking about a 10%
> overhead on a 40-80 nanosecond function call.
> 
> Obviously the overhead will vary by platform.  I imagine 32-bit
> platforms will exceed 10% because we're now doing even more 64-bit
> math.  I'd be interested in results from e.g. macppc or armv7.
> 
> ministat summaries are below.  My benchmark program and script is
> below that.  The patch is attached below that.
> 
> Thoughts?

I never understood this code.  But I don't understand that if our
timecounters are only 32 bits, we need more than 64 bits to store a
time difference on the order of seconds...

> --
> 
> All values (min, max, med, avg, stddev) are in seconds.
> 
> Call clock_gettime(2) 10,000 times.  Do that 1000 times.  ~5.88% increase.
> 
> x clock-gettime-tsc-user-head-1K-10K.dat1
> + clock-gettime-tsc-user-patch-1K-10K.dat1
> N   Min   MaxMedian   AvgStddev
> x 1000   0.000329069   0.000362506   0.000335691 0.00033582683 3.6951597e-06
> + 1000   0.000346276   0.000381319   0.000353605 0.00035557716  3.962225e-06
> Difference at 99.5% confidence
> 1.97503e-05 +/- 5.29405e-07
> 5.88111% +/- 0.157642%
> (Student's t, pooled s = 3.83102e-06)
> 
> Call clock_gettime(2) 100,000 times.  Do that 1000 times.  ~1.31% increase.
> 
> x clock-gettime-tsc-user-head-1K-100K.dat1
> + clock-gettime-tsc-user-patch-1K-100K.dat1
> N   Min   MaxMedian   AvgStddev
> x 1000   0.003432839   0.003679475   0.003472749  0.0034827782 3.0275757e-05
> + 1000   0.003426934   0.003776351   0.003524453  0.0035284519 3.8631925e-05
> Difference at 99.5% confidence
> 4.56738e-05 +/- 4.79602e-06
> 1.31142% +/- 0.137707%
> (Student's t, pooled s = 3.47062e-05)
> 
> Call clock_gettime(2) 5,000 times.  Do that 10,000 times.  ~8.53% increase.
> 
> x clock-gettime-tsc-user-head-10K-5K.dat1
> + clock-gettime-tsc-user-patch-10K-5K.dat1
> N   Min   MaxMedian   AvgStddev
> x 1   0.000159681   0.000220758   0.000186826 0.00017805393  1.177249e-05
> + 1   0.000171105   0.000217009   0.000200789 0.00019325052 1.1280074e-05
> Difference at 99.5% confidence
> 1.51966e-05 +/- 5.03804e-07
> 8.53482% +/- 0.28295%
> (Student's t, pooled s = 1.15289e-05)
> 
> Call clock_gettime(2) 

Re: head(1): fully support the legacy -count syntax

2021-10-14 Thread Alexander Hall
On Mon, Oct 11, 2021 at 09:13:16AM -0500, Scott Cheloha wrote:
> On Sun, Oct 10, 2021 at 08:26:04PM -0400, gwes wrote:
> > On 10/10/21 5:03 PM, Scott Cheloha wrote:
> > > [...]
> > > 
> > > If we want to have the unportable legacy syntax then it should work
> > > like other option arguments.  Option arguments can be respecified
> > > multiple times in most other utilities.  The last such appearance of
> > > an option argument is the one the utility uses.  That's how option
> > > arguments work, for the most part.
> > > 
> > > We could remove the legacy syntax and shave a couple lines of code.
> > > 
> > > OTOH, supporting it fully is super easy.  I've provided a patch.
> > > 
> > The man page says head [-count | -n count] [file ...]
> > There are only two valid switches.
> > One, the other, or none.
> > Using more than one is -undefined-.
> > I doubt it ever has been defined.
> > It could be asserted that any more than 1 switch is an error.
> 
> No.
> 
> It wouldn't be an error, it would be undefined.  Just like you said.

If we have survived with a crippled backwards compat since '95, it's
pretty pointless to re-add it now.

I also started thinking about e.g. kill(1). We don't support multiple
signals there.

Thinking a bit more, and testing some, I'm pretty sure the following 
change to usage() and the man page would suffice.

-   usage: head [-count | -n count] [file ...]
+   usage: head [-count] [-n count] [file ...]

/Alexander


Index: head.1
===
RCS file: /cvs/src/usr.bin/head/head.1,v
retrieving revision 1.23
diff -u -p -r1.23 head.1
--- head.1  25 Oct 2015 21:50:32 -  1.23
+++ head.1  14 Oct 2021 21:28:31 -
@@ -37,7 +37,8 @@
 .Nd display first few lines of files
 .Sh SYNOPSIS
 .Nm head
-.Op Fl Ar count | Fl n Ar count
+.Op Fl Ar count
+.Op Fl n Ar count
 .Op Ar
 .Sh DESCRIPTION
 The
Index: head.c
===
RCS file: /cvs/src/usr.bin/head/head.c,v
retrieving revision 1.22
diff -u -p -r1.22 head.c
--- head.c  10 Oct 2021 15:57:25 -  1.22
+++ head.c  14 Oct 2021 21:28:31 -
@@ -114,6 +114,6 @@ main(int argc, char *argv[])
 static void
 usage(void)
 {
-   fputs("usage: head [-count | -n count] [file ...]\n", stderr);
+   fputs("usage: head [-count] [-n count] [file ...]\n", stderr);
exit(1);
 }



timecounting: use full 96-bit product when computing high-res time

2021-10-14 Thread Scott Cheloha
Hi,

When we compute high resolution time, both in the kernel and in libc,
we get a 32-bit (or smaller) value from the active timecounter and
scale it up into a 128-bit bintime.

The scaling math currently looks like this in the kernel:

*bt = th->th_offset;
bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);

It looks similar in libc.

The problem with this code is that if the product

th->tc_scale * tc_delta(th)

exceeds UINT64_MAX, the result overflows and we lose time.

This is unlikely to happen under normal circumstances.  Normally, the
clock interrupt fires about HZ times per second, so the product
doesn't get very close to overflow.  You'd need to run with interrupts
disabled on CPU0 for about 1 second to see an overflow.

However, I have seen this happen before in the wild.  If you try to
build LLVM on an EdgeRouter Lite (512MB RAM), it will swap *very
slowly* to the SD card and eventually lose time.

I think it is inevitable that we will encounter bugs of this nature at
some point.  In general, I would like the timecounting layer to be
more robust to nasty situations like these.  Losing time is a
catastrophic failure.  It screws everything up.

The solution to this problem is to use the full 96-bit product when we
scale the count up into a bintime.  We're multiplying a u_int
(32-bit), the count, by a uint64_t, the scale, but we're not capturing
the upper 32 bits of that product.  If we did, we would have a longer
grace period between clock interrupts before we lost time.

The attached patch adds a TIMECOUNT_TO_BINTIME() function to sys/time.h
and puts it to use in sys/kern/kern_tc.c and lib/libc/sys/microtime.c.
The math is a bit boring, see the patch if you are curious.

As for the cost, there is a small but significant increase in overhead
when reading the clock with the TSC.  Slower timecounters (HPET, ACPI
timer) are so slow the extra overhead is noise.

I ran my usual synthetic benchmark against a patched and unpatched
libc for a number of different sample sizes.  I ran the benchmark in
single-user mode to avoid ntpd(8) interference (adjtime(2), adjfreq(2))
and set hw.setperf=100 to ensure uniform results.

It looks to me that on amd64, userspace clock_gettime(2) is up to ~10%
slower with the patch.  But there is a lot of variation between the
comparisons, so I don't think it's a consistent 10%.  I'd say 10% is
an upper bound.

To ensure the discussion is well-framed: we're talking about a 10%
overhead on a 40-80 nanosecond function call.

Obviously the overhead will vary by platform.  I imagine 32-bit
platforms will exceed 10% because we're now doing even more 64-bit
math.  I'd be interested in results from e.g. macppc or armv7.

ministat summaries are below.  My benchmark program and script is
below that.  The patch is attached below that.

Thoughts?

--

All values (min, max, med, avg, stddev) are in seconds.

Call clock_gettime(2) 10,000 times.  Do that 1000 times.  ~5.88% increase.

x clock-gettime-tsc-user-head-1K-10K.dat1
+ clock-gettime-tsc-user-patch-1K-10K.dat1
N   Min   MaxMedian   AvgStddev
x 1000   0.000329069   0.000362506   0.000335691 0.00033582683 3.6951597e-06
+ 1000   0.000346276   0.000381319   0.000353605 0.00035557716  3.962225e-06
Difference at 99.5% confidence
1.97503e-05 +/- 5.29405e-07
5.88111% +/- 0.157642%
(Student's t, pooled s = 3.83102e-06)

Call clock_gettime(2) 100,000 times.  Do that 1000 times.  ~1.31% increase.

x clock-gettime-tsc-user-head-1K-100K.dat1
+ clock-gettime-tsc-user-patch-1K-100K.dat1
N   Min   MaxMedian   AvgStddev
x 1000   0.003432839   0.003679475   0.003472749  0.0034827782 3.0275757e-05
+ 1000   0.003426934   0.003776351   0.003524453  0.0035284519 3.8631925e-05
Difference at 99.5% confidence
4.56738e-05 +/- 4.79602e-06
1.31142% +/- 0.137707%
(Student's t, pooled s = 3.47062e-05)

Call clock_gettime(2) 5,000 times.  Do that 10,000 times.  ~8.53% increase.

x clock-gettime-tsc-user-head-10K-5K.dat1
+ clock-gettime-tsc-user-patch-10K-5K.dat1
N   Min   MaxMedian   AvgStddev
x 1   0.000159681   0.000220758   0.000186826 0.00017805393  1.177249e-05
+ 1   0.000171105   0.000217009   0.000200789 0.00019325052 1.1280074e-05
Difference at 99.5% confidence
1.51966e-05 +/- 5.03804e-07
8.53482% +/- 0.28295%
(Student's t, pooled s = 1.15289e-05)

Call clock_gettime(2) 10,000 times.  Do that 10,000 times.  ~8.74% increase.

x clock-gettime-tsc-user-head-10K-10K.dat1
+ clock-gettime-tsc-user-patch-10K-10K.dat1
N   Min   MaxMedian   AvgStddev
x 1   0.000320472   0.000410915   0.000371971 0.00035789663 2.2141645e-05
+ 1   0.000340087   0.000436915   0.000393461 0.00038918836 1.8884201e-05
Difference at 99.5% confidence
3.12917e-05 +/- 8.99219e-07
8.74323% 

Re: lrint(3) and llrint(3) implementation

2021-10-14 Thread Todd C . Miller
On Thu, 14 Oct 2021 23:06:01 +0200, Mark Kettenis wrote:

> This doesn't work because of the hidden symbol madness.  The way
> things currently are we need one copy (s_lrint.c) with:
>
> DEF_STD(fn);
> LDBL_MAYBE_CLONE(fn);
>
> another version (s_lrintf.c) with
>
> DEF_STD(fn);
>
> and a final version (s_lrintl.c) without any magic.
>
> I suppose I could have the s_lrint.c and s_lrintf.c include
> s_lrintfl.c if you think that would be better?

Don't worry about it.  I forgot about the symbol hiding magic^Wmadness.

 - todd



Re: lrint(3) and llrint(3) implementation

2021-10-14 Thread Mark Kettenis
> From: "Todd C. Miller" 
> Date: Thu, 14 Oct 2021 14:40:13 -0600
> 
> On Thu, 14 Oct 2021 01:15:56 +0200, Mark Kettenis wrote:
> 
> > Currently the lib/libm/msun/run-lrint_test regress fails on powerpc64
> > and other platforms.  Our implementation came from NetBSD, but NetBSD
> > switched to the implementation from FreeBSD some time ago.  That is
> > the same implementation that we already use for lrintl(3) and
> > llrintl(3).
> >
> > Diff below makes us use that implementation for lrint(3), lrintf(3),
> > llrint(3) and llrintf(3) as well.  This makes the regress test pass on
> > powerpc64.
> 
> Doesn't this mean we end up with three copies of what is essentially
> the same code in s_lrint.c, s_lrintf.c and s_lrintl.c?
> 
> I know it's not much but why not just move the actual code to
> s_lrint.c and include that in the others?

This doesn't work because of the hidden symbol madness.  The way
things currently are we need one copy (s_lrint.c) with:

DEF_STD(fn);
LDBL_MAYBE_CLONE(fn);

another version (s_lrintf.c) with

DEF_STD(fn);

and a final version (s_lrintl.c) without any magic.

I suppose I could have the s_lrint.c and s_lrintf.c include
s_lrintfl.c if you think that would be better?



Re: lrint(3) and llrint(3) implementation

2021-10-14 Thread Mark Kettenis
> Date: Thu, 14 Oct 2021 22:23:25 +0200
> From: Alexander Bluhm 
> 
> On Thu, Oct 14, 2021 at 01:15:56AM +0200, Mark Kettenis wrote:
> > Currently the lib/libm/msun/run-lrint_test regress fails on powerpc64
> > and other platforms.  Our implementation came from NetBSD, but NetBSD
> > switched to the implementation from FreeBSD some time ago.  That is
> > the same implementation that we already use for lrintl(3) and
> > llrintl(3).
> >
> > Diff below makes us use that implementation for lrint(3), lrintf(3),
> > llrint(3) and llrintf(3) as well.  This makes the regress test pass on
> > powerpc64.
> >
> > ok?
> 
> libm Tests on powerpc64 pass now.
> amd64 and i386 still passing.
> arm64 and sparc64 still failing.

The remaining arm64 failures are for the "long double" versions.  The
problem there is that "long double" is implemented as soft-float and
the soft-float implementation in libcompiler_rt doesn't set the
appropriate floating-point status flags.  I suspect that the sparc64
failures are similar since it is in the same boat (but uses libgcc of
course).

So don't expect this to be fixed anytime soon.

> So it is an improvement.
> 
> OK bluhm@
> 
> > Index: lib/libm/src/s_llrint.c
> > ===
> > RCS file: /cvs/src/lib/libm/src/s_llrint.c,v
> > retrieving revision 1.6
> > diff -u -p -r1.6 s_llrint.c
> > --- lib/libm/src/s_llrint.c 12 Sep 2016 19:47:02 -  1.6
> > +++ lib/libm/src/s_llrint.c 13 Oct 2021 23:12:11 -
> > @@ -1,14 +1,12 @@
> > -/* $OpenBSD: s_llrint.c,v 1.6 2016/09/12 19:47:02 guenther Exp $   */
> > -/* $NetBSD: llrint.c,v 1.2 2004/10/13 15:18:32 drochner Exp $ */
> > +/* $OpenBSD$   */
> >
> >  /*
> > - * Written by Matthias Drochner .
> > - * Public domain.
> > + * Written by Martynas Venckus.  Public domain
> >   */
> >
> > -#define LRINTNAME llrint
> > -#define RESTYPE long long int
> > -#define RESTYPE_MIN LLONG_MIN
> > -#define RESTYPE_MAX LLONG_MAX
> > +#define type   double
> > +#define rounditrint
> > +#define dtype  long long
> > +#define fn llrint
> >
> >  #include "s_lrint.c"
> > Index: lib/libm/src/s_llrintf.c
> > ===
> > RCS file: /cvs/src/lib/libm/src/s_llrintf.c,v
> > retrieving revision 1.2
> > diff -u -p -r1.2 s_llrintf.c
> > --- lib/libm/src/s_llrintf.c25 Sep 2006 22:16:48 -  1.2
> > +++ lib/libm/src/s_llrintf.c13 Oct 2021 23:12:11 -
> > @@ -1,14 +1,12 @@
> > -/* $OpenBSD: s_llrintf.c,v 1.2 2006/09/25 22:16:48 kettenis Exp $  */
> > -/* $NetBSD: llrintf.c,v 1.2 2004/10/13 15:18:32 drochner Exp $ */
> > +/* $OpenBSD$   */
> >
> >  /*
> > - * Written by Matthias Drochner .
> > - * Public domain.
> > + * Written by Martynas Venckus.  Public domain
> >   */
> >
> > -#define LRINTNAME llrintf
> > -#define RESTYPE long long int
> > -#define RESTYPE_MIN LLONG_MIN
> > -#define RESTYPE_MAX LLONG_MAX
> > +#define type   float
> > +#define rounditrintf
> > +#define dtype  long long
> > +#define fn llrintf
> >
> >  #include "s_lrintf.c"
> > Index: lib/libm/src/s_lrint.c
> > ===
> > RCS file: /cvs/src/lib/libm/src/s_lrint.c,v
> > retrieving revision 1.11
> > diff -u -p -r1.11 s_lrint.c
> > --- lib/libm/src/s_lrint.c  12 Sep 2016 19:47:02 -  1.11
> > +++ lib/libm/src/s_lrint.c  13 Oct 2021 23:12:11 -
> > @@ -1,9 +1,8 @@
> > -/* $OpenBSD: s_lrint.c,v 1.11 2016/09/12 19:47:02 guenther Exp $   */
> > -/* $NetBSD: lrint.c,v 1.3 2004/10/13 15:18:32 drochner Exp $ */
> > +/* $OpenBSD$   */
> >
> >  /*-
> > - * Copyright (c) 2004
> > - * Matthias Drochner. All rights reserved.
> > + * Copyright (c) 2005 David Schultz 
> > + * All rights reserved.
> >   *
> >   * Redistribution and use in source and binary forms, with or without
> >   * modification, are permitted provided that the following conditions
> > @@ -27,75 +26,35 @@
> >   * SUCH DAMAGE.
> >   */
> >
> > -#include 
> > -#include 
> > -#include 
> > +#include 
> >  #include 
> > -#include 
> > -#include 
> >
> > -#include "math_private.h"
> > -
> > -#ifndef LRINTNAME
> > -#define LRINTNAME lrint
> > -#define RESTYPE long int
> > -#define RESTYPE_MIN LONG_MIN
> > -#define RESTYPE_MAX LONG_MAX
> > +#ifndef type
> > +#define type   double
> > +#define rounditrint
> > +#define dtype  long
> > +#define fn lrint
> >  #endif
> >
> > -#define RESTYPE_BITS (sizeof(RESTYPE) * 8)
> > -
> > -static const double
> > -TWO52[2]={
> > -  4.5035996273704960e+15, /* 0x4330, 0x */
> > - -4.5035996273704960e+15, /* 0xC330, 0x */
> > -};
> > -
> > -RESTYPE
> > -LRINTNAME(double x)
> > +/*
> > + * C99 says we should not raise a spurious inexact exception when an
> > + * invalid exception is raised.  Unfortunately, the set of inputs
> > + * that 

Re: lrint(3) and llrint(3) implementation

2021-10-14 Thread Todd C . Miller
On Thu, 14 Oct 2021 01:15:56 +0200, Mark Kettenis wrote:

> Currently the lib/libm/msun/run-lrint_test regress fails on powerpc64
> and other platforms.  Our implementation came from NetBSD, but NetBSD
> switched to the implementation from FreeBSD some time ago.  That is
> the same implementation that we already use for lrintl(3) and
> llrintl(3).
>
> Diff below makes us use that implementation for lrint(3), lrintf(3),
> llrint(3) and llrintf(3) as well.  This makes the regress test pass on
> powerpc64.

Doesn't this mean we end up with three copies of what is essentially
the same code in s_lrint.c, s_lrintf.c and s_lrintl.c?

I know it's not much but why not just move the actual code to
s_lrint.c and include that in the others?

 - todd



Re: lrint(3) and llrint(3) implementation

2021-10-14 Thread Alexander Bluhm
On Thu, Oct 14, 2021 at 01:15:56AM +0200, Mark Kettenis wrote:
> Currently the lib/libm/msun/run-lrint_test regress fails on powerpc64
> and other platforms.  Our implementation came from NetBSD, but NetBSD
> switched to the implementation from FreeBSD some time ago.  That is
> the same implementation that we already use for lrintl(3) and
> llrintl(3).
> 
> Diff below makes us use that implementation for lrint(3), lrintf(3),
> llrint(3) and llrintf(3) as well.  This makes the regress test pass on
> powerpc64.
> 
> ok?

libm Tests on powerpc64 pass now.
amd64 and i386 still passing.
arm64 and sparc64 still failing.
So it is an improvement.

OK bluhm@

> Index: lib/libm/src/s_llrint.c
> ===
> RCS file: /cvs/src/lib/libm/src/s_llrint.c,v
> retrieving revision 1.6
> diff -u -p -r1.6 s_llrint.c
> --- lib/libm/src/s_llrint.c   12 Sep 2016 19:47:02 -  1.6
> +++ lib/libm/src/s_llrint.c   13 Oct 2021 23:12:11 -
> @@ -1,14 +1,12 @@
> -/*   $OpenBSD: s_llrint.c,v 1.6 2016/09/12 19:47:02 guenther Exp $   */
> -/* $NetBSD: llrint.c,v 1.2 2004/10/13 15:18:32 drochner Exp $ */
> +/*   $OpenBSD$   */
>  
>  /*
> - * Written by Matthias Drochner .
> - * Public domain.
> + * Written by Martynas Venckus.  Public domain
>   */
>  
> -#define LRINTNAME llrint
> -#define RESTYPE long long int
> -#define RESTYPE_MIN LLONG_MIN
> -#define RESTYPE_MAX LLONG_MAX
> +#define type double
> +#define roundit  rint
> +#define dtypelong long
> +#define fn   llrint
>  
>  #include "s_lrint.c"
> Index: lib/libm/src/s_llrintf.c
> ===
> RCS file: /cvs/src/lib/libm/src/s_llrintf.c,v
> retrieving revision 1.2
> diff -u -p -r1.2 s_llrintf.c
> --- lib/libm/src/s_llrintf.c  25 Sep 2006 22:16:48 -  1.2
> +++ lib/libm/src/s_llrintf.c  13 Oct 2021 23:12:11 -
> @@ -1,14 +1,12 @@
> -/*   $OpenBSD: s_llrintf.c,v 1.2 2006/09/25 22:16:48 kettenis Exp $  */
> -/* $NetBSD: llrintf.c,v 1.2 2004/10/13 15:18:32 drochner Exp $ */
> +/*   $OpenBSD$   */
>  
>  /*
> - * Written by Matthias Drochner .
> - * Public domain.
> + * Written by Martynas Venckus.  Public domain
>   */
>  
> -#define LRINTNAME llrintf
> -#define RESTYPE long long int
> -#define RESTYPE_MIN LLONG_MIN
> -#define RESTYPE_MAX LLONG_MAX
> +#define type float
> +#define roundit  rintf
> +#define dtypelong long
> +#define fn   llrintf
>  
>  #include "s_lrintf.c"
> Index: lib/libm/src/s_lrint.c
> ===
> RCS file: /cvs/src/lib/libm/src/s_lrint.c,v
> retrieving revision 1.11
> diff -u -p -r1.11 s_lrint.c
> --- lib/libm/src/s_lrint.c12 Sep 2016 19:47:02 -  1.11
> +++ lib/libm/src/s_lrint.c13 Oct 2021 23:12:11 -
> @@ -1,9 +1,8 @@
> -/*   $OpenBSD: s_lrint.c,v 1.11 2016/09/12 19:47:02 guenther Exp $   */
> -/* $NetBSD: lrint.c,v 1.3 2004/10/13 15:18:32 drochner Exp $ */
> +/*   $OpenBSD$   */
>  
>  /*-
> - * Copyright (c) 2004
> - *   Matthias Drochner. All rights reserved.
> + * Copyright (c) 2005 David Schultz 
> + * All rights reserved.
>   *
>   * Redistribution and use in source and binary forms, with or without
>   * modification, are permitted provided that the following conditions
> @@ -27,75 +26,35 @@
>   * SUCH DAMAGE.
>   */
>  
> -#include 
> -#include 
> -#include 
> +#include 
>  #include 
> -#include 
> -#include 
>  
> -#include "math_private.h"
> -
> -#ifndef LRINTNAME
> -#define LRINTNAME lrint
> -#define RESTYPE long int
> -#define RESTYPE_MIN LONG_MIN
> -#define RESTYPE_MAX LONG_MAX
> +#ifndef type
> +#define type double
> +#define roundit  rint
> +#define dtypelong
> +#define fn   lrint
>  #endif
>  
> -#define RESTYPE_BITS (sizeof(RESTYPE) * 8)
> -
> -static const double
> -TWO52[2]={
> -  4.5035996273704960e+15, /* 0x4330, 0x */
> - -4.5035996273704960e+15, /* 0xC330, 0x */
> -};
> -
> -RESTYPE
> -LRINTNAME(double x)
> +/*
> + * C99 says we should not raise a spurious inexact exception when an
> + * invalid exception is raised.  Unfortunately, the set of inputs
> + * that overflows depends on the rounding mode when 'dtype' has more
> + * significant bits than 'type'.  Hence, we bend over backwards for the
> + * sake of correctness; an MD implementation could be more efficient.
> + */
> +dtype
> +fn(type x)
>  {
> - u_int32_t i0, i1;
> - int e, s, shift;
> - RESTYPE res;
> -
> - GET_HIGH_WORD(i0, x);
> - e = i0 >> DBL_FRACHBITS;
> - s = e >> DBL_EXPBITS;
> - e = (e & 0x7ff) - DBL_EXP_BIAS;
> -
> - /* 1.0 x 2^31 (or 2^63) is already too large */
> - if (e >= (int)RESTYPE_BITS - 1)
> - return (s ? RESTYPE_MIN : RESTYPE_MAX); /* ??? unspecified */
> -
> - /* >= 2^52 is already an exact integer */
> - if (e < DBL_FRACBITS) 

Re: iwx(4) 40MHz channel support

2021-10-14 Thread Stefan Hagen
joshua stein wrote:
> On Thu, 14 Oct 2021 at 09:21:18 +0200, Stefan Hagen wrote:
> > This ramp up time is reproducible. It always starts slower in the first
> > 1 or 2 seconds and then ramps up to full speed.
> 
> I've observed this in ethernet transfers on OpenBSD for a long time, 
> it's not limited to WiFi.
> 

I see it, but only upstream. This is the upper link speed end for the
iperf3 server (FreeBSD) as well, so the test may be skewed.

Ethernet:

[x13](0)(~)$ iperf3 -c 192.168.1.20
Connecting to host 192.168.1.20, port 5201
[  5] local 192.168.1.69 port 37380 connected to 192.168.1.20 port 5201
[ ID] Interval   Transfer Bitrate
[  5]   0.00-1.00   sec   104 MBytes   876 Mbits/sec
[  5]   1.00-2.00   sec   104 MBytes   870 Mbits/sec
[  5]   2.00-3.00   sec   104 MBytes   870 Mbits/sec
[  5]   3.00-4.00   sec   104 MBytes   870 Mbits/sec

[x13](1)(~)$ iperf3 -R -c 192.168.1.20
Connecting to host 192.168.1.20, port 5201
Reverse mode, remote host 192.168.1.20 is sending
[  5] local 192.168.1.69 port 44067 connected to 192.168.1.20 port 5201
[ ID] Interval   Transfer Bitrate
[  5]   0.00-1.00   sec  78.9 MBytes   662 Mbits/sec
[  5]   1.00-2.00   sec   110 MBytes   924 Mbits/sec
[  5]   2.00-3.00   sec   111 MBytes   929 Mbits/sec
[  5]   3.00-4.00   sec   110 MBytes   926 Mbits/sec

Best regards,
Stefan



Re: iwx(4) 40MHz channel support

2021-10-14 Thread joshua stein
On Thu, 14 Oct 2021 at 09:21:18 +0200, Stefan Hagen wrote:
> This ramp up time is reproducible. It always starts slower in the first
> 1 or 2 seconds and then ramps up to full speed.

I've observed this in ethernet transfers on OpenBSD for a long time, 
it's not limited to WiFi.



Re: unp_externalize() and `unp_lock'

2021-10-14 Thread Alexander Bluhm
OK bluhm@

> - out:
> - fdpunlock(fdp);
> +out:

With a space before the label, diff -p shows the function name and
not some label within the function.  I consider this helpful when
reading diffs.



poll(2) on top of kqueue

2021-10-14 Thread Martin Pieuchot
Diff below is the counterpart of the select(2) one I just committed to
make poll(2) and ppoll(2) use kqueue internally.

They use the same logic as select(2): convert pollfd into kqueue events
with EV_SET(2) then wait in kqueue_scan().

To make this implementation compatible with the existing poll(2) semantic  
I added a new specific kqueue-filter to FIFOs to handle the case where
POLLOUT is specified on a read-only event.  Thanks to millert@ for the
idea.  The regress sys/fifofs is passing with that.

As for the select(2) diff I'm currently interested in knowing if you
find any incompatibility with the current behavior. 

Thanks for testing,
Martin

Index: kern/sys_generic.c
===
RCS file: /cvs/src/sys/kern/sys_generic.c,v
retrieving revision 1.136
diff -u -p -r1.136 sys_generic.c
--- kern/sys_generic.c  14 Oct 2021 08:46:01 -  1.136
+++ kern/sys_generic.c  14 Oct 2021 09:00:22 -
@@ -81,6 +81,8 @@ int kqpoll_debug = 0;
 
 int pselregister(struct proc *, fd_set *[], int, int *);
 int pselcollect(struct proc *, struct kevent *, fd_set *[], int *);
+int ppollregister(struct proc *, struct pollfd *, int, int *);
+int ppollcollect(struct proc *, struct kevent *, struct pollfd *, u_int);
 
 int pollout(struct pollfd *, struct pollfd *, u_int);
 int dopselect(struct proc *, int, fd_set *, fd_set *, fd_set *,
@@ -769,6 +771,7 @@ pselregister(struct proc *p, fd_set *pib
/* FALLTHROUGH */
case EOPNOTSUPP:/* No underlying kqfilter */
case EINVAL:/* Unimplemented filter */
+   case EPERM: /* Specific to FIFO */
error = 0;
break;
case ENXIO: /* Device has been detached */
@@ -899,31 +902,132 @@ doselwakeup(struct selinfo *sip)
}
 }
 
-void
-pollscan(struct proc *p, struct pollfd *pl, u_int nfd, register_t *retval)
+int
+ppollregister_evts(struct proc *p, struct kevent *kevp, int nkev,
+struct pollfd *pl)
 {
-   struct filedesc *fdp = p->p_fd;
-   struct file *fp;
-   u_int i;
-   int n = 0;
+   int i, error, nevents = 0;
 
-   for (i = 0; i < nfd; i++, pl++) {
-   /* Check the file descriptor. */
-   if (pl->fd < 0) {
-   pl->revents = 0;
-   continue;
+   KASSERT(pl->revents == 0);
+
+#ifdef KTRACE
+   if (KTRPOINT(p, KTR_STRUCT))
+   ktrevent(p, kevp, nkev);
+#endif
+   for (i = 0; i < nkev; i++, kevp++) {
+again:
+   error = kqueue_register(p->p_kq, kevp, p);
+   switch (error) {
+   case 0:
+   nevents++;
+   break;
+   case EOPNOTSUPP:/* No underlying kqfilter */
+   case EINVAL:/* Unimplemented filter */
+   break;
+   case EBADF: /* Bad file descriptor */
+   pl->revents |= POLLNVAL;
+   break;
+   case EPERM: /* Specific to FIFO */
+   KASSERT(kevp->filter == EVFILT_WRITE);
+   if (nkev == 1) {
+   /*
+* If this is the only filter make sure
+* POLLHUP is passed to userland.
+*/
+   kevp->filter = EVFILT_EXCEPT;
+   goto again;
+   }
+   break;
+   case EPIPE: /* Specific to pipes */
+   KASSERT(kevp->filter == EVFILT_WRITE);
+   pl->revents |= POLLHUP;
+   break;
+   default:
+#ifdef DIAGNOSTIC
+   DPRINTFN(0, "poll err %lu fd %d revents %02x serial"
+   " %lu filt %d ERROR=%d\n",
+   ((unsigned long)kevp->udata - p->p_kq_serial),
+   pl->fd, pl->revents, p->p_kq_serial, kevp->filter,
+   error);
+#endif
+   /* FALLTHROUGH */
+   case ENXIO: /* Device has been detached */
+   pl->revents |= POLLERR;
+   break;
}
-   if ((fp = fd_getfile(fdp, pl->fd)) == NULL) {
-   pl->revents = POLLNVAL;
-   n++;
+   }
+
+   return (nevents);
+}
+
+/*
+ * Convert pollfd into kqueue events and register them on the
+ * per-thread queue.
+ *
+ * Return the number of pollfd that triggered at least one error and aren't
+ * completly monitored.  These pollfd should have the correponding error bit
+ * set in `revents'.
+ *
+ * At most 3 events can correspond to a single pollfd.
+ */
+int

Re: iwx(4) 40MHz channel support

2021-10-14 Thread Hrvoje Popovski
On 12.10.2021. 16:29, Hrvoje Popovski wrote:
> On 12.10.2021. 14:47, Stefan Sperling wrote:
>> This patch adds support for 40MHz channels to iwx(4).
>>
>> Please sync your source tree before attempting to apply this patch.
>> I have committed some changes to this driver today which this patch
>> is based on.
>>
>> Works for me on AX200/AX201. Does anyone else want to do a pre-commit test?
> 
> with this diff i'm getting 150Mbps vs 100Mbps without it here at home
> ... will test it at work ...
> 
> Thank you ..
> 
> 
> 

Hi,

on work i'm getting 170Mbps

iwx0 at pci3 dev 0 function 0 "Intel Wi-Fi 6 AX200" rev 0x1a, msix

tcpbench:

Conn: 1 Mbps:166.087 Peak Mbps:172.858 Avg Mbps:166.087
Conn: 1 Mbps:171.568 Peak Mbps:172.858 Avg Mbps:171.568
Conn: 1 Mbps:169.212 Peak Mbps:172.858 Avg Mbps:169.212
Conn: 1 Mbps:167.841 Peak Mbps:172.858 Avg Mbps:167.841
Conn: 1 Mbps:169.302 Peak Mbps:172.858 Avg Mbps:169.302


tcpdump -n -i iwx0 -v -y IEEE802_11_RADIO -s 4096 type mgt and subtype
beacon

12:06:52.870418 802.11 flags=0<>: beacon,
caps=421, ssid (eduroam), rates 12M* 18M 24M*
36M 48M 54M 12M* 12M*, ds (chan 36), tim 0x0001, country 'HR ',
channel 36 limit 23dB, channel 40 limit 23dB, channel 44 limit 23dB,
channel 48 limit 23dB, channel 52 limit 23dB, channel 56 limit 23dB,
channel 60 limit 23dB, channel 64 limit 23dB, channel 100 limit 30dB,
channel 104 limit 30dB, channel 108 limit 30dB, channel 112 limit 30dB,
channel 116 limit 30dB, channel 120 limit 30dB, channel 124 limit 30dB,
channel 128 limit 30dB, channel 132 limit 30dB, channel 136 limit 30dB,
channel 140 limit 30dB, channel 149 limit 23dB, channel 153 limit 23dB,
channel 157 limit 23dB, channel 161 limit 23dB, channel 165 limit 23dB,
power constraint 0dB, tpcreport 0x0b00, rsn=, 9 stations, 14% utilization,
admission capacity 0us/s, 71:1 0x04,
htcaps=<20/40MHz,LDPC,SGI@20MHz,SGI@40MHz,TXSTBC,RXSTBC 1 stream,A-MSDU
7935,A-MPDU max 65535,A-MPDU spacing 4.00us,RxMCS
0xff000100>, htop=<40MHz chan 36:40,RIFS,htprot
none,non-greenfield STA,basic MCS set 0x>, 127:8
0x04000140, 191:12 0x92018003eaffeaff, 192:5
0x002600, 195:3 0x010202, vendor 0x00904c0407, vendor
0x0010180209001c, vendor 0x0050f202000100, vendor
0x001977010603140d04175cf482000a8894ae7735c413e21d478026d8c413e21d47a01803085af60227e13e0808030427a309120064000100,
vendor 0x0019772101000c535243452d41502d53313500, 



Re: iwx(4) 40MHz channel support

2021-10-14 Thread Stefan Hagen
Stefan Sperling wrote:
> This patch adds support for 40MHz channels to iwx(4).
> 
> Please sync your source tree before attempting to apply this patch.
> I have committed some changes to this driver today which this patch
> is based on.
> 
> Works for me on AX200/AX201. Does anyone else want to do a pre-commit test?

*jaw drops to floor*

I double checked. There was no cable connected.

[x13](0)(~)$ iperf3 -c 192.168.1.20
Connecting to host 192.168.1.20, port 5201
[  5] local 192.168.1.226 port 40211 connected to 192.168.1.20 port 5201
[ ID] Interval   Transfer Bitrate
[  5]   0.00-1.00   sec  20.7 MBytes   173 Mbits/sec
[  5]   1.00-2.00   sec  24.3 MBytes   204 Mbits/sec
[  5]   2.00-3.00   sec  24.6 MBytes   206 Mbits/sec
[  5]   3.00-4.00   sec  24.6 MBytes   207 Mbits/sec
[  5]   4.00-5.00   sec  24.6 MBytes   206 Mbits/sec
[  5]   5.00-6.00   sec  24.8 MBytes   208 Mbits/sec
[  5]   6.00-7.00   sec  24.8 MBytes   208 Mbits/sec
[  5]   7.00-8.00   sec  24.6 MBytes   207 Mbits/sec
[  5]   8.00-9.00   sec  24.8 MBytes   208 Mbits/sec
[  5]   9.00-10.00  sec  24.8 MBytes   209 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bitrate
[  5]   0.00-10.00  sec   243 MBytes   203 Mbits/sec  sender
[  5]   0.00-10.02  sec   243 MBytes   203 Mbits/sec  receiver

iperf Done.
[x13](0)(~)$ iperf3 -R -c 192.168.1.20
Connecting to host 192.168.1.20, port 5201
Reverse mode, remote host 192.168.1.20 is sending
[  5] local 192.168.1.226 port 19869 connected to 192.168.1.20 port 5201
[ ID] Interval   Transfer Bitrate
[  5]   0.00-1.00   sec  7.21 MBytes  60.5 Mbits/sec
[  5]   1.00-2.00   sec  12.7 MBytes   106 Mbits/sec
[  5]   2.00-3.00   sec  15.6 MBytes   131 Mbits/sec
[  5]   3.00-4.00   sec  17.5 MBytes   147 Mbits/sec
[  5]   4.00-5.00   sec  18.6 MBytes   156 Mbits/sec
[  5]   5.00-6.00   sec  19.3 MBytes   162 Mbits/sec
[  5]   6.00-7.00   sec  19.8 MBytes   166 Mbits/sec
[  5]   7.00-8.00   sec  20.2 MBytes   170 Mbits/sec
[  5]   8.00-9.00   sec  20.9 MBytes   175 Mbits/sec
[  5]   9.00-10.00  sec  21.0 MBytes   176 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval   Transfer Bitrate Retr
[  5]   0.00-10.01  sec   173 MBytes   145 Mbits/sec0 sender
[  5]   0.00-10.00  sec   173 MBytes   145 Mbits/sec  receiver

iperf Done.

This ramp up time is reproducible. It always starts slower in the first
1 or 2 seconds and then ramps up to full speed.

More testing + log in case you want to skim through it.

Test 1: Walking around and got it to roam to the next AP
Test 2: Going into suspend for ~30sec with running NFS transfer, resuming, NFS 
transfer continues
Test 3: Going into suspend, walk to the next room (AP change), resuming, NFS 
transfer continues

[...]
Oct 14 08:23:41 x13 /bsd: iwx0 at pci3 dev 0 function 0 "Intel Wi-Fi 6 AX200" 
rev 0x1a, msix
Oct 14 08:23:41 x13 /bsd: iwx0: hw rev 0x340, fw ver 63.c04f3485.0, address 
70:9c:d1:b1:2d:bc
[...]
Oct 14 08:23:41 x13 /bsd: iwx0: begin active scan
Oct 14 08:23:41 x13 /bsd: iwx0: INIT -> SCAN
Oct 14 08:23:41 x13 /bsd: iwx0: SCAN -> INIT
Oct 14 08:23:41 x13 /bsd: iwx0: begin active scan
Oct 14 08:23:41 x13 /bsd: iwx0: INIT -> SCAN
Oct 14 08:23:41 x13 /bsd: iwx0: end active scan
Oct 14 08:23:41 x13 /bsd: iwx0: best AP e0:63:da:7e:f8:24 "DiscMate" score 75
Oct 14 08:23:41 x13 /bsd: iwx0: switching to network "DiscMate"
Oct 14 08:23:41 x13 /bsd: iwx0: + 18:e8:29:57:3d:061   +27 54M   ess  
privacy   rsn  "DiscMate"
Oct 14 08:23:41 x13 /bsd: iwx0: + 18:e8:29:58:3d:06   36   +29 54M   ess  
privacy   rsn  "DiscMate"
Oct 14 08:23:41 x13 /bsd: iwx0: - 1c:3a:de:74:c2:50   13   +13 54M   ess  
privacy   rsn  "HZN242371510"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 1e:e8:29:57:3d:061   +29 54M   ess  
privacy   rsn  ""!
Oct 14 08:23:41 x13 /bsd: iwx0: - 1e:e8:29:58:3d:06   36   +29 54M   ess  
privacy   rsn  ""!
Oct 14 08:23:41 x13 /bsd: iwx0: - 2c:91:ab:82:87:43   36+8 54M   ess  
privacy   rsn  "FRITZ!Box 6591 Cable AQ"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 2c:91:ab:82:87:446   +15 54M   ess  
privacy   rsn  "FRITZ!Box 6591 Cable AQ"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 38:10:d5:f7:c4:b61   +17 54M   ess  
privacy   rsn  "FRITZ!Box 7412"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 38:43:7d:a5:ea:43   44   +17 54M   ess  
privacy   rsn  "UPC5673586"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 38:43:7d:a5:ea:691   +16 54M   ess  
privacy   rsn  "UPC5673586"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 3a:9d:92:ad:3d:c76+8 54M   ess  
privacy   rsn  "DIRECT-bL-EPSON-XP-7100 Series"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 3c:a6:2f:b4:99:f51   +18 54M   ess  
privacy   rsn  "MAESTRO"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 44:4e:6d:4f:c7:356   +12 54M   ess  
privacy   rsn  "FRITZ!Box 6590 Cable MF"!
Oct 14 08:23:41 x13 /bsd: iwx0: - 4c:1b:86:17:91:62   11   +21 54M   ess  
privacy   rsn  "WLAN-791054"!
Oct