Re: amd64: i8254_delay(): simpler microsecond->ticks conversion

2019-05-20 Thread Joerg Sonnenberger
On Sun, May 19, 2019 at 12:25:04PM -0500, Scott Cheloha wrote:
> guenther@ has pointed out in a separate mail that I can make the diff
> smaller with a cast and that it's acceptable here.  As with the code
> it's replacing I've left a comment explaining what we're doing.

There is a rather obvious bug in the function for n < 0, but that's
older. You should get better assembly by using an intermediate cast to
uint64_t anyway, esp. if the compiler can't prove that n is positive.

Joerg



Re: amd64: i8254_delay(): simpler microsecond->ticks conversion

2019-05-19 Thread Scott Cheloha
On Sun, May 19, 2019 at 01:23:17AM -0500, Scott Cheloha wrote:
> On Sat, May 18, 2019 at 10:57:32PM -0700, Mike Larkin wrote:
> > On Sun, May 19, 2019 at 12:47:11AM -0500, Scott Cheloha wrote:
> > > This code is really fidgety and I think we can do better.
> > > 
> > > If we use a 64-bit value here for microseconds the compiler arranges
> > > for all the math to be done with 64-bit quantites.  I'm pretty sure
> > > this is standard C numeric type upconversion.  As noted in the comment
> > > for the assembly, use of the 64-bit intermediate avoids overflow for
> > > most plausible inputs.
> > > 
> > > So with one line of code we get the same result as we do with the
> > > __GNUC__ assembly and the nigh incomprehensible default computation.
> > > 
> > > [...]
> > 
> > What bug or problem is this fixing?
> 
> The code is more complicated than it needs to be.
> 
> There are two different ways to convert microseconds to i8254 ticks
> in this function.  One is assembly, the other is a non-obvious method
> that avoids overflowing an int while using ints for partial sums.
> 
> You can replace both of them with legible, plain C that does exactly
> what it looks like it is doing by using a larger intermediate type.

guenther@ has pointed out in a separate mail that I can make the diff
smaller with a cast and that it's acceptable here.  As with the code
it's replacing I've left a comment explaining what we're doing.

ok?

Index: arch/amd64/isa/clock.c
===
RCS file: /cvs/src/sys/arch/amd64/isa/clock.c,v
retrieving revision 1.28
diff -u -p -r1.28 clock.c
--- arch/amd64/isa/clock.c  27 Jul 2018 21:11:31 -  1.28
+++ arch/amd64/isa/clock.c  19 May 2019 17:11:15 -
@@ -242,31 +242,8 @@ i8254_delay(int n)
if (n <= 25)
n = delaytab[n];
else {
-#ifdef __GNUC__
-   /*
-* Calculate ((n * TIMER_FREQ) / 1e6) using explicit assembler
-* code so we can take advantage of the intermediate 64-bit
-* quantity to prevent loss of significance.
-*/
-   int m;
-   __asm volatile("mul %3"
-: "=a" (n), "=d" (m)
-: "0" (n), "r" (TIMER_FREQ));
-   __asm volatile("div %4"
-: "=a" (n), "=d" (m)
-: "0" (n), "1" (m), "r" (100));
-#else
-   /*
-* Calculate ((n * TIMER_FREQ) / 1e6) without using floating
-* point and without any avoidable overflows.
-*/
-   int sec = n / 100,
-   usec = n % 100;
-   n = sec * TIMER_FREQ +
-   usec * (TIMER_FREQ / 100) +
-   usec * ((TIMER_FREQ % 100) / 1000) / 1000 +
-   usec * (TIMER_FREQ % 1000) / 100;
-#endif
+   /* Force 64-bit math to avoid 32-bit overflow if possible. */
+   n = (int64_t)n * TIMER_FREQ / 100;
}
 
limit = TIMER_FREQ / hz;



Re: amd64: i8254_delay(): simpler microsecond->ticks conversion

2019-05-19 Thread Scott Cheloha
On Sat, May 18, 2019 at 10:57:32PM -0700, Mike Larkin wrote:
> On Sun, May 19, 2019 at 12:47:11AM -0500, Scott Cheloha wrote:
> > This code is really fidgety and I think we can do better.
> > 
> > If we use a 64-bit value here for microseconds the compiler arranges
> > for all the math to be done with 64-bit quantites.  I'm pretty sure
> > this is standard C numeric type upconversion.  As noted in the comment
> > for the assembly, use of the 64-bit intermediate avoids overflow for
> > most plausible inputs.
> > 
> > So with one line of code we get the same result as we do with the
> > __GNUC__ assembly and the nigh incomprehensible default computation.
> > 
> > We could add error checking, but it wasn't here before, and you won't
> > overflow an int of 8254 ticks until you try to delay for more than
> > 1799795545 microseconds.  Maybe a KASSERT...
> > 
> > My toy program suggests that this simpler code is actually faster when
> > compiled with both base gcc and clang.  But that's just a userspace
> > benchmark, not measurement during practical execution in the kernel.
> > Is the setup for delay(9) considered a hot path?
> > 
> > Near-identical simplification could be done in arch/i386/isa/clock.c.
> > I don't have an i386 though so I don't know whether this code might be
> > faster on such hardware.  It'd definitely be smaller though.
> > 
> > thoughts?
> 
> What bug or problem is this fixing?

The code is more complicated than it needs to be.

There are two different ways to convert microseconds to i8254 ticks
in this function.  One is assembly, the other is a non-obvious method
that avoids overflowing an int while using ints for partial sums.

You can replace both of them with legible, plain C that does exactly
what it looks like it is doing by using a larger intermediate type.



Re: amd64: i8254_delay(): simpler microsecond->ticks conversion

2019-05-18 Thread Mike Larkin
On Sun, May 19, 2019 at 12:47:11AM -0500, Scott Cheloha wrote:
> This code is really fidgety and I think we can do better.
> 
> If we use a 64-bit value here for microseconds the compiler arranges
> for all the math to be done with 64-bit quantites.  I'm pretty sure
> this is standard C numeric type upconversion.  As noted in the comment
> for the assembly, use of the 64-bit intermediate avoids overflow for
> most plausible inputs.
> 
> So with one line of code we get the same result as we do with the
> __GNUC__ assembly and the nigh incomprehensible default computation.
> 
> We could add error checking, but it wasn't here before, and you won't
> overflow an int of 8254 ticks until you try to delay for more than
> 1799795545 microseconds.  Maybe a KASSERT...
> 
> My toy program suggests that this simpler code is actually faster when
> compiled with both base gcc and clang.  But that's just a userspace
> benchmark, not measurement during practical execution in the kernel.
> Is the setup for delay(9) considered a hot path?
> 
> Near-identical simplification could be done in arch/i386/isa/clock.c.
> I don't have an i386 though so I don't know whether this code might be
> faster on such hardware.  It'd definitely be smaller though.
> 
> thoughts?
> 

What bug or problem is this fixing?

-ml

> Index: arch/amd64/isa/clock.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/isa/clock.c,v
> retrieving revision 1.28
> diff -u -p -r1.28 clock.c
> --- arch/amd64/isa/clock.c27 Jul 2018 21:11:31 -  1.28
> +++ arch/amd64/isa/clock.c19 May 2019 05:31:04 -
> @@ -226,6 +226,7 @@ gettick(void)
>  void
>  i8254_delay(int n)
>  {
> + int64_t microseconds;
>   int limit, tick, otick;
>   static const int delaytab[26] = {
>0,  2,  3,  4,  5,  6,  7,  9, 10, 11,
> @@ -242,31 +243,9 @@ i8254_delay(int n)
>   if (n <= 25)
>   n = delaytab[n];
>   else {
> -#ifdef __GNUC__
> - /*
> -  * Calculate ((n * TIMER_FREQ) / 1e6) using explicit assembler
> -  * code so we can take advantage of the intermediate 64-bit
> -  * quantity to prevent loss of significance.
> -  */
> - int m;
> - __asm volatile("mul %3"
> -  : "=a" (n), "=d" (m)
> -  : "0" (n), "r" (TIMER_FREQ));
> - __asm volatile("div %4"
> -  : "=a" (n), "=d" (m)
> -  : "0" (n), "1" (m), "r" (100));
> -#else
> - /*
> -  * Calculate ((n * TIMER_FREQ) / 1e6) without using floating
> -  * point and without any avoidable overflows.
> -  */
> - int sec = n / 100,
> - usec = n % 100;
> - n = sec * TIMER_FREQ +
> - usec * (TIMER_FREQ / 100) +
> - usec * ((TIMER_FREQ % 100) / 1000) / 1000 +
> - usec * (TIMER_FREQ % 1000) / 100;
> -#endif
> + /* Using a 64-bit intermediate here avoids most overflow. */
> + microseconds = n;
> + n = microseconds * TIMER_FREQ / 100;
>   }
>  
>   limit = TIMER_FREQ / hz;
> 



amd64: i8254_delay(): simpler microsecond->ticks conversion

2019-05-18 Thread Scott Cheloha
This code is really fidgety and I think we can do better.

If we use a 64-bit value here for microseconds the compiler arranges
for all the math to be done with 64-bit quantites.  I'm pretty sure
this is standard C numeric type upconversion.  As noted in the comment
for the assembly, use of the 64-bit intermediate avoids overflow for
most plausible inputs.

So with one line of code we get the same result as we do with the
__GNUC__ assembly and the nigh incomprehensible default computation.

We could add error checking, but it wasn't here before, and you won't
overflow an int of 8254 ticks until you try to delay for more than
1799795545 microseconds.  Maybe a KASSERT...

My toy program suggests that this simpler code is actually faster when
compiled with both base gcc and clang.  But that's just a userspace
benchmark, not measurement during practical execution in the kernel.
Is the setup for delay(9) considered a hot path?

Near-identical simplification could be done in arch/i386/isa/clock.c.
I don't have an i386 though so I don't know whether this code might be
faster on such hardware.  It'd definitely be smaller though.

thoughts?

Index: arch/amd64/isa/clock.c
===
RCS file: /cvs/src/sys/arch/amd64/isa/clock.c,v
retrieving revision 1.28
diff -u -p -r1.28 clock.c
--- arch/amd64/isa/clock.c  27 Jul 2018 21:11:31 -  1.28
+++ arch/amd64/isa/clock.c  19 May 2019 05:31:04 -
@@ -226,6 +226,7 @@ gettick(void)
 void
 i8254_delay(int n)
 {
+   int64_t microseconds;
int limit, tick, otick;
static const int delaytab[26] = {
 0,  2,  3,  4,  5,  6,  7,  9, 10, 11,
@@ -242,31 +243,9 @@ i8254_delay(int n)
if (n <= 25)
n = delaytab[n];
else {
-#ifdef __GNUC__
-   /*
-* Calculate ((n * TIMER_FREQ) / 1e6) using explicit assembler
-* code so we can take advantage of the intermediate 64-bit
-* quantity to prevent loss of significance.
-*/
-   int m;
-   __asm volatile("mul %3"
-: "=a" (n), "=d" (m)
-: "0" (n), "r" (TIMER_FREQ));
-   __asm volatile("div %4"
-: "=a" (n), "=d" (m)
-: "0" (n), "1" (m), "r" (100));
-#else
-   /*
-* Calculate ((n * TIMER_FREQ) / 1e6) without using floating
-* point and without any avoidable overflows.
-*/
-   int sec = n / 100,
-   usec = n % 100;
-   n = sec * TIMER_FREQ +
-   usec * (TIMER_FREQ / 100) +
-   usec * ((TIMER_FREQ % 100) / 1000) / 1000 +
-   usec * (TIMER_FREQ % 1000) / 100;
-#endif
+   /* Using a 64-bit intermediate here avoids most overflow. */
+   microseconds = n;
+   n = microseconds * TIMER_FREQ / 100;
}
 
limit = TIMER_FREQ / hz;