Re: amd64: i8254_delay(): simpler microsecond->ticks conversion
On Sun, May 19, 2019 at 12:25:04PM -0500, Scott Cheloha wrote: > guenther@ has pointed out in a separate mail that I can make the diff > smaller with a cast and that it's acceptable here. As with the code > it's replacing I've left a comment explaining what we're doing. There is a rather obvious bug in the function for n < 0, but that's older. You should get better assembly by using an intermediate cast to uint64_t anyway, esp. if the compiler can't prove that n is positive. Joerg
Re: amd64: i8254_delay(): simpler microsecond->ticks conversion
On Sun, May 19, 2019 at 01:23:17AM -0500, Scott Cheloha wrote: > On Sat, May 18, 2019 at 10:57:32PM -0700, Mike Larkin wrote: > > On Sun, May 19, 2019 at 12:47:11AM -0500, Scott Cheloha wrote: > > > This code is really fidgety and I think we can do better. > > > > > > If we use a 64-bit value here for microseconds the compiler arranges > > > for all the math to be done with 64-bit quantites. I'm pretty sure > > > this is standard C numeric type upconversion. As noted in the comment > > > for the assembly, use of the 64-bit intermediate avoids overflow for > > > most plausible inputs. > > > > > > So with one line of code we get the same result as we do with the > > > __GNUC__ assembly and the nigh incomprehensible default computation. > > > > > > [...] > > > > What bug or problem is this fixing? > > The code is more complicated than it needs to be. > > There are two different ways to convert microseconds to i8254 ticks > in this function. One is assembly, the other is a non-obvious method > that avoids overflowing an int while using ints for partial sums. > > You can replace both of them with legible, plain C that does exactly > what it looks like it is doing by using a larger intermediate type. guenther@ has pointed out in a separate mail that I can make the diff smaller with a cast and that it's acceptable here. As with the code it's replacing I've left a comment explaining what we're doing. ok? Index: arch/amd64/isa/clock.c === RCS file: /cvs/src/sys/arch/amd64/isa/clock.c,v retrieving revision 1.28 diff -u -p -r1.28 clock.c --- arch/amd64/isa/clock.c 27 Jul 2018 21:11:31 - 1.28 +++ arch/amd64/isa/clock.c 19 May 2019 17:11:15 - @@ -242,31 +242,8 @@ i8254_delay(int n) if (n <= 25) n = delaytab[n]; else { -#ifdef __GNUC__ - /* -* Calculate ((n * TIMER_FREQ) / 1e6) using explicit assembler -* code so we can take advantage of the intermediate 64-bit -* quantity to prevent loss of significance. -*/ - int m; - __asm volatile("mul %3" -: "=a" (n), "=d" (m) -: "0" (n), "r" (TIMER_FREQ)); - __asm volatile("div %4" -: "=a" (n), "=d" (m) -: "0" (n), "1" (m), "r" (100)); -#else - /* -* Calculate ((n * TIMER_FREQ) / 1e6) without using floating -* point and without any avoidable overflows. -*/ - int sec = n / 100, - usec = n % 100; - n = sec * TIMER_FREQ + - usec * (TIMER_FREQ / 100) + - usec * ((TIMER_FREQ % 100) / 1000) / 1000 + - usec * (TIMER_FREQ % 1000) / 100; -#endif + /* Force 64-bit math to avoid 32-bit overflow if possible. */ + n = (int64_t)n * TIMER_FREQ / 100; } limit = TIMER_FREQ / hz;
Re: amd64: i8254_delay(): simpler microsecond->ticks conversion
On Sat, May 18, 2019 at 10:57:32PM -0700, Mike Larkin wrote: > On Sun, May 19, 2019 at 12:47:11AM -0500, Scott Cheloha wrote: > > This code is really fidgety and I think we can do better. > > > > If we use a 64-bit value here for microseconds the compiler arranges > > for all the math to be done with 64-bit quantites. I'm pretty sure > > this is standard C numeric type upconversion. As noted in the comment > > for the assembly, use of the 64-bit intermediate avoids overflow for > > most plausible inputs. > > > > So with one line of code we get the same result as we do with the > > __GNUC__ assembly and the nigh incomprehensible default computation. > > > > We could add error checking, but it wasn't here before, and you won't > > overflow an int of 8254 ticks until you try to delay for more than > > 1799795545 microseconds. Maybe a KASSERT... > > > > My toy program suggests that this simpler code is actually faster when > > compiled with both base gcc and clang. But that's just a userspace > > benchmark, not measurement during practical execution in the kernel. > > Is the setup for delay(9) considered a hot path? > > > > Near-identical simplification could be done in arch/i386/isa/clock.c. > > I don't have an i386 though so I don't know whether this code might be > > faster on such hardware. It'd definitely be smaller though. > > > > thoughts? > > What bug or problem is this fixing? The code is more complicated than it needs to be. There are two different ways to convert microseconds to i8254 ticks in this function. One is assembly, the other is a non-obvious method that avoids overflowing an int while using ints for partial sums. You can replace both of them with legible, plain C that does exactly what it looks like it is doing by using a larger intermediate type.
Re: amd64: i8254_delay(): simpler microsecond->ticks conversion
On Sun, May 19, 2019 at 12:47:11AM -0500, Scott Cheloha wrote: > This code is really fidgety and I think we can do better. > > If we use a 64-bit value here for microseconds the compiler arranges > for all the math to be done with 64-bit quantites. I'm pretty sure > this is standard C numeric type upconversion. As noted in the comment > for the assembly, use of the 64-bit intermediate avoids overflow for > most plausible inputs. > > So with one line of code we get the same result as we do with the > __GNUC__ assembly and the nigh incomprehensible default computation. > > We could add error checking, but it wasn't here before, and you won't > overflow an int of 8254 ticks until you try to delay for more than > 1799795545 microseconds. Maybe a KASSERT... > > My toy program suggests that this simpler code is actually faster when > compiled with both base gcc and clang. But that's just a userspace > benchmark, not measurement during practical execution in the kernel. > Is the setup for delay(9) considered a hot path? > > Near-identical simplification could be done in arch/i386/isa/clock.c. > I don't have an i386 though so I don't know whether this code might be > faster on such hardware. It'd definitely be smaller though. > > thoughts? > What bug or problem is this fixing? -ml > Index: arch/amd64/isa/clock.c > === > RCS file: /cvs/src/sys/arch/amd64/isa/clock.c,v > retrieving revision 1.28 > diff -u -p -r1.28 clock.c > --- arch/amd64/isa/clock.c27 Jul 2018 21:11:31 - 1.28 > +++ arch/amd64/isa/clock.c19 May 2019 05:31:04 - > @@ -226,6 +226,7 @@ gettick(void) > void > i8254_delay(int n) > { > + int64_t microseconds; > int limit, tick, otick; > static const int delaytab[26] = { >0, 2, 3, 4, 5, 6, 7, 9, 10, 11, > @@ -242,31 +243,9 @@ i8254_delay(int n) > if (n <= 25) > n = delaytab[n]; > else { > -#ifdef __GNUC__ > - /* > - * Calculate ((n * TIMER_FREQ) / 1e6) using explicit assembler > - * code so we can take advantage of the intermediate 64-bit > - * quantity to prevent loss of significance. > - */ > - int m; > - __asm volatile("mul %3" > - : "=a" (n), "=d" (m) > - : "0" (n), "r" (TIMER_FREQ)); > - __asm volatile("div %4" > - : "=a" (n), "=d" (m) > - : "0" (n), "1" (m), "r" (100)); > -#else > - /* > - * Calculate ((n * TIMER_FREQ) / 1e6) without using floating > - * point and without any avoidable overflows. > - */ > - int sec = n / 100, > - usec = n % 100; > - n = sec * TIMER_FREQ + > - usec * (TIMER_FREQ / 100) + > - usec * ((TIMER_FREQ % 100) / 1000) / 1000 + > - usec * (TIMER_FREQ % 1000) / 100; > -#endif > + /* Using a 64-bit intermediate here avoids most overflow. */ > + microseconds = n; > + n = microseconds * TIMER_FREQ / 100; > } > > limit = TIMER_FREQ / hz; >
amd64: i8254_delay(): simpler microsecond->ticks conversion
This code is really fidgety and I think we can do better. If we use a 64-bit value here for microseconds the compiler arranges for all the math to be done with 64-bit quantites. I'm pretty sure this is standard C numeric type upconversion. As noted in the comment for the assembly, use of the 64-bit intermediate avoids overflow for most plausible inputs. So with one line of code we get the same result as we do with the __GNUC__ assembly and the nigh incomprehensible default computation. We could add error checking, but it wasn't here before, and you won't overflow an int of 8254 ticks until you try to delay for more than 1799795545 microseconds. Maybe a KASSERT... My toy program suggests that this simpler code is actually faster when compiled with both base gcc and clang. But that's just a userspace benchmark, not measurement during practical execution in the kernel. Is the setup for delay(9) considered a hot path? Near-identical simplification could be done in arch/i386/isa/clock.c. I don't have an i386 though so I don't know whether this code might be faster on such hardware. It'd definitely be smaller though. thoughts? Index: arch/amd64/isa/clock.c === RCS file: /cvs/src/sys/arch/amd64/isa/clock.c,v retrieving revision 1.28 diff -u -p -r1.28 clock.c --- arch/amd64/isa/clock.c 27 Jul 2018 21:11:31 - 1.28 +++ arch/amd64/isa/clock.c 19 May 2019 05:31:04 - @@ -226,6 +226,7 @@ gettick(void) void i8254_delay(int n) { + int64_t microseconds; int limit, tick, otick; static const int delaytab[26] = { 0, 2, 3, 4, 5, 6, 7, 9, 10, 11, @@ -242,31 +243,9 @@ i8254_delay(int n) if (n <= 25) n = delaytab[n]; else { -#ifdef __GNUC__ - /* -* Calculate ((n * TIMER_FREQ) / 1e6) using explicit assembler -* code so we can take advantage of the intermediate 64-bit -* quantity to prevent loss of significance. -*/ - int m; - __asm volatile("mul %3" -: "=a" (n), "=d" (m) -: "0" (n), "r" (TIMER_FREQ)); - __asm volatile("div %4" -: "=a" (n), "=d" (m) -: "0" (n), "1" (m), "r" (100)); -#else - /* -* Calculate ((n * TIMER_FREQ) / 1e6) without using floating -* point and without any avoidable overflows. -*/ - int sec = n / 100, - usec = n % 100; - n = sec * TIMER_FREQ + - usec * (TIMER_FREQ / 100) + - usec * ((TIMER_FREQ % 100) / 1000) / 1000 + - usec * (TIMER_FREQ % 1000) / 100; -#endif + /* Using a 64-bit intermediate here avoids most overflow. */ + microseconds = n; + n = microseconds * TIMER_FREQ / 100; } limit = TIMER_FREQ / hz;