> Date: Thu, 10 Aug 2023 16:01:10 -0500 > From: Scott Cheloha <scottchel...@gmail.com> > > On Tue, Aug 08, 2023 at 08:00:47PM +0200, Mark Kettenis wrote: > > > From: Dale Rahn <dale.r...@gmail.com> > > > Date: Tue, 8 Aug 2023 12:36:45 -0400 > > > > > > Switching the computation of cycles/delaycnt to a proper 64 value math > > > instead of the '32 bit safe' complex math is likely a good idea. > > > However I am not completely convinced that switching to 'yield' (current > > > CPU_BUSY_CYCLE implementation) for every loop of a 'short wait' in the > > > wait > > > loop makes sense. In a hypervisor environment, this could cause a very > > > short wait between register writes to become very long, In a > > > non-hypervisor > > > environment there is essentially no improvement because the yield doesn't > > > really have any benefits on non-hypervisor. > > > > Dale, I think you're confused here. There is no arcitectural way to > > trap a YIELD instruction. The docs explicitly state that SMT and SMP > > systems. I suspect that this instruction was primarily introduced for > > SMT systems that never materialized; I completely believe you that on > > current hardware it does nothing. > > > > Even without a YIELD instruction a hypervisor might interrupt us and > > schedule a different vCPU onto the core. And on real hardware > > external interrupts may happen. So you really can't count on delay(9) > > being accurate. > > > > Linux does use YIELD in it delay loop. > > Okay good. > > > > To my current understanding, there is no useful 'wait short period' on arm > > > cores. > > > > There is WFET (and WFIT), but that is Armv8.7 material, so not > > availble on actual hardware that we run on. Linux uses that > > instruction in its delay loop when available. > > These look great. Can't wait for them to be implemented outside of a > functional simulator. > > > On machines with a working Generic Timer event stream, Linux uses WFE > > if the delay is long enough. > > I have a prototype for this. Obviously it's a separate patch. > > Anyway, can I go ahead with the patch below to start? > > - Use 64-bit arithmetic to simplify agtimer_delay(). > > - Use CPU_BUSY_CYCLE() ("yield" on arm64) in the loop to match other > delay(9) implementations.
ok kettenis@ > Index: agtimer.c > =================================================================== > RCS file: /cvs/src/sys/arch/arm64/dev/agtimer.c,v > retrieving revision 1.23 > diff -u -p -r1.23 agtimer.c > --- agtimer.c 25 Jul 2023 18:16:19 -0000 1.23 > +++ agtimer.c 10 Aug 2023 20:59:27 -0000 > @@ -323,32 +323,12 @@ agtimer_cpu_initclocks(void) > void > agtimer_delay(u_int usecs) > { > - uint64_t clock, oclock, delta, delaycnt; > - uint64_t csec, usec; > - volatile int j; > + uint64_t cycles, start; > > - if (usecs > (0x80000000 / agtimer_frequency)) { > - csec = usecs / 10000; > - usec = usecs % 10000; > - > - delaycnt = (agtimer_frequency / 100) * csec + > - (agtimer_frequency / 100) * usec / 10000; > - } else { > - delaycnt = agtimer_frequency * usecs / 1000000; > - } > - if (delaycnt <= 1) > - for (j = 100; j > 0; j--) > - ; > - > - oclock = agtimer_readcnt64(); > - while (1) { > - for (j = 100; j > 0; j--) > - ; > - clock = agtimer_readcnt64(); > - delta = clock - oclock; > - if (delta > delaycnt) > - break; > - } > + start = agtimer_readcnt64(); > + cycles = (uint64_t)usecs * agtimer_frequency / 1000000; > + while (agtimer_readcnt64() - start < cycles) > + CPU_BUSY_CYCLE(); > } > > void > >