On Thu, Sep 08, 2022 at 08:32:27AM -0500, Scott Cheloha wrote:
> On Tue, Sep 06, 2022 at 03:30:44AM -0700, Mike Larkin wrote:
> > On Sun, Sep 04, 2022 at 02:50:10PM +1000, Jonathan Gray wrote:
> > > On Sat, Sep 03, 2022 at 05:33:01PM -0500, Scott Cheloha wrote:
> > > > On Sat, Sep 03, 2022 at 10:37:31PM +1000, Jonathan Gray wrote:
> > > > > On Sat, Sep 03, 2022 at 06:52:20AM -0500, Scott Cheloha wrote:
> > > > > > > On Sep 3, 2022, at 02:22, Jonathan Gray <j...@jsg.id.au> wrote:
> > > > > > >
> > > > > > > ???On Fri, Sep 02, 2022 at 06:00:25PM -0500, Scott Cheloha wrote:
> > > > > > >> dv@ suggested coming to the list to request testing for the 
> > > > > > >> pvclock(4)
> > > > > > >> driver.  Attached is a patch that corrects several bugs.  Most of
> > > > > > >> these changes will only matter in the non-TSC_STABLE case on a
> > > > > > >> multiprocessor VM.
> > > > > > >>
> > > > > > >> Ideally, nothing should break.
> > > > > > >>
> > > > > > >> - pvclock yields a 64-bit value.  The BSD timecounter layer can 
> > > > > > >> only
> > > > > > >>  use the lower 32 bits, but internally we need to track the full
> > > > > > >>  64-bit value to allow comparisons with the full value in the
> > > > > > >>  non-TSC_STABLE case.  So make pvclock_lastcount a 64-bit 
> > > > > > >> quantity.
> > > > > > >>
> > > > > > >> - In pvclock_get_timecount(), move rdtsc() up into the lockless 
> > > > > > >> read
> > > > > > >>  loop to get a more accurate timestamp.
> > > > > > >>
> > > > > > >> - In pvclock_get_timecount(), use rdtsc_lfence(), not rdtsc().
> > > > > > >>
> > > > > > >> - In pvclock_get_timecount(), check that our TSC value doesn't 
> > > > > > >> predate
> > > > > > >>  ti->ti_tsc_timestamp, otherwise we will produce an enormous 
> > > > > > >> value.
> > > > > > >>
> > > > > > >> - In pvclock_get_timecount(), update pvclock_lastcount in the
> > > > > > >>  non-TSC_STABLE case with more care.  On amd64 we can do this 
> > > > > > >> with an
> > > > > > >>  atomic_cas_ulong(9) loop because u_long is 64 bits.  On i386 we 
> > > > > > >> need
> > > > > > >>  to introduce a mutex to protect our comparison and read/write.
> > > > > > >
> > > > > > > i386 has cmpxchg8b, no need to disable interrupts
> > > > > > > the ifdefs seem excessive
> > > > > >
> > > > > > How do I make use of CMPXCHG8B on i386
> > > > > > in this context?
> > > > > >
> > > > > > atomic_cas_ulong(9) is a 32-bit CAS on
> > > > > > i386.
> > > > >
> > > > > static inline uint64_t
> > > > > atomic_cas_64(volatile uint64_t *p, uint64_t o, uint64_t n)
> > > > > {
> > > > >       return __sync_val_compare_and_swap(p, o, n);
> > > > > }
> > > > >
> > > > > Or md atomic.h files could have an equivalent.
> > > > > Not possible on all 32-bit archs.
> > > > >
> > > > > >
> > > > > > We can't use FP registers in the kernel, no?
> > > > >
> > > > > What do FP registers have to do with it?
> > > > >
> > > > > >
> > > > > > Am I missing some other avenue?
> > > > >
> > > > > There is no rdtsc_lfence() on i386.  Initial diff doesn't build.
> > > >
> > > > LFENCE is an SSE2 extension.  As is MFENCE.  I don't think I can just
> > > > drop rdtsc_lfence() into cpufunc.h and proceed without causing some
> > > > kind of fault on an older CPU.
> > > >
> > > > What are my options on a 586-class CPU for forcing RDTSC to complete
> > > > before later instructions?
> > >
> > > "3.3.2. Serializing Operations
> > > After executing certain instructions the Pentium processor serializes
> > > instruction execution. This means that any modifications to flags,
> > > registers, and memory for previous instructions are completed before
> > > the next instruction is fetched and executed. The prefetch queue
> > > is flushed as a result of serializing operations.
> > >
> > > The Pentium processor serializes instruction execution after executing
> > > one of the following instructions: Move to Special Register (except
> > > CRO), INVD, INVLPG, IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD,
> > > CPUID, RSM and WRMSR."
> > >
> > > from:
> > > Pentium Processor User's Manual
> > > Volume 1: Pentium Processor Data Book
> > > Order Number 241428
> > >
> > > http://bitsavers.org/components/intel/pentium/1993_Intel_Pentium_Processor_Users_Manual_Volume_1.pdf
> > >
> > > So it could be rdtsc ; cpuid.
> > > lfence; rdtsc should still be preferred.
> > >
> > > It could be tested during boot and set a function pointer.
> > > Or the codepatch bits could be used.
> > >
> > > In the specific case of pvclock, can it be assumed that the host
> > > has hardware virt and would then have lfence?
> > >
> >
> > I think this is a fair assumption.
>
> Okay, so how about we use rdtsc_lfence() in pvclock_get_timecount()
> and, just to be safe, check for SSE2 during pvclock_match().
>
> If in the future someone wants pvclock(4) support on guests without
> SSE2 support (if any exist) we can deal with it at that time.
>
> Updated patch:
>
> - Add atomic_cas_64() to amd64/include/atomic.h.
>
>   (NetBSD calls it atomic_cas_64(), so this is consistent with their
>   atomic function naming scheme.)
>
>   I am unsure whether it would be better to just do this:
>
> #define atomic_cas_64(_p, _e, _n) _atomic_cas_ulong((unsigned long *)(_p), 
> (_e), (_n))
>
>   If I don't cast the pointer, the compiler complains, so I added
>   _atomic_cas_ullong().
>
>   What is preferred here?
>
> - Add atomic_cas_64() to i386/include/atomic.h.  Implemented with the
>   __sync_val_compare_and_swap() builtin.
>
> - Add rdtsc_lfence() to i386/include/cpufunc.h.
>
> - In pvclock.c, make pvclock_lastcount a volatile 64-bit value to
>   fix the non-PVCLOCK_FLAG_TSC_STABLE case.
>
> - In pvclock_match(), check for SSE2 support in ci_feature_flags.
>   If we don't have it, we don't have LFENCE and we can't match.
>
> - In pvclock_get_timecount(), do RDTSC as early as possible in the
>   lockless read loop to get a better timestamp.
>
> - In pvclock_get_timecount(), use rdtsc_lfence() instead of rdtsc().
>
> - In pvclock_get_timecount(), check whether our TSC lags
>   ti_tsc_timestamp to make sure so we don't produce an enormous,
>   invalid delta.  If we're lagging, set delta to zero.
>
> - In pvclock_get_timecount(), fix the non-PVCLOCK_FLAG_TSC_STABLE
>   case.
>
>   Because we could be in pvclock_get_timecount() with other vCPUs, we
>   need to read, compare, and (maybe) replace pvclock_lastcount in an
>   atomic CAS loop.
>
>   The only thing that I'm still unsure about is whether this:
>
> volatile uint64_t pvclock_lastcount;
>
> void
> foo(void)
> {
>       uint64_t last;
>
>       last = pvclock_lastcount;       /* atomic on i386? */
> }
>
>   is a safe, atomic 8-byte load on i386 on all the CPUs we support,
>   i.e. 586 and up.
>
>   Can someone confirm this?  I didn't know you could do this with
>   32-bit registers.
>

You could compile this and then objdump -D it and see for yourself...

> --
>
> Should I put this out in a second request-for-test email?  The patch
> has drifted a bit.
>
> Index: arch/amd64/include/atomic.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/include/atomic.h,v
> retrieving revision 1.22
> diff -u -p -r1.22 atomic.h
> --- arch/amd64/include/atomic.h       29 Aug 2022 02:01:18 -0000      1.22
> +++ arch/amd64/include/atomic.h       8 Sep 2022 13:28:48 -0000
> @@ -77,6 +77,18 @@ _atomic_cas_ulong(volatile unsigned long
>  }
>  #define atomic_cas_ulong(_p, _e, _n) _atomic_cas_ulong((_p), (_e), (_n))
>
> +static inline unsigned long long
> +_atomic_cas_ullong(volatile unsigned long long *p, unsigned long long e,
> +    unsigned long long n)
> +{
> +     __asm volatile(_LOCK " cmpxchgq %2, %1"
> +         : "=a" (n), "=m" (*p)
> +         : "r" (n), "a" (e), "m" (*p));
> +
> +     return (n);
> +}
> +#define atomic_cas_64(_p, _e, _n) _atomic_cas_ullong((_p), (_e), (_n))
> +
>  static inline void *
>  _atomic_cas_ptr(volatile void *p, void *e, void *n)
>  {
> Index: arch/i386/include/atomic.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/i386/include/atomic.h,v
> retrieving revision 1.20
> diff -u -p -r1.20 atomic.h
> --- arch/i386/include/atomic.h        29 Aug 2022 02:01:18 -0000      1.20
> +++ arch/i386/include/atomic.h        8 Sep 2022 13:28:49 -0000
> @@ -83,6 +83,12 @@ _atomic_cas_ptr(volatile void *p, void *
>  }
>  #define atomic_cas_ptr(_p, _e, _n) _atomic_cas_ptr((_p), (_e), (_n))
>
> +static inline uint64_t
> +atomic_cas_64(volatile uint64_t *p, uint64_t e, uint64_t n)
> +{
> +     return __sync_val_compare_and_swap(p, e, n);
> +}
> +
>  static inline unsigned int
>  _atomic_swap_uint(volatile unsigned int *p, unsigned int n)
>  {
> Index: arch/i386/include/cpufunc.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/i386/include/cpufunc.h,v
> retrieving revision 1.33
> diff -u -p -r1.33 cpufunc.h
> --- arch/i386/include/cpufunc.h       13 Sep 2020 11:53:16 -0000      1.33
> +++ arch/i386/include/cpufunc.h       8 Sep 2022 13:28:49 -0000
> @@ -229,6 +229,15 @@ rdtsc(void)
>       return (tsc);
>  }
>
> +static inline uint64_t
> +rdtsc_lfence(void)
> +{
> +     uint64_t tsc;
> +
> +     __asm volatile("lfence; rdtsc" : "=A" (tsc));
> +     return tsc;
> +}
> +
>  static __inline void
>  wrmsr(u_int msr, u_int64_t newval)
>  {
> Index: dev/pv/pvclock.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/pv/pvclock.c,v
> retrieving revision 1.8
> diff -u -p -r1.8 pvclock.c
> --- dev/pv/pvclock.c  5 Nov 2021 11:38:29 -0000       1.8
> +++ dev/pv/pvclock.c  8 Sep 2022 13:28:49 -0000
> @@ -21,21 +21,23 @@
>  #endif
>
>  #include <sys/param.h>
> +#include <sys/stdint.h>
>  #include <sys/systm.h>
>  #include <sys/kernel.h>
>  #include <sys/timetc.h>
>  #include <sys/timeout.h>
>  #include <sys/malloc.h>
> -#include <sys/atomic.h>
>
>  #include <machine/cpu.h>
>  #include <machine/atomic.h>
> +#include <machine/specialreg.h>
> +
>  #include <uvm/uvm_extern.h>
>
>  #include <dev/pv/pvvar.h>
>  #include <dev/pv/pvreg.h>
>
> -uint pvclock_lastcount;
> +volatile uint64_t pvclock_lastcount;
>
>  struct pvclock_softc {
>       struct device            sc_dev;
> @@ -116,6 +118,12 @@ pvclock_match(struct device *parent, voi
>                   (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) == 0)
>                       return (0);
>
> +             /*
> +              * LFENCE is also required, so check for SSE2 support.
> +              */
> +             if (!ISSET(curcpu()->ci_feature_flags, CPUID_SSE2))
> +                     return (0);
> +
>               return (1);
>       }
>
> @@ -213,6 +221,7 @@ pvclock_get_timecount(struct timecounter
>       struct pvclock_softc            *sc = tc->tc_priv;
>       struct pvclock_time_info        *ti;
>       uint64_t                         tsc_timestamp, system_time, delta, ctr;
> +     uint64_t                         last, tsc;
>       uint32_t                         version, mul_frac;
>       int8_t                           shift;
>       uint8_t                          flags;
> @@ -220,6 +229,7 @@ pvclock_get_timecount(struct timecounter
>       ti = sc->sc_time;
>       do {
>               version = pvclock_read_begin(ti);
> +             tsc = rdtsc_lfence();
>               system_time = ti->ti_system_time;
>               tsc_timestamp = ti->ti_tsc_timestamp;
>               mul_frac = ti->ti_tsc_to_system_mul;
> @@ -231,7 +241,10 @@ pvclock_get_timecount(struct timecounter
>        * The algorithm is described in
>        * linux/Documentation/virtual/kvm/msr.txt
>        */
> -     delta = rdtsc() - tsc_timestamp;
> +     if (tsc > tsc_timestamp)
> +             delta = tsc - tsc_timestamp;
> +     else
> +             delta = 0;
>       if (shift < 0)
>               delta >>= -shift;
>       else
> @@ -241,10 +254,11 @@ pvclock_get_timecount(struct timecounter
>       if ((flags & PVCLOCK_FLAG_TSC_STABLE) != 0)
>               return (ctr);
>
> -     if (ctr < pvclock_lastcount)
> -             return (pvclock_lastcount);
> -
> -     atomic_swap_uint(&pvclock_lastcount, ctr);
> +     do {
> +             last = pvclock_lastcount;
> +             if (ctr < last)
> +                     return (last);
> +     } while (atomic_cas_64(&pvclock_lastcount, last, ctr) != last);
>
>       return (ctr);
>  }

Reply via email to