> On Sep 8, 2022, at 9:27 AM, Mark Kettenis <mark.kette...@xs4all.nl> wrote:
> 
>> Date: Thu, 8 Sep 2022 08:32:27 -0500
>> From: Scott Cheloha <scottchel...@gmail.com>
>> 
>> On Tue, Sep 06, 2022 at 03:30:44AM -0700, Mike Larkin wrote:
>>> On Sun, Sep 04, 2022 at 02:50:10PM +1000, Jonathan Gray wrote:
>>>> On Sat, Sep 03, 2022 at 05:33:01PM -0500, Scott Cheloha wrote:
>>>>> On Sat, Sep 03, 2022 at 10:37:31PM +1000, Jonathan Gray wrote:
>>>>>> On Sat, Sep 03, 2022 at 06:52:20AM -0500, Scott Cheloha wrote:
>>>>>>>> On Sep 3, 2022, at 02:22, Jonathan Gray <j...@jsg.id.au> wrote:
>>>>>>>> 
>>>>>>>> ???On Fri, Sep 02, 2022 at 06:00:25PM -0500, Scott Cheloha wrote:
>>>>>>>>> dv@ suggested coming to the list to request testing for the pvclock(4)
>>>>>>>>> driver.  Attached is a patch that corrects several bugs.  Most of
>>>>>>>>> these changes will only matter in the non-TSC_STABLE case on a
>>>>>>>>> multiprocessor VM.
>>>>>>>>> 
>>>>>>>>> Ideally, nothing should break.
>>>>>>>>> 
>>>>>>>>> - pvclock yields a 64-bit value.  The BSD timecounter layer can only
>>>>>>>>> use the lower 32 bits, but internally we need to track the full
>>>>>>>>> 64-bit value to allow comparisons with the full value in the
>>>>>>>>> non-TSC_STABLE case.  So make pvclock_lastcount a 64-bit quantity.
>>>>>>>>> 
>>>>>>>>> - In pvclock_get_timecount(), move rdtsc() up into the lockless read
>>>>>>>>> loop to get a more accurate timestamp.
>>>>>>>>> 
>>>>>>>>> - In pvclock_get_timecount(), use rdtsc_lfence(), not rdtsc().
>>>>>>>>> 
>>>>>>>>> - In pvclock_get_timecount(), check that our TSC value doesn't predate
>>>>>>>>> ti->ti_tsc_timestamp, otherwise we will produce an enormous value.
>>>>>>>>> 
>>>>>>>>> - In pvclock_get_timecount(), update pvclock_lastcount in the
>>>>>>>>> non-TSC_STABLE case with more care.  On amd64 we can do this with an
>>>>>>>>> atomic_cas_ulong(9) loop because u_long is 64 bits.  On i386 we need
>>>>>>>>> to introduce a mutex to protect our comparison and read/write.
>>>>>>>> 
>>>>>>>> i386 has cmpxchg8b, no need to disable interrupts
>>>>>>>> the ifdefs seem excessive
>>>>>>> 
>>>>>>> How do I make use of CMPXCHG8B on i386
>>>>>>> in this context?
>>>>>>> 
>>>>>>> atomic_cas_ulong(9) is a 32-bit CAS on
>>>>>>> i386.
>>>>>> 
>>>>>> static inline uint64_t
>>>>>> atomic_cas_64(volatile uint64_t *p, uint64_t o, uint64_t n)
>>>>>> {
>>>>>>  return __sync_val_compare_and_swap(p, o, n);
>>>>>> }
>>>>>> 
>>>>>> Or md atomic.h files could have an equivalent.
>>>>>> Not possible on all 32-bit archs.
>>>>>> 
>>>>>>> 
>>>>>>> We can't use FP registers in the kernel, no?
>>>>>> 
>>>>>> What do FP registers have to do with it?
>>>>>> 
>>>>>>> 
>>>>>>> Am I missing some other avenue?
>>>>>> 
>>>>>> There is no rdtsc_lfence() on i386.  Initial diff doesn't build.
>>>>> 
>>>>> LFENCE is an SSE2 extension.  As is MFENCE.  I don't think I can just
>>>>> drop rdtsc_lfence() into cpufunc.h and proceed without causing some
>>>>> kind of fault on an older CPU.
>>>>> 
>>>>> What are my options on a 586-class CPU for forcing RDTSC to complete
>>>>> before later instructions?
>>>> 
>>>> "3.3.2. Serializing Operations
>>>> After executing certain instructions the Pentium processor serializes
>>>> instruction execution. This means that any modifications to flags,
>>>> registers, and memory for previous instructions are completed before
>>>> the next instruction is fetched and executed. The prefetch queue
>>>> is flushed as a result of serializing operations.
>>>> 
>>>> The Pentium processor serializes instruction execution after executing
>>>> one of the following instructions: Move to Special Register (except
>>>> CRO), INVD, INVLPG, IRET, IRETD, LGDT, LLDT, LIDT, LTR, WBINVD,
>>>> CPUID, RSM and WRMSR."
>>>> 
>>>> from:
>>>> Pentium Processor User's Manual
>>>> Volume 1: Pentium Processor Data Book
>>>> Order Number 241428
>>>> 
>>>> http://bitsavers.org/components/intel/pentium/1993_Intel_Pentium_Processor_Users_Manual_Volume_1.pdf
>>>> 
>>>> So it could be rdtsc ; cpuid.
>>>> lfence; rdtsc should still be preferred.
>>>> 
>>>> It could be tested during boot and set a function pointer.
>>>> Or the codepatch bits could be used.
>>>> 
>>>> In the specific case of pvclock, can it be assumed that the host
>>>> has hardware virt and would then have lfence?
>>>> 
>>> 
>>> I think this is a fair assumption.
>> 
>> Okay, so how about we use rdtsc_lfence() in pvclock_get_timecount()
>> and, just to be safe, check for SSE2 during pvclock_match().
>> 
>> If in the future someone wants pvclock(4) support on guests without
>> SSE2 support (if any exist) we can deal with it at that time.
>> 
>> Updated patch:
>> 
>> - Add atomic_cas_64() to amd64/include/atomic.h.
>> 
>>  (NetBSD calls it atomic_cas_64(), so this is consistent with their
>>  atomic function naming scheme.)
>> 
>>  I am unsure whether it would be better to just do this:
>> 
>> #define atomic_cas_64(_p, _e, _n) _atomic_cas_ulong((unsigned long *)(_p), 
>> (_e), (_n))
>> 
>>  If I don't cast the pointer, the compiler complains, so I added
>>  _atomic_cas_ullong().
>> 
>>  What is preferred here?
>> 
>> - Add atomic_cas_64() to i386/include/atomic.h.  Implemented with the
>>  __sync_val_compare_and_swap() builtin.
>> 
>> - Add rdtsc_lfence() to i386/include/cpufunc.h.
>> 
>> - In pvclock.c, make pvclock_lastcount a volatile 64-bit value to
>>  fix the non-PVCLOCK_FLAG_TSC_STABLE case.
>> 
>> - In pvclock_match(), check for SSE2 support in ci_feature_flags.
>>  If we don't have it, we don't have LFENCE and we can't match.
>> 
>> - In pvclock_get_timecount(), do RDTSC as early as possible in the
>>  lockless read loop to get a better timestamp.
>> 
>> - In pvclock_get_timecount(), use rdtsc_lfence() instead of rdtsc().
>> 
>> - In pvclock_get_timecount(), check whether our TSC lags
>>  ti_tsc_timestamp to make sure so we don't produce an enormous,
>>  invalid delta.  If we're lagging, set delta to zero.
>> 
>> - In pvclock_get_timecount(), fix the non-PVCLOCK_FLAG_TSC_STABLE
>>  case.
>> 
>>  Because we could be in pvclock_get_timecount() with other vCPUs, we
>>  need to read, compare, and (maybe) replace pvclock_lastcount in an
>>  atomic CAS loop.
>> 
>>  The only thing that I'm still unsure about is whether this:
>> 
>> volatile uint64_t pvclock_lastcount;
>> 
>> void
>> foo(void)
>> {
>>      uint64_t last;
>> 
>>      last = pvclock_lastcount;       /* atomic on i386? */
>> }
>> 
>>  is a safe, atomic 8-byte load on i386 on all the CPUs we support,
>>  i.e. 586 and up.
>> 
>>  Can someone confirm this?  I didn't know you could do this with
>>  32-bit registers.
>> 
>> --
>> 
>> Should I put this out in a second request-for-test email?  The patch
>> has drifted a bit.
>> 
>> Index: arch/amd64/include/atomic.h
>> ===================================================================
>> RCS file: /cvs/src/sys/arch/amd64/include/atomic.h,v
>> retrieving revision 1.22
>> diff -u -p -r1.22 atomic.h
>> --- arch/amd64/include/atomic.h      29 Aug 2022 02:01:18 -0000      1.22
>> +++ arch/amd64/include/atomic.h      8 Sep 2022 13:28:48 -0000
>> @@ -77,6 +77,18 @@ _atomic_cas_ulong(volatile unsigned long
>> }
>> #define atomic_cas_ulong(_p, _e, _n) _atomic_cas_ulong((_p), (_e), (_n))
>> 
>> +static inline unsigned long long
>> +_atomic_cas_ullong(volatile unsigned long long *p, unsigned long long e,
>> +    unsigned long long n)
>> +{
>> +    __asm volatile(_LOCK " cmpxchgq %2, %1"
>> +        : "=a" (n), "=m" (*p)
>> +        : "r" (n), "a" (e), "m" (*p));
>> +
>> +    return (n);
>> +}
>> +#define atomic_cas_64(_p, _e, _n) _atomic_cas_ullong((_p), (_e), (_n))
>> +
> 
> Please don't do this.  64-bit atomics are generally not available on
> 32-bit systems.  Putting this function in <machine/atomic.h> will make
> it visible from <sys/atomic.h> which makes it way too easy for people
> that only test on amd64 to accidentally use it and break other
> architectures.

I introduced this updated patch to address jsg@'s concerns about muddying
up the driver with a bunch of unneeded #ifdef cruft.

If putting atomic_cas_64() in <machine/atomic.h> is bad from a project
maintenance standpoint, where would you prefer I put it?

Reply via email to