This definitely wants testing on Ryzen ThinkPads (e.g. E485/E585/X395/T495s)
or Inspiron 5505, I see user TSC disabled on a lot of those in dmesglog.


On 2022/01/27 10:28, Scott Cheloha wrote:
> Hi,
> 
> sthen@ complained recently about a multisocket system not being able
> to use the TSC in userspace because the sync test measured too much
> skew and disabled it.
> 
> I don't think there is any real skew on that system.  I think the sync
> test is confusing NUMA overhead for skew and issuing a false positive
> result.
> 
> Now, we _could_ change the test to account for NUMA overhead.  I don't
> know exactly how you would do it, but I imagine you could devise a
> handshake to compute an estimate and then factor that into your skew
> measurement.
> 
> Another approach is to drop the current skew measurement handshake in
> favor of a dumber approach without that particular false positive case.
> 
> This patch changes our sync test to a dumber approach.  Instead of
> trying to measure and correct for skew we only test for lag and do
> not attempt to correct for it if we detect it.
> 
> Two CPUs enter a loop and continuously check whether the other CPU's
> TSC lags their own.  With this approach the false positive we are
> seeing on sthen@'s machine is impossible because we are only checking
> whether one value lags the other, not whether their difference exceeds
> an arbitrary value.  Synchronization is tested to within a margin of
> error because both CPUs are checking for lag at the same time.
> 
> To keep the margin of error is as small as possible for a given clock
> rate we spin for a relatively long time.  Right now we spin for 1
> millisecond per test round.  This is arbitrary but it seems to work.
> There is a point of diminishing returns for round duration.  This
> sync test approach takes much more time than the current handshake
> approach and I'd like to minimize our impact on boot time.
> 
> To actually shrink the margin of error you need to run the CPUs at the
> highest possible clock rate.  If they are underclocked due to e.g.
> SpeedStep your margin of error grows and the test may fail to detect
> lag.
> 
> We do two rounds of testing for each CPU.  This is arbitrary.  You
> could do more.  I think at least one additional test round is a good
> idea, just in case we "get lucky" in the first round.  I think this
> could help mitigate measurement problems introduced by factors beyond
> our control.  For example, if one of the CPUs blacks out for the
> duration of the test because it is preempted by the hypervisor the
> test will pass but the result is bogus.  If we do more test rounds we
> have a better chance of getting a meaningful result even if we get
> unlucky with hypervisor preemption or something similar.
> 
> Misc. notes:
> 
> - We no longer have a per-CPU skew value.  If any TSC lags another we
>   just set the timecounter quality to a negative number.  We don't
>   disable userspace or attempt to correct the skew in the kernel.
> 
>   I think that bears repeating: we are no longer correcting the skew.
>   If we detect any lag the TSC is effectively broken.
> 
> - There is no longer any concept of TSC drift.  Drift is computed
>   from skew change and we have no skew value anymore.
> 
> - Am I allowed to printf(9) from a secondary CPU during boot?  It
>   seems to hang the kernel but I don't know why.
> 
> - I have no idea how to use the volatile keyword correctly.  I am
>   trying to keep the compiler from optimizing away my stores.  I don't
>   think it implicitly understands that two threads are looking at these
>   variables at the same time
> 
>   If I am abusing it please tell me.  I'm trying to avoid a situation
>   where some later compiler change subtly breaks the test.  If I
>   sprinkle volatile everywhere my impression is that it forces the
>   compiler to actually do the store.
> 
> - I have aligned the global TSC values to 64 bytes to try to minimize
>   "cache bounce".  Each value has one reader and one writer so if the
>   two values are on different cache lines a write to one value shouldn't
>   cause a cache miss for the writer of the other value.
> 
>   ... right?
> 
>   I'm not an expert on cache stuff.  Can someone shed light on whether
>   I am doing the right thing here?
> 
> - I rolled my own thread barriers for the test.  Would a generic
>   thread barrier be useful elsewhere in the kernel?  Feels wrong to roll
>   my own synchronization primitive, but maybe we actually don't need
>   them anywhere else?
> 
> - I would like to forcibly reset IA32_TSC_ADJUST before running the
>   test but I don't think the CPU feature flags are set early enough
>   for tsc_reset_adjust() to see the relevant flag even if the CPU
>   has that register.
> 
>   Could we initialize the flags earlier in boot, before the sync test?
> 
> Testing notes:
> 
> - Tests on multisocket systems, multiprocessor VMs on various hypervisors,
>   and on systems where the TSC is currently disabled by the kernel are
>   appreciated!  Just send your dmesg.
> 
> - TSC_DEBUG is set for now to force the test to run on all CPUs even if
>   a prior test has already failed.  This is useful for testing but in
>   practice if we fail one test there is no point in running further
>   tests.
> 
>   I'm actually unsure about this.  If we take the time to run the test
>   on every processor it might give us a better "picture" of what is
>   wrong in a bug report.
> 
> - I left the TSC_FORCE_DESYNC preprocessor clause in there if you
>   would like to try forcibly desynchronizing the TSC on your AP to see
>   what the results look like in the console buffer.  It relies on the
>   IA32_TSC_ADJUST register, so if you don't have one you'll get a #GP
>   fault.  Proceed at your own risk.
> 
> Thoughts?  Test results?

Index: amd64/tsc.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
retrieving revision 1.24
diff -u -p -r1.24 tsc.c
--- amd64/tsc.c 31 Aug 2021 15:11:54 -0000      1.24
+++ amd64/tsc.c 27 Jan 2022 16:27:49 -0000
@@ -36,13 +36,6 @@ int          tsc_recalibrate;
 uint64_t       tsc_frequency;
 int            tsc_is_invariant;
 
-#define        TSC_DRIFT_MAX                   250
-#define TSC_SKEW_MAX                   100
-int64_t        tsc_drift_observed;
-
-volatile int64_t       tsc_sync_val;
-volatile struct cpu_info       *tsc_sync_cpu;
-
 u_int          tsc_get_timecount(struct timecounter *tc);
 void           tsc_delay(int usecs);
 
@@ -236,22 +229,12 @@ cpu_recalibrate_tsc(struct timecounter *
 u_int
 tsc_get_timecount(struct timecounter *tc)
 {
-       return rdtsc_lfence() + curcpu()->ci_tsc_skew;
+       return rdtsc_lfence();
 }
 
 void
 tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
 {
-#ifdef TSC_DEBUG
-       printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
-           (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
-#endif
-       if (ci->ci_tsc_skew < -TSC_SKEW_MAX || ci->ci_tsc_skew > TSC_SKEW_MAX) {
-               printf("%s: disabling user TSC (skew=%lld)\n",
-                   ci->ci_dev->dv_xname, (long long)ci->ci_tsc_skew);
-               tsc_timecounter.tc_user = 0;
-       }
-
        if (!(ci->ci_flags & CPUF_PRIMARY) ||
            !(ci->ci_flags & CPUF_CONST_TSC) ||
            !(ci->ci_flags & CPUF_INVAR_TSC))
@@ -268,111 +251,237 @@ tsc_timecounter_init(struct cpu_info *ci
                calibrate_tsc_freq();
        }
 
-       if (tsc_drift_observed > TSC_DRIFT_MAX) {
-               printf("ERROR: %lld cycle TSC drift observed\n",
-                   (long long)tsc_drift_observed);
-               tsc_timecounter.tc_quality = -1000;
-               tsc_timecounter.tc_user = 0;
-               tsc_is_invariant = 0;
-       }
-
        tc_init(&tsc_timecounter);
 }
 
-/*
- * Record drift (in clock cycles).  Called during AP startup.
- */
 void
-tsc_sync_drift(int64_t drift)
+tsc_delay(int usecs)
 {
-       if (drift < 0)
-               drift = -drift;
-       if (drift > tsc_drift_observed)
-               tsc_drift_observed = drift;
+       uint64_t interval, start;
+
+       interval = (uint64_t)usecs * tsc_frequency / 1000000;
+       start = rdtsc_lfence();
+       while (rdtsc_lfence() - start < interval)
+               CPU_BUSY_CYCLE();
 }
 
-/*
- * Called during startup of APs, by the boot processor.  Interrupts
- * are disabled on entry.
- */
-void
-tsc_read_bp(struct cpu_info *ci, uint64_t *bptscp, uint64_t *aptscp)
-{
-       uint64_t bptsc;
+#ifdef MULTIPROCESSOR
 
-       if (atomic_swap_ptr(&tsc_sync_cpu, ci) != NULL)
-               panic("tsc_sync_bp: 1");
+#define TSC_DEBUG 1
 
-       /* Flag it and read our TSC. */
-       atomic_setbits_int(&ci->ci_flags, CPUF_SYNCTSC);
-       bptsc = (rdtsc_lfence() >> 1);
+/*
+ * Protections for global variables in this code:
+ *
+ *     1       Only set from zero to one
+ *     a       Modified atomically
+ *     b       Protected by a barrier
+ *     p       Only modified by primary CPU
+ *     s       Only modified by secondary CPU
+ */
 
-       /* Wait for remote to complete, and read ours again. */
-       while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
-               membar_consumer();
-       bptsc += (rdtsc_lfence() >> 1);
+#define TSC_SYNC_MS            1       /* Test round duration */
+#define TSC_SYNC_ROUNDS                2       /* Number of test rounds */
 
-       /* Wait for the results to come in. */
-       while (tsc_sync_cpu == ci)
-               CPU_BUSY_CYCLE();
-       if (tsc_sync_cpu != NULL)
-               panic("tsc_sync_bp: 2");
+uint64_t tsc_ap_val __aligned(64);     /* [s] Latest TSC value from AP */
+uint64_t tsc_bp_val __aligned(64);     /* [p] Latest TSC value from BP */
+uint64_t tsc_sync_cycles;              /* [p] TSC cycles per test round */
+const char *tsc_ap_name;               /* [b] Name of AP */
+volatile uint64_t tsc_ap_lag;          /* [s] Biggest regression seen by AP */
+volatile u_int tsc_egress_barrier;     /* [a] Sync test end barrier */
+volatile u_int tsc_ingress_barrier;    /* [a] Sync test start barrier */
+volatile u_int tsc_sync_rounds;                /* [b] Remaining test rounds */
+volatile u_int tsc_lag_count;          /* [a] No. of regressions this round */
+int tsc_simulag;                       /* [1] Regressions seen on both CPUs? */
+int tsc_is_synchronized = 1;           /* [p] TSC sync'd across all CPUs? */
 
-       *bptscp = bptsc;
-       *aptscp = tsc_sync_val;
-}
+void tsc_reset_adjust(const struct cpu_info *);
+uint64_t tsc_test_sync(volatile const uint64_t *, uint64_t *);
 
 void
-tsc_sync_bp(struct cpu_info *ci)
+tsc_test_sync_bp(struct cpu_info *ci)
 {
-       uint64_t bptsc, aptsc;
+       uint64_t bp_lag;
+       unsigned int round;
+
+       /* TSC must be constant and invariant to use it as a timecounter. */
+       if (!tsc_is_invariant)
+               return;
+#ifndef TSC_DEBUG
+       /* No point in testing again if a prior test failed. */
+       if (!tsc_is_synchronized)
+               return;
+#endif
 
-       tsc_read_bp(ci, &bptsc, &aptsc); /* discarded - cache effects */
-       tsc_read_bp(ci, &bptsc, &aptsc);
+       /* Reset IA32_TSC_ADJUST if it exists. */
+       tsc_reset_adjust(ci);
 
-       /* Compute final value to adjust for skew. */
-       ci->ci_tsc_skew = bptsc - aptsc;
+       /* Reset the test cycle limit and round count. */
+       tsc_sync_cycles = TSC_SYNC_MS * tsc_frequency / 1000;
+       tsc_sync_rounds = TSC_SYNC_ROUNDS;
+
+       do {
+               /*
+                * Pass through the ingress barrier, run the test,
+                * then wait for the AP to reach the egress barrier.
+                */
+               atomic_inc_int(&tsc_ingress_barrier);
+               while (tsc_ingress_barrier != 2)
+                       CPU_BUSY_CYCLE();
+               bp_lag = tsc_test_sync(&tsc_ap_val, &tsc_bp_val);
+               while (tsc_egress_barrier != 1)
+                       CPU_BUSY_CYCLE();
+       
+               /*
+                * Report what went wrong, if anything.
+                */
+               if (tsc_lag_count != 0) {
+                       round = TSC_SYNC_ROUNDS - tsc_sync_rounds + 1;
+                       printf("tsc: cpu0/%s sync round %u: %u regressions\n",
+                           tsc_ap_name, round, tsc_lag_count);
+                       if (tsc_simulag) {
+                               printf("tsc: cpu0/%s sync round %u: "
+                                   "regressions seen by both CPUs\n",
+                                   tsc_ap_name, round);
+                       }
+                       printf("tsc: cpu0/%s sync round %u: "
+                           "cpu0 lags %s by %llu cycles\n",
+                           tsc_ap_name, round, tsc_ap_name, bp_lag);
+                       printf("tsc: cpu0/%s sync round %u: "
+                           "%s lags cpu0 by %llu cycles\n",
+                           tsc_ap_name, round, tsc_ap_name, tsc_ap_lag);
+                       /*
+                        * XXX We need a tc_detach() function to actually
+                        * disable a given timecounter.  Lowering the quality
+                        * like this is a nasty hack.
+                        */
+                       tsc_timecounter.tc_quality = -1000;
+                       tsc_is_synchronized = 0;
+                       tsc_sync_rounds = 0;
+               } else
+                       tsc_sync_rounds--;
+
+               /*
+                * Clean up for the next round.
+                *
+                * It is safe to reset the ingress barrier because
+                * at this point we know the AP has reached the egress
+                * barrier.
+                */
+               tsc_bp_val = 0;
+               tsc_ap_val = 0;
+               tsc_ap_lag = 0;
+               tsc_ingress_barrier = 0;
+               tsc_simulag = 0;
+               tsc_lag_count = 0;
+               if (tsc_sync_rounds == 0)
+                       tsc_ap_name = NULL;
+
+               /*
+                * Pass through the egress barrier and release the AP.
+                * The AP is responsible for resetting the barrier.
+                */
+               if (atomic_inc_int_nv(&tsc_egress_barrier) != 2)
+                       panic("%s: unexpected egress count", __func__);
+       } while (tsc_sync_rounds > 0);
 }
 
-/*
- * Called during startup of AP, by the AP itself.  Interrupts are
- * disabled on entry.
- */
 void
-tsc_post_ap(struct cpu_info *ci)
+tsc_test_sync_ap(struct cpu_info *ci)
 {
-       uint64_t tsc;
+       if (!tsc_is_invariant)
+               return;
+#ifndef TSC_DEBUG
+       if (!tsc_is_synchronized)
+               return;
+#endif
 
-       /* Wait for go-ahead from primary. */
-       while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
-               membar_consumer();
-       tsc = (rdtsc_lfence() >> 1);
+       tsc_reset_adjust(ci);
 
-       /* Instruct primary to read its counter. */
-       atomic_clearbits_int(&ci->ci_flags, CPUF_SYNCTSC);
-       tsc += (rdtsc_lfence() >> 1);
+/* #define TSC_FORCE_DESYNC (150) */
+#ifdef TSC_FORCE_DESYNC
+       wrmsr(MSR_TSC_ADJUST, TSC_FORCE_DESYNC);
+#endif
 
-       /* Post result.  Ensure the whole value goes out atomically. */
-       (void)atomic_swap_64(&tsc_sync_val, tsc);
+       /* The BP needs our name in order to log any problems. */
+       tsc_ap_name = ci->ci_dev->dv_xname;
 
-       if (atomic_swap_ptr(&tsc_sync_cpu, NULL) != ci)
-               panic("tsc_sync_ap");
+       /*
+        * As the AP we are only responsible for running the test,
+        * reporting our lag, and resetting the egress barrier.
+        * The BP sets up and tears down everything else.
+        */
+       do {
+               atomic_inc_int(&tsc_ingress_barrier);
+               while (tsc_ingress_barrier != 2)
+                       CPU_BUSY_CYCLE();
+               tsc_ap_lag = tsc_test_sync(&tsc_bp_val, &tsc_ap_val);
+               atomic_inc_int(&tsc_egress_barrier);
+               while (tsc_egress_barrier != 2)
+                       CPU_BUSY_CYCLE();
+               tsc_egress_barrier = 0;
+       } while (tsc_sync_rounds > 0);
 }
 
 void
-tsc_sync_ap(struct cpu_info *ci)
+tsc_reset_adjust(const struct cpu_info *ci)
 {
-       tsc_post_ap(ci);
-       tsc_post_ap(ci);
+       int64_t adj;
+
+       if (ISSET(ci->ci_feature_sefflags_ebx, SEFF0EBX_TSC_ADJUST)) {
+               adj = rdmsr(MSR_TSC_ADJUST);
+               if (adj != 0) {
+                       wrmsr(MSR_TSC_ADJUST, 0);
+#if 0
+                       /* XXX APs can't printf during boot or resume? */
+                       printf("tsc: %s: IA32_TSC_ADJUST: %+lld -> 0",
+                           ci->ci_dev->dv_xname, (long long)adj);
+#endif
+               }
+       }
 }
 
-void
-tsc_delay(int usecs)
+uint64_t
+tsc_test_sync(volatile const uint64_t *reference, uint64_t *local)
 {
-       uint64_t interval, start;
+       uint64_t end, lag, max_lag, ref_latest;
+       u_int i, lag_count;
 
-       interval = (uint64_t)usecs * tsc_frequency / 1000000;
-       start = rdtsc_lfence();
-       while (rdtsc_lfence() - start < interval)
-               CPU_BUSY_CYCLE();
+       max_lag = 0;
+       i = lag_count = 0;
+
+       end = rdtsc_lfence() + tsc_sync_cycles;
+       while (*local < end) {
+               ref_latest = *reference;                /* atomic */
+               *local = rdtsc_lfence();
+
+               /*
+                * Ensure the other CPU has a chance to run.  This is
+                * crucial if the other CPU is our SMT sibling.  SMT
+                * starvation can prevent this test from detecting
+                * relatively large lags.  Eight is an arbitrary value
+                * but it works in practice without impacting the
+                * accuracy of the test for non-sibling threads.
+                */
+               if (++i % 8 == 0)
+                       CPU_BUSY_CYCLE();
+
+               /* Everything is fine if the count is strictly monotonic. */
+               if (ref_latest < *local)
+                       continue;
+
+               /*
+                * Otherwise, our TSC lags the other TSC.  Record the
+                * magnitude of the problem.
+                */
+               lag = ref_latest - *local;
+               if (max_lag < lag)
+                       max_lag = lag;
+               if (atomic_inc_int_nv(&tsc_lag_count) != lag_count + 1)
+                       tsc_simulag = 1;
+               lag_count = tsc_lag_count;              /* atomic */
+       }
+
+       return max_lag;
 }
+
+#endif /* MULTIPROCESSOR */
Index: amd64/cpu.c
===================================================================
RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v
retrieving revision 1.154
diff -u -p -r1.154 cpu.c
--- amd64/cpu.c 31 Aug 2021 17:40:59 -0000      1.154
+++ amd64/cpu.c 27 Jan 2022 16:27:49 -0000
@@ -762,9 +762,9 @@ cpu_init(struct cpu_info *ci)
        lcr4(cr4 & ~CR4_PGE);
        lcr4(cr4);
 
-       /* Synchronize TSC */
+       /* Check if our TSC is synchronized with the BP's TSC. */
        if (cold && !CPU_IS_PRIMARY(ci))
-             tsc_sync_ap(ci);
+             tsc_test_sync_ap(ci);
 #endif
 }
 
@@ -844,18 +844,14 @@ cpu_start_secondary(struct cpu_info *ci)
 #endif
        } else {
                /*
-                * Synchronize time stamp counters. Invalidate cache and
-                * synchronize twice (in tsc_sync_bp) to minimize possible
-                * cache effects. Disable interrupts to try and rule out any
-                * external interference.
+                * Check if our TSC is synchronized with the AP's TSC.
+                * Invalidate cache to minimize possible cache effects
+                * and disable interrupts to minimize test interference.
                 */
                s = intr_disable();
                wbinvd();
-               tsc_sync_bp(ci);
+               tsc_test_sync_bp(ci);
                intr_restore(s);
-#ifdef TSC_DEBUG
-               printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew);
-#endif
        }
 
        if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) {
@@ -880,7 +876,6 @@ void
 cpu_boot_secondary(struct cpu_info *ci)
 {
        int i;
-       int64_t drift;
        u_long s;
 
        atomic_setbits_int(&ci->ci_flags, CPUF_GO);
@@ -895,18 +890,14 @@ cpu_boot_secondary(struct cpu_info *ci)
                db_enter();
 #endif
        } else if (cold) {
-               /* Synchronize TSC again, check for drift. */
-               drift = ci->ci_tsc_skew;
+               /*
+                * Test TSC synchronization again.  Something may
+                * have screwed it up while we were down.
+                */
                s = intr_disable();
                wbinvd();
-               tsc_sync_bp(ci);
+               tsc_test_sync_bp(ci);
                intr_restore(s);
-               drift -= ci->ci_tsc_skew;
-#ifdef TSC_DEBUG
-               printf("TSC skew=%lld drift=%lld\n",
-                   (long long)ci->ci_tsc_skew, (long long)drift);
-#endif
-               tsc_sync_drift(drift);
        }
 }
 
@@ -932,16 +923,17 @@ cpu_hatch(void *v)
 #endif
 
        /*
-        * Synchronize the TSC for the first time. Note that interrupts are
-        * off at this point.
+        * Test if our TSC is synchronized with the BP's TSC.
+        * Note that interrupts are off at this point.  Invalidate
+        * cache to minimize cache effects on the test.
         */
        wbinvd();
        ci->ci_flags |= CPUF_PRESENT;
-       ci->ci_tsc_skew = 0;    /* reset on resume */
-       tsc_sync_ap(ci);
+       tsc_test_sync_ap(ci);
 
        lapic_enable();
        lapic_startclock();
+
        cpu_ucode_apply(ci);
        cpu_tsx_disable(ci);
 
Index: include/cpu.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
retrieving revision 1.141
diff -u -p -r1.141 cpu.h
--- include/cpu.h       31 Aug 2021 17:40:59 -0000      1.141
+++ include/cpu.h       27 Jan 2022 16:27:49 -0000
@@ -209,8 +209,6 @@ struct cpu_info {
        paddr_t         ci_vmxon_region_pa;
        struct vmxon_region *ci_vmxon_region;
 
-       int64_t         ci_tsc_skew;            /* counter skew vs cpu0 */
-
        char            ci_panicbuf[512];
 
        paddr_t         ci_vmcs_pa;
@@ -230,7 +228,6 @@ struct cpu_info {
 #define CPUF_INVAR_TSC 0x0100          /* CPU has invariant TSC */
 #define CPUF_USERXSTATE        0x0200          /* CPU has curproc's xsave 
state */
 
-#define CPUF_SYNCTSC   0x0800          /* Synchronize TSC */
 #define CPUF_PRESENT   0x1000          /* CPU is present */
 #define CPUF_RUNNING   0x2000          /* CPU is running */
 #define CPUF_PAUSE     0x4000          /* CPU is paused in DDB */
Index: include/cpuvar.h
===================================================================
RCS file: /cvs/src/sys/arch/amd64/include/cpuvar.h,v
retrieving revision 1.11
diff -u -p -r1.11 cpuvar.h
--- include/cpuvar.h    16 May 2021 04:33:05 -0000      1.11
+++ include/cpuvar.h    27 Jan 2022 16:27:49 -0000
@@ -97,8 +97,7 @@ void identifycpu(struct cpu_info *);
 void cpu_init(struct cpu_info *);
 void cpu_init_first(void);
 
-void tsc_sync_drift(int64_t);
-void tsc_sync_bp(struct cpu_info *);
-void tsc_sync_ap(struct cpu_info *);
+void tsc_test_sync_bp(struct cpu_info *);
+void tsc_test_sync_ap(struct cpu_info *);
 
 #endif

Reply via email to