This definitely wants testing on Ryzen ThinkPads (e.g. E485/E585/X395/T495s) or Inspiron 5505, I see user TSC disabled on a lot of those in dmesglog.
On 2022/01/27 10:28, Scott Cheloha wrote: > Hi, > > sthen@ complained recently about a multisocket system not being able > to use the TSC in userspace because the sync test measured too much > skew and disabled it. > > I don't think there is any real skew on that system. I think the sync > test is confusing NUMA overhead for skew and issuing a false positive > result. > > Now, we _could_ change the test to account for NUMA overhead. I don't > know exactly how you would do it, but I imagine you could devise a > handshake to compute an estimate and then factor that into your skew > measurement. > > Another approach is to drop the current skew measurement handshake in > favor of a dumber approach without that particular false positive case. > > This patch changes our sync test to a dumber approach. Instead of > trying to measure and correct for skew we only test for lag and do > not attempt to correct for it if we detect it. > > Two CPUs enter a loop and continuously check whether the other CPU's > TSC lags their own. With this approach the false positive we are > seeing on sthen@'s machine is impossible because we are only checking > whether one value lags the other, not whether their difference exceeds > an arbitrary value. Synchronization is tested to within a margin of > error because both CPUs are checking for lag at the same time. > > To keep the margin of error is as small as possible for a given clock > rate we spin for a relatively long time. Right now we spin for 1 > millisecond per test round. This is arbitrary but it seems to work. > There is a point of diminishing returns for round duration. This > sync test approach takes much more time than the current handshake > approach and I'd like to minimize our impact on boot time. > > To actually shrink the margin of error you need to run the CPUs at the > highest possible clock rate. If they are underclocked due to e.g. > SpeedStep your margin of error grows and the test may fail to detect > lag. > > We do two rounds of testing for each CPU. This is arbitrary. You > could do more. I think at least one additional test round is a good > idea, just in case we "get lucky" in the first round. I think this > could help mitigate measurement problems introduced by factors beyond > our control. For example, if one of the CPUs blacks out for the > duration of the test because it is preempted by the hypervisor the > test will pass but the result is bogus. If we do more test rounds we > have a better chance of getting a meaningful result even if we get > unlucky with hypervisor preemption or something similar. > > Misc. notes: > > - We no longer have a per-CPU skew value. If any TSC lags another we > just set the timecounter quality to a negative number. We don't > disable userspace or attempt to correct the skew in the kernel. > > I think that bears repeating: we are no longer correcting the skew. > If we detect any lag the TSC is effectively broken. > > - There is no longer any concept of TSC drift. Drift is computed > from skew change and we have no skew value anymore. > > - Am I allowed to printf(9) from a secondary CPU during boot? It > seems to hang the kernel but I don't know why. > > - I have no idea how to use the volatile keyword correctly. I am > trying to keep the compiler from optimizing away my stores. I don't > think it implicitly understands that two threads are looking at these > variables at the same time > > If I am abusing it please tell me. I'm trying to avoid a situation > where some later compiler change subtly breaks the test. If I > sprinkle volatile everywhere my impression is that it forces the > compiler to actually do the store. > > - I have aligned the global TSC values to 64 bytes to try to minimize > "cache bounce". Each value has one reader and one writer so if the > two values are on different cache lines a write to one value shouldn't > cause a cache miss for the writer of the other value. > > ... right? > > I'm not an expert on cache stuff. Can someone shed light on whether > I am doing the right thing here? > > - I rolled my own thread barriers for the test. Would a generic > thread barrier be useful elsewhere in the kernel? Feels wrong to roll > my own synchronization primitive, but maybe we actually don't need > them anywhere else? > > - I would like to forcibly reset IA32_TSC_ADJUST before running the > test but I don't think the CPU feature flags are set early enough > for tsc_reset_adjust() to see the relevant flag even if the CPU > has that register. > > Could we initialize the flags earlier in boot, before the sync test? > > Testing notes: > > - Tests on multisocket systems, multiprocessor VMs on various hypervisors, > and on systems where the TSC is currently disabled by the kernel are > appreciated! Just send your dmesg. > > - TSC_DEBUG is set for now to force the test to run on all CPUs even if > a prior test has already failed. This is useful for testing but in > practice if we fail one test there is no point in running further > tests. > > I'm actually unsure about this. If we take the time to run the test > on every processor it might give us a better "picture" of what is > wrong in a bug report. > > - I left the TSC_FORCE_DESYNC preprocessor clause in there if you > would like to try forcibly desynchronizing the TSC on your AP to see > what the results look like in the console buffer. It relies on the > IA32_TSC_ADJUST register, so if you don't have one you'll get a #GP > fault. Proceed at your own risk. > > Thoughts? Test results? Index: amd64/tsc.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v retrieving revision 1.24 diff -u -p -r1.24 tsc.c --- amd64/tsc.c 31 Aug 2021 15:11:54 -0000 1.24 +++ amd64/tsc.c 27 Jan 2022 16:27:49 -0000 @@ -36,13 +36,6 @@ int tsc_recalibrate; uint64_t tsc_frequency; int tsc_is_invariant; -#define TSC_DRIFT_MAX 250 -#define TSC_SKEW_MAX 100 -int64_t tsc_drift_observed; - -volatile int64_t tsc_sync_val; -volatile struct cpu_info *tsc_sync_cpu; - u_int tsc_get_timecount(struct timecounter *tc); void tsc_delay(int usecs); @@ -236,22 +229,12 @@ cpu_recalibrate_tsc(struct timecounter * u_int tsc_get_timecount(struct timecounter *tc) { - return rdtsc_lfence() + curcpu()->ci_tsc_skew; + return rdtsc_lfence(); } void tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq) { -#ifdef TSC_DEBUG - printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname, - (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed); -#endif - if (ci->ci_tsc_skew < -TSC_SKEW_MAX || ci->ci_tsc_skew > TSC_SKEW_MAX) { - printf("%s: disabling user TSC (skew=%lld)\n", - ci->ci_dev->dv_xname, (long long)ci->ci_tsc_skew); - tsc_timecounter.tc_user = 0; - } - if (!(ci->ci_flags & CPUF_PRIMARY) || !(ci->ci_flags & CPUF_CONST_TSC) || !(ci->ci_flags & CPUF_INVAR_TSC)) @@ -268,111 +251,237 @@ tsc_timecounter_init(struct cpu_info *ci calibrate_tsc_freq(); } - if (tsc_drift_observed > TSC_DRIFT_MAX) { - printf("ERROR: %lld cycle TSC drift observed\n", - (long long)tsc_drift_observed); - tsc_timecounter.tc_quality = -1000; - tsc_timecounter.tc_user = 0; - tsc_is_invariant = 0; - } - tc_init(&tsc_timecounter); } -/* - * Record drift (in clock cycles). Called during AP startup. - */ void -tsc_sync_drift(int64_t drift) +tsc_delay(int usecs) { - if (drift < 0) - drift = -drift; - if (drift > tsc_drift_observed) - tsc_drift_observed = drift; + uint64_t interval, start; + + interval = (uint64_t)usecs * tsc_frequency / 1000000; + start = rdtsc_lfence(); + while (rdtsc_lfence() - start < interval) + CPU_BUSY_CYCLE(); } -/* - * Called during startup of APs, by the boot processor. Interrupts - * are disabled on entry. - */ -void -tsc_read_bp(struct cpu_info *ci, uint64_t *bptscp, uint64_t *aptscp) -{ - uint64_t bptsc; +#ifdef MULTIPROCESSOR - if (atomic_swap_ptr(&tsc_sync_cpu, ci) != NULL) - panic("tsc_sync_bp: 1"); +#define TSC_DEBUG 1 - /* Flag it and read our TSC. */ - atomic_setbits_int(&ci->ci_flags, CPUF_SYNCTSC); - bptsc = (rdtsc_lfence() >> 1); +/* + * Protections for global variables in this code: + * + * 1 Only set from zero to one + * a Modified atomically + * b Protected by a barrier + * p Only modified by primary CPU + * s Only modified by secondary CPU + */ - /* Wait for remote to complete, and read ours again. */ - while ((ci->ci_flags & CPUF_SYNCTSC) != 0) - membar_consumer(); - bptsc += (rdtsc_lfence() >> 1); +#define TSC_SYNC_MS 1 /* Test round duration */ +#define TSC_SYNC_ROUNDS 2 /* Number of test rounds */ - /* Wait for the results to come in. */ - while (tsc_sync_cpu == ci) - CPU_BUSY_CYCLE(); - if (tsc_sync_cpu != NULL) - panic("tsc_sync_bp: 2"); +uint64_t tsc_ap_val __aligned(64); /* [s] Latest TSC value from AP */ +uint64_t tsc_bp_val __aligned(64); /* [p] Latest TSC value from BP */ +uint64_t tsc_sync_cycles; /* [p] TSC cycles per test round */ +const char *tsc_ap_name; /* [b] Name of AP */ +volatile uint64_t tsc_ap_lag; /* [s] Biggest regression seen by AP */ +volatile u_int tsc_egress_barrier; /* [a] Sync test end barrier */ +volatile u_int tsc_ingress_barrier; /* [a] Sync test start barrier */ +volatile u_int tsc_sync_rounds; /* [b] Remaining test rounds */ +volatile u_int tsc_lag_count; /* [a] No. of regressions this round */ +int tsc_simulag; /* [1] Regressions seen on both CPUs? */ +int tsc_is_synchronized = 1; /* [p] TSC sync'd across all CPUs? */ - *bptscp = bptsc; - *aptscp = tsc_sync_val; -} +void tsc_reset_adjust(const struct cpu_info *); +uint64_t tsc_test_sync(volatile const uint64_t *, uint64_t *); void -tsc_sync_bp(struct cpu_info *ci) +tsc_test_sync_bp(struct cpu_info *ci) { - uint64_t bptsc, aptsc; + uint64_t bp_lag; + unsigned int round; + + /* TSC must be constant and invariant to use it as a timecounter. */ + if (!tsc_is_invariant) + return; +#ifndef TSC_DEBUG + /* No point in testing again if a prior test failed. */ + if (!tsc_is_synchronized) + return; +#endif - tsc_read_bp(ci, &bptsc, &aptsc); /* discarded - cache effects */ - tsc_read_bp(ci, &bptsc, &aptsc); + /* Reset IA32_TSC_ADJUST if it exists. */ + tsc_reset_adjust(ci); - /* Compute final value to adjust for skew. */ - ci->ci_tsc_skew = bptsc - aptsc; + /* Reset the test cycle limit and round count. */ + tsc_sync_cycles = TSC_SYNC_MS * tsc_frequency / 1000; + tsc_sync_rounds = TSC_SYNC_ROUNDS; + + do { + /* + * Pass through the ingress barrier, run the test, + * then wait for the AP to reach the egress barrier. + */ + atomic_inc_int(&tsc_ingress_barrier); + while (tsc_ingress_barrier != 2) + CPU_BUSY_CYCLE(); + bp_lag = tsc_test_sync(&tsc_ap_val, &tsc_bp_val); + while (tsc_egress_barrier != 1) + CPU_BUSY_CYCLE(); + + /* + * Report what went wrong, if anything. + */ + if (tsc_lag_count != 0) { + round = TSC_SYNC_ROUNDS - tsc_sync_rounds + 1; + printf("tsc: cpu0/%s sync round %u: %u regressions\n", + tsc_ap_name, round, tsc_lag_count); + if (tsc_simulag) { + printf("tsc: cpu0/%s sync round %u: " + "regressions seen by both CPUs\n", + tsc_ap_name, round); + } + printf("tsc: cpu0/%s sync round %u: " + "cpu0 lags %s by %llu cycles\n", + tsc_ap_name, round, tsc_ap_name, bp_lag); + printf("tsc: cpu0/%s sync round %u: " + "%s lags cpu0 by %llu cycles\n", + tsc_ap_name, round, tsc_ap_name, tsc_ap_lag); + /* + * XXX We need a tc_detach() function to actually + * disable a given timecounter. Lowering the quality + * like this is a nasty hack. + */ + tsc_timecounter.tc_quality = -1000; + tsc_is_synchronized = 0; + tsc_sync_rounds = 0; + } else + tsc_sync_rounds--; + + /* + * Clean up for the next round. + * + * It is safe to reset the ingress barrier because + * at this point we know the AP has reached the egress + * barrier. + */ + tsc_bp_val = 0; + tsc_ap_val = 0; + tsc_ap_lag = 0; + tsc_ingress_barrier = 0; + tsc_simulag = 0; + tsc_lag_count = 0; + if (tsc_sync_rounds == 0) + tsc_ap_name = NULL; + + /* + * Pass through the egress barrier and release the AP. + * The AP is responsible for resetting the barrier. + */ + if (atomic_inc_int_nv(&tsc_egress_barrier) != 2) + panic("%s: unexpected egress count", __func__); + } while (tsc_sync_rounds > 0); } -/* - * Called during startup of AP, by the AP itself. Interrupts are - * disabled on entry. - */ void -tsc_post_ap(struct cpu_info *ci) +tsc_test_sync_ap(struct cpu_info *ci) { - uint64_t tsc; + if (!tsc_is_invariant) + return; +#ifndef TSC_DEBUG + if (!tsc_is_synchronized) + return; +#endif - /* Wait for go-ahead from primary. */ - while ((ci->ci_flags & CPUF_SYNCTSC) == 0) - membar_consumer(); - tsc = (rdtsc_lfence() >> 1); + tsc_reset_adjust(ci); - /* Instruct primary to read its counter. */ - atomic_clearbits_int(&ci->ci_flags, CPUF_SYNCTSC); - tsc += (rdtsc_lfence() >> 1); +/* #define TSC_FORCE_DESYNC (150) */ +#ifdef TSC_FORCE_DESYNC + wrmsr(MSR_TSC_ADJUST, TSC_FORCE_DESYNC); +#endif - /* Post result. Ensure the whole value goes out atomically. */ - (void)atomic_swap_64(&tsc_sync_val, tsc); + /* The BP needs our name in order to log any problems. */ + tsc_ap_name = ci->ci_dev->dv_xname; - if (atomic_swap_ptr(&tsc_sync_cpu, NULL) != ci) - panic("tsc_sync_ap"); + /* + * As the AP we are only responsible for running the test, + * reporting our lag, and resetting the egress barrier. + * The BP sets up and tears down everything else. + */ + do { + atomic_inc_int(&tsc_ingress_barrier); + while (tsc_ingress_barrier != 2) + CPU_BUSY_CYCLE(); + tsc_ap_lag = tsc_test_sync(&tsc_bp_val, &tsc_ap_val); + atomic_inc_int(&tsc_egress_barrier); + while (tsc_egress_barrier != 2) + CPU_BUSY_CYCLE(); + tsc_egress_barrier = 0; + } while (tsc_sync_rounds > 0); } void -tsc_sync_ap(struct cpu_info *ci) +tsc_reset_adjust(const struct cpu_info *ci) { - tsc_post_ap(ci); - tsc_post_ap(ci); + int64_t adj; + + if (ISSET(ci->ci_feature_sefflags_ebx, SEFF0EBX_TSC_ADJUST)) { + adj = rdmsr(MSR_TSC_ADJUST); + if (adj != 0) { + wrmsr(MSR_TSC_ADJUST, 0); +#if 0 + /* XXX APs can't printf during boot or resume? */ + printf("tsc: %s: IA32_TSC_ADJUST: %+lld -> 0", + ci->ci_dev->dv_xname, (long long)adj); +#endif + } + } } -void -tsc_delay(int usecs) +uint64_t +tsc_test_sync(volatile const uint64_t *reference, uint64_t *local) { - uint64_t interval, start; + uint64_t end, lag, max_lag, ref_latest; + u_int i, lag_count; - interval = (uint64_t)usecs * tsc_frequency / 1000000; - start = rdtsc_lfence(); - while (rdtsc_lfence() - start < interval) - CPU_BUSY_CYCLE(); + max_lag = 0; + i = lag_count = 0; + + end = rdtsc_lfence() + tsc_sync_cycles; + while (*local < end) { + ref_latest = *reference; /* atomic */ + *local = rdtsc_lfence(); + + /* + * Ensure the other CPU has a chance to run. This is + * crucial if the other CPU is our SMT sibling. SMT + * starvation can prevent this test from detecting + * relatively large lags. Eight is an arbitrary value + * but it works in practice without impacting the + * accuracy of the test for non-sibling threads. + */ + if (++i % 8 == 0) + CPU_BUSY_CYCLE(); + + /* Everything is fine if the count is strictly monotonic. */ + if (ref_latest < *local) + continue; + + /* + * Otherwise, our TSC lags the other TSC. Record the + * magnitude of the problem. + */ + lag = ref_latest - *local; + if (max_lag < lag) + max_lag = lag; + if (atomic_inc_int_nv(&tsc_lag_count) != lag_count + 1) + tsc_simulag = 1; + lag_count = tsc_lag_count; /* atomic */ + } + + return max_lag; } + +#endif /* MULTIPROCESSOR */ Index: amd64/cpu.c =================================================================== RCS file: /cvs/src/sys/arch/amd64/amd64/cpu.c,v retrieving revision 1.154 diff -u -p -r1.154 cpu.c --- amd64/cpu.c 31 Aug 2021 17:40:59 -0000 1.154 +++ amd64/cpu.c 27 Jan 2022 16:27:49 -0000 @@ -762,9 +762,9 @@ cpu_init(struct cpu_info *ci) lcr4(cr4 & ~CR4_PGE); lcr4(cr4); - /* Synchronize TSC */ + /* Check if our TSC is synchronized with the BP's TSC. */ if (cold && !CPU_IS_PRIMARY(ci)) - tsc_sync_ap(ci); + tsc_test_sync_ap(ci); #endif } @@ -844,18 +844,14 @@ cpu_start_secondary(struct cpu_info *ci) #endif } else { /* - * Synchronize time stamp counters. Invalidate cache and - * synchronize twice (in tsc_sync_bp) to minimize possible - * cache effects. Disable interrupts to try and rule out any - * external interference. + * Check if our TSC is synchronized with the AP's TSC. + * Invalidate cache to minimize possible cache effects + * and disable interrupts to minimize test interference. */ s = intr_disable(); wbinvd(); - tsc_sync_bp(ci); + tsc_test_sync_bp(ci); intr_restore(s); -#ifdef TSC_DEBUG - printf("TSC skew=%lld\n", (long long)ci->ci_tsc_skew); -#endif } if ((ci->ci_flags & CPUF_IDENTIFIED) == 0) { @@ -880,7 +876,6 @@ void cpu_boot_secondary(struct cpu_info *ci) { int i; - int64_t drift; u_long s; atomic_setbits_int(&ci->ci_flags, CPUF_GO); @@ -895,18 +890,14 @@ cpu_boot_secondary(struct cpu_info *ci) db_enter(); #endif } else if (cold) { - /* Synchronize TSC again, check for drift. */ - drift = ci->ci_tsc_skew; + /* + * Test TSC synchronization again. Something may + * have screwed it up while we were down. + */ s = intr_disable(); wbinvd(); - tsc_sync_bp(ci); + tsc_test_sync_bp(ci); intr_restore(s); - drift -= ci->ci_tsc_skew; -#ifdef TSC_DEBUG - printf("TSC skew=%lld drift=%lld\n", - (long long)ci->ci_tsc_skew, (long long)drift); -#endif - tsc_sync_drift(drift); } } @@ -932,16 +923,17 @@ cpu_hatch(void *v) #endif /* - * Synchronize the TSC for the first time. Note that interrupts are - * off at this point. + * Test if our TSC is synchronized with the BP's TSC. + * Note that interrupts are off at this point. Invalidate + * cache to minimize cache effects on the test. */ wbinvd(); ci->ci_flags |= CPUF_PRESENT; - ci->ci_tsc_skew = 0; /* reset on resume */ - tsc_sync_ap(ci); + tsc_test_sync_ap(ci); lapic_enable(); lapic_startclock(); + cpu_ucode_apply(ci); cpu_tsx_disable(ci); Index: include/cpu.h =================================================================== RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v retrieving revision 1.141 diff -u -p -r1.141 cpu.h --- include/cpu.h 31 Aug 2021 17:40:59 -0000 1.141 +++ include/cpu.h 27 Jan 2022 16:27:49 -0000 @@ -209,8 +209,6 @@ struct cpu_info { paddr_t ci_vmxon_region_pa; struct vmxon_region *ci_vmxon_region; - int64_t ci_tsc_skew; /* counter skew vs cpu0 */ - char ci_panicbuf[512]; paddr_t ci_vmcs_pa; @@ -230,7 +228,6 @@ struct cpu_info { #define CPUF_INVAR_TSC 0x0100 /* CPU has invariant TSC */ #define CPUF_USERXSTATE 0x0200 /* CPU has curproc's xsave state */ -#define CPUF_SYNCTSC 0x0800 /* Synchronize TSC */ #define CPUF_PRESENT 0x1000 /* CPU is present */ #define CPUF_RUNNING 0x2000 /* CPU is running */ #define CPUF_PAUSE 0x4000 /* CPU is paused in DDB */ Index: include/cpuvar.h =================================================================== RCS file: /cvs/src/sys/arch/amd64/include/cpuvar.h,v retrieving revision 1.11 diff -u -p -r1.11 cpuvar.h --- include/cpuvar.h 16 May 2021 04:33:05 -0000 1.11 +++ include/cpuvar.h 27 Jan 2022 16:27:49 -0000 @@ -97,8 +97,7 @@ void identifycpu(struct cpu_info *); void cpu_init(struct cpu_info *); void cpu_init_first(void); -void tsc_sync_drift(int64_t); -void tsc_sync_bp(struct cpu_info *); -void tsc_sync_ap(struct cpu_info *); +void tsc_test_sync_bp(struct cpu_info *); +void tsc_test_sync_ap(struct cpu_info *); #endif