Re: lfence for rdtsc
Mark Kettenis wrote: > But maybe the default rdtsc() should include the lfence. And then we > could have rdtsc_unordered() for this cases that don't care about > ordering. Right. But I don't like the word 'order', because it is too vague. There are layers of ordering, speculation, asyncronous execution, etc. and lfence just deals with some of them. > As I wrote in my first mail, cpu_rnd_messybits() may want to use that. > And maybe one of the network stack people should investigate what the > impact of having the fence in the timecounter is? cpu_rnd_messybits is indifferent. It is amusing to capture some speculation side effect, but if we don't, it isn't the end of the world. A few days ago, we didn't even know this aspect existed...
Re: lfence for rdtsc
> Date: Sun, 21 Jun 2020 16:55:56 +0100 > From: Stuart Henderson > > On 2020/06/21 18:46, Paul Irofti wrote: > > > > > > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt a scris: > > >Paul Irofti wrote: > > > > > >> If you change the name to rdtsc_ordered(), OK. > > > > > >That is a weaker name. > > > > > >Ordered in what way, at what level; ordered against what? > > > > > >This is using a specific pipeline ordering known as lfence. > > >So it might as well say lfence. That is the technical name for > > >that type of ordering. Being vague is unhelpful. > > > > > > Ok then, if you think that's best. > > > > Any idea why in > https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html > they are using cpuid to serialize access instead of lfence? Yes; LFENCE only exists on machines with SSE2. So if you want something that works on older (32-bit) CPUs you need to use a different instruction. CPUID is the canonical choice for that.
Re: lfence for rdtsc
On Sun, Jun 21, 2020 at 04:55:56PM +0100, Stuart Henderson wrote: > On 2020/06/21 18:46, Paul Irofti wrote: > > > > > > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt a scris: > > >Paul Irofti wrote: > > > > > >> If you change the name to rdtsc_ordered(), OK. > > > > > >That is a weaker name. > > > > > >Ordered in what way, at what level; ordered against what? > > > > > >This is using a specific pipeline ordering known as lfence. > > >So it might as well say lfence. That is the technical name for > > >that type of ordering. Being vague is unhelpful. > > > > > > Ok then, if you think that's best. > > > > Any idea why in > https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html > they are using cpuid to serialize access instead of lfence? If I remember correctly it is because it is also a serializing instruction, but nowadays it is more expensive than lfence.
Re: lfence for rdtsc
> From: "Theo de Raadt" > Date: Sun, 21 Jun 2020 07:30:43 -0600 > > Paul Irofti wrote: > > > If you change the name to rdtsc_ordered(), OK. > > That is a weaker name. > > Ordered in what way, at what level; ordered against what? > > This is using a specific pipeline ordering known as lfence. > So it might as well say lfence. That is the technical name for > that type of ordering. Being vague is unhelpful. But maybe the default rdtsc() should include the lfence. And then we could have rdtsc_unordered() for this cases that don't care about ordering. As I wrote in my first mail, cpu_rnd_messybits() may want to use that. And maybe one of the network stack people should investigate what the impact of having the fence in the timecounter is?
Re: lfence for rdtsc
On 2020/06/21 18:46, Paul Irofti wrote: > > > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt a scris: > >Paul Irofti wrote: > > > >> If you change the name to rdtsc_ordered(), OK. > > > >That is a weaker name. > > > >Ordered in what way, at what level; ordered against what? > > > >This is using a specific pipeline ordering known as lfence. > >So it might as well say lfence. That is the technical name for > >that type of ordering. Being vague is unhelpful. > > > Ok then, if you think that's best. > Any idea why in https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html they are using cpuid to serialize access instead of lfence?
Re: lfence for rdtsc
În 21 iunie 2020 16:30:43 EEST, Theo de Raadt a scris: >Paul Irofti wrote: > >> If you change the name to rdtsc_ordered(), OK. > >That is a weaker name. > >Ordered in what way, at what level; ordered against what? > >This is using a specific pipeline ordering known as lfence. >So it might as well say lfence. That is the technical name for >that type of ordering. Being vague is unhelpful. Ok then, if you think that's best.
Re: lfence for rdtsc
Paul Irofti wrote: > If you change the name to rdtsc_ordered(), OK. That is a weaker name. Ordered in what way, at what level; ordered against what? This is using a specific pipeline ordering known as lfence. So it might as well say lfence. That is the technical name for that type of ordering. Being vague is unhelpful.
Re: lfence for rdtsc
it definitely got better: cpu0: TSC skew=0 observed drift=0 cpu0: TSC skew=0 observed drift=0 cpu1: TSC skew=51 observed drift=0 cpu2: TSC skew=68 observed drift=0 cpu3: TSC skew=68 observed drift=0 cpu4: TSC skew=0 observed drift=0 cpu5: TSC skew=0 observed drift=0 cpu6: TSC skew=85 observed drift=0 cpu7: TSC skew=51 observed drift=0 cpu8: TSC skew=17 observed drift=0 cpu9: TSC skew=34 observed drift=0 cpu10: TSC skew=0 observed drift=0 cpu11: TSC skew=17 observed drift=0 cpu12: TSC skew=0 observed drift=0 cpu13: TSC skew=-51 observed drift=0 cpu14: TSC skew=-17 observed drift=0 cpu15: TSC skew=-17 observed drift=0
Re: lfence for rdtsc
On Sat, Jun 20, 2020 at 10:02:19PM +0200, Mark Kettenis wrote: > RDTSC is not a serializing instruction; to make sure we get the TSC > value corresponding to the position of RDTSC in te instruction stream > we need a barrier. Linux uses LFENCE on machines where it is > available. FreeBSD seems to prefer MFENCE for AMD CPUs but uses > LFENCE for Intel CPUs. For now my thinjing is that what's good enough > for Linux should be good enough for us. And on amd64 LFENCE is always > available. > > This diff reduces the scatter in the skew values. Before I had > occasional outliers of more than 200 cycles. Now the maximem values I see > are around 60 cycles. > > I din't changes the rdtsc() call that reads the timecounter. But > maybe that one should change as well? Bit of a tradeof between > performance and accoracy I think. > > This also changes the skew print message (stolen from what Theo put in > snaps). Printing the CPU number makes it easier to get statistics for > a specific CPU. Diff also enabled the debug message. Maybe it should > be committed this way and then disabled again later such that we can > get some statistics? > > comments? ok? If you change the name to rdtsc_ordered(), OK. By the way, if you want to continue in this direction you can look into adding support for the TSC_ADJUST MSR to synchronize TSC across CPUs as described in Section 17.17.3 from the Intel manual. > Index: arch/amd64/amd64/tsc.c > === > RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v > retrieving revision 1.16 > diff -u -p -r1.16 tsc.c > --- arch/amd64/amd64/tsc.c6 Apr 2020 00:01:08 - 1.16 > +++ arch/amd64/amd64/tsc.c20 Jun 2020 20:01:46 - > @@ -100,9 +100,9 @@ get_tsc_and_timecount(struct timecounter > int i; > > for (i = 0; i < RECALIBRATE_MAX_RETRIES; i++) { > - tsc1 = rdtsc(); > + tsc1 = rdtsc_lfence(); > n = (tc->tc_get_timecount(tc) & tc->tc_counter_mask); > - tsc2 = rdtsc(); > + tsc2 = rdtsc_lfence(); > > if ((tsc2 - tsc1) < RECALIBRATE_SMI_THRESHOLD) { > *count = n; > @@ -216,8 +216,9 @@ tsc_get_timecount(struct timecounter *tc > void > tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq) > { > +#define TSC_DEBUG > #ifdef TSC_DEBUG > - printf("%s: TSC skew=%lld observed drift=%lld\n", __func__, > + printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname, > (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed); > #endif > > @@ -276,12 +277,12 @@ tsc_read_bp(struct cpu_info *ci, uint64_ > > /* Flag it and read our TSC. */ > atomic_setbits_int(&ci->ci_flags, CPUF_SYNCTSC); > - bptsc = (rdtsc() >> 1); > + bptsc = (rdtsc_lfence() >> 1); > > /* Wait for remote to complete, and read ours again. */ > while ((ci->ci_flags & CPUF_SYNCTSC) != 0) > membar_consumer(); > - bptsc += (rdtsc() >> 1); > + bptsc += (rdtsc_lfence() >> 1); > > /* Wait for the results to come in. */ > while (tsc_sync_cpu == ci) > @@ -317,11 +318,11 @@ tsc_post_ap(struct cpu_info *ci) > /* Wait for go-ahead from primary. */ > while ((ci->ci_flags & CPUF_SYNCTSC) == 0) > membar_consumer(); > - tsc = (rdtsc() >> 1); > + tsc = (rdtsc_lfence() >> 1); > > /* Instruct primary to read its counter. */ > atomic_clearbits_int(&ci->ci_flags, CPUF_SYNCTSC); > - tsc += (rdtsc() >> 1); > + tsc += (rdtsc_lfence() >> 1); > > /* Post result. Ensure the whole value goes out atomically. */ > (void)atomic_swap_64(&tsc_sync_val, tsc); > Index: arch/amd64/include/cpufunc.h > === > RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v > retrieving revision 1.34 > diff -u -p -r1.34 cpufunc.h > --- arch/amd64/include/cpufunc.h 28 Jun 2019 21:54:05 - 1.34 > +++ arch/amd64/include/cpufunc.h 20 Jun 2020 20:01:46 - > @@ -292,6 +292,15 @@ rdtsc(void) > } > > static __inline u_int64_t > +rdtsc_lfence(void) > +{ > + uint32_t hi, lo; > + > + __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo)); > + return (((uint64_t)hi << 32) | (uint64_t) lo); > +} > + > +static __inline u_int64_t > rdpmc(u_int pmc) > { > uint32_t hi, lo;