Re: lfence for rdtsc
Mark Kettenis wrote: > But maybe the default rdtsc() should include the lfence. And then we > could have rdtsc_unordered() for this cases that don't care about > ordering. Right. But I don't like the word 'order', because it is too vague. There are layers of ordering, speculation, asyncronous execution, etc. and lfence just deals with some of them. > As I wrote in my first mail, cpu_rnd_messybits() may want to use that. > And maybe one of the network stack people should investigate what the > impact of having the fence in the timecounter is? cpu_rnd_messybits is indifferent. It is amusing to capture some speculation side effect, but if we don't, it isn't the end of the world. A few days ago, we didn't even know this aspect existed...
Re: lfence for rdtsc
> Date: Sun, 21 Jun 2020 16:55:56 +0100 > From: Stuart Henderson > > On 2020/06/21 18:46, Paul Irofti wrote: > > > > > > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt a scris: > > >Paul Irofti wrote: > > > > > >> If you change the name to rdtsc_ordered(), OK. > > > > > >That is a weaker name. > > > > > >Ordered in what way, at what level; ordered against what? > > > > > >This is using a specific pipeline ordering known as lfence. > > >So it might as well say lfence. That is the technical name for > > >that type of ordering. Being vague is unhelpful. > > > > > > Ok then, if you think that's best. > > > > Any idea why in > https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html > they are using cpuid to serialize access instead of lfence? Yes; LFENCE only exists on machines with SSE2. So if you want something that works on older (32-bit) CPUs you need to use a different instruction. CPUID is the canonical choice for that.
Re: lfence for rdtsc
On Sun, Jun 21, 2020 at 04:55:56PM +0100, Stuart Henderson wrote: > On 2020/06/21 18:46, Paul Irofti wrote: > > > > > > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt a scris: > > >Paul Irofti wrote: > > > > > >> If you change the name to rdtsc_ordered(), OK. > > > > > >That is a weaker name. > > > > > >Ordered in what way, at what level; ordered against what? > > > > > >This is using a specific pipeline ordering known as lfence. > > >So it might as well say lfence. That is the technical name for > > >that type of ordering. Being vague is unhelpful. > > > > > > Ok then, if you think that's best. > > > > Any idea why in > https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html > they are using cpuid to serialize access instead of lfence? If I remember correctly it is because it is also a serializing instruction, but nowadays it is more expensive than lfence.
Re: lfence for rdtsc
> From: "Theo de Raadt" > Date: Sun, 21 Jun 2020 07:30:43 -0600 > > Paul Irofti wrote: > > > If you change the name to rdtsc_ordered(), OK. > > That is a weaker name. > > Ordered in what way, at what level; ordered against what? > > This is using a specific pipeline ordering known as lfence. > So it might as well say lfence. That is the technical name for > that type of ordering. Being vague is unhelpful. But maybe the default rdtsc() should include the lfence. And then we could have rdtsc_unordered() for this cases that don't care about ordering. As I wrote in my first mail, cpu_rnd_messybits() may want to use that. And maybe one of the network stack people should investigate what the impact of having the fence in the timecounter is?
Re: lfence for rdtsc
On 2020/06/21 18:46, Paul Irofti wrote: > > > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt a scris: > >Paul Irofti wrote: > > > >> If you change the name to rdtsc_ordered(), OK. > > > >That is a weaker name. > > > >Ordered in what way, at what level; ordered against what? > > > >This is using a specific pipeline ordering known as lfence. > >So it might as well say lfence. That is the technical name for > >that type of ordering. Being vague is unhelpful. > > > Ok then, if you think that's best. > Any idea why in https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html they are using cpuid to serialize access instead of lfence?
Re: lfence for rdtsc
În 21 iunie 2020 16:30:43 EEST, Theo de Raadt a scris: >Paul Irofti wrote: > >> If you change the name to rdtsc_ordered(), OK. > >That is a weaker name. > >Ordered in what way, at what level; ordered against what? > >This is using a specific pipeline ordering known as lfence. >So it might as well say lfence. That is the technical name for >that type of ordering. Being vague is unhelpful. Ok then, if you think that's best.
Re: lfence for rdtsc
Paul Irofti wrote: > If you change the name to rdtsc_ordered(), OK. That is a weaker name. Ordered in what way, at what level; ordered against what? This is using a specific pipeline ordering known as lfence. So it might as well say lfence. That is the technical name for that type of ordering. Being vague is unhelpful.
Re: lfence for rdtsc
it definitely got better: cpu0: TSC skew=0 observed drift=0 cpu0: TSC skew=0 observed drift=0 cpu1: TSC skew=51 observed drift=0 cpu2: TSC skew=68 observed drift=0 cpu3: TSC skew=68 observed drift=0 cpu4: TSC skew=0 observed drift=0 cpu5: TSC skew=0 observed drift=0 cpu6: TSC skew=85 observed drift=0 cpu7: TSC skew=51 observed drift=0 cpu8: TSC skew=17 observed drift=0 cpu9: TSC skew=34 observed drift=0 cpu10: TSC skew=0 observed drift=0 cpu11: TSC skew=17 observed drift=0 cpu12: TSC skew=0 observed drift=0 cpu13: TSC skew=-51 observed drift=0 cpu14: TSC skew=-17 observed drift=0 cpu15: TSC skew=-17 observed drift=0
Re: lfence for rdtsc
On Sat, Jun 20, 2020 at 10:02:19PM +0200, Mark Kettenis wrote: > RDTSC is not a serializing instruction; to make sure we get the TSC > value corresponding to the position of RDTSC in te instruction stream > we need a barrier. Linux uses LFENCE on machines where it is > available. FreeBSD seems to prefer MFENCE for AMD CPUs but uses > LFENCE for Intel CPUs. For now my thinjing is that what's good enough > for Linux should be good enough for us. And on amd64 LFENCE is always > available. > > This diff reduces the scatter in the skew values. Before I had > occasional outliers of more than 200 cycles. Now the maximem values I see > are around 60 cycles. > > I din't changes the rdtsc() call that reads the timecounter. But > maybe that one should change as well? Bit of a tradeof between > performance and accoracy I think. > > This also changes the skew print message (stolen from what Theo put in > snaps). Printing the CPU number makes it easier to get statistics for > a specific CPU. Diff also enabled the debug message. Maybe it should > be committed this way and then disabled again later such that we can > get some statistics? > > comments? ok? If you change the name to rdtsc_ordered(), OK. By the way, if you want to continue in this direction you can look into adding support for the TSC_ADJUST MSR to synchronize TSC across CPUs as described in Section 17.17.3 from the Intel manual. > Index: arch/amd64/amd64/tsc.c > === > RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v > retrieving revision 1.16 > diff -u -p -r1.16 tsc.c > --- arch/amd64/amd64/tsc.c6 Apr 2020 00:01:08 - 1.16 > +++ arch/amd64/amd64/tsc.c20 Jun 2020 20:01:46 - > @@ -100,9 +100,9 @@ get_tsc_and_timecount(struct timecounter > int i; > > for (i = 0; i < RECALIBRATE_MAX_RETRIES; i++) { > - tsc1 = rdtsc(); > + tsc1 = rdtsc_lfence(); > n = (tc->tc_get_timecount(tc) & tc->tc_counter_mask); > - tsc2 = rdtsc(); > + tsc2 = rdtsc_lfence(); > > if ((tsc2 - tsc1) < RECALIBRATE_SMI_THRESHOLD) { > *count = n; > @@ -216,8 +216,9 @@ tsc_get_timecount(struct timecounter *tc > void > tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq) > { > +#define TSC_DEBUG > #ifdef TSC_DEBUG > - printf("%s: TSC skew=%lld observed drift=%lld\n", __func__, > + printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname, > (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed); > #endif > > @@ -276,12 +277,12 @@ tsc_read_bp(struct cpu_info *ci, uint64_ > > /* Flag it and read our TSC. */ > atomic_setbits_int(>ci_flags, CPUF_SYNCTSC); > - bptsc = (rdtsc() >> 1); > + bptsc = (rdtsc_lfence() >> 1); > > /* Wait for remote to complete, and read ours again. */ > while ((ci->ci_flags & CPUF_SYNCTSC) != 0) > membar_consumer(); > - bptsc += (rdtsc() >> 1); > + bptsc += (rdtsc_lfence() >> 1); > > /* Wait for the results to come in. */ > while (tsc_sync_cpu == ci) > @@ -317,11 +318,11 @@ tsc_post_ap(struct cpu_info *ci) > /* Wait for go-ahead from primary. */ > while ((ci->ci_flags & CPUF_SYNCTSC) == 0) > membar_consumer(); > - tsc = (rdtsc() >> 1); > + tsc = (rdtsc_lfence() >> 1); > > /* Instruct primary to read its counter. */ > atomic_clearbits_int(>ci_flags, CPUF_SYNCTSC); > - tsc += (rdtsc() >> 1); > + tsc += (rdtsc_lfence() >> 1); > > /* Post result. Ensure the whole value goes out atomically. */ > (void)atomic_swap_64(_sync_val, tsc); > Index: arch/amd64/include/cpufunc.h > === > RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v > retrieving revision 1.34 > diff -u -p -r1.34 cpufunc.h > --- arch/amd64/include/cpufunc.h 28 Jun 2019 21:54:05 - 1.34 > +++ arch/amd64/include/cpufunc.h 20 Jun 2020 20:01:46 - > @@ -292,6 +292,15 @@ rdtsc(void) > } > > static __inline u_int64_t > +rdtsc_lfence(void) > +{ > + uint32_t hi, lo; > + > + __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo)); > + return (((uint64_t)hi << 32) | (uint64_t) lo); > +} > + > +static __inline u_int64_t > rdpmc(u_int pmc) > { > uint32_t hi, lo;
lfence for rdtsc
RDTSC is not a serializing instruction; to make sure we get the TSC value corresponding to the position of RDTSC in te instruction stream we need a barrier. Linux uses LFENCE on machines where it is available. FreeBSD seems to prefer MFENCE for AMD CPUs but uses LFENCE for Intel CPUs. For now my thinjing is that what's good enough for Linux should be good enough for us. And on amd64 LFENCE is always available. This diff reduces the scatter in the skew values. Before I had occasional outliers of more than 200 cycles. Now the maximem values I see are around 60 cycles. I din't changes the rdtsc() call that reads the timecounter. But maybe that one should change as well? Bit of a tradeof between performance and accoracy I think. This also changes the skew print message (stolen from what Theo put in snaps). Printing the CPU number makes it easier to get statistics for a specific CPU. Diff also enabled the debug message. Maybe it should be committed this way and then disabled again later such that we can get some statistics? comments? ok? Index: arch/amd64/amd64/tsc.c === RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v retrieving revision 1.16 diff -u -p -r1.16 tsc.c --- arch/amd64/amd64/tsc.c 6 Apr 2020 00:01:08 - 1.16 +++ arch/amd64/amd64/tsc.c 20 Jun 2020 20:01:46 - @@ -100,9 +100,9 @@ get_tsc_and_timecount(struct timecounter int i; for (i = 0; i < RECALIBRATE_MAX_RETRIES; i++) { - tsc1 = rdtsc(); + tsc1 = rdtsc_lfence(); n = (tc->tc_get_timecount(tc) & tc->tc_counter_mask); - tsc2 = rdtsc(); + tsc2 = rdtsc_lfence(); if ((tsc2 - tsc1) < RECALIBRATE_SMI_THRESHOLD) { *count = n; @@ -216,8 +216,9 @@ tsc_get_timecount(struct timecounter *tc void tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq) { +#define TSC_DEBUG #ifdef TSC_DEBUG - printf("%s: TSC skew=%lld observed drift=%lld\n", __func__, + printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname, (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed); #endif @@ -276,12 +277,12 @@ tsc_read_bp(struct cpu_info *ci, uint64_ /* Flag it and read our TSC. */ atomic_setbits_int(>ci_flags, CPUF_SYNCTSC); - bptsc = (rdtsc() >> 1); + bptsc = (rdtsc_lfence() >> 1); /* Wait for remote to complete, and read ours again. */ while ((ci->ci_flags & CPUF_SYNCTSC) != 0) membar_consumer(); - bptsc += (rdtsc() >> 1); + bptsc += (rdtsc_lfence() >> 1); /* Wait for the results to come in. */ while (tsc_sync_cpu == ci) @@ -317,11 +318,11 @@ tsc_post_ap(struct cpu_info *ci) /* Wait for go-ahead from primary. */ while ((ci->ci_flags & CPUF_SYNCTSC) == 0) membar_consumer(); - tsc = (rdtsc() >> 1); + tsc = (rdtsc_lfence() >> 1); /* Instruct primary to read its counter. */ atomic_clearbits_int(>ci_flags, CPUF_SYNCTSC); - tsc += (rdtsc() >> 1); + tsc += (rdtsc_lfence() >> 1); /* Post result. Ensure the whole value goes out atomically. */ (void)atomic_swap_64(_sync_val, tsc); Index: arch/amd64/include/cpufunc.h === RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v retrieving revision 1.34 diff -u -p -r1.34 cpufunc.h --- arch/amd64/include/cpufunc.h28 Jun 2019 21:54:05 - 1.34 +++ arch/amd64/include/cpufunc.h20 Jun 2020 20:01:46 - @@ -292,6 +292,15 @@ rdtsc(void) } static __inline u_int64_t +rdtsc_lfence(void) +{ + uint32_t hi, lo; + + __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo)); + return (((uint64_t)hi << 32) | (uint64_t) lo); +} + +static __inline u_int64_t rdpmc(u_int pmc) { uint32_t hi, lo;