Re: lfence for rdtsc

2020-06-21 Thread Theo de Raadt
Mark Kettenis  wrote:

> But maybe the default rdtsc() should include the lfence.  And then we
> could have rdtsc_unordered() for this cases that don't care about
> ordering.

Right.

But I don't like the word 'order', because it is too vague.  There
are layers of ordering, speculation, asyncronous execution, etc.
and lfence just deals with some of them.

> As I wrote in my first mail, cpu_rnd_messybits() may want to use that.
> And maybe one of the network stack people should investigate what the
> impact of having the fence in the timecounter is?

cpu_rnd_messybits is indifferent.  It is amusing to capture some
speculation side effect, but if we don't, it isn't the end of the world.
A few days ago, we didn't even know this aspect existed...



Re: lfence for rdtsc

2020-06-21 Thread Mark Kettenis
> Date: Sun, 21 Jun 2020 16:55:56 +0100
> From: Stuart Henderson 
> 
> On 2020/06/21 18:46, Paul Irofti wrote:
> > 
> > 
> > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt  a scris:
> > >Paul Irofti  wrote:
> > >
> > >> If you change the name to rdtsc_ordered(), OK.
> > >
> > >That is a weaker name.
> > >
> > >Ordered in what way, at what level; ordered against what?
> > >
> > >This is using a specific pipeline ordering known as lfence.
> > >So it might as well say lfence.  That is the technical name for
> > >that type of ordering.  Being vague is unhelpful.
> > 
> > 
> > Ok then, if you think that's best.
> > 
> 
> Any idea why in
> https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
> they are using cpuid to serialize access instead of lfence?

Yes; LFENCE only exists on machines with SSE2.  So if you want
something that works on older (32-bit) CPUs you need to use a
different instruction.  CPUID is the canonical choice for that.



Re: lfence for rdtsc

2020-06-21 Thread Paul Irofti
On Sun, Jun 21, 2020 at 04:55:56PM +0100, Stuart Henderson wrote:
> On 2020/06/21 18:46, Paul Irofti wrote:
> > 
> > 
> > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt  a scris:
> > >Paul Irofti  wrote:
> > >
> > >> If you change the name to rdtsc_ordered(), OK.
> > >
> > >That is a weaker name.
> > >
> > >Ordered in what way, at what level; ordered against what?
> > >
> > >This is using a specific pipeline ordering known as lfence.
> > >So it might as well say lfence.  That is the technical name for
> > >that type of ordering.  Being vague is unhelpful.
> > 
> > 
> > Ok then, if you think that's best.
> > 
> 
> Any idea why in
> https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
> they are using cpuid to serialize access instead of lfence?

If I remember correctly it is because it is also a serializing
instruction, but nowadays it is more expensive than lfence.



Re: lfence for rdtsc

2020-06-21 Thread Mark Kettenis
> From: "Theo de Raadt" 
> Date: Sun, 21 Jun 2020 07:30:43 -0600
> 
> Paul Irofti  wrote:
> 
> > If you change the name to rdtsc_ordered(), OK.
> 
> That is a weaker name.
> 
> Ordered in what way, at what level; ordered against what?
> 
> This is using a specific pipeline ordering known as lfence.
> So it might as well say lfence.  That is the technical name for
> that type of ordering.  Being vague is unhelpful.

But maybe the default rdtsc() should include the lfence.  And then we
could have rdtsc_unordered() for this cases that don't care about
ordering.

As I wrote in my first mail, cpu_rnd_messybits() may want to use that.
And maybe one of the network stack people should investigate what the
impact of having the fence in the timecounter is?



Re: lfence for rdtsc

2020-06-21 Thread Stuart Henderson
On 2020/06/21 18:46, Paul Irofti wrote:
> 
> 
> În 21 iunie 2020 16:30:43 EEST, Theo de Raadt  a scris:
> >Paul Irofti  wrote:
> >
> >> If you change the name to rdtsc_ordered(), OK.
> >
> >That is a weaker name.
> >
> >Ordered in what way, at what level; ordered against what?
> >
> >This is using a specific pipeline ordering known as lfence.
> >So it might as well say lfence.  That is the technical name for
> >that type of ordering.  Being vague is unhelpful.
> 
> 
> Ok then, if you think that's best.
> 

Any idea why in
https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
they are using cpuid to serialize access instead of lfence?



Re: lfence for rdtsc

2020-06-21 Thread Paul Irofti



În 21 iunie 2020 16:30:43 EEST, Theo de Raadt  a scris:
>Paul Irofti  wrote:
>
>> If you change the name to rdtsc_ordered(), OK.
>
>That is a weaker name.
>
>Ordered in what way, at what level; ordered against what?
>
>This is using a specific pipeline ordering known as lfence.
>So it might as well say lfence.  That is the technical name for
>that type of ordering.  Being vague is unhelpful.


Ok then, if you think that's best.



Re: lfence for rdtsc

2020-06-21 Thread Theo de Raadt
Paul Irofti  wrote:

> If you change the name to rdtsc_ordered(), OK.

That is a weaker name.

Ordered in what way, at what level; ordered against what?

This is using a specific pipeline ordering known as lfence.
So it might as well say lfence.  That is the technical name for
that type of ordering.  Being vague is unhelpful.



Re: lfence for rdtsc

2020-06-21 Thread Robert Nagy
it definitely got better:

cpu0: TSC skew=0 observed drift=0
cpu0: TSC skew=0 observed drift=0
cpu1: TSC skew=51 observed drift=0
cpu2: TSC skew=68 observed drift=0
cpu3: TSC skew=68 observed drift=0
cpu4: TSC skew=0 observed drift=0
cpu5: TSC skew=0 observed drift=0
cpu6: TSC skew=85 observed drift=0
cpu7: TSC skew=51 observed drift=0
cpu8: TSC skew=17 observed drift=0
cpu9: TSC skew=34 observed drift=0
cpu10: TSC skew=0 observed drift=0
cpu11: TSC skew=17 observed drift=0
cpu12: TSC skew=0 observed drift=0
cpu13: TSC skew=-51 observed drift=0
cpu14: TSC skew=-17 observed drift=0
cpu15: TSC skew=-17 observed drift=0



Re: lfence for rdtsc

2020-06-21 Thread Paul Irofti
On Sat, Jun 20, 2020 at 10:02:19PM +0200, Mark Kettenis wrote:
> RDTSC is not a serializing instruction; to make sure we get the TSC
> value corresponding to the position of RDTSC in te instruction stream
> we need a barrier.  Linux uses LFENCE on machines where it is
> available.  FreeBSD seems to prefer MFENCE for AMD CPUs but uses
> LFENCE for Intel CPUs.  For now my thinjing is that what's good enough
> for Linux should be good enough for us.  And on amd64 LFENCE is always
> available.
> 
> This diff reduces the scatter in the skew values.  Before I had
> occasional outliers of more than 200 cycles.  Now the maximem values I see 
> are around 60 cycles.
> 
> I din't changes the rdtsc() call that reads the timecounter.  But
> maybe that one should change as well?  Bit of a tradeof between
> performance and accoracy I think.
> 
> This also changes the skew print message (stolen from what Theo put in
> snaps).  Printing the CPU number makes it easier to get statistics for
> a specific CPU.  Diff also enabled the debug message.  Maybe it should
> be committed this way and then disabled again later such that we can
> get some statistics?
> 
> comments?  ok?

If you change the name to rdtsc_ordered(), OK.

By the way, if you want to continue in this direction you can look into
adding support for the TSC_ADJUST MSR to synchronize TSC across CPUs
as described in Section 17.17.3 from the Intel manual.

> Index: arch/amd64/amd64/tsc.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
> retrieving revision 1.16
> diff -u -p -r1.16 tsc.c
> --- arch/amd64/amd64/tsc.c6 Apr 2020 00:01:08 -   1.16
> +++ arch/amd64/amd64/tsc.c20 Jun 2020 20:01:46 -
> @@ -100,9 +100,9 @@ get_tsc_and_timecount(struct timecounter
>   int i;
>  
>   for (i = 0; i < RECALIBRATE_MAX_RETRIES; i++) {
> - tsc1 = rdtsc();
> + tsc1 = rdtsc_lfence();
>   n = (tc->tc_get_timecount(tc) & tc->tc_counter_mask);
> - tsc2 = rdtsc();
> + tsc2 = rdtsc_lfence();
>  
>   if ((tsc2 - tsc1) < RECALIBRATE_SMI_THRESHOLD) {
>   *count = n;
> @@ -216,8 +216,9 @@ tsc_get_timecount(struct timecounter *tc
>  void
>  tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
>  {
> +#define TSC_DEBUG
>  #ifdef TSC_DEBUG
> - printf("%s: TSC skew=%lld observed drift=%lld\n", __func__,
> + printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
>   (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
>  #endif
>  
> @@ -276,12 +277,12 @@ tsc_read_bp(struct cpu_info *ci, uint64_
>  
>   /* Flag it and read our TSC. */
>   atomic_setbits_int(>ci_flags, CPUF_SYNCTSC);
> - bptsc = (rdtsc() >> 1);
> + bptsc = (rdtsc_lfence() >> 1);
>  
>   /* Wait for remote to complete, and read ours again. */
>   while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
>   membar_consumer();
> - bptsc += (rdtsc() >> 1);
> + bptsc += (rdtsc_lfence() >> 1);
>  
>   /* Wait for the results to come in. */
>   while (tsc_sync_cpu == ci)
> @@ -317,11 +318,11 @@ tsc_post_ap(struct cpu_info *ci)
>   /* Wait for go-ahead from primary. */
>   while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
>   membar_consumer();
> - tsc = (rdtsc() >> 1);
> + tsc = (rdtsc_lfence() >> 1);
>  
>   /* Instruct primary to read its counter. */
>   atomic_clearbits_int(>ci_flags, CPUF_SYNCTSC);
> - tsc += (rdtsc() >> 1);
> + tsc += (rdtsc_lfence() >> 1);
>  
>   /* Post result.  Ensure the whole value goes out atomically. */
>   (void)atomic_swap_64(_sync_val, tsc);
> Index: arch/amd64/include/cpufunc.h
> ===
> RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v
> retrieving revision 1.34
> diff -u -p -r1.34 cpufunc.h
> --- arch/amd64/include/cpufunc.h  28 Jun 2019 21:54:05 -  1.34
> +++ arch/amd64/include/cpufunc.h  20 Jun 2020 20:01:46 -
> @@ -292,6 +292,15 @@ rdtsc(void)
>  }
>  
>  static __inline u_int64_t
> +rdtsc_lfence(void)
> +{
> + uint32_t hi, lo;
> +
> + __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo));
> + return (((uint64_t)hi << 32) | (uint64_t) lo);
> +}
> +
> +static __inline u_int64_t
>  rdpmc(u_int pmc)
>  {
>   uint32_t hi, lo;



lfence for rdtsc

2020-06-20 Thread Mark Kettenis
RDTSC is not a serializing instruction; to make sure we get the TSC
value corresponding to the position of RDTSC in te instruction stream
we need a barrier.  Linux uses LFENCE on machines where it is
available.  FreeBSD seems to prefer MFENCE for AMD CPUs but uses
LFENCE for Intel CPUs.  For now my thinjing is that what's good enough
for Linux should be good enough for us.  And on amd64 LFENCE is always
available.

This diff reduces the scatter in the skew values.  Before I had
occasional outliers of more than 200 cycles.  Now the maximem values I see are 
around 60 cycles.

I din't changes the rdtsc() call that reads the timecounter.  But
maybe that one should change as well?  Bit of a tradeof between
performance and accoracy I think.

This also changes the skew print message (stolen from what Theo put in
snaps).  Printing the CPU number makes it easier to get statistics for
a specific CPU.  Diff also enabled the debug message.  Maybe it should
be committed this way and then disabled again later such that we can
get some statistics?

comments?  ok?


Index: arch/amd64/amd64/tsc.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
retrieving revision 1.16
diff -u -p -r1.16 tsc.c
--- arch/amd64/amd64/tsc.c  6 Apr 2020 00:01:08 -   1.16
+++ arch/amd64/amd64/tsc.c  20 Jun 2020 20:01:46 -
@@ -100,9 +100,9 @@ get_tsc_and_timecount(struct timecounter
int i;
 
for (i = 0; i < RECALIBRATE_MAX_RETRIES; i++) {
-   tsc1 = rdtsc();
+   tsc1 = rdtsc_lfence();
n = (tc->tc_get_timecount(tc) & tc->tc_counter_mask);
-   tsc2 = rdtsc();
+   tsc2 = rdtsc_lfence();
 
if ((tsc2 - tsc1) < RECALIBRATE_SMI_THRESHOLD) {
*count = n;
@@ -216,8 +216,9 @@ tsc_get_timecount(struct timecounter *tc
 void
 tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
 {
+#define TSC_DEBUG
 #ifdef TSC_DEBUG
-   printf("%s: TSC skew=%lld observed drift=%lld\n", __func__,
+   printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
(long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
 #endif
 
@@ -276,12 +277,12 @@ tsc_read_bp(struct cpu_info *ci, uint64_
 
/* Flag it and read our TSC. */
atomic_setbits_int(>ci_flags, CPUF_SYNCTSC);
-   bptsc = (rdtsc() >> 1);
+   bptsc = (rdtsc_lfence() >> 1);
 
/* Wait for remote to complete, and read ours again. */
while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
membar_consumer();
-   bptsc += (rdtsc() >> 1);
+   bptsc += (rdtsc_lfence() >> 1);
 
/* Wait for the results to come in. */
while (tsc_sync_cpu == ci)
@@ -317,11 +318,11 @@ tsc_post_ap(struct cpu_info *ci)
/* Wait for go-ahead from primary. */
while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
membar_consumer();
-   tsc = (rdtsc() >> 1);
+   tsc = (rdtsc_lfence() >> 1);
 
/* Instruct primary to read its counter. */
atomic_clearbits_int(>ci_flags, CPUF_SYNCTSC);
-   tsc += (rdtsc() >> 1);
+   tsc += (rdtsc_lfence() >> 1);
 
/* Post result.  Ensure the whole value goes out atomically. */
(void)atomic_swap_64(_sync_val, tsc);
Index: arch/amd64/include/cpufunc.h
===
RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v
retrieving revision 1.34
diff -u -p -r1.34 cpufunc.h
--- arch/amd64/include/cpufunc.h28 Jun 2019 21:54:05 -  1.34
+++ arch/amd64/include/cpufunc.h20 Jun 2020 20:01:46 -
@@ -292,6 +292,15 @@ rdtsc(void)
 }
 
 static __inline u_int64_t
+rdtsc_lfence(void)
+{
+   uint32_t hi, lo;
+
+   __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo));
+   return (((uint64_t)hi << 32) | (uint64_t) lo);
+}
+
+static __inline u_int64_t
 rdpmc(u_int pmc)
 {
uint32_t hi, lo;