Re: lfence for rdtsc

2020-06-21 Thread Theo de Raadt
Mark Kettenis  wrote:

> But maybe the default rdtsc() should include the lfence.  And then we
> could have rdtsc_unordered() for this cases that don't care about
> ordering.

Right.

But I don't like the word 'order', because it is too vague.  There
are layers of ordering, speculation, asyncronous execution, etc.
and lfence just deals with some of them.

> As I wrote in my first mail, cpu_rnd_messybits() may want to use that.
> And maybe one of the network stack people should investigate what the
> impact of having the fence in the timecounter is?

cpu_rnd_messybits is indifferent.  It is amusing to capture some
speculation side effect, but if we don't, it isn't the end of the world.
A few days ago, we didn't even know this aspect existed...



Re: lfence for rdtsc

2020-06-21 Thread Mark Kettenis
> Date: Sun, 21 Jun 2020 16:55:56 +0100
> From: Stuart Henderson 
> 
> On 2020/06/21 18:46, Paul Irofti wrote:
> > 
> > 
> > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt  a scris:
> > >Paul Irofti  wrote:
> > >
> > >> If you change the name to rdtsc_ordered(), OK.
> > >
> > >That is a weaker name.
> > >
> > >Ordered in what way, at what level; ordered against what?
> > >
> > >This is using a specific pipeline ordering known as lfence.
> > >So it might as well say lfence.  That is the technical name for
> > >that type of ordering.  Being vague is unhelpful.
> > 
> > 
> > Ok then, if you think that's best.
> > 
> 
> Any idea why in
> https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
> they are using cpuid to serialize access instead of lfence?

Yes; LFENCE only exists on machines with SSE2.  So if you want
something that works on older (32-bit) CPUs you need to use a
different instruction.  CPUID is the canonical choice for that.



Re: lfence for rdtsc

2020-06-21 Thread Paul Irofti
On Sun, Jun 21, 2020 at 04:55:56PM +0100, Stuart Henderson wrote:
> On 2020/06/21 18:46, Paul Irofti wrote:
> > 
> > 
> > În 21 iunie 2020 16:30:43 EEST, Theo de Raadt  a scris:
> > >Paul Irofti  wrote:
> > >
> > >> If you change the name to rdtsc_ordered(), OK.
> > >
> > >That is a weaker name.
> > >
> > >Ordered in what way, at what level; ordered against what?
> > >
> > >This is using a specific pipeline ordering known as lfence.
> > >So it might as well say lfence.  That is the technical name for
> > >that type of ordering.  Being vague is unhelpful.
> > 
> > 
> > Ok then, if you think that's best.
> > 
> 
> Any idea why in
> https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
> they are using cpuid to serialize access instead of lfence?

If I remember correctly it is because it is also a serializing
instruction, but nowadays it is more expensive than lfence.



Re: lfence for rdtsc

2020-06-21 Thread Mark Kettenis
> From: "Theo de Raadt" 
> Date: Sun, 21 Jun 2020 07:30:43 -0600
> 
> Paul Irofti  wrote:
> 
> > If you change the name to rdtsc_ordered(), OK.
> 
> That is a weaker name.
> 
> Ordered in what way, at what level; ordered against what?
> 
> This is using a specific pipeline ordering known as lfence.
> So it might as well say lfence.  That is the technical name for
> that type of ordering.  Being vague is unhelpful.

But maybe the default rdtsc() should include the lfence.  And then we
could have rdtsc_unordered() for this cases that don't care about
ordering.

As I wrote in my first mail, cpu_rnd_messybits() may want to use that.
And maybe one of the network stack people should investigate what the
impact of having the fence in the timecounter is?



Re: lfence for rdtsc

2020-06-21 Thread Stuart Henderson
On 2020/06/21 18:46, Paul Irofti wrote:
> 
> 
> În 21 iunie 2020 16:30:43 EEST, Theo de Raadt  a scris:
> >Paul Irofti  wrote:
> >
> >> If you change the name to rdtsc_ordered(), OK.
> >
> >That is a weaker name.
> >
> >Ordered in what way, at what level; ordered against what?
> >
> >This is using a specific pipeline ordering known as lfence.
> >So it might as well say lfence.  That is the technical name for
> >that type of ordering.  Being vague is unhelpful.
> 
> 
> Ok then, if you think that's best.
> 

Any idea why in
https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html
they are using cpuid to serialize access instead of lfence?



Re: lfence for rdtsc

2020-06-21 Thread Paul Irofti



În 21 iunie 2020 16:30:43 EEST, Theo de Raadt  a scris:
>Paul Irofti  wrote:
>
>> If you change the name to rdtsc_ordered(), OK.
>
>That is a weaker name.
>
>Ordered in what way, at what level; ordered against what?
>
>This is using a specific pipeline ordering known as lfence.
>So it might as well say lfence.  That is the technical name for
>that type of ordering.  Being vague is unhelpful.


Ok then, if you think that's best.



Re: lfence for rdtsc

2020-06-21 Thread Theo de Raadt
Paul Irofti  wrote:

> If you change the name to rdtsc_ordered(), OK.

That is a weaker name.

Ordered in what way, at what level; ordered against what?

This is using a specific pipeline ordering known as lfence.
So it might as well say lfence.  That is the technical name for
that type of ordering.  Being vague is unhelpful.



Re: lfence for rdtsc

2020-06-21 Thread Robert Nagy
it definitely got better:

cpu0: TSC skew=0 observed drift=0
cpu0: TSC skew=0 observed drift=0
cpu1: TSC skew=51 observed drift=0
cpu2: TSC skew=68 observed drift=0
cpu3: TSC skew=68 observed drift=0
cpu4: TSC skew=0 observed drift=0
cpu5: TSC skew=0 observed drift=0
cpu6: TSC skew=85 observed drift=0
cpu7: TSC skew=51 observed drift=0
cpu8: TSC skew=17 observed drift=0
cpu9: TSC skew=34 observed drift=0
cpu10: TSC skew=0 observed drift=0
cpu11: TSC skew=17 observed drift=0
cpu12: TSC skew=0 observed drift=0
cpu13: TSC skew=-51 observed drift=0
cpu14: TSC skew=-17 observed drift=0
cpu15: TSC skew=-17 observed drift=0



Re: lfence for rdtsc

2020-06-21 Thread Paul Irofti
On Sat, Jun 20, 2020 at 10:02:19PM +0200, Mark Kettenis wrote:
> RDTSC is not a serializing instruction; to make sure we get the TSC
> value corresponding to the position of RDTSC in te instruction stream
> we need a barrier.  Linux uses LFENCE on machines where it is
> available.  FreeBSD seems to prefer MFENCE for AMD CPUs but uses
> LFENCE for Intel CPUs.  For now my thinjing is that what's good enough
> for Linux should be good enough for us.  And on amd64 LFENCE is always
> available.
> 
> This diff reduces the scatter in the skew values.  Before I had
> occasional outliers of more than 200 cycles.  Now the maximem values I see 
> are around 60 cycles.
> 
> I din't changes the rdtsc() call that reads the timecounter.  But
> maybe that one should change as well?  Bit of a tradeof between
> performance and accoracy I think.
> 
> This also changes the skew print message (stolen from what Theo put in
> snaps).  Printing the CPU number makes it easier to get statistics for
> a specific CPU.  Diff also enabled the debug message.  Maybe it should
> be committed this way and then disabled again later such that we can
> get some statistics?
> 
> comments?  ok?

If you change the name to rdtsc_ordered(), OK.

By the way, if you want to continue in this direction you can look into
adding support for the TSC_ADJUST MSR to synchronize TSC across CPUs
as described in Section 17.17.3 from the Intel manual.

> Index: arch/amd64/amd64/tsc.c
> ===
> RCS file: /cvs/src/sys/arch/amd64/amd64/tsc.c,v
> retrieving revision 1.16
> diff -u -p -r1.16 tsc.c
> --- arch/amd64/amd64/tsc.c6 Apr 2020 00:01:08 -   1.16
> +++ arch/amd64/amd64/tsc.c20 Jun 2020 20:01:46 -
> @@ -100,9 +100,9 @@ get_tsc_and_timecount(struct timecounter
>   int i;
>  
>   for (i = 0; i < RECALIBRATE_MAX_RETRIES; i++) {
> - tsc1 = rdtsc();
> + tsc1 = rdtsc_lfence();
>   n = (tc->tc_get_timecount(tc) & tc->tc_counter_mask);
> - tsc2 = rdtsc();
> + tsc2 = rdtsc_lfence();
>  
>   if ((tsc2 - tsc1) < RECALIBRATE_SMI_THRESHOLD) {
>   *count = n;
> @@ -216,8 +216,9 @@ tsc_get_timecount(struct timecounter *tc
>  void
>  tsc_timecounter_init(struct cpu_info *ci, uint64_t cpufreq)
>  {
> +#define TSC_DEBUG
>  #ifdef TSC_DEBUG
> - printf("%s: TSC skew=%lld observed drift=%lld\n", __func__,
> + printf("%s: TSC skew=%lld observed drift=%lld\n", ci->ci_dev->dv_xname,
>   (long long)ci->ci_tsc_skew, (long long)tsc_drift_observed);
>  #endif
>  
> @@ -276,12 +277,12 @@ tsc_read_bp(struct cpu_info *ci, uint64_
>  
>   /* Flag it and read our TSC. */
>   atomic_setbits_int(>ci_flags, CPUF_SYNCTSC);
> - bptsc = (rdtsc() >> 1);
> + bptsc = (rdtsc_lfence() >> 1);
>  
>   /* Wait for remote to complete, and read ours again. */
>   while ((ci->ci_flags & CPUF_SYNCTSC) != 0)
>   membar_consumer();
> - bptsc += (rdtsc() >> 1);
> + bptsc += (rdtsc_lfence() >> 1);
>  
>   /* Wait for the results to come in. */
>   while (tsc_sync_cpu == ci)
> @@ -317,11 +318,11 @@ tsc_post_ap(struct cpu_info *ci)
>   /* Wait for go-ahead from primary. */
>   while ((ci->ci_flags & CPUF_SYNCTSC) == 0)
>   membar_consumer();
> - tsc = (rdtsc() >> 1);
> + tsc = (rdtsc_lfence() >> 1);
>  
>   /* Instruct primary to read its counter. */
>   atomic_clearbits_int(>ci_flags, CPUF_SYNCTSC);
> - tsc += (rdtsc() >> 1);
> + tsc += (rdtsc_lfence() >> 1);
>  
>   /* Post result.  Ensure the whole value goes out atomically. */
>   (void)atomic_swap_64(_sync_val, tsc);
> Index: arch/amd64/include/cpufunc.h
> ===
> RCS file: /cvs/src/sys/arch/amd64/include/cpufunc.h,v
> retrieving revision 1.34
> diff -u -p -r1.34 cpufunc.h
> --- arch/amd64/include/cpufunc.h  28 Jun 2019 21:54:05 -  1.34
> +++ arch/amd64/include/cpufunc.h  20 Jun 2020 20:01:46 -
> @@ -292,6 +292,15 @@ rdtsc(void)
>  }
>  
>  static __inline u_int64_t
> +rdtsc_lfence(void)
> +{
> + uint32_t hi, lo;
> +
> + __asm volatile("lfence; rdtsc" : "=d" (hi), "=a" (lo));
> + return (((uint64_t)hi << 32) | (uint64_t) lo);
> +}
> +
> +static __inline u_int64_t
>  rdpmc(u_int pmc)
>  {
>   uint32_t hi, lo;