Re: timekeep: fixing large skews on amd64 with RDTSCP

Paul Irofti Sun, 23 Aug 2020 02:12:42 -0700

Hi,

As I see this is addressed to me I will reply. I am in the mountains until the 
end of the month with poor internet connection.


First, it was not me that stated that the network stack is blocking the change. 
Somebody said that during the initial development of the user timeclock when I 
proposed the same change. 

I love that you actually provided data to analyze the issue! Statements like 
that should always come with data to support them like you did now.

Just to make it clear, I am OK with this, but I am not a network guy so I don't 
know the real issue (if there was any).

Paul


În 23 august 2020 06:05:44 EEST, Scott Cheloha <[email protected]> a scris:
>On Tue, Jul 28, 2020 at 10:02:07AM +0300, Paul Irofti wrote:
>> 
>> [...]
>> 
>> Is the issue with LFENCE slowing down the network stack settled? That
>was
>> the argument against it last time.
>
>... a month passes.  Nobody says anything.
>
>This "it might slow down the network stack" thing keeps coming up, and
>yet nobody can point to (a) who expressed this concern or (b) what the
>penalty is in practice.
>
>Note that the alternative is "your timecounter might not be monotonic
>between threads".  For me, that's already a dealbreaker.
>
>But for sake of discussion let's look at some data.  For those of you
>watching from home, please follow along!  I would like to know what
>your results look like.
>
>To start, here is a microbenchmarking program for clock_gettime(2) on
>amd64.  If you have the userspace timecounter, then
>
>       clock_gettime(CLOCK_MONOTONIC, ...);
>
>is a suitable surrogate for nanouptime(9), so this microbenchmark can
>actually tell us about how nanouptime(9) or nanotime(9) would be
>impacted by a comparable change in the kernel timecounter.
>
>--
>
>/*
> * clock_gettime-bench.c
> */
>#include <err.h>
>#include <limits.h>
>#include <time.h>
>#include <stdio.h>
>#include <stdlib.h>
>
>static uint64_t
>rdtsc_lfence(void)
>{
>       uint32_t hi, lo;
>
>       __asm volatile("lfence; rdtsc; lfence" : "=d" (hi), "=a" (lo));
>       return ((uint64_t)hi << 32) | lo;
>}
>
>int
>main(int argc, char *argv[])
>{
>       struct timespec now;
>       uint64_t begin, end;
>       long long count, i;
>       const char *errstr;
>
>       if (argc != 2) {
>               fprintf(stderr, "usage: %s count\n", getprogname());
>               return 1;
>       }
>       count = strtonum(argv[1], 1, LLONG_MAX, &errstr);
>       if (errstr != NULL)
>               errx(1, "count is %s: %s", errstr, argv[1]);
>
>       begin = rdtsc_lfence();
>       for (i = 0; i < count; i++)
>               clock_gettime(CLOCK_MONOTONIC, &now);
>       end = rdtsc_lfence();
>
>       printf("%lld\t%llu\n", count, end - begin);
>
>       return 0;
>}
>
>--
>
>Now consider a benchmark of 100K clock_gettime(2) calls against the
>userspace timecounter.
>
>$ clock_gettime-bench 100000
>100000  15703664
>
>Let's collect 10K of these benchmarks -- our samples -- atop an
>unpatched libc.  Use the shell script below.  Note that we throw out
>samples where we hit a context switch.
>
>--
>
>#! /bin/sh
>
>[ $# -ne 1 ] && exit 1
>RESULTS=$1
>shift
>
>TIME=$(mktemp) || exit 1
>TMP=$(mktemp) || exit 1
>
># Collect 10K samples.
>i=0
>while [ $i -lt 10000 ]; do
>       # Call clock_gettime(2) 100K times.
>       /usr/bin/time -l ~/scratch/clock_gettime-bench 100000 > $TMP 2> $TIME
>       # Ignore this sample if a context switch occurred.
>       if egrep -q '[1-9][0-9]* +(in)?voluntary context' $TIME; then
>               continue
>       fi
>       cat $TMP >> $RESULTS
>       i=$((i + 1))
>done
>
>rm $TMP $TIME
>
>--
>
>Run it like this:
>
>$ ksh bench.sh unpatched.out
>
>That will take ~5-10 minutes at most.
>
>Next, we'll patch libc to add the LFENCE to the userspace timecounter.
>
>Index: usertc.c
>===================================================================
>RCS file: /cvs/src/lib/libc/arch/amd64/gen/usertc.c,v
>retrieving revision 1.2
>diff -u -p -r1.2 usertc.c
>--- usertc.c   8 Jul 2020 09:17:48 -0000       1.2
>+++ usertc.c   22 Aug 2020 22:18:47 -0000
>@@ -19,10 +19,10 @@
> #include <sys/timetc.h>
> 
> static inline u_int
>-rdtsc(void)
>+rdtsc_lfence(void)
> {
>       uint32_t hi, lo;
>-      asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
>+      asm volatile("lfence; rdtsc" : "=a"(lo), "=d"(hi));
>       return ((uint64_t)lo)|(((uint64_t)hi)<<32);
> }
> 
>@@ -31,7 +31,7 @@ tc_get_timecount(struct timekeep *tk, u_
> {
>       switch (tk->tk_user) {
>       case TC_TSC:
>-              *tc = rdtsc();
>+              *tc = rdtsc_lfence();
>               return 0;
>       }
> 
>--
>
>Recompile and reinstall libc.
>
>Then rerun the benchmark.  Be careful not to overwrite our results
>from the unpatched libc:
>
>$ ksh bench.sh patched.out
>
>--
>
>Alright, now let's compare the results.  I'm not a mathemagician so I
>use ministat and trust it implicitly.  A stat jock could probably do
>this in R or with some python, but I am not that clever, so I will
>stick with ministat.
>
>There is no ministat port for OpenBSD, but it is pretty trivial to
>clone this github repo and build it on -current:
>
>https://github.com/thorduri/ministat
>
>--
>
>Okay, you have ministat?
>
>Let's compare the results.  We want the 2nd column in the output
>(-C2).  I'm not interested in the graph (-q), given our population
>size.  We have N=10000, so let's push the CI up (-c 99.5).
>
>$ ~/repo/ministat/ministat -C2 -q -c99.5 unpatched.out patched.out
>x unpatched.out
>+ patched.out
>N           Min           Max        Median           Avg        Stddev
>x 10000      13752102      18019218      14442398      14431918    
>237842.31
>+ 10000      15196970      16992030      15721390      15779178    
>181623.5
>Difference at 99.5% confidence
>        1.34726e+06 +/- 9247.11
>        9.33528% +/- 0.064074%
>        (Student's t, pooled s = 211608)
>
>So, in conclusion, prefixing RDTSC with LFENCE incurs a penalty.  It
>is unambiguously slower.
>
>What is the penalty?  No more than ~10%.  Which translates to maybe
>~10-15 cycles on my machine.
>
>Something else worth noting is that with the LFENCE in place the
>stddev is smaller.  So, reading the clock, though slower, takes a more
>*consistent* amount of time with the LFENCE in place.
>
>--
>
>I don't think this penalty is so bad.
>
>Thoughts?

Re: timekeep: fixing large skews on amd64 with RDTSCP

Reply via email to