> Date: Thu, 14 Oct 2021 16:13:17 -0500
> From: Scott Cheloha <scottchel...@gmail.com>
> 
> Hi,
> 
> When we compute high resolution time, both in the kernel and in libc,
> we get a 32-bit (or smaller) value from the active timecounter and
> scale it up into a 128-bit bintime.
> 
> The scaling math currently looks like this in the kernel:
> 
>       *bt = th->th_offset;
>       bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);
> 
> It looks similar in libc.
> 
> The problem with this code is that if the product
> 
>       th->tc_scale * tc_delta(th)
> 
> exceeds UINT64_MAX, the result overflows and we lose time.
> 
> This is unlikely to happen under normal circumstances.  Normally, the
> clock interrupt fires about HZ times per second, so the product
> doesn't get very close to overflow.  You'd need to run with interrupts
> disabled on CPU0 for about 1 second to see an overflow.
> 
> However, I have seen this happen before in the wild.  If you try to
> build LLVM on an EdgeRouter Lite (512MB RAM), it will swap *very
> slowly* to the SD card and eventually lose time.
> 
> I think it is inevitable that we will encounter bugs of this nature at
> some point.  In general, I would like the timecounting layer to be
> more robust to nasty situations like these.  Losing time is a
> catastrophic failure.  It screws everything up.
> 
> The solution to this problem is to use the full 96-bit product when we
> scale the count up into a bintime.  We're multiplying a u_int
> (32-bit), the count, by a uint64_t, the scale, but we're not capturing
> the upper 32 bits of that product.  If we did, we would have a longer
> grace period between clock interrupts before we lost time.
> 
> The attached patch adds a TIMECOUNT_TO_BINTIME() function to sys/time.h
> and puts it to use in sys/kern/kern_tc.c and lib/libc/sys/microtime.c.
> The math is a bit boring, see the patch if you are curious.
> 
> As for the cost, there is a small but significant increase in overhead
> when reading the clock with the TSC.  Slower timecounters (HPET, ACPI
> timer) are so slow the extra overhead is noise.
> 
> I ran my usual synthetic benchmark against a patched and unpatched
> libc for a number of different sample sizes.  I ran the benchmark in
> single-user mode to avoid ntpd(8) interference (adjtime(2), adjfreq(2))
> and set hw.setperf=100 to ensure uniform results.
> 
> It looks to me that on amd64, userspace clock_gettime(2) is up to ~10%
> slower with the patch.  But there is a lot of variation between the
> comparisons, so I don't think it's a consistent 10%.  I'd say 10% is
> an upper bound.
> 
> To ensure the discussion is well-framed: we're talking about a 10%
> overhead on a 40-80 nanosecond function call.
> 
> Obviously the overhead will vary by platform.  I imagine 32-bit
> platforms will exceed 10% because we're now doing even more 64-bit
> math.  I'd be interested in results from e.g. macppc or armv7.
> 
> ministat summaries are below.  My benchmark program and script is
> below that.  The patch is attached below that.
> 
> Thoughts?

I never understood this code.  But I don't understand that if our
timecounters are only 32 bits, we need more than 64 bits to store a
time difference on the order of seconds...

> --
> 
> All values (min, max, med, avg, stddev) are in seconds.
> 
> Call clock_gettime(2) 10,000 times.  Do that 1000 times.  ~5.88% increase.
> 
> x clock-gettime-tsc-user-head-1K-10K.dat1
> + clock-gettime-tsc-user-patch-1K-10K.dat1
>     N           Min           Max        Median           Avg        Stddev
> x 1000   0.000329069   0.000362506   0.000335691 0.00033582683 3.6951597e-06
> + 1000   0.000346276   0.000381319   0.000353605 0.00035557716  3.962225e-06
> Difference at 99.5% confidence
>         1.97503e-05 +/- 5.29405e-07
>         5.88111% +/- 0.157642%
>         (Student's t, pooled s = 3.83102e-06)
> 
> Call clock_gettime(2) 100,000 times.  Do that 1000 times.  ~1.31% increase.
> 
> x clock-gettime-tsc-user-head-1K-100K.dat1
> + clock-gettime-tsc-user-patch-1K-100K.dat1
>     N           Min           Max        Median           Avg        Stddev
> x 1000   0.003432839   0.003679475   0.003472749  0.0034827782 3.0275757e-05
> + 1000   0.003426934   0.003776351   0.003524453  0.0035284519 3.8631925e-05
> Difference at 99.5% confidence
>         4.56738e-05 +/- 4.79602e-06
>         1.31142% +/- 0.137707%
>         (Student's t, pooled s = 3.47062e-05)
> 
> Call clock_gettime(2) 5,000 times.  Do that 10,000 times.  ~8.53% increase.
> 
> x clock-gettime-tsc-user-head-10K-5K.dat1
> + clock-gettime-tsc-user-patch-10K-5K.dat1
>     N           Min           Max        Median           Avg        Stddev
> x 10000   0.000159681   0.000220758   0.000186826 0.00017805393  1.177249e-05
> + 10000   0.000171105   0.000217009   0.000200789 0.00019325052 1.1280074e-05
> Difference at 99.5% confidence
>         1.51966e-05 +/- 5.03804e-07
>         8.53482% +/- 0.28295%
>         (Student's t, pooled s = 1.15289e-05)
> 
> Call clock_gettime(2) 10,000 times.  Do that 10,000 times.  ~8.74% increase.
> 
> x clock-gettime-tsc-user-head-10K-10K.dat1
> + clock-gettime-tsc-user-patch-10K-10K.dat1
>     N           Min           Max        Median           Avg        Stddev
> x 10000   0.000320472   0.000410915   0.000371971 0.00035789663 2.2141645e-05
> + 10000   0.000340087   0.000436915   0.000393461 0.00038918836 1.8884201e-05
> Difference at 99.5% confidence
>         3.12917e-05 +/- 8.99219e-07
>         8.74323% +/- 0.251251%
>         (Student's t, pooled s = 2.05775e-05)
> 
> Call clock_gettime(2) 15,000 times.  Do that 10,000 times.  ~2.47% increase.
> 
> x clock-gettime-tsc-user-head-10K-15K.dat1
> + clock-gettime-tsc-user-patch-10K-15K.dat1
>     N           Min           Max        Median           Avg        Stddev
> x 10000    0.00049023   0.000660955   0.000573344 0.00055863893 2.9447818e-05
> + 10000   0.000510269   0.000651218   0.000584569 0.00057240886 3.4920043e-05
> Difference at 99.5% confidence
>         1.37699e-05 +/- 1.41149e-06
>         2.46491% +/- 0.252665%
>         (Student's t, pooled s = 3.23e-05)
> 
> --
> 
> /*
>  * clock-gettime-bench.c
>  *
>  * Public domain.
>  */
> 
> #include <sys/time.h>
> 
> #include <err.h>
> #include <limits.h>
> #include <time.h>
> #include <stdio.h>
> #include <stdlib.h>
> 
> int
> main(int argc, char *argv[])
> {
>       struct timespec begin, elapsed, end, now;
>       long long count, i;
>       const char *errstr;
> 
>       if (argc != 2) {
>               fprintf(stderr, "usage: %s count\n", getprogname());
>               return 1;
>       }
>       count = strtonum(argv[1], 1, LLONG_MAX, &errstr);
>       if (errstr != NULL)
>               errx(1, "count is %s: %s", errstr, argv[1]);
> 
>       clock_gettime(CLOCK_MONOTONIC, &begin);
>       for (i = 0; i < count; i++)
>               clock_gettime(CLOCK_MONOTONIC, &now);
>       clock_gettime(CLOCK_MONOTONIC, &end);
> 
>       timespecsub(&end, &begin, &elapsed);
>       printf("%lld\t%lld.%09ld\n",
>           count, (long long)elapsed.tv_sec, elapsed.tv_nsec);
> 
>       return 0;
> }
> 
> --
> 
> #! /bin/sh
> #
> # clock-gettime-bench.sh
> #
> # Public domain.
> #
> # Driver program for clock-gettime-bench.c that discards runs that are
> # context switched.  We want to measure *our* computation time, not
> # the time some other program spends running.
> #
> # Run as root to nice the program down to -20.
> #
> # In general, run from single-user mode at hw.setperf=100 to limit
> # interference from other variables and ensure a fair comparison.
> 
> warn()
> {
>       printf "$0: $@\n" >&2
> }
> 
> set -e
> 
> if [ $# -ne 2 ]; then
>       printf "usage: $0 nrounds nsamples\n" >&2
>       exit 1
> fi
> 
> nrounds=$1
> nsamples=$2
> 
> i=0
> while [ $i -lt $nrounds ]; do
>       result=$(/usr/bin/time -l nice -n -20 ./clock-gettime-bench $nsamples 
> 2>&1)
>       if echo "$result" | tail -n 1 | egrep -q '^( ){9}0  involuntary context 
> switches$'; then
>               echo "$result" | head -n 1
>               i=$((i + 1))
>       else
>               warn "context switched: retrying round $i"
>       fi
> done
> 
> exit 0
> 
> --
> 
> Index: sys/sys/time.h
> ===================================================================
> RCS file: /cvs/src/sys/sys/time.h,v
> retrieving revision 1.61
> diff -u -p -r1.61 time.h
> --- sys/sys/time.h    19 Jun 2021 13:49:39 -0000      1.61
> +++ sys/sys/time.h    14 Oct 2021 20:58:38 -0000
> @@ -222,6 +222,17 @@ bintimesub(const struct bintime *bt, con
>   *   time_second ticks after N.999999999 not after N.4999999999
>   */
>  
> +static inline void
> +TIMECOUNT_TO_BINTIME(u_int count, uint64_t scale, struct bintime *bt)
> +{
> +     uint64_t hi64;
> +
> +     hi64 = count * (scale >> 32);
> +     bt->sec = hi64 >> 32;
> +     bt->frac = hi64 << 32;
> +     bintimeaddfrac(bt, count * (scale & 0xffffffff), bt);
> +}
> +
>  static inline uint32_t
>  FRAC_TO_NSEC(uint64_t frac)
>  {
> Index: sys/kern/kern_tc.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_tc.c,v
> retrieving revision 1.74
> diff -u -p -r1.74 kern_tc.c
> --- sys/kern/kern_tc.c        19 Jun 2021 13:49:39 -0000      1.74
> +++ sys/kern/kern_tc.c        14 Oct 2021 20:58:38 -0000
> @@ -189,8 +189,8 @@ binuptime(struct bintime *bt)
>               th = timehands;
>               gen = th->th_generation;
>               membar_consumer();
> -             *bt = th->th_offset;
> -             bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);
> +             TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
> +             bintimeadd(bt, &th->th_offset, bt);
>               membar_consumer();
>       } while (gen == 0 || gen != th->th_generation);
>  }
> @@ -278,7 +278,8 @@ binruntime(struct bintime *bt)
>               th = timehands;
>               gen = th->th_generation;
>               membar_consumer();
> -             bintimeaddfrac(&th->th_offset, th->th_scale * tc_delta(th), bt);
> +             TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
> +             bintimeadd(bt, &th->th_offset, bt);
>               bintimesub(bt, &th->th_naptime, bt);
>               membar_consumer();
>       } while (gen == 0 || gen != th->th_generation);
> @@ -303,8 +304,8 @@ bintime(struct bintime *bt)
>               th = timehands;
>               gen = th->th_generation;
>               membar_consumer();
> -             *bt = th->th_offset;
> -             bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);
> +             TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
> +             bintimeadd(bt, &th->th_offset, bt);
>               bintimeadd(bt, &th->th_boottime, bt);
>               membar_consumer();
>       } while (gen == 0 || gen != th->th_generation);
> @@ -641,7 +642,8 @@ tc_windup(struct bintime *new_boottime, 
>               ncount = 0;
>       th->th_offset_count += delta;
>       th->th_offset_count &= th->th_counter->tc_counter_mask;
> -     bintimeaddfrac(&th->th_offset, th->th_scale * delta, &th->th_offset);
> +     TIMECOUNT_TO_BINTIME(delta, th->th_scale, &bt);
> +     bintimeadd(&th->th_offset, &bt, &th->th_offset);
>  
>       /*
>        * Ignore new offsets that predate the current offset.
> Index: lib/libc/sys/microtime.c
> ===================================================================
> RCS file: /cvs/src/lib/libc/sys/microtime.c,v
> retrieving revision 1.1
> diff -u -p -r1.1 microtime.c
> --- lib/libc/sys/microtime.c  6 Jul 2020 13:33:06 -0000       1.1
> +++ lib/libc/sys/microtime.c  14 Oct 2021 20:58:38 -0000
> @@ -45,10 +45,10 @@ binuptime(struct bintime *bt, struct tim
>       do {
>               gen = tk->tk_generation;
>               membar_consumer();
> -             *bt = tk->tk_offset;
>               if (tc_delta(tk, &delta))
>                       return -1;
> -             bintimeaddfrac(bt, tk->tk_scale * delta, bt);
> +             TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
> +             bintimeadd(bt, &tk->tk_offset, bt);
>               membar_consumer();
>       } while (gen == 0 || gen != tk->tk_generation);
>  
> @@ -65,7 +65,8 @@ binruntime(struct bintime *bt, struct ti
>               membar_consumer();
>               if (tc_delta(tk, &delta))
>                       return -1;
> -             bintimeaddfrac(&tk->tk_offset, tk->tk_scale * delta, bt);
> +             TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
> +             bintimeadd(bt, &tk->tk_offset, bt);
>               bintimesub(bt, &tk->tk_naptime, bt);
>               membar_consumer();
>       } while (gen == 0 || gen != tk->tk_generation);
> @@ -81,10 +82,10 @@ bintime(struct bintime *bt, struct timek
>       do {
>               gen = tk->tk_generation;
>               membar_consumer();
> -             *bt = tk->tk_offset;
>               if (tc_delta(tk, &delta))
>                       return -1;
> -             bintimeaddfrac(bt, tk->tk_scale * delta, bt);
> +             TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
> +             bintimeadd(bt, &tk->tk_offset, bt);
>               bintimeadd(bt, &tk->tk_boottime, bt);
>               membar_consumer();
>       } while (gen == 0 || gen != tk->tk_generation);
> 
> 

Reply via email to