> Date: Thu, 14 Oct 2021 16:13:17 -0500 > From: Scott Cheloha <scottchel...@gmail.com> > > Hi, > > When we compute high resolution time, both in the kernel and in libc, > we get a 32-bit (or smaller) value from the active timecounter and > scale it up into a 128-bit bintime. > > The scaling math currently looks like this in the kernel: > > *bt = th->th_offset; > bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt); > > It looks similar in libc. > > The problem with this code is that if the product > > th->tc_scale * tc_delta(th) > > exceeds UINT64_MAX, the result overflows and we lose time. > > This is unlikely to happen under normal circumstances. Normally, the > clock interrupt fires about HZ times per second, so the product > doesn't get very close to overflow. You'd need to run with interrupts > disabled on CPU0 for about 1 second to see an overflow. > > However, I have seen this happen before in the wild. If you try to > build LLVM on an EdgeRouter Lite (512MB RAM), it will swap *very > slowly* to the SD card and eventually lose time. > > I think it is inevitable that we will encounter bugs of this nature at > some point. In general, I would like the timecounting layer to be > more robust to nasty situations like these. Losing time is a > catastrophic failure. It screws everything up. > > The solution to this problem is to use the full 96-bit product when we > scale the count up into a bintime. We're multiplying a u_int > (32-bit), the count, by a uint64_t, the scale, but we're not capturing > the upper 32 bits of that product. If we did, we would have a longer > grace period between clock interrupts before we lost time. > > The attached patch adds a TIMECOUNT_TO_BINTIME() function to sys/time.h > and puts it to use in sys/kern/kern_tc.c and lib/libc/sys/microtime.c. > The math is a bit boring, see the patch if you are curious. > > As for the cost, there is a small but significant increase in overhead > when reading the clock with the TSC. Slower timecounters (HPET, ACPI > timer) are so slow the extra overhead is noise. > > I ran my usual synthetic benchmark against a patched and unpatched > libc for a number of different sample sizes. I ran the benchmark in > single-user mode to avoid ntpd(8) interference (adjtime(2), adjfreq(2)) > and set hw.setperf=100 to ensure uniform results. > > It looks to me that on amd64, userspace clock_gettime(2) is up to ~10% > slower with the patch. But there is a lot of variation between the > comparisons, so I don't think it's a consistent 10%. I'd say 10% is > an upper bound. > > To ensure the discussion is well-framed: we're talking about a 10% > overhead on a 40-80 nanosecond function call. > > Obviously the overhead will vary by platform. I imagine 32-bit > platforms will exceed 10% because we're now doing even more 64-bit > math. I'd be interested in results from e.g. macppc or armv7. > > ministat summaries are below. My benchmark program and script is > below that. The patch is attached below that. > > Thoughts?
I never understood this code. But I don't understand that if our timecounters are only 32 bits, we need more than 64 bits to store a time difference on the order of seconds... > -- > > All values (min, max, med, avg, stddev) are in seconds. > > Call clock_gettime(2) 10,000 times. Do that 1000 times. ~5.88% increase. > > x clock-gettime-tsc-user-head-1K-10K.dat1 > + clock-gettime-tsc-user-patch-1K-10K.dat1 > N Min Max Median Avg Stddev > x 1000 0.000329069 0.000362506 0.000335691 0.00033582683 3.6951597e-06 > + 1000 0.000346276 0.000381319 0.000353605 0.00035557716 3.962225e-06 > Difference at 99.5% confidence > 1.97503e-05 +/- 5.29405e-07 > 5.88111% +/- 0.157642% > (Student's t, pooled s = 3.83102e-06) > > Call clock_gettime(2) 100,000 times. Do that 1000 times. ~1.31% increase. > > x clock-gettime-tsc-user-head-1K-100K.dat1 > + clock-gettime-tsc-user-patch-1K-100K.dat1 > N Min Max Median Avg Stddev > x 1000 0.003432839 0.003679475 0.003472749 0.0034827782 3.0275757e-05 > + 1000 0.003426934 0.003776351 0.003524453 0.0035284519 3.8631925e-05 > Difference at 99.5% confidence > 4.56738e-05 +/- 4.79602e-06 > 1.31142% +/- 0.137707% > (Student's t, pooled s = 3.47062e-05) > > Call clock_gettime(2) 5,000 times. Do that 10,000 times. ~8.53% increase. > > x clock-gettime-tsc-user-head-10K-5K.dat1 > + clock-gettime-tsc-user-patch-10K-5K.dat1 > N Min Max Median Avg Stddev > x 10000 0.000159681 0.000220758 0.000186826 0.00017805393 1.177249e-05 > + 10000 0.000171105 0.000217009 0.000200789 0.00019325052 1.1280074e-05 > Difference at 99.5% confidence > 1.51966e-05 +/- 5.03804e-07 > 8.53482% +/- 0.28295% > (Student's t, pooled s = 1.15289e-05) > > Call clock_gettime(2) 10,000 times. Do that 10,000 times. ~8.74% increase. > > x clock-gettime-tsc-user-head-10K-10K.dat1 > + clock-gettime-tsc-user-patch-10K-10K.dat1 > N Min Max Median Avg Stddev > x 10000 0.000320472 0.000410915 0.000371971 0.00035789663 2.2141645e-05 > + 10000 0.000340087 0.000436915 0.000393461 0.00038918836 1.8884201e-05 > Difference at 99.5% confidence > 3.12917e-05 +/- 8.99219e-07 > 8.74323% +/- 0.251251% > (Student's t, pooled s = 2.05775e-05) > > Call clock_gettime(2) 15,000 times. Do that 10,000 times. ~2.47% increase. > > x clock-gettime-tsc-user-head-10K-15K.dat1 > + clock-gettime-tsc-user-patch-10K-15K.dat1 > N Min Max Median Avg Stddev > x 10000 0.00049023 0.000660955 0.000573344 0.00055863893 2.9447818e-05 > + 10000 0.000510269 0.000651218 0.000584569 0.00057240886 3.4920043e-05 > Difference at 99.5% confidence > 1.37699e-05 +/- 1.41149e-06 > 2.46491% +/- 0.252665% > (Student's t, pooled s = 3.23e-05) > > -- > > /* > * clock-gettime-bench.c > * > * Public domain. > */ > > #include <sys/time.h> > > #include <err.h> > #include <limits.h> > #include <time.h> > #include <stdio.h> > #include <stdlib.h> > > int > main(int argc, char *argv[]) > { > struct timespec begin, elapsed, end, now; > long long count, i; > const char *errstr; > > if (argc != 2) { > fprintf(stderr, "usage: %s count\n", getprogname()); > return 1; > } > count = strtonum(argv[1], 1, LLONG_MAX, &errstr); > if (errstr != NULL) > errx(1, "count is %s: %s", errstr, argv[1]); > > clock_gettime(CLOCK_MONOTONIC, &begin); > for (i = 0; i < count; i++) > clock_gettime(CLOCK_MONOTONIC, &now); > clock_gettime(CLOCK_MONOTONIC, &end); > > timespecsub(&end, &begin, &elapsed); > printf("%lld\t%lld.%09ld\n", > count, (long long)elapsed.tv_sec, elapsed.tv_nsec); > > return 0; > } > > -- > > #! /bin/sh > # > # clock-gettime-bench.sh > # > # Public domain. > # > # Driver program for clock-gettime-bench.c that discards runs that are > # context switched. We want to measure *our* computation time, not > # the time some other program spends running. > # > # Run as root to nice the program down to -20. > # > # In general, run from single-user mode at hw.setperf=100 to limit > # interference from other variables and ensure a fair comparison. > > warn() > { > printf "$0: $@\n" >&2 > } > > set -e > > if [ $# -ne 2 ]; then > printf "usage: $0 nrounds nsamples\n" >&2 > exit 1 > fi > > nrounds=$1 > nsamples=$2 > > i=0 > while [ $i -lt $nrounds ]; do > result=$(/usr/bin/time -l nice -n -20 ./clock-gettime-bench $nsamples > 2>&1) > if echo "$result" | tail -n 1 | egrep -q '^( ){9}0 involuntary context > switches$'; then > echo "$result" | head -n 1 > i=$((i + 1)) > else > warn "context switched: retrying round $i" > fi > done > > exit 0 > > -- > > Index: sys/sys/time.h > =================================================================== > RCS file: /cvs/src/sys/sys/time.h,v > retrieving revision 1.61 > diff -u -p -r1.61 time.h > --- sys/sys/time.h 19 Jun 2021 13:49:39 -0000 1.61 > +++ sys/sys/time.h 14 Oct 2021 20:58:38 -0000 > @@ -222,6 +222,17 @@ bintimesub(const struct bintime *bt, con > * time_second ticks after N.999999999 not after N.4999999999 > */ > > +static inline void > +TIMECOUNT_TO_BINTIME(u_int count, uint64_t scale, struct bintime *bt) > +{ > + uint64_t hi64; > + > + hi64 = count * (scale >> 32); > + bt->sec = hi64 >> 32; > + bt->frac = hi64 << 32; > + bintimeaddfrac(bt, count * (scale & 0xffffffff), bt); > +} > + > static inline uint32_t > FRAC_TO_NSEC(uint64_t frac) > { > Index: sys/kern/kern_tc.c > =================================================================== > RCS file: /cvs/src/sys/kern/kern_tc.c,v > retrieving revision 1.74 > diff -u -p -r1.74 kern_tc.c > --- sys/kern/kern_tc.c 19 Jun 2021 13:49:39 -0000 1.74 > +++ sys/kern/kern_tc.c 14 Oct 2021 20:58:38 -0000 > @@ -189,8 +189,8 @@ binuptime(struct bintime *bt) > th = timehands; > gen = th->th_generation; > membar_consumer(); > - *bt = th->th_offset; > - bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt); > + TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt); > + bintimeadd(bt, &th->th_offset, bt); > membar_consumer(); > } while (gen == 0 || gen != th->th_generation); > } > @@ -278,7 +278,8 @@ binruntime(struct bintime *bt) > th = timehands; > gen = th->th_generation; > membar_consumer(); > - bintimeaddfrac(&th->th_offset, th->th_scale * tc_delta(th), bt); > + TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt); > + bintimeadd(bt, &th->th_offset, bt); > bintimesub(bt, &th->th_naptime, bt); > membar_consumer(); > } while (gen == 0 || gen != th->th_generation); > @@ -303,8 +304,8 @@ bintime(struct bintime *bt) > th = timehands; > gen = th->th_generation; > membar_consumer(); > - *bt = th->th_offset; > - bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt); > + TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt); > + bintimeadd(bt, &th->th_offset, bt); > bintimeadd(bt, &th->th_boottime, bt); > membar_consumer(); > } while (gen == 0 || gen != th->th_generation); > @@ -641,7 +642,8 @@ tc_windup(struct bintime *new_boottime, > ncount = 0; > th->th_offset_count += delta; > th->th_offset_count &= th->th_counter->tc_counter_mask; > - bintimeaddfrac(&th->th_offset, th->th_scale * delta, &th->th_offset); > + TIMECOUNT_TO_BINTIME(delta, th->th_scale, &bt); > + bintimeadd(&th->th_offset, &bt, &th->th_offset); > > /* > * Ignore new offsets that predate the current offset. > Index: lib/libc/sys/microtime.c > =================================================================== > RCS file: /cvs/src/lib/libc/sys/microtime.c,v > retrieving revision 1.1 > diff -u -p -r1.1 microtime.c > --- lib/libc/sys/microtime.c 6 Jul 2020 13:33:06 -0000 1.1 > +++ lib/libc/sys/microtime.c 14 Oct 2021 20:58:38 -0000 > @@ -45,10 +45,10 @@ binuptime(struct bintime *bt, struct tim > do { > gen = tk->tk_generation; > membar_consumer(); > - *bt = tk->tk_offset; > if (tc_delta(tk, &delta)) > return -1; > - bintimeaddfrac(bt, tk->tk_scale * delta, bt); > + TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt); > + bintimeadd(bt, &tk->tk_offset, bt); > membar_consumer(); > } while (gen == 0 || gen != tk->tk_generation); > > @@ -65,7 +65,8 @@ binruntime(struct bintime *bt, struct ti > membar_consumer(); > if (tc_delta(tk, &delta)) > return -1; > - bintimeaddfrac(&tk->tk_offset, tk->tk_scale * delta, bt); > + TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt); > + bintimeadd(bt, &tk->tk_offset, bt); > bintimesub(bt, &tk->tk_naptime, bt); > membar_consumer(); > } while (gen == 0 || gen != tk->tk_generation); > @@ -81,10 +82,10 @@ bintime(struct bintime *bt, struct timek > do { > gen = tk->tk_generation; > membar_consumer(); > - *bt = tk->tk_offset; > if (tc_delta(tk, &delta)) > return -1; > - bintimeaddfrac(bt, tk->tk_scale * delta, bt); > + TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt); > + bintimeadd(bt, &tk->tk_offset, bt); > bintimeadd(bt, &tk->tk_boottime, bt); > membar_consumer(); > } while (gen == 0 || gen != tk->tk_generation); > >