On Thu, Oct 14, 2021 at 04:13:18PM -0500, Scott Cheloha wrote:
>
> [...]
>
> When we compute high resolution time, both in the kernel and in libc,
> we get a 32-bit (or smaller) value from the active timecounter and
> scale it up into a 128-bit bintime.
>
> The scaling math currently looks like this in the kernel:
>
> [...]
>
> The problem with this code is that if the product
>
> th->tc_scale * tc_delta(th)
>
> exceeds UINT64_MAX, the result overflows and we lose time.
>
> [...]
>
> The solution to this problem is to use the full 96-bit product when we
> scale the count up into a bintime. We're multiplying a u_int
> (32-bit), the count, by a uint64_t, the scale, but we're not capturing
> the upper 32 bits of that product. If we did, we would have a longer
> grace period between clock interrupts before we lost time.
>
> The attached patch adds a TIMECOUNT_TO_BINTIME() function to sys/time.h
> and puts it to use in sys/kern/kern_tc.c and lib/libc/sys/microtime.c.
> The math is a bit boring, see the patch if you are curious.
>
> As for the cost, there is a small but significant increase in overhead
> when reading the clock with the TSC. Slower timecounters (HPET, ACPI
> timer) are so slow the extra overhead is noise.
>
> [...]
>
> It looks to me that on amd64, userspace clock_gettime(2) is up to ~10%
> slower with the patch. But there is a lot of variation between the
> comparisons, so I don't think it's a consistent 10%. I'd say 10% is
> an upper bound.
6 month bump. Absent some sort of discussion I'm going to commit this
in a week, I'm confident the math is correct.
As before, the only potentially contentious aspect to this change is
the additional overhead that we will carry on every gettimeofday(2)
call, clock_gettime(2) call, and high-res time call in the kernel.
Profiling the overhead for this is very difficult, at least on amd64.
I am still confident about my 10% upper bound on amd64 with the
userspace TSC, but the actual overhead in a given a/b test varies a
lot within that range. I've seen everything from 2% to 9%. It's all
over the place, just never more than 10%.
I think 10% is tolerable. If you object to that, PLEASE speak up.
There are more complex ways to fix this problem that might eliminate
the overhead, but they are uglier.
If you are an evangelist for a non-amd64 platform I would really
appreciate an a/b test for the overhead in userspace. If you need
test code please ask on-list and I'll provide it.
Index: lib/libc/sys/microtime.c
===================================================================
RCS file: /cvs/src/lib/libc/sys/microtime.c,v
retrieving revision 1.1
diff -u -p -r1.1 microtime.c
--- lib/libc/sys/microtime.c 6 Jul 2020 13:33:06 -0000 1.1
+++ lib/libc/sys/microtime.c 29 Apr 2022 01:07:26 -0000
@@ -45,10 +45,10 @@ binuptime(struct bintime *bt, struct tim
do {
gen = tk->tk_generation;
membar_consumer();
- *bt = tk->tk_offset;
if (tc_delta(tk, &delta))
return -1;
- bintimeaddfrac(bt, tk->tk_scale * delta, bt);
+ TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
+ bintimeadd(bt, &tk->tk_offset, bt);
membar_consumer();
} while (gen == 0 || gen != tk->tk_generation);
@@ -65,7 +65,8 @@ binruntime(struct bintime *bt, struct ti
membar_consumer();
if (tc_delta(tk, &delta))
return -1;
- bintimeaddfrac(&tk->tk_offset, tk->tk_scale * delta, bt);
+ TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
+ bintimeadd(bt, &tk->tk_offset, bt);
bintimesub(bt, &tk->tk_naptime, bt);
membar_consumer();
} while (gen == 0 || gen != tk->tk_generation);
@@ -81,10 +82,10 @@ bintime(struct bintime *bt, struct timek
do {
gen = tk->tk_generation;
membar_consumer();
- *bt = tk->tk_offset;
if (tc_delta(tk, &delta))
return -1;
- bintimeaddfrac(bt, tk->tk_scale * delta, bt);
+ TIMECOUNT_TO_BINTIME(delta, tk->tk_scale, bt);
+ bintimeadd(bt, &tk->tk_offset, bt);
bintimeadd(bt, &tk->tk_boottime, bt);
membar_consumer();
} while (gen == 0 || gen != tk->tk_generation);
Index: sys/kern/kern_tc.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_tc.c,v
retrieving revision 1.75
diff -u -p -r1.75 kern_tc.c
--- sys/kern/kern_tc.c 24 Oct 2021 00:02:25 -0000 1.75
+++ sys/kern/kern_tc.c 29 Apr 2022 01:07:26 -0000
@@ -189,8 +189,8 @@ binuptime(struct bintime *bt)
th = timehands;
gen = th->th_generation;
membar_consumer();
- *bt = th->th_offset;
- bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);
+ TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
+ bintimeadd(bt, &th->th_offset, bt);
membar_consumer();
} while (gen == 0 || gen != th->th_generation);
}
@@ -278,7 +278,8 @@ binruntime(struct bintime *bt)
th = timehands;
gen = th->th_generation;
membar_consumer();
- bintimeaddfrac(&th->th_offset, th->th_scale * tc_delta(th), bt);
+ TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
+ bintimeadd(bt, &th->th_offset, bt);
bintimesub(bt, &th->th_naptime, bt);
membar_consumer();
} while (gen == 0 || gen != th->th_generation);
@@ -303,8 +304,8 @@ bintime(struct bintime *bt)
th = timehands;
gen = th->th_generation;
membar_consumer();
- *bt = th->th_offset;
- bintimeaddfrac(bt, th->th_scale * tc_delta(th), bt);
+ TIMECOUNT_TO_BINTIME(tc_delta(th), th->th_scale, bt);
+ bintimeadd(bt, &th->th_offset, bt);
bintimeadd(bt, &th->th_boottime, bt);
membar_consumer();
} while (gen == 0 || gen != th->th_generation);
@@ -641,7 +642,8 @@ tc_windup(struct bintime *new_boottime,
ncount = 0;
th->th_offset_count += delta;
th->th_offset_count &= th->th_counter->tc_counter_mask;
- bintimeaddfrac(&th->th_offset, th->th_scale * delta, &th->th_offset);
+ TIMECOUNT_TO_BINTIME(delta, th->th_scale, &bt);
+ bintimeadd(&th->th_offset, &bt, &th->th_offset);
/*
* Ignore new offsets that predate the current offset.
Index: sys/sys/time.h
===================================================================
RCS file: /cvs/src/sys/sys/time.h,v
retrieving revision 1.61
diff -u -p -r1.61 time.h
--- sys/sys/time.h 19 Jun 2021 13:49:39 -0000 1.61
+++ sys/sys/time.h 29 Apr 2022 01:07:26 -0000
@@ -208,6 +208,17 @@ bintimesub(const struct bintime *bt, con
dt->frac = bt->frac - ct->frac;
}
+static inline void
+TIMECOUNT_TO_BINTIME(u_int count, uint64_t scale, struct bintime *bt)
+{
+ uint64_t hi64;
+
+ hi64 = count * (scale >> 32);
+ bt->sec = hi64 >> 32;
+ bt->frac = hi64 << 32;
+ bintimeaddfrac(bt, count * (scale & 0xffffffff), bt);
+}
+
/*-
* Background information:
*