Re: timeout(9): add clock-based timeouts (attempt 2)

2020-10-09 Thread Theo de Raadt
> I think this is good to go.  ok kettenis@
> 
> Did briefly discuss with Theo during k2k20 and the consensus was it
> should go in after relase.  Which is now!

Agree.



Re: timeout(9): add clock-based timeouts (attempt 2)

2020-10-09 Thread Mark Kettenis
> From: Scott Cheloha 
> Date: Fri, 9 Oct 2020 13:03:05 -0500
> 
> Hey,
> 
> > On Oct 7, 2020, at 8:49 PM, 内藤 祐一郎  wrote:
> > 
> > Hi.
> > 
> > I'm looking forward to this patch is committed.
> > Because this patch solves my problem about CARP timeout.
> > 
> > IIJ, a company that I am working for, is using carp(4) on VMware ESXi hosts
> > for VPN and web gateway services.
> > 
> > One is master and the other is backup of carp(4).
> > Active host sometimes failover to backup when the ESXi host gets high cpu 
> > usage.
> > And also CPU ready of OpenBSD machine seems high average on ESXi monitor.
> > 
> > High CPU ready machine delays sending carp advertisement for 3 or 4 seconds.
> > It is enough to failover to backup.
> > 
> > In my investigation, OpenBSD machine does not always get CPU under high CPU 
> > ready condition.
> > Although it is needed for interrupt handler.
> > The delay of calling hardclock() causes tick count up delay.
> > One delay is small but will never be resolved.
> > So total delay can reach 3 or 4 seconds while tick counts up to 100.
> > The tickless patch can solve the delay.
> > 
> > I have tried to adapt in_carp.c to the tickless attempt 2.
> > Delay of carp advertisement reduced to about 2 seconds.
> 
> I'm glad to hear it improves things.  Thanks for testing it out.
> 
> >> 2020/09/09 4:00、Mark Kettenis のメール:
> >> The diff looks reasonable to me, but I'd like to discuss the path
> >> forward with some people during the hackathon next week.
> > 
> > Is there any discussion in the hackathon?
> 
> Not that I heard.  I wasn't at the hackathon, though.
> 
> --
> 
> If I get an OK from someone I will commit what I have so far.
> 
> Where do we stand?
> 
> - The nitty gritty details in this commit -- the hashing,
>   the loops, and the basic algorithm -- haven't changed
>   in almost a year.  I'm confident they work.
> 
> - The commit itself doesn't change any behavior because no
>   existing timeouts are converted to use timeout_set_kclock().
>   So we shouldn't see any regressions like last time until
>   someone deliberately changes an existing timeout to use the
>   kclock interfaces.
> 
> The thing that needs to be decided is how to go about dragging
> the rest of the tree into using the kclock timeout interfaces.
> 
> - Should we keep a tick-based timeout interface?  If so,
>   for how long?  Linux kept theirs as a distinct interface.
>   FreeBSD discarded theirs.
> 
> - Should we quietly reimplement timeout_add_sec(9), etc.,
>   in terms of kclock timeouts or should we do a full-tree
>   API change to explicitly use timeout_in_nsec()?
> 
> I don't think we can make such decisions without putting kclock
> timeouts into the tree so people can use them.
> 
> So, are you OK with this as-is?
> 
> Anybody else?

I think this is good to go.  ok kettenis@

Did briefly discuss with Theo during k2k20 and the consensus was it
should go in after relase.  Which is now!



Re: timeout(9): add clock-based timeouts (attempt 2)

2020-10-09 Thread Scott Cheloha
Hey,

> On Oct 7, 2020, at 8:49 PM, 内藤 祐一郎  wrote:
> 
> Hi.
> 
> I'm looking forward to this patch is committed.
> Because this patch solves my problem about CARP timeout.
> 
> IIJ, a company that I am working for, is using carp(4) on VMware ESXi hosts
> for VPN and web gateway services.
> 
> One is master and the other is backup of carp(4).
> Active host sometimes failover to backup when the ESXi host gets high cpu 
> usage.
> And also CPU ready of OpenBSD machine seems high average on ESXi monitor.
> 
> High CPU ready machine delays sending carp advertisement for 3 or 4 seconds.
> It is enough to failover to backup.
> 
> In my investigation, OpenBSD machine does not always get CPU under high CPU 
> ready condition.
> Although it is needed for interrupt handler.
> The delay of calling hardclock() causes tick count up delay.
> One delay is small but will never be resolved.
> So total delay can reach 3 or 4 seconds while tick counts up to 100.
> The tickless patch can solve the delay.
> 
> I have tried to adapt in_carp.c to the tickless attempt 2.
> Delay of carp advertisement reduced to about 2 seconds.

I'm glad to hear it improves things.  Thanks for testing it out.

>> 2020/09/09 4:00、Mark Kettenis のメール:
>> The diff looks reasonable to me, but I'd like to discuss the path
>> forward with some people during the hackathon next week.
> 
> Is there any discussion in the hackathon?

Not that I heard.  I wasn't at the hackathon, though.

--

If I get an OK from someone I will commit what I have so far.

Where do we stand?

- The nitty gritty details in this commit -- the hashing,
  the loops, and the basic algorithm -- haven't changed
  in almost a year.  I'm confident they work.

- The commit itself doesn't change any behavior because no
  existing timeouts are converted to use timeout_set_kclock().
  So we shouldn't see any regressions like last time until
  someone deliberately changes an existing timeout to use the
  kclock interfaces.

The thing that needs to be decided is how to go about dragging
the rest of the tree into using the kclock timeout interfaces.

- Should we keep a tick-based timeout interface?  If so,
  for how long?  Linux kept theirs as a distinct interface.
  FreeBSD discarded theirs.

- Should we quietly reimplement timeout_add_sec(9), etc.,
  in terms of kclock timeouts or should we do a full-tree
  API change to explicitly use timeout_in_nsec()?

I don't think we can make such decisions without putting kclock
timeouts into the tree so people can use them.

So, are you OK with this as-is?

Anybody else?



Re: timeout(9): add clock-based timeouts (attempt 2)

2020-10-07 Thread 内藤 祐一郎
Hi.

I'm looking forward to this patch is committed.
Because this patch solves my problem about CARP timeout.

IIJ, a company that I am working for, is using carp(4) on VMware ESXi hosts
for VPN and web gateway services.

One is master and the other is backup of carp(4).
Active host sometimes failover to backup when the ESXi host gets high cpu usage.
And also CPU ready of OpenBSD machine seems high average on ESXi monitor.

High CPU ready machine delays sending carp advertisement for 3 or 4 seconds.
It is enough to failover to backup.

In my investigation, OpenBSD machine does not always get CPU under high CPU 
ready condition.
Although it is needed for interrupt handler.
The delay of calling hardclock() causes tick count up delay.
One delay is small but will never be resolved.
So total delay can reach 3 or 4 seconds while tick counts up to 100.
The tickless patch can solve the delay.

I have tried to adapt in_carp.c to the tickless attempt 2.
Delay of carp advertisement reduced to about 2 seconds.


> 2020/09/09 4:00、Mark Kettenis のメール:
> The diff looks reasonable to me, but I'd like to discuss the path
> forward with some people during the hackathon next week.

Is there any discussion in the hackathon?

—
Yuichiro NAITO
naito.yuich...@gmail.com





Re: timeout(9): add clock-based timeouts (attempt 2)

2020-09-08 Thread Mark Kettenis
> Date: Mon, 7 Sep 2020 18:50:44 -0500
> From: Scott Cheloha 
> 
> On Sat, Sep 05, 2020 at 01:11:59PM +0200, Mark Kettenis wrote:
> > > Date: Fri, 4 Sep 2020 17:55:39 -0500
> > > From: Scott Cheloha 
> > > 
> > > On Sat, Jul 25, 2020 at 08:46:08PM -0500, Scott Cheloha wrote:
> > > > 
> > > > [...]
> > > > 
> > > > I want to add clock-based timeouts to the kernel because tick-based
> > > > timeouts suffer from a few problems:
> > > > 
> > > > [...]
> > > > 
> > > > Basically, ticks are a poor approximation for the system clock.  We
> > > > should use the real thing where possible.
> > > > 
> > > > [...]
> > > > 
> > > > Thoughts on this approach?  Thoughts on the proposed API?
> > > 
> > > 6 week bump.
> > > 
> > > Attached is an rebased and streamlined diff.
> > > 
> > > Let's try again:
> > > 
> > > This patch adds support for timeouts scheduled against the hardware
> > > timecounter.  I call these "kclock timeouts".  They are distinct from
> > > the current tick-based timeouts because ticks are "software time", not
> > > "real time".
> > 
> > So what's the end game here?  Are these kclock-based timeouts going to
> > replace the tick-based timeouts at some point in the future?  I can
> > see why you want to have both in parallel for a while, but long-term I
> > don't think we want to keep both.
> 
> Ideally we would replace tick-based timeouts entirely with kclock
> timeouts eventually.
> 
> There are a few roadblocks, though:
> 
> 1. The scheduler is tick-based.  If you want to wait until the next
>tick, the easiest way to do that is with timeout_add(9) or tsleep(9).

I don't think this really matters in most cases.  Keeping the tick as
the base for a scheduling quantum is probably wise for now, but I
don't think it matters that timeouts and tsleeps (especially tsleeps)
are actually synchronized to the scheduling clock.

> 2. Linux has ktimers, which is tick-based.  drm uses it.  Shouldn't
>we have a tick-based timeout interface for compatibility with them?
>We could fake it, like FreeBSD does, but doing so is probably more
>complicated than just keeping support for tick-based timeouts.

You can easily emulate this using an absolute timer that you keep
rescheduling.  I think that is preferable to keeping a complete
separate tick-based timeout system.

> 3. Scheduling a timeout with timeout_add(9) is fast.  Scheduling a
>timeout with timeout_in_nsec(9) involves a clock read.  It is slower.
>It is probably too slow for some code.
> 
> (1) will be overcome if ever the scheduler is no longer tick-based.
> 
> (2) is tricky.  Maybe you or jsg@ have an opinion?

Not really.  But I don't think the Linux ktimers tick at the same rate
as ours so I don't think it matters.

> (3) is somewhat easier to fix.  I intend to introduce a TIMEOUT_COARSE
> flag in the future which causes timeout_in_nsec() to call
> getnanouptime(9) instead of nanouptime(9).  Reading the timestamp is
> faster than reading the clock.  You lose accuracy, but any code
> worried about the overhead of reading the clock is probably not very
> concerned with accuracy.

Right.

> > We don't really want to do a wholesale conversion of APIs again I'd
> > say.  So at some point the existing timeout_add_xxx() calls should be
> > implemented in terms of "kclock timeouts".
> 
> We can do this, but we'll still need to change the calls that
> reschedule a periodic timeout to use the dedicated rescheduling
> interface.  Otherwise those periodic timeouts will drift.  They don't
> currently drift because a tick is a very coarse unit of time.  With
> nanosecond resolution we'll get drift.

Periodic timeouts are rare.  At least those that care about drift.

> > This implementation is still tick driven, so it doesn't really provide
> > sub-tick resolution.
> 
> Yes, that's right.  Each timeout maintains nanosecond resolution for
> its expiration time but will only actually run after hardclock(9) runs
> and dumps the timeout to softclock().
> 
> We would need to implement a more flexible clock interrupt scheduler
> to run timeouts in between hardclocks.
> 
> > What does that mean for testing this?  I mean if we spend a lot of time
> > now to verify that subsystems can tolerate the more fine-grained timeouts,
> > we need to that again when you switch from having a period interrupt driving
> > the wheel to having a scheduled interrupt isn't it?
> 
> Yes.  But both changes can break things.
> 
> I think we should do kclock timeouts before sub-tick timeouts.  The
> former is a lot less disruptive than the latter, as the timeouts still
> run right after the hardclock.
> 
> And you need kclock timeouts to even test sub-tick timeouts anyway.
> 
> > > For now we have one kclock, KCLOCK_UPTIME, which corresponds to
> > > nanouptime(9).  In the future I intend to add support for runtime and
> > > UTC kclocks.
> > 
> > Do we really need that?  I suppose it helps implementing something
> > like clock_nanosleep() with the TIMER_ABSTIME flag for various

Re: timeout(9): add clock-based timeouts (attempt 2)

2020-09-07 Thread Scott Cheloha
On Sat, Sep 05, 2020 at 01:11:59PM +0200, Mark Kettenis wrote:
> > Date: Fri, 4 Sep 2020 17:55:39 -0500
> > From: Scott Cheloha 
> > 
> > On Sat, Jul 25, 2020 at 08:46:08PM -0500, Scott Cheloha wrote:
> > > 
> > > [...]
> > > 
> > > I want to add clock-based timeouts to the kernel because tick-based
> > > timeouts suffer from a few problems:
> > > 
> > > [...]
> > > 
> > > Basically, ticks are a poor approximation for the system clock.  We
> > > should use the real thing where possible.
> > > 
> > > [...]
> > > 
> > > Thoughts on this approach?  Thoughts on the proposed API?
> > 
> > 6 week bump.
> > 
> > Attached is an rebased and streamlined diff.
> > 
> > Let's try again:
> > 
> > This patch adds support for timeouts scheduled against the hardware
> > timecounter.  I call these "kclock timeouts".  They are distinct from
> > the current tick-based timeouts because ticks are "software time", not
> > "real time".
> 
> So what's the end game here?  Are these kclock-based timeouts going to
> replace the tick-based timeouts at some point in the future?  I can
> see why you want to have both in parallel for a while, but long-term I
> don't think we want to keep both.

Ideally we would replace tick-based timeouts entirely with kclock
timeouts eventually.

There are a few roadblocks, though:

1. The scheduler is tick-based.  If you want to wait until the next
   tick, the easiest way to do that is with timeout_add(9) or tsleep(9).

2. Linux has ktimers, which is tick-based.  drm uses it.  Shouldn't
   we have a tick-based timeout interface for compatibility with them?
   We could fake it, like FreeBSD does, but doing so is probably more
   complicated than just keeping support for tick-based timeouts.

3. Scheduling a timeout with timeout_add(9) is fast.  Scheduling a
   timeout with timeout_in_nsec(9) involves a clock read.  It is slower.
   It is probably too slow for some code.

(1) will be overcome if ever the scheduler is no longer tick-based.

(2) is tricky.  Maybe you or jsg@ have an opinion?

(3) is somewhat easier to fix.  I intend to introduce a TIMEOUT_COARSE
flag in the future which causes timeout_in_nsec() to call
getnanouptime(9) instead of nanouptime(9).  Reading the timestamp is
faster than reading the clock.  You lose accuracy, but any code
worried about the overhead of reading the clock is probably not very
concerned with accuracy.

> We don't really want to do a wholesale conversion of APIs again I'd
> say.  So at some point the existing timeout_add_xxx() calls should be
> implemented in terms of "kclock timeouts".

We can do this, but we'll still need to change the calls that
reschedule a periodic timeout to use the dedicated rescheduling
interface.  Otherwise those periodic timeouts will drift.  They don't
currently drift because a tick is a very coarse unit of time.  With
nanosecond resolution we'll get drift.

> This implementation is still tick driven, so it doesn't really provide
> sub-tick resolution.

Yes, that's right.  Each timeout maintains nanosecond resolution for
its expiration time but will only actually run after hardclock(9) runs
and dumps the timeout to softclock().

We would need to implement a more flexible clock interrupt scheduler
to run timeouts in between hardclocks.

> What does that mean for testing this?  I mean if we spend a lot of time
> now to verify that subsystems can tolerate the more fine-grained timeouts,
> we need to that again when you switch from having a period interrupt driving
> the wheel to having a scheduled interrupt isn't it?

Yes.  But both changes can break things.

I think we should do kclock timeouts before sub-tick timeouts.  The
former is a lot less disruptive than the latter, as the timeouts still
run right after the hardclock.

And you need kclock timeouts to even test sub-tick timeouts anyway.

> > For now we have one kclock, KCLOCK_UPTIME, which corresponds to
> > nanouptime(9).  In the future I intend to add support for runtime and
> > UTC kclocks.
> 
> Do we really need that?  I suppose it helps implementing something
> like clock_nanosleep() with the TIMER_ABSTIME flag for various
> clock_id values?

Exactly.

FreeBSD decided to not support multiple clocks in their revamped
callout(9).  The result is a bit simpler (one clock) but in order to
implement absolute CLOCK_REALTIME sleeps for userspace they have this
flag for each thread that causes the thread to wake up and reschedule
itself whenever settimeofday(2) happens.

It's clever, but it seems messy to me.

I would rather support UTC timeouts as "first class citizens" of the
timeout subsystem.  Linux's hrtimers API supports UTC timeouts
explicitly.  I prefer their approach.

The advantage of real support is that the timeout(9) subsystem will
handle rebucketing UTC timeouts if the clock jumps backwards
transparently.  This is nice.

> > Why do we want kclock timeouts at all?
> > 
> > 1. Kclock timeouts expire at an actual time, not a tick.  They
> >have nanosecond resolution and 

Re: timeout(9): add clock-based timeouts (attempt 2)

2020-09-05 Thread Mark Kettenis
> Date: Fri, 4 Sep 2020 17:55:39 -0500
> From: Scott Cheloha 
> 
> On Sat, Jul 25, 2020 at 08:46:08PM -0500, Scott Cheloha wrote:
> > 
> > [...]
> > 
> > I want to add clock-based timeouts to the kernel because tick-based
> > timeouts suffer from a few problems:
> > 
> > [...]
> > 
> > Basically, ticks are a poor approximation for the system clock.  We
> > should use the real thing where possible.
> > 
> > [...]
> > 
> > Thoughts on this approach?  Thoughts on the proposed API?
> 
> 6 week bump.
> 
> Attached is an rebased and streamlined diff.
> 
> Let's try again:
> 
> This patch adds support for timeouts scheduled against the hardware
> timecounter.  I call these "kclock timeouts".  They are distinct from
> the current tick-based timeouts because ticks are "software time", not
> "real time".

So what's the end game here?  Are these kclock-based timeouts going to
replace the tick-based timeouts at some point in the future?  I can
see why you want to have both in parallel for a while, but long-term I
don't think we want to keep both.  We don't really want to do a
wholesale conversion of APIs again I'd say.  So at some point the
existing timeout_add_xxx() calls should be implemented in terms of
"kclock timeouts".

This implementation is still tick driven, so it doesn't really provide
sub-tick resolution.  What does that mean for testing this?  I mean if
we spend a lot of time now to verify that subsystems can tolerate the
more fine-grained timeouts, we need to that again when you switch from
having a period interrupt driving the wheel to having a scheduled
interrupt isn't it?

> For now we have one kclock, KCLOCK_UPTIME, which corresponds to
> nanouptime(9).  In the future I intend to add support for runtime and
> UTC kclocks.

Do we really need that?  I suppose it helps implementing something
like clock_nanosleep() with the TIMER_ABSTIME flag for various
clock_id values?

> Why do we want kclock timeouts at all?
> 
> 1. Kclock timeouts expire at an actual time, not a tick.  They
>have nanosecond resolution and are NTP-sensitive.  Thus, they
>will *never* fire early.

Is there a lot of overhead in these being NTP-sensitive?  I'm asking
because for short timeouts you don't really care about NTP
corrections.

> 2. One upshot of nanosecond resolution is that we don't need to
>"round up" to the next tick when scheduling a timeout to prevent
>early execution.  The extra resolution allows us to reduce
>latency in some contexts.
> 
> 3. Kclock timeouts cover the entire range of the kernel timeline.
>We can remove the "tick loops" like the one sys_nanosleep().
> 
> 4. Kclock timeouts are scheduled as absolute deadlines.  This makes
>supporting absolute timeouts trivial, which means we can add support
>for clock_nanosleep(2) and the absolute pthread timeouts to the
>kernel.
> 
> Kclock timeouts aren't actually used anywhere yet, so merging this
> patch will not break anything like last time (CC bluhm@).
> 
> In a subsequent diff I will put them to use in tsleep_nsec(9) etc.
> This will enable those interfaces to block for less than a tick, which
> in turn will allow userspace to block for less than a tick in e.g.
> futex(2), and poll(2).  pd@ has verified that this fixes the "time
> problem" in OpenBSD vmm(4) VMs (CC pd@).

Like I said above, running the timeout is still tick-driven isn't it?
This avoids having to wait at least a tick for timeouts that are
shorter than a tick, but it means the timeout can still be extended up
to a full tick.

> You initialize kclock timeouts with timeout_set_kclock().  You
> schedule them with timeout_in_nsec(), a relative timeout interface
> that accepts a count of nanoseconds.  If your timeout is in some
> other unit (seconds, milliseconds, whatever) you must convert it
> to nanoseconds before scheduling.  Something like this will work:
> 
>   timeout_in_nsec(_timeout, SEC_TO_NSEC(1));
> 
> There won't be a flavored API supporting every conceivable time unit.

So this is where I get worried.  What is the game plan?  Slowly
convert everything from timeout_add_xxx() to timeout_in_nsec()?  Or
offer this as a temporary interface for people to test some critical
subsystems after which we dump it and simply re-implement
timeout_add_xxx() as kclock-based timeouts?

> In the future I will expose an absolute timeout interface and a
> periodic timeout rescheduling interface.  We don't need either of
> these interfaces to start, though.

Not sure how useful a periodic timeout rescheduling interface really
is if you have an absolute timeout interface.  And isn't
timeout_at_ts() already implemented in the diff?

> Tick-based timeouts and kclock-based timeouts are *not* compatible.
> You cannot schedule a kclock timeout with timeout_add(9).  You cannot
> schedule a tick-based timeout with timeout_in_nsec(9).  I have added
> KASSERTs to prevent this.
> 
> Scheduling a kclock timeout with timeout_in_nsec() is more costly than
> scheduling a tick-based 

Re: timeout(9): add clock-based timeouts (attempt 2)

2020-09-05 Thread Todd Carson


Scott Cheloha writes:

> On Fri, Sep 04, 2020 at 05:55:40PM -0500, Scott Cheloha wrote:
>> On Sat, Jul 25, 2020 at 08:46:08PM -0500, Scott Cheloha wrote:
>> > Basically, ticks are a poor approximation for the system clock.  We
>> > should use the real thing where possible.
>> > 
>> > [...]
>> > 
>> > Thoughts on this approach?  Thoughts on the proposed API?
>> 
>> 6 week bump.
>> 

This might not be a very useful data point, but I ran this aggressively
for a couple weeks using clock timeouts for process sleeps, like this:

--- kern/kern_fork.c20 Mar 2020 08:14:07 -  1.225
+++ kern/kern_fork.c26 Jul 2020 09:34:56 -
@@ -166,7 +166,7 @@ thread_new(struct proc *parent, vaddr_t 
/*
 * Initialize the timeouts.
 */
-   timeout_set(>p_sleep_to, endtsleep, p);
+   timeout_set_kclock(>p_sleep_to, endtsleep, p, 0, KCLOCK_UPTIME);
 
return p;
 }

On a VM which has always had a lot of problems with kqueue timers firing
late and causing dovecot to complain "time jumped forward" in syslog,
using the clock timeouts eliminated the complaints and caught a much
larger number of late timeouts in the kern.timeout_stats.

I didn't notice any adverse side effects and I didn't try to measure
performance impact.

> +uint32_t
> +timeout_maskwheel(uint32_t level, const struct timespec *abstime)
> +{
> + uint32_t hi, lo;
> +
> + hi = abstime->tv_sec << 7;
> + lo = abstime->tv_nsec / 7812500;
> +
> + return ((hi | lo) >> (level * WHEELBITS)) & WHEELMASK;
> +}

I found it a little hard to understand at first where 7812500 came from;
would it be too ugly to write it as (10L / (1 << 7))?



Re: timeout(9): add clock-based timeouts (attempt 2)

2020-09-04 Thread Scott Cheloha
On Fri, Sep 04, 2020 at 05:55:40PM -0500, Scott Cheloha wrote:
> On Sat, Jul 25, 2020 at 08:46:08PM -0500, Scott Cheloha wrote:
> > 
> > [...]
> > 
> > I want to add clock-based timeouts to the kernel because tick-based
> > timeouts suffer from a few problems:
> > 
> > [...]
> > 
> > Basically, ticks are a poor approximation for the system clock.  We
> > should use the real thing where possible.
> > 
> > [...]
> > 
> > Thoughts on this approach?  Thoughts on the proposed API?
> 
> 6 week bump.
> 
> Attached is an rebased and streamlined diff.
> 
> [...]

Updated diff fixes a pasto.

Index: kern/kern_timeout.c
===
RCS file: /cvs/src/sys/kern/kern_timeout.c,v
retrieving revision 1.79
diff -u -p -r1.79 kern_timeout.c
--- kern/kern_timeout.c 7 Aug 2020 00:45:25 -   1.79
+++ kern/kern_timeout.c 5 Sep 2020 01:54:42 -
@@ -1,4 +1,4 @@
-/* $OpenBSD: kern_timeout.c,v 1.79 2020/08/07 00:45:25 cheloha Exp $   
*/
+/* $OpenBSD: kern_timeout.c,v 1.77 2020/08/01 08:40:20 anton Exp $ */
 /*
  * Copyright (c) 2001 Thomas Nordin 
  * Copyright (c) 2000-2001 Artur Grabowski 
@@ -64,16 +64,27 @@ struct timeoutstat tostat;  /* [T] stati
  * of the global variable "ticks" when the timeout should be called. There are
  * four levels with 256 buckets each.
  */
-#define BUCKETS 1024
+#define WHEELCOUNT 4
 #define WHEELSIZE 256
 #define WHEELMASK 255
 #define WHEELBITS 8
+#define BUCKETS (WHEELCOUNT * WHEELSIZE)
 
-struct circq timeout_wheel[BUCKETS];   /* [T] Queues of timeouts */
+struct circq timeout_wheel[BUCKETS];   /* [T] Tick-based timeouts */
+struct circq timeout_wheel_kc[BUCKETS];/* [T] Clock-based timeouts */
 struct circq timeout_new;  /* [T] New, unscheduled timeouts */
 struct circq timeout_todo; /* [T] Due or needs rescheduling */
 struct circq timeout_proc; /* [T] Due + needs process context */
 
+time_t timeout_level_width[WHEELCOUNT];/* [I] Wheel level width 
(seconds) */
+struct timespec tick_ts;   /* [I] Length of a tick (1/hz secs) */
+
+struct kclock {
+   struct timespec kc_lastscan;/* [T] Clock time at last wheel scan */
+   struct timespec kc_late;/* [T] Late if due prior */
+   struct timespec kc_offset;  /* [T] Offset from primary kclock */
+} timeout_kclock[KCLOCK_MAX];
+
 #define MASKWHEEL(wheel, time) (((time) >> ((wheel)*WHEELBITS)) & WHEELMASK)
 
 #define BUCKET(rel, abs)   \
@@ -155,9 +166,15 @@ struct lock_type timeout_spinlock_type =
((needsproc) ? _sleeplock_obj : _spinlock_obj)
 #endif
 
+void kclock_nanotime(int, struct timespec *);
 void softclock(void *);
 void softclock_create_thread(void *);
+void softclock_process_kclock_timeout(struct timeout *, int);
+void softclock_process_tick_timeout(struct timeout *, int);
 void softclock_thread(void *);
+uint32_t timeout_bucket(struct timeout *);
+uint32_t timeout_maskwheel(uint32_t, const struct timespec *);
+void timeout_run(struct timeout *);
 void timeout_proc_barrier(void *);
 
 /*
@@ -207,13 +224,19 @@ timeout_sync_leave(int needsproc)
 void
 timeout_startup(void)
 {
-   int b;
+   int b, level;
 
CIRCQ_INIT(_new);
CIRCQ_INIT(_todo);
CIRCQ_INIT(_proc);
for (b = 0; b < nitems(timeout_wheel); b++)
CIRCQ_INIT(_wheel[b]);
+   for (b = 0; b < nitems(timeout_wheel_kc); b++)
+   CIRCQ_INIT(_wheel_kc[b]);
+
+   for (level = 0; level < nitems(timeout_level_width); level++)
+   timeout_level_width[level] = 2 << (level * WHEELBITS);
+   NSEC_TO_TIMESPEC(tick_nsec, _ts);
 }
 
 void
@@ -229,25 +252,39 @@ timeout_proc_init(void)
kthread_create_deferred(softclock_create_thread, NULL);
 }
 
+static inline void
+_timeout_set(struct timeout *to, void (*fn)(void *), void *arg, int flags,
+int kclock)
+{
+   to->to_func = fn;
+   to->to_arg = arg;
+   to->to_flags = flags | TIMEOUT_INITIALIZED;
+   to->to_kclock = kclock;
+}
+
 void
 timeout_set(struct timeout *new, void (*fn)(void *), void *arg)
 {
-   timeout_set_flags(new, fn, arg, 0);
+   _timeout_set(new, fn, arg, 0, KCLOCK_NONE);
 }
 
 void
 timeout_set_flags(struct timeout *to, void (*fn)(void *), void *arg, int flags)
 {
-   to->to_func = fn;
-   to->to_arg = arg;
-   to->to_process = NULL;
-   to->to_flags = flags | TIMEOUT_INITIALIZED;
+   _timeout_set(to, fn, arg, flags, KCLOCK_NONE);
 }
 
 void
 timeout_set_proc(struct timeout *new, void (*fn)(void *), void *arg)
 {
-   timeout_set_flags(new, fn, arg, TIMEOUT_PROC);
+   _timeout_set(new, fn, arg, TIMEOUT_PROC, KCLOCK_NONE);
+}
+
+void
+timeout_set_kclock(struct timeout *to, void (*fn)(void *), void *arg, int 
flags,
+int kclock)
+{
+   _timeout_set(to, fn, arg, flags | TIMEOUT_KCLOCK, kclock);
 }
 
 int
@@ -257,6 +294,8 @@ timeout_add(struct timeout 

Re: timeout(9): add clock-based timeouts (attempt 2)

2020-09-04 Thread Scott Cheloha
On Sat, Jul 25, 2020 at 08:46:08PM -0500, Scott Cheloha wrote:
> 
> [...]
> 
> I want to add clock-based timeouts to the kernel because tick-based
> timeouts suffer from a few problems:
> 
> [...]
> 
> Basically, ticks are a poor approximation for the system clock.  We
> should use the real thing where possible.
> 
> [...]
> 
> Thoughts on this approach?  Thoughts on the proposed API?

6 week bump.

Attached is an rebased and streamlined diff.

Let's try again:

This patch adds support for timeouts scheduled against the hardware
timecounter.  I call these "kclock timeouts".  They are distinct from
the current tick-based timeouts because ticks are "software time", not
"real time".

For now we have one kclock, KCLOCK_UPTIME, which corresponds to
nanouptime(9).  In the future I intend to add support for runtime and
UTC kclocks.

Why do we want kclock timeouts at all?

1. Kclock timeouts expire at an actual time, not a tick.  They
   have nanosecond resolution and are NTP-sensitive.  Thus, they
   will *never* fire early.

2. One upshot of nanosecond resolution is that we don't need to
   "round up" to the next tick when scheduling a timeout to prevent
   early execution.  The extra resolution allows us to reduce
   latency in some contexts.

3. Kclock timeouts cover the entire range of the kernel timeline.
   We can remove the "tick loops" like the one sys_nanosleep().

4. Kclock timeouts are scheduled as absolute deadlines.  This makes
   supporting absolute timeouts trivial, which means we can add support
   for clock_nanosleep(2) and the absolute pthread timeouts to the
   kernel.

Kclock timeouts aren't actually used anywhere yet, so merging this
patch will not break anything like last time (CC bluhm@).

In a subsequent diff I will put them to use in tsleep_nsec(9) etc.
This will enable those interfaces to block for less than a tick, which
in turn will allow userspace to block for less than a tick in e.g.
futex(2), and poll(2).  pd@ has verified that this fixes the "time
problem" in OpenBSD vmm(4) VMs (CC pd@).

You initialize kclock timeouts with timeout_set_kclock().  You
schedule them with timeout_in_nsec(), a relative timeout interface
that accepts a count of nanoseconds.  If your timeout is in some
other unit (seconds, milliseconds, whatever) you must convert it
to nanoseconds before scheduling.  Something like this will work:

timeout_in_nsec(_timeout, SEC_TO_NSEC(1));

There won't be a flavored API supporting every conceivable time unit.

In the future I will expose an absolute timeout interface and a
periodic timeout rescheduling interface.  We don't need either of
these interfaces to start, though.

Tick-based timeouts and kclock-based timeouts are *not* compatible.
You cannot schedule a kclock timeout with timeout_add(9).  You cannot
schedule a tick-based timeout with timeout_in_nsec(9).  I have added
KASSERTs to prevent this.

Scheduling a kclock timeout with timeout_in_nsec() is more costly than
scheduling a tick-based timeout with timeout_add(9) because you have
to read the hardware timecounter.  The cost will vary with your clock:
bad clocks have lots of overhead, good clocks have low-to-no overhead.
The programmer will need to decide if the potential overhead is too
high when employing these timeouts.  In most cases the overhead will
not be a problem.  The network stack is one spot where it might be.

Processing the kclock timeout wheel during hardclock(9) adds
negligible overhead to that routine.

Processing a kclock timeout during softclock() is roughly 4 times as
expensive as processing a tick-based timeout.  At idle on my 2Ghz
amd64 machine tick-based timeouts take ~125 cycles to process while
kclock timeouts take ~500 cycles.  The average cost seems to drop as
more kclock timeouts are processed, though I can't really explain why.

Thoughts?  ok?

Index: kern/kern_timeout.c
===
RCS file: /cvs/src/sys/kern/kern_timeout.c,v
retrieving revision 1.79
diff -u -p -r1.79 kern_timeout.c
--- kern/kern_timeout.c 7 Aug 2020 00:45:25 -   1.79
+++ kern/kern_timeout.c 4 Sep 2020 22:41:12 -
@@ -1,4 +1,4 @@
-/* $OpenBSD: kern_timeout.c,v 1.79 2020/08/07 00:45:25 cheloha Exp $   
*/
+/* $OpenBSD: kern_timeout.c,v 1.77 2020/08/01 08:40:20 anton Exp $ */
 /*
  * Copyright (c) 2001 Thomas Nordin 
  * Copyright (c) 2000-2001 Artur Grabowski 
@@ -64,16 +64,27 @@ struct timeoutstat tostat;  /* [T] stati
  * of the global variable "ticks" when the timeout should be called. There are
  * four levels with 256 buckets each.
  */
-#define BUCKETS 1024
+#define WHEELCOUNT 4
 #define WHEELSIZE 256
 #define WHEELMASK 255
 #define WHEELBITS 8
+#define BUCKETS (WHEELCOUNT * WHEELSIZE)
 
-struct circq timeout_wheel[BUCKETS];   /* [T] Queues of timeouts */
+struct circq timeout_wheel[BUCKETS];   /* [T] Tick-based timeouts */
+struct circq timeout_wheel_kc[BUCKETS];/* [T] Clock-based timeouts */
 struct circq