Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-18 Thread Martin Pieuchot
On 17/09/23(Sun) 11:22, Scott Cheloha wrote:
> v2 is attached.

Thanks.

> Clockintrs now have an argument.  If we pass the PCB as argument, we
> can avoid doing a linear search to find the PCB during the interrupt.
> 
> One thing I'm unsure about is whether I need to add a "barrier" flag
> to clockintr_cancel() and/or clockintr_disestablish().  This would
> cause the caller to block until the clockintr callback function has
> finished executing, which would ensure that it was safe to free the
> PCB.

Please do not reinvent the wheel.  Try to imitate what other kernel APIs
do.  Look at task_add(9) and timeout_add(9).  Call the functions add/del()
to match existing APIs, then we can add a clockintr_del_barrier() if needed.
Do not introduce functions before we need them.  I hope we won't need
it.

> On Mon, Sep 04, 2023 at 01:39:25PM +0100, Martin Pieuchot wrote:
> > On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> > > On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > [...]
> > > The goal of clockintr is to provide a machine-independent API for
> > > scheduling clock interrupts.  You can use it to implement something
> > > like hardclock() or statclock().  We are already using it to implement
> > > these functions, among others.
> > 
> > After reading all the code and the previous manuals, I understand it as
> > a low-level per-CPU timeout API with nanosecond precision.  Is that it?
> 
> Yes.
> 
> The distinguishing feature is that it is usually wired up to a
> platform backend, so it can deliver the interrupt at the requested
> expiration time with relatively low error.

Why shouldn't we use it and prefer timeout then?  That's unclear to me
and I'd love to have this clearly documented.

What happened to the manual? 
 
> > Apart from running periodically on a given CPU an important need for
> > dt(4) is to get a timestamps for every event.  Currently nanotime(9)
> > is used.  This is global to the system.  I saw that ftrace use different
> > clocks per-CPU which might be something to consider now that we're
> > moving to a per-CPU API.
> > 
> > It's all about cost of the instrumentation.  Note that if we use a
> > different per-CPU timestamp it has to work outside of clockintr because
> > probes can be anywhere.
> > 
> > Regarding clockintr_establish(), I'd rather have it *not* allocated the
> > `clockintr'.  I'd prefer waste some more space in every PCB.  The reason
> > for this is to keep live allocation in dt(4) to dt_pcb_alloc() only to
> > be able to not go through malloc(9) at some point in the future to not
> > interfere with the rest of the kernel.  Is there any downside to this?
> 
> You're asking me to change from callee-allocated clockintrs to
> caller-allocated clockintrs.  I don't want to do this.

Why not?  What is your plan?  You want to put many clockintrs in the
kernel?  I don't understand where you're going.  Can you explain?  From
my point of view this design decision is only added complexity for no
good reason

> I am hoping to expirment with using a per-CPU pool for clockintrs
> during the next release cycle.  I think keeping all the clockintrs on
> a single page in memory will have cache benefits when reinserting
> clockintrs into the sorted data structure.

I do not understand this wish for micro optimization.

The API has only on dynamic consumer: dt(4), I tell you I'd prefer to
allocate the structure outside of your API and you tell me you don't
want.  You want to use pool.  Really I don't understand...  Please let's
design an API for our needs and not based on best practises from outside.

Or am I completely missing something from the clockintr world conquest
plan?

If you want to play with pools, they are many other stuff you can do :o)
I just don't understand.

> > Can we have a different hook for the interval provider?
> 
> I think I have added this now?
> 
> "profile" uses dt_prov_profile_intr().
> 
> "interval" uses dt_prov_interval_intr().
> 
> Did you mean a different kind of "hook"?

That's it and please remove DT_ENTER().  There's no need for the use of
the macro inside dt(4).  I thought I already mentioned it.

> > Since we need only one clockintr and we don't care about the CPU
> > should we pick a random one?  Could that be implemented by passing
> > a NULL "struct cpu_info *" pointer to clockintr_establish()?  So
> > multiple "interval" probes would run on different CPUs...
> 
> It would be simpler to just stick to running the "interval" provider
> on the primary CPU.

Well the primary CPU is already too loaded with interrupts on OpenBSD.
So if we can avoid it for no good reason, I'd prefer.

> If you expect a large number of "interval" probes to be active at
> once, it might help to choose the CPU round-robin style.  But that
> would be an optimization for a later step.

Let's do that now!  There's no limitation with your API right?

> > I'd rather keep "struct dt_softc" private to not create too convoluted
> > code.
> 
> I can't figure out how to 

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-17 Thread Scott Cheloha
v2 is attached.

Clockintrs now have an argument.  If we pass the PCB as argument, we
can avoid doing a linear search to find the PCB during the interrupt.

One thing I'm unsure about is whether I need to add a "barrier" flag
to clockintr_cancel() and/or clockintr_disestablish().  This would
cause the caller to block until the clockintr callback function has
finished executing, which would ensure that it was safe to free the
PCB.

You're already using SMR, though, so a barrier may be totally
unnecessary.

On Mon, Sep 04, 2023 at 01:39:25PM +0100, Martin Pieuchot wrote:
> On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> > On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > > [...] 
> > > The only behavior that needs to be preserved is the output of dumping
> > > stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
> > > be adapted with this change.  You can figure that out by looking at the
> > > output of /usr/src/share/btrace/kprofile.bt without and with this diff.
> > > 
> > > Please generate a FlameGraph to make sure they're still the same.
> > 
> > dt_prov_profile_intr() runs at the same stack depth as hardclock(), so
> > indeed they are still the same.
> 
> Lovely.
> 
> > > Apart from that I'd prefer if we could skip the mechanical change and
> > > go straight to what dt(4) needs.  Otherwise we will have to re-design
> > > everything.
> > 
> > I think a mechanical "move the code from point A to point B" patch is
> > useful.  It makes the changes easier to follow when tracing the
> > revision history in the future.
> > 
> > If you insist on skipping it, though, I think I can live without it.
> 
> I do insist.  It is really hard for me to follow and work with you
> because you're too verbose for my capacity.  If you want to work with
> me, please do smaller steps and do not mix so much in big diffs.  I
> have plenty of possible comments but can deal with huge chunks.
> 
> > > The current code assumes the periodic entry points are external to dt(4).
> > > This diff moves them in the middle of dt(4) but keeps the existing flow
> > > which makes the code very convoluted.
> > > 
> > > A starting point to understand the added complexity it so see that the
> > > DT_ENTER() macro are no longer needed if we move the entry points inside
> > > dt(4).
> > 
> > I did see that.  It seems really easy to remove the macros in a
> > subsequent patch, though.
> > 
> > Again, if you want to do it all in one patch that's OK.
> 
> Yes please.
> 
> > > The first periodic timeout is dt_prov_interval_enter().  It could be
> > > implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
> > > approach is that it uses too much code in the kernel which is a problem
> > > when instrumenting the kernel itself.  Every subsystem used by dt(4) is
> > > impossible to instrument with btrace(8).
> > 
> > I think you can avoid this instrumentation problem by using clockintr,
> > where the callback functions are run from the hardware interrupt
> > context, just like hardclock().
> 
> Fair enough.
> 
> > > The second periodic timeout it dt_prov_profile_enter().  It is similar
> > > to the previous one and has to run on every CPU.
> > > 
> > > Both are currently bound to tick, but we want per-PCB time resolution.
> > > We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
> > > timeouts fires.
> > 
> > My current thought is that each PCB could have its own "dp_period" or
> > "dp_interval", a count of nanoseconds.
> 
> Indeed.  We can have `dp_nsecs' and use that to determine if
> clockintr_advance() needs to be called in dt_ioctl_record_start().

I have added dt_pcb.dp_nsecs.

> > > [...]
> > 
> > The goal of clockintr is to provide a machine-independent API for
> > scheduling clock interrupts.  You can use it to implement something
> > like hardclock() or statclock().  We are already using it to implement
> > these functions, among others.
> 
> After reading all the code and the previous manuals, I understand it as
> a low-level per-CPU timeout API with nanosecond precision.  Is that it?

Yes.

The distinguishing feature is that it is usually wired up to a
platform backend, so it can deliver the interrupt at the requested
expiration time with relatively low error.

> > > >   One alternative is to start running the clock interrupts when they
> > > >   are allocated in dtpv_alloc() and stop them when they are freed in
> > > >   dtpv_dealloc().  This is wasteful, though: the PCBs are not recording
> > > >   yet, so the interrupts won't perform any useful work until the
> > > >   controlling process enables recording.
> > > > 
> > > >   An additional pair of provider hooks, e.g. "dtpv_record_start" and
> > > >   "dtpv_record_stop", might resolve this.
> > > 
> > > Another alternative would be to have a single hardclock-like handler for
> > > dt(4).  Sorry, I don't understand the big picture behind clockintr, so I
> > > can't help.
> > > 
> > > > - We haven't needed to destroy clock 

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-04 Thread Martin Pieuchot
On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > [...] 
> > The only behavior that needs to be preserved is the output of dumping
> > stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
> > be adapted with this change.  You can figure that out by looking at the
> > output of /usr/src/share/btrace/kprofile.bt without and with this diff.
> > 
> > Please generate a FlameGraph to make sure they're still the same.
> 
> dt_prov_profile_intr() runs at the same stack depth as hardclock(), so
> indeed they are still the same.

Lovely.

> > Apart from that I'd prefer if we could skip the mechanical change and
> > go straight to what dt(4) needs.  Otherwise we will have to re-design
> > everything.
> 
> I think a mechanical "move the code from point A to point B" patch is
> useful.  It makes the changes easier to follow when tracing the
> revision history in the future.
> 
> If you insist on skipping it, though, I think I can live without it.

I do insist.  It is really hard for me to follow and work with you
because you're too verbose for my capacity.  If you want to work with
me, please do smaller steps and do not mix so much in big diffs.  I
have plenty of possible comments but can deal with huge chunks.

> > The current code assumes the periodic entry points are external to dt(4).
> > This diff moves them in the middle of dt(4) but keeps the existing flow
> > which makes the code very convoluted.
> > 
> > A starting point to understand the added complexity it so see that the
> > DT_ENTER() macro are no longer needed if we move the entry points inside
> > dt(4).
> 
> I did see that.  It seems really easy to remove the macros in a
> subsequent patch, though.
> 
> Again, if you want to do it all in one patch that's OK.

Yes please.

> > The first periodic timeout is dt_prov_interval_enter().  It could be
> > implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
> > approach is that it uses too much code in the kernel which is a problem
> > when instrumenting the kernel itself.  Every subsystem used by dt(4) is
> > impossible to instrument with btrace(8).
> 
> I think you can avoid this instrumentation problem by using clockintr,
> where the callback functions are run from the hardware interrupt
> context, just like hardclock().

Fair enough.

> > The second periodic timeout it dt_prov_profile_enter().  It is similar
> > to the previous one and has to run on every CPU.
> > 
> > Both are currently bound to tick, but we want per-PCB time resolution.
> > We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
> > timeouts fires.
> 
> My current thought is that each PCB could have its own "dp_period" or
> "dp_interval", a count of nanoseconds.

Indeed.  We can have `dp_nsecs' and use that to determine if
clockintr_advance() needs to be called in dt_ioctl_record_start().

> > > - Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
> > >   PCB's implementing provider needs a clock interrupt to do its work, it
> > >   stores the handle in dp_clockintr.  The handle, if any, is established
> > >   during dtpv_alloc() and disestablished during dtpv_dealloc().
> > 
> > Sorry, but as I said I don't understand what is a clockintr.  How does it
> > fit in the kernel and how is it supposed to be used?
> 
> The goal of clockintr is to provide a machine-independent API for
> scheduling clock interrupts.  You can use it to implement something
> like hardclock() or statclock().  We are already using it to implement
> these functions, among others.

After reading all the code and the previous manuals, I understand it as
a low-level per-CPU timeout API with nanosecond precision.  Is that it?

> > Why have it per PCB and not per provider or for the whole driver instead?
> > Per-PCB implies that if I run 3 different profiling on a 32 CPU machines
> > I now have 96 different clockintr.  Is it what we want?
> 
> Yes, I think that sounds fine.  If we run into scaling problems we can
> always just change the underlying data structure to an RB tree or a
> minheap.

Fine.

> > If so, why not simply use timeout(9)?  What's the difference?
> 
> Some code needs to run from a hardware interrupt context.  It's going
> to be easier to profile more of the kernel if we're collecting stack
> traces during a clock interrupt.
> 
> Timeouts run at IPL_SOFTCLOCK.  You can profile process context
> code...  that's it.  Timeouts also only run on a single CPU.  Per-CPU
> timeout wheels are at least a year away if not longer.

It's not clear to me which code needs what.  I'd love to see that
explained in the manual.

> > >   One alternative is to start running the clock interrupts when they
> > >   are allocated in dtpv_alloc() and stop them when they are freed in
> > >   dtpv_dealloc().  This is wasteful, though: the PCBs are not recording
> > >   yet, so the interrupts won't perform any useful work until the
> > >   controlling 

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-09-04 Thread Martin Pieuchot
On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > On 23/08/23(Wed) 18:52, Scott Cheloha wrote:
> > > This is the next patch in the clock interrupt reorganization series.
> > 
> > Thanks for your diff.  I'm sorry but it is really hard for me to help
> > review this diff because there is still no man page for this API+subsystem.
> > 
> > Can we start with that please?
> 
> Sure, a first draft of a clockintr_establish.9 manpage is included
> below.

Lovely, I'll answer to that first in this email.  Please commit it, so
we can tweak it in tree.  ok mpi@

> We also have a manpage in the tree, clockintr.9.  It is a bit out of
> date, but it covers the broad strokes of how the driver-facing portion
> of the subsystem works.

Currently reading it, should we get rid of `schedhz' in the manpage and
its remaining in the kernel?

Why isn't the statclock() always randomize?  I see a couple of
clocks/archs that do not use CL_RNDSTAT...   Is there any technical
reason apart from testing?

I don't understand what you mean with "Until we understand scheduler
lock contention better"?  What does lock contention has to do with
firing hardclock and statclock at the same moment?

> Index: share/man/man9/clockintr_establish.9
> ===
> RCS file: share/man/man9/clockintr_establish.9
> diff -N share/man/man9/clockintr_establish.9
> --- /dev/null 1 Jan 1970 00:00:00 -
> +++ share/man/man9/clockintr_establish.9  26 Aug 2023 01:44:37 -
> @@ -0,0 +1,239 @@
> +.\" $OpenBSD$
> +.\"
> +.\" Copyright (c) 2020-2023 Scott Cheloha 
> +.\"
> +.\" Permission to use, copy, modify, and distribute this software for any
> +.\" purpose with or without fee is hereby granted, provided that the above
> +.\" copyright notice and this permission notice appear in all copies.
> +.\"
> +.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> +.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> +.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> +.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> +.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> +.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> +.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> +.\"
> +.Dd $Mdocdate$
> +.Dt CLOCKINTR_ESTABLISH 9
> +.Os
> +.Sh NAME
> +.Nm clockintr_advance ,
> +.Nm clockintr_cancel ,
> +.Nm clockintr_disestablish ,
> +.Nm clockintr_establish ,
> +.Nm clockintr_expiration ,
> +.Nm clockintr_nsecuptime ,
> +.Nm clockintr_schedule ,
> +.Nm clockintr_stagger
> +.Nd schedule a function for execution from clock interrupt context
> +.Sh SYNOPSIS
> +.In sys/clockintr.h
> +.Ft uint64_t
> +.Fo clockintr_advance
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t interval"

Can we do s/interval/nsecs/?

> +.Fc
> +.Ft void
> +.Fo clockintr_cancel
> +.Fa "struct clockintr *cl"
> +.Fc
> +.Ft void
> +.Fo clockintr_disestablish
> +.Fa "struct clockintr *cl"
> +.Fc
> +.Ft struct clockintr *
> +.Fo clockintr_establish
> +.Fa "struct clockintr_queue *queue"
> +.Fa "void (*func)(struct clockintr *, void *)"
> +.Fc
> +.Ft uint64_t
> +.Fo clockintr_expiration
> +.Fa "const struct clockintr *cl"
> +.Fc
> +.Ft uint64_t
> +.Fo clockintr_nsecuptime
> +.Fa "const struct clockintr *cl"
> +.Fc
> +.Ft void
> +.Fo clockintr_schedule
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t abs"
> +.Fc
> +.Ft void
> +.Fo clockintr_stagger
> +.Fa "struct clockintr *cl"
> +.Fa "uint64_t interval"
> +.Fa "u_int numerator"
> +.Fa "u_int denominator"
> +.Fc
> +.Sh DESCRIPTION
> +The clock interrupt subsystem schedules functions for asynchronous execution
> +in the future from a hardware interrupt context.

This is the same description as timeout_set(9) apart from "hardware
interrupt context".  Why should I use this one and not the other one?
I'm confused.  I dislike choices.

> +.Pp
> +The
> +.Fn clockintr_establish
> +function allocates a new clock interrupt object
> +.Po
> +a
> +.Dq clockintr
> +.Pc
> +and binds it to the given clock interrupt
> +.Fa queue .
> +When the clockintr is executed,
> +the callback function
> +.Fa func
> +will be called from a hardware interrupt context on the CPU in control of the
> +.Fa queue .

It seems clockintr queues are per-CPU, can we abstract this data
structure from the caller and pass CPUs instead?  I don't see the point
of manipulating such queues outside of kern/kern_clockintr.c 

> +The new clockintr is not initially scheduled for execution and has an
> +expiration time of zero.

Speaking of expiration time confuse me.  I don't see this as necessary,
at least for the moment.  When writing API I'd strongly suggest to not
add functions that are not used.  This applies to clockintr_expiration()
below.

> +.Pp
> +The
> +.Fn clockintr_schedule
> +function schedules the given clockintr 

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-08-30 Thread George Koehler
On Fri, 25 Aug 2023 21:00:34 -0500
Scott Cheloha  wrote:

> @@ -148,4 +185,18 @@ dt_prov_interval_enter(struct dt_provide
>   }
>   smr_read_leave();
>   return 0;
> +}
> +
> +void
> +dt_prov_profile_intr(struct clockintr *cl, void *cf)
> +{
> + uint64_t count, i;
> + struct cpu_info *ci = curcpu();
> +
> + count = clockintr_advance(cl, hardclock_period);
> + for (i = 0; i < count; i++) {
> + DT_ENTER(profile, NULL);
> + if (CPU_IS_PRIMARY(ci))
> + DT_ENTER(interval, NULL);
> + }
>  }

In build of GENERIC bsd.sp,

/usr/src/sys/dev/dt/dt_prov_profile.c:194:19: error: unused variable
'ci' [-Werror,-Wunused-variable]
struct cpu_info *ci = curcpu();
 ^
1 error generated.

To unbreak it, I edited the function to be,

void
dt_prov_profile_intr(struct clockintr *cl, void *cf)
{
uint64_t count, i;
int primary = CPU_IS_PRIMARY(curcpu());

count = clockintr_advance(cl, hardclock_period);
for (i = 0; i < count; i++) {
DT_ENTER(profile, NULL);
if (primary)
DT_ENTER(interval, NULL);
}
}



Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-08-25 Thread Scott Cheloha
On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> On 23/08/23(Wed) 18:52, Scott Cheloha wrote:
> > This is the next patch in the clock interrupt reorganization series.
> 
> Thanks for your diff.  I'm sorry but it is really hard for me to help
> review this diff because there is still no man page for this API+subsystem.
> 
> Can we start with that please?

Sure, a first draft of a clockintr_establish.9 manpage is included
below.

We also have a manpage in the tree, clockintr.9.  It is a bit out of
date, but it covers the broad strokes of how the driver-facing portion
of the subsystem works.

> > This patch moves the entry points for the interval and profile dt(4)
> > providers from the hardclock(9) to a dedicated clock interrupt
> > callback, dt_prov_profile_intr(), in dev/dt/dt_prov_profile.c.
> > 
> > - To preserve current behavior, (1) both provider entrypoints have
> >   been moved into a single callback function, (2) the interrupt runs at
> >   the same frequency as the hardclock, and (3) the interrupt is
> >   staggered to co-occur with the hardclock on a given CPU.
> 
> The only behavior that needs to be preserved is the output of dumping
> stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
> be adapted with this change.  You can figure that out by looking at the
> output of /usr/src/share/btrace/kprofile.bt without and with this diff.
> 
> Please generate a FlameGraph to make sure they're still the same.

dt_prov_profile_intr() runs at the same stack depth as hardclock(), so
indeed they are still the same.

> Apart from that I'd prefer if we could skip the mechanical change and
> go straight to what dt(4) needs.  Otherwise we will have to re-design
> everything.

I think a mechanical "move the code from point A to point B" patch is
useful.  It makes the changes easier to follow when tracing the
revision history in the future.

If you insist on skipping it, though, I think I can live without it.

> If you don't want to do this work, then leave it and tell
> me what you need and what is your plan so I can help you and do it
> myself.

I am very keen to do this, or at least to help with it.

> dt(4) needs a way to schedule two different kind of periodic timeouts
> with the higher precision possible.  It is currently plugged to hardclock
> because there is nothing better.

Yes.

> The current code assumes the periodic entry points are external to dt(4).
> This diff moves them in the middle of dt(4) but keeps the existing flow
> which makes the code very convoluted.
> 
> A starting point to understand the added complexity it so see that the
> DT_ENTER() macro are no longer needed if we move the entry points inside
> dt(4).

I did see that.  It seems really easy to remove the macros in a
subsequent patch, though.

Again, if you want to do it all in one patch that's OK.

> The first periodic timeout is dt_prov_interval_enter().  It could be
> implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
> approach is that it uses too much code in the kernel which is a problem
> when instrumenting the kernel itself.  Every subsystem used by dt(4) is
> impossible to instrument with btrace(8).

I think you can avoid this instrumentation problem by using clockintr,
where the callback functions are run from the hardware interrupt
context, just like hardclock().

> The second periodic timeout it dt_prov_profile_enter().  It is similar
> to the previous one and has to run on every CPU.
> 
> Both are currently bound to tick, but we want per-PCB time resolution.
> We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
> timeouts fires.

My current thought is that each PCB could have its own "dp_period" or
"dp_interval", a count of nanoseconds.

> > - Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
> >   PCB's implementing provider needs a clock interrupt to do its work, it
> >   stores the handle in dp_clockintr.  The handle, if any, is established
> >   during dtpv_alloc() and disestablished during dtpv_dealloc().
> 
> Sorry, but as I said I don't understand what is a clockintr.  How does it
> fit in the kernel and how is it supposed to be used?

The goal of clockintr is to provide a machine-independent API for
scheduling clock interrupts.  You can use it to implement something
like hardclock() or statclock().  We are already using it to implement
these functions, among others.

> Why have it per PCB and not per provider or for the whole driver instead?
> Per-PCB implies that if I run 3 different profiling on a 32 CPU machines
> I now have 96 different clockintr.  Is it what we want?

Yes, I think that sounds fine.  If we run into scaling problems we can
always just change the underlying data structure to an RB tree or a
minheap.

> If so, why not simply use timeout(9)?  What's the difference?

Some code needs to run from a hardware interrupt context.  It's going
to be easier to profile more of the kernel if we're collecting stack

Re: dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-08-24 Thread Martin Pieuchot
On 23/08/23(Wed) 18:52, Scott Cheloha wrote:
> This is the next patch in the clock interrupt reorganization series.

Thanks for your diff.  I'm sorry but it is really hard for me to help
review this diff because there is still no man page for this API+subsystem.

Can we start with that please?

> This patch moves the entry points for the interval and profile dt(4)
> providers from the hardclock(9) to a dedicated clock interrupt
> callback, dt_prov_profile_intr(), in dev/dt/dt_prov_profile.c.
> 
> - To preserve current behavior, (1) both provider entrypoints have
>   been moved into a single callback function, (2) the interrupt runs at
>   the same frequency as the hardclock, and (3) the interrupt is
>   staggered to co-occur with the hardclock on a given CPU.

The only behavior that needs to be preserved is the output of dumping
stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
be adapted with this change.  You can figure that out by looking at the
output of /usr/src/share/btrace/kprofile.bt without and with this diff.

Please generate a FlameGraph to make sure they're still the same.

Apart from that I'd prefer if we could skip the mechanical change and
go straight to what dt(4) needs.  Otherwise we will have to re-design
everything.   If you don't want to do this work, then leave it and tell
me what you need and what is your plan so I can help you and do it
myself.

dt(4) needs a way to schedule two different kind of periodic timeouts
with the higher precision possible.  It is currently plugged to hardclock
because there is nothing better.

The current code assumes the periodic entry points are external to dt(4).
This diff moves them in the middle of dt(4) but keeps the existing flow
which makes the code very convoluted. 
A starting point to understand the added complexity it so see that the
DT_ENTER() macro are no longer needed if we move the entry points inside
dt(4).

The first periodic timeout is dt_prov_interval_enter().  It could be
implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
approach is that it uses too much code in the kernel which is a problem
when instrumenting the kernel itself.  Every subsystem used by dt(4) is
impossible to instrument with btrace(8).

The second periodic timeout it dt_prov_profile_enter().  It is similar
to the previous one and has to run on every CPU.

Both are currently bound to tick, but we want per-PCB time resolution.
We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
timeouts fires.

> - Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
>   PCB's implementing provider needs a clock interrupt to do its work, it
>   stores the handle in dp_clockintr.  The handle, if any, is established
>   during dtpv_alloc() and disestablished during dtpv_dealloc().

Sorry, but as I said I don't understand what is a clockintr.  How does it
fit in the kernel and how is it supposed to be used?

Why have it per PCB and not per provider or for the whole driver instead?
Per-PCB implies that if I run 3 different profiling on a 32 CPU machines
I now have 96 different clockintr.  Is it what we want?  If so, why not
simply use timeout(9)?  What's the difference?

>   One alternative is to start running the clock interrupts when they
>   are allocated in dtpv_alloc() and stop them when they are freed in
>   dtpv_dealloc().  This is wasteful, though: the PCBs are not recording
>   yet, so the interrupts won't perform any useful work until the
>   controlling process enables recording.
> 
>   An additional pair of provider hooks, e.g. "dtpv_record_start" and
>   "dtpv_record_stop", might resolve this.

Another alternative would be to have a single hardclock-like handler for
dt(4).  Sorry, I don't understand the big picture behind clockintr, so I
can't help.  

> - We haven't needed to destroy clock interrupts yet, so the
>   clockintr_disestablish() function used in this patch is new.

So maybe that's fishy.  Are they supposed to be destroyed?  What is the
goal of this API?



dt(4), hardclock(9): move interval, profile providers to dedicated callback

2023-08-23 Thread Scott Cheloha
This is the next patch in the clock interrupt reorganization series.

This patch moves the entry points for the interval and profile dt(4)
providers from the hardclock(9) to a dedicated clock interrupt
callback, dt_prov_profile_intr(), in dev/dt/dt_prov_profile.c.

- To preserve current behavior, (1) both provider entrypoints have
  been moved into a single callback function, (2) the interrupt runs at
  the same frequency as the hardclock, and (3) the interrupt is
  staggered to co-occur with the hardclock on a given CPU.

  All of this can be changed later.  This patch is strictly
  logistical: moving these parts out of the hardclock.

- Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
  PCB's implementing provider needs a clock interrupt to do its work, it
  stores the handle in dp_clockintr.  The handle, if any, is established
  during dtpv_alloc() and disestablished during dtpv_dealloc().

  Only the interval and profile providers use it at present.

- In order to implement dt_prov_profile_dealloc() I needed a complete
  definition of struct dt_softc, so I hoisted it up out of dt_dev.c
  into dtvar.h.

- A PCB's periodic clock interrupt, if any, is started during
  dt_ioctl_record_start() stopped during dt_ioctl_record_stop().
  This is not a provider-agnostic approach, but I don't see where
  else to start/stop the clock interrupt.

  One alternative is to start running the clock interrupts when they
  are allocated in dtpv_alloc() and stop them when they are freed in
  dtpv_dealloc().  This is wasteful, though: the PCBs are not recording
  yet, so the interrupts won't perform any useful work until the
  controlling process enables recording.

  An additional pair of provider hooks, e.g. "dtpv_record_start" and
  "dtpv_record_stop", might resolve this.

- We haven't needed to destroy clock interrupts yet, so the
  clockintr_disestablish() function used in this patch is new.

  The implementation is extremely similar to that of clockintr_cancel().
  However, for sake of simplicity, a callback may not disestablish its
  clockintr while the callback is running.

  One tricky part: if a clockintr is disestablished while it is running,
  the dispatch thread takes care not to use-after-free when it re-enters
  the clockintr_queue mutex.

I have tested this on amd64.  It seems to work: the clock interrupts
fire, stack traces are recorded, and the handles are deallocated when
the process closes the file descriptor.

I will be testing it on other platforms over the next few days.

Thoughts?  Test results?

Index: kern/kern_clockintr.c
===
RCS file: /cvs/src/sys/kern/kern_clockintr.c,v
retrieving revision 1.32
diff -u -p -r1.32 kern_clockintr.c
--- kern/kern_clockintr.c   21 Aug 2023 17:22:04 -  1.32
+++ kern/kern_clockintr.c   23 Aug 2023 23:35:11 -
@@ -218,7 +218,7 @@ clockintr_dispatch(void *frame)
 {
uint64_t lateness, run = 0, start;
struct cpu_info *ci = curcpu();
-   struct clockintr *cl;
+   struct clockintr *cl, *shadow;
struct clockintr_queue *cq = >ci_queue;
u_int ogen;
 
@@ -257,22 +257,26 @@ clockintr_dispatch(void *frame)
break;
}
clockintr_cancel_locked(cl);
-   cq->cq_shadow.cl_expiration = cl->cl_expiration;
+   shadow = >cq_shadow;
+   shadow->cl_expiration = cl->cl_expiration;
+   shadow->cl_func = cl->cl_func;
cq->cq_running = cl;
mtx_leave(>cq_mtx);
 
-   cl->cl_func(>cq_shadow, frame);
+   shadow->cl_func(shadow, frame);
 
mtx_enter(>cq_mtx);
-   cq->cq_running = NULL;
-   if (ISSET(cl->cl_flags, CLST_IGNORE_SHADOW)) {
-   CLR(cl->cl_flags, CLST_IGNORE_SHADOW);
-   CLR(cq->cq_shadow.cl_flags, CLST_SHADOW_PENDING);
-   }
-   if (ISSET(cq->cq_shadow.cl_flags, CLST_SHADOW_PENDING)) {
-   CLR(cq->cq_shadow.cl_flags, CLST_SHADOW_PENDING);
-   clockintr_schedule_locked(cl,
-   cq->cq_shadow.cl_expiration);
+   if (cq->cq_running != NULL) {
+   cq->cq_running = NULL;
+   if (ISSET(cl->cl_flags, CLST_IGNORE_SHADOW)) {
+   CLR(cl->cl_flags, CLST_IGNORE_SHADOW);
+   CLR(shadow->cl_flags, CLST_SHADOW_PENDING);
+   }
+   if (ISSET(shadow->cl_flags, CLST_SHADOW_PENDING)) {
+   CLR(shadow->cl_flags, CLST_SHADOW_PENDING);
+   clockintr_schedule_locked(cl,
+   shadow->cl_expiration);
+   }
}
run++;
}
@@ -382,6 +386,34 @@