On 25/08/23(Fri) 21:00, Scott Cheloha wrote:
> On Thu, Aug 24, 2023 at 07:21:29PM +0200, Martin Pieuchot wrote:
> > [...] 
> > The only behavior that needs to be preserved is the output of dumping
> > stacks.  That means DT_FA_PROFILE and DT_FA_STATIC certainly needs to
> > be adapted with this change.  You can figure that out by looking at the
> > output of /usr/src/share/btrace/kprofile.bt without and with this diff.
> > 
> > Please generate a FlameGraph to make sure they're still the same.
> 
> dt_prov_profile_intr() runs at the same stack depth as hardclock(), so
> indeed they are still the same.

Lovely.

> > Apart from that I'd prefer if we could skip the mechanical change and
> > go straight to what dt(4) needs.  Otherwise we will have to re-design
> > everything.
> 
> I think a mechanical "move the code from point A to point B" patch is
> useful.  It makes the changes easier to follow when tracing the
> revision history in the future.
> 
> If you insist on skipping it, though, I think I can live without it.

I do insist.  It is really hard for me to follow and work with you
because you're too verbose for my capacity.  If you want to work with
me, please do smaller steps and do not mix so much in big diffs.  I
have plenty of possible comments but can deal with huge chunks.

> > The current code assumes the periodic entry points are external to dt(4).
> > This diff moves them in the middle of dt(4) but keeps the existing flow
> > which makes the code very convoluted.
> > 
> > A starting point to understand the added complexity it so see that the
> > DT_ENTER() macro are no longer needed if we move the entry points inside
> > dt(4).
> 
> I did see that.  It seems really easy to remove the macros in a
> subsequent patch, though.
> 
> Again, if you want to do it all in one patch that's OK.

Yes please.

> > The first periodic timeout is dt_prov_interval_enter().  It could be
> > implemented as a per-PCB timeout_add_nsec(9).  The drawback of this
> > approach is that it uses too much code in the kernel which is a problem
> > when instrumenting the kernel itself.  Every subsystem used by dt(4) is
> > impossible to instrument with btrace(8).
> 
> I think you can avoid this instrumentation problem by using clockintr,
> where the callback functions are run from the hardware interrupt
> context, just like hardclock().

Fair enough.

> > The second periodic timeout it dt_prov_profile_enter().  It is similar
> > to the previous one and has to run on every CPU.
> > 
> > Both are currently bound to tick, but we want per-PCB time resolution.
> > We can get rid of `dp_nticks' and `dp_maxtick' if we control when the
> > timeouts fires.
> 
> My current thought is that each PCB could have its own "dp_period" or
> "dp_interval", a count of nanoseconds.

Indeed.  We can have `dp_nsecs' and use that to determine if
clockintr_advance() needs to be called in dt_ioctl_record_start().

> > > - Each dt_pcb gets a provider-specific "dp_clockintr" pointer.  If the
> > >   PCB's implementing provider needs a clock interrupt to do its work, it
> > >   stores the handle in dp_clockintr.  The handle, if any, is established
> > >   during dtpv_alloc() and disestablished during dtpv_dealloc().
> > 
> > Sorry, but as I said I don't understand what is a clockintr.  How does it
> > fit in the kernel and how is it supposed to be used?
> 
> The goal of clockintr is to provide a machine-independent API for
> scheduling clock interrupts.  You can use it to implement something
> like hardclock() or statclock().  We are already using it to implement
> these functions, among others.

After reading all the code and the previous manuals, I understand it as
a low-level per-CPU timeout API with nanosecond precision.  Is that it?

> > Why have it per PCB and not per provider or for the whole driver instead?
> > Per-PCB implies that if I run 3 different profiling on a 32 CPU machines
> > I now have 96 different clockintr.  Is it what we want?
> 
> Yes, I think that sounds fine.  If we run into scaling problems we can
> always just change the underlying data structure to an RB tree or a
> minheap.

Fine.

> > If so, why not simply use timeout(9)?  What's the difference?
> 
> Some code needs to run from a hardware interrupt context.  It's going
> to be easier to profile more of the kernel if we're collecting stack
> traces during a clock interrupt.
> 
> Timeouts run at IPL_SOFTCLOCK.  You can profile process context
> code...  that's it.  Timeouts also only run on a single CPU.  Per-CPU
> timeout wheels are at least a year away if not longer.

It's not clear to me which code needs what.  I'd love to see that
explained in the manual.

> > >   One alternative is to start running the clock interrupts when they
> > >   are allocated in dtpv_alloc() and stop them when they are freed in
> > >   dtpv_dealloc().  This is wasteful, though: the PCBs are not recording
> > >   yet, so the interrupts won't perform any useful work until the
> > >   controlling process enables recording.
> > > 
> > >   An additional pair of provider hooks, e.g. "dtpv_record_start" and
> > >   "dtpv_record_stop", might resolve this.
> > 
> > Another alternative would be to have a single hardclock-like handler for
> > dt(4).  Sorry, I don't understand the big picture behind clockintr, so I
> > can't help.
> > 
> > > - We haven't needed to destroy clock interrupts yet, so the
> > >   clockintr_disestablish() function used in this patch is new.
> > 
> > So maybe that's fishy.  Are they supposed to be destroyed? [...]
> 
> Let's start with what we have above and circle back to this.

I don't have time to waste circling back, better go straight to the
wanted design.

Apart from running periodically on a given CPU an important need for
dt(4) is to get a timestamps for every event.  Currently nanotime(9)
is used.  This is global to the system.  I saw that ftrace use different
clocks per-CPU which might be something to consider now that we're
moving to a per-CPU API.
It's all about cost of the instrumentation.  Note that if we use a
different per-CPU timestamp it has to work outside of clockintr because
probes can be anywhere.

Regarding clockintr_establish(), I'd rather have it *not* allocated the
`clockintr'.  I'd prefer waste some more space in every PCB.  The reason
for this is to keep live allocation in dt(4) to dt_pcb_alloc() only to
be able to not go through malloc(9) at some point in the future to not
interfere with the rest of the kernel.  Is there any downside to this?

Can we have a different hook for the interval provider?  Since we need
only one clockintr and we don't care about the CPU should we pick a
random one?  Could that be implemented by passing a NULL "struct
cpu_info *" pointer to clockintr_establish()?  So multiple "interval"
probes would run on different CPUs...

I'd rather keep "struct dt_softc" private to not create too convoluted
code. 

Reply via email to