On Mon, Dec 28, 2020 at 06:56:08PM -0600, Scott Cheloha wrote:
> On Mon, Dec 28, 2020 at 10:49:59AM +1000, David Gwynne wrote:
> > On Sat, Dec 26, 2020 at 04:48:23PM -0600, Scott Cheloha wrote:
> > > Now that we've removed bd_rdStart from the bpf_d struct, removing
> > > ticks from bpf(4) itself is straightforward.
> > >
> > > - bd_rtout becomes a timespec; update bpfioctl() accordingly.
> > > Cap it at MAXTSLP nanoseconds to avoid arithmetic overflow
> > > in bpfread().
> > >
> > > - At the start of bpfread(), if a timeout is set, find the end
> > > of the read as an absolute uptime. This is the point where
> > > we want to avoid overflow: if bd_rtout is only MAXTSLP
> > > nanoseconds the timespecadd(3) will effectively never overflow.
> > >
> > > - Before going to sleep, if we have a timeout set, compute how
> > > much longer to sleep in nanoseconds.
> > >
> > > Here's a spot where an absolute timeout sleep would save a
> > > little code, but we don't have such an interface yet. Worth
> > > keeping in mind for the future, though.
> >
> > Are there any other places that would be useful though? bpf is pretty
> > special.
>
> kqueue_scan() in kern_event.c can have a spurious wakeup, so an
> absolute sleep would be useful there. doppoll() in sys_generic.c is
> the same, though I think mpi@/visa@ intend to refactor it to use
> kqueue_scan().
>
> In general, if you have a thread that wants to do something on a
> strict period you need to use an absolute sleep to avoid drift.
True, something to keep in mind then.
> This code drifts:
>
> for (;;) {
> do_work();
> tsleep_nsec(&nowake, PPAUSE, "worker", SEC_TO_NSEC(1));
> }
>
> While this code will not:
>
> uint64_t deadline;
>
> deadline = nsecuptime();
> for (;;) {
> do_work();
> deadline = nsec_advance(deadline, SEC_TO_NSEC(1));
> tsleep_abs_nsec(&chan, PPAUSE, "worker", deadline);
> }
>
> (Some of those interfaces don't actually exist, but they are easy to
> write and you can infer how they work.)
>
> Most developers probably do not care about maintaining a strict period
> for periodic workloads, but I have a suspicion that it would keep
> system performance more deterministic because you don't have various
> periodic workloads drifting into one another, overlapping, and
> momentarily causing utilization spikes.
>
> I know that probably sounds far-fetched... it's an idea I've been
> fussing with.
>
> > > dlg@: You said you wanted to simplify this loop. Unsure what shape
> > > that cleanup would take. Should we clean it up before this change?
> >
> > I wanted to have a single msleep_nsec call and pass INFSLP when it
> > should sleep forever.. You saw my first attempt at that. It had
> > issues.
> >
> > > Thoughts? ok?
> >
> > How would this look if you used a uint64_t for nsecs for bd_rtout,
> > and the nsec uptime thing from your pool diff instead of timespecs
> > and nanouptime?
>
> See the attached patch. It is shorter because we can do more inline
> stuff with a uint64_t than with a timespec.
I like it.
>
> > What's the thinking behind nanouptime instead of getnanouptime?
>
> In general we should prefer high resolution time unless there is a
> compelling performance reason to use low-res time.
>
> In particular, we should prefer high res time whenever userspace
> timeouts are involved as userspace can only use high-res time.
>
> For instance, when tsleep_nsec/etc. are reimplemented with kclock
> timeouts (soon!) I will remove the use of low-res time from
> nanosleep(2), select(2)/pselect(2), poll(2)/ppoll(2), and kevent(2).
> The use of low-res time in these interfaces can cause undersleep right
> now. They're buggy.
>
> > More generally, what will getnanouptime do when ticks go away?
>
> Ticks will probably never "go away" entirely.
>
> In general, some CPU is always going to need to call tc_windup()
> regularly to keep time moving forward. When tc_windup() is called we
> update the low-res timestamps. So getnanouptime(9)/etc. will continue
> to work as they do today. I also imagine that the CPU responsible for
> calling tc_windup() will continue to increment `ticks' and `jiffies'.
Only one CPU needs to do that though, not all CPUs need to tick.
> In the near-ish future I want to add support for dynamic clock
> interrupts. This would permit a CPU to stay idle, or drop into a
> deeper power-saving state, for longer than 1 tick if it has no work to
> do. This is not strictly-speaking "tickless" operation, as the clock
> interrupt is not disabled. But it is nice, so it is a near-term goal.
>
> In the Distant Future we could add support for disabling the clock
> interrupt on select CPUs. This might be useful for certain realtime
> applications. This would be "true tickless" operation. But for this
> to be useful we'd need to add support for realtime scheduling policies
> (e.g. SCHED_FIFO) and support for binding threads to (and excluding
> threads from) particular CPUs. So think Distant Future, if ever.
>
> --
>
> Anyway, updated patch here.
I like this one more. It's ok by me, but wait a bit to see if anyone
objects.
dlg.
>
> Index: bpf.c
> ===================================================================
> RCS file: /cvs/src/sys/net/bpf.c,v
> retrieving revision 1.199
> diff -u -p -r1.199 bpf.c
> --- bpf.c 26 Dec 2020 16:30:58 -0000 1.199
> +++ bpf.c 29 Dec 2020 00:21:45 -0000
> @@ -60,6 +60,7 @@
> #include <sys/selinfo.h>
> #include <sys/sigio.h>
> #include <sys/task.h>
> +#include <sys/time.h>
>
> #include <net/if.h>
> #include <net/bpf.h>
> @@ -77,9 +78,6 @@
>
> #define PRINET 26 /* interruptible */
>
> -/* from kern/kern_clock.c; incremented each clock tick. */
> -extern int ticks;
> -
> /*
> * The default read buffer size is patchable.
> */
> @@ -422,15 +420,30 @@ bpfclose(dev_t dev, int flag, int mode,
> (d)->bd_sbuf = (d)->bd_fbuf; \
> (d)->bd_slen = 0; \
> (d)->bd_fbuf = NULL;
> +
> +/*
> + * TODO Move nsecuptime() into kern_tc.c and document it when we have
> + * more users elsewhere in the kernel.
> + */
> +static uint64_t
> +nsecuptime(void)
> +{
> + struct timespec now;
> +
> + nanouptime(&now);
> + return TIMESPEC_TO_NSEC(&now);
> +}
> +
> /*
> * bpfread - read next chunk of packets from buffers
> */
> int
> bpfread(dev_t dev, struct uio *uio, int ioflag)
> {
> + uint64_t end, now;
> struct bpf_d *d;
> caddr_t hbuf;
> - int end, error, hlen, nticks;
> + int error, hlen;
>
> KERNEL_ASSERT_LOCKED();
>
> @@ -453,8 +466,12 @@ bpfread(dev_t dev, struct uio *uio, int
> /*
> * If there's a timeout, mark when the read should end.
> */
> - if (d->bd_rtout)
> - end = ticks + (int)d->bd_rtout;
> + if (d->bd_rtout != 0) {
> + now = nsecuptime();
> + end = now + d->bd_rtout;
> + if (end < now)
> + end = UINT64_MAX;
> + }
>
> /*
> * If the hold buffer is empty, then do a timed sleep, which
> @@ -489,11 +506,11 @@ bpfread(dev_t dev, struct uio *uio, int
> error = msleep_nsec(d, &d->bd_mtx, PRINET|PCATCH,
> "bpf", INFSLP);
> d->bd_nreaders--;
> - } else if ((nticks = end - ticks) > 0) {
> + } else if ((now = nsecuptime()) < end) {
> /* Read timeout has not expired yet. */
> d->bd_nreaders++;
> - error = msleep(d, &d->bd_mtx, PRINET|PCATCH, "bpf",
> - nticks);
> + error = msleep_nsec(d, &d->bd_mtx, PRINET|PCATCH,
> + "bpf", end - now);
> d->bd_nreaders--;
> } else {
> /* Read timeout has expired. */
> @@ -861,27 +878,19 @@ bpfioctl(dev_t dev, u_long cmd, caddr_t
> case BIOCSRTIMEOUT:
> {
> struct timeval *tv = (struct timeval *)addr;
> - u_long rtout;
> + uint64_t rtout;
>
> - /* Compute number of ticks. */
> if (tv->tv_sec < 0 || !timerisvalid(tv)) {
> error = EINVAL;
> break;
> }
> - if (tv->tv_sec > INT_MAX / hz) {
> - error = EOVERFLOW;
> - break;
> - }
> - rtout = tv->tv_sec * hz;
> - if (tv->tv_usec / tick > INT_MAX - rtout) {
> + rtout = TIMEVAL_TO_NSEC(tv);
> + if (rtout > MAXTSLP) {
> error = EOVERFLOW;
> break;
> }
> - rtout += tv->tv_usec / tick;
> mtx_enter(&d->bd_mtx);
> d->bd_rtout = rtout;
> - if (d->bd_rtout == 0 && tv->tv_usec != 0)
> - d->bd_rtout = 1;
> mtx_leave(&d->bd_mtx);
> break;
> }
> @@ -893,9 +902,9 @@ bpfioctl(dev_t dev, u_long cmd, caddr_t
> {
> struct timeval *tv = (struct timeval *)addr;
>
> + memset(tv, 0, sizeof(*tv));
> mtx_enter(&d->bd_mtx);
> - tv->tv_sec = d->bd_rtout / hz;
> - tv->tv_usec = (d->bd_rtout % hz) * tick;
> + NSEC_TO_TIMEVAL(d->bd_rtout, tv);
> mtx_leave(&d->bd_mtx);
> break;
> }
> Index: bpfdesc.h
> ===================================================================
> RCS file: /cvs/src/sys/net/bpfdesc.h,v
> retrieving revision 1.43
> diff -u -p -r1.43 bpfdesc.h
> --- bpfdesc.h 26 Dec 2020 16:30:58 -0000 1.43
> +++ bpfdesc.h 29 Dec 2020 00:21:45 -0000
> @@ -78,7 +78,7 @@ struct bpf_d {
> int bd_in_uiomove; /* for debugging purpose */
>
> struct bpf_if *bd_bif; /* interface descriptor */
> - u_long bd_rtout; /* [m] Read timeout in 'ticks' */
> + uint64_t bd_rtout; /* [m] Read timeout in nanoseconds */
> u_long bd_nreaders; /* [m] # threads asleep in bpfread() */
> int bd_rnonblock; /* true if nonblocking reads are set */
> struct bpf_program_smr