Re: bpf(4): remove ticks

David Gwynne Mon, 28 Dec 2020 22:39:32 -0800

On Mon, Dec 28, 2020 at 06:56:08PM -0600, Scott Cheloha wrote:
> On Mon, Dec 28, 2020 at 10:49:59AM +1000, David Gwynne wrote:
> > On Sat, Dec 26, 2020 at 04:48:23PM -0600, Scott Cheloha wrote:
> > > Now that we've removed bd_rdStart from the bpf_d struct, removing
> > > ticks from bpf(4) itself is straightforward.
> > > 
> > > - bd_rtout becomes a timespec; update bpfioctl() accordingly.
> > >   Cap it at MAXTSLP nanoseconds to avoid arithmetic overflow
> > >   in bpfread().
> > > 
> > > - At the start of bpfread(), if a timeout is set, find the end
> > >   of the read as an absolute uptime.  This is the point where
> > >   we want to avoid overflow: if bd_rtout is only MAXTSLP
> > >   nanoseconds the timespecadd(3) will effectively never overflow.
> > > 
> > > - Before going to sleep, if we have a timeout set, compute how
> > >   much longer to sleep in nanoseconds.
> > > 
> > >   Here's a spot where an absolute timeout sleep would save a
> > >   little code, but we don't have such an interface yet.  Worth
> > >   keeping in mind for the future, though.
> > 
> > Are there any other places that would be useful though? bpf is pretty
> > special.
> 
> kqueue_scan() in kern_event.c can have a spurious wakeup, so an
> absolute sleep would be useful there.  doppoll() in sys_generic.c is
> the same, though I think mpi@/visa@ intend to refactor it to use
> kqueue_scan().
> 
> In general, if you have a thread that wants to do something on a
> strict period you need to use an absolute sleep to avoid drift.


True, something to keep in mind then.

> This code drifts:
> 
>       for (;;) {
>               do_work();
>               tsleep_nsec(&nowake, PPAUSE, "worker", SEC_TO_NSEC(1));
>       }
> 
> While this code will not:
> 
>       uint64_t deadline;
> 
>       deadline = nsecuptime();
>       for (;;) {
>               do_work();
>               deadline = nsec_advance(deadline, SEC_TO_NSEC(1));
>               tsleep_abs_nsec(&chan, PPAUSE, "worker", deadline);
>       }
> 
> (Some of those interfaces don't actually exist, but they are easy to
> write and you can infer how they work.)
> 
> Most developers probably do not care about maintaining a strict period
> for periodic workloads, but I have a suspicion that it would keep
> system performance more deterministic because you don't have various
> periodic workloads drifting into one another, overlapping, and
> momentarily causing utilization spikes.
> 
> I know that probably sounds far-fetched... it's an idea I've been
> fussing with.
> 
> > > dlg@: You said you wanted to simplify this loop.  Unsure what shape
> > > that cleanup would take.  Should we clean it up before this change?
> > 
> > I wanted to have a single msleep_nsec call and pass INFSLP when it
> > should sleep forever.. You saw my first attempt at that. It had
> > issues.
> > 
> > > Thoughts?  ok?
> > 
> > How would this look if you used a uint64_t for nsecs for bd_rtout,
> > and the nsec uptime thing from your pool diff instead of timespecs
> > and nanouptime?
> 
> See the attached patch.  It is shorter because we can do more inline
> stuff with a uint64_t than with a timespec.

I like it.

> 
> > What's the thinking behind nanouptime instead of getnanouptime?
> 
> In general we should prefer high resolution time unless there is a
> compelling performance reason to use low-res time.
> 
> In particular, we should prefer high res time whenever userspace
> timeouts are involved as userspace can only use high-res time.
> 
> For instance, when tsleep_nsec/etc. are reimplemented with kclock
> timeouts (soon!) I will remove the use of low-res time from
> nanosleep(2), select(2)/pselect(2), poll(2)/ppoll(2), and kevent(2).
> The use of low-res time in these interfaces can cause undersleep right
> now.  They're buggy.
> 
> > More generally, what will getnanouptime do when ticks go away?
> 
> Ticks will probably never "go away" entirely.
> 
> In general, some CPU is always going to need to call tc_windup()
> regularly to keep time moving forward.  When tc_windup() is called we
> update the low-res timestamps.  So getnanouptime(9)/etc. will continue
> to work as they do today.  I also imagine that the CPU responsible for
> calling tc_windup() will continue to increment `ticks' and `jiffies'.

Only one CPU needs to do that though, not all CPUs need to tick.

> In the near-ish future I want to add support for dynamic clock
> interrupts.  This would permit a CPU to stay idle, or drop into a
> deeper power-saving state, for longer than 1 tick if it has no work to
> do.  This is not strictly-speaking "tickless" operation, as the clock
> interrupt is not disabled.  But it is nice, so it is a near-term goal.
> 
> In the Distant Future we could add support for disabling the clock
> interrupt on select CPUs.  This might be useful for certain realtime
> applications.  This would be "true tickless" operation.  But for this
> to be useful we'd need to add support for realtime scheduling policies
> (e.g. SCHED_FIFO) and support for binding threads to (and excluding
> threads from) particular CPUs.  So think Distant Future, if ever.
> 
> --
> 
> Anyway, updated patch here.

I like this one more. It's ok by me, but wait a bit to see if anyone
objects.

dlg.

> 
> Index: bpf.c
> ===================================================================
> RCS file: /cvs/src/sys/net/bpf.c,v
> retrieving revision 1.199
> diff -u -p -r1.199 bpf.c
> --- bpf.c     26 Dec 2020 16:30:58 -0000      1.199
> +++ bpf.c     29 Dec 2020 00:21:45 -0000
> @@ -60,6 +60,7 @@
>  #include <sys/selinfo.h>
>  #include <sys/sigio.h>
>  #include <sys/task.h>
> +#include <sys/time.h>
>  
>  #include <net/if.h>
>  #include <net/bpf.h>
> @@ -77,9 +78,6 @@
>  
>  #define PRINET  26                   /* interruptible */
>  
> -/* from kern/kern_clock.c; incremented each clock tick. */
> -extern int ticks;
> -
>  /*
>   * The default read buffer size is patchable.
>   */
> @@ -422,15 +420,30 @@ bpfclose(dev_t dev, int flag, int mode, 
>       (d)->bd_sbuf = (d)->bd_fbuf; \
>       (d)->bd_slen = 0; \
>       (d)->bd_fbuf = NULL;
> +
> +/*
> + * TODO Move nsecuptime() into kern_tc.c and document it when we have
> + * more users elsewhere in the kernel.
> + */
> +static uint64_t
> +nsecuptime(void)
> +{
> +     struct timespec now;
> +
> +     nanouptime(&now);
> +     return TIMESPEC_TO_NSEC(&now);
> +}
> +
>  /*
>   *  bpfread - read next chunk of packets from buffers
>   */
>  int
>  bpfread(dev_t dev, struct uio *uio, int ioflag)
>  {
> +     uint64_t end, now;
>       struct bpf_d *d;
>       caddr_t hbuf;
> -     int end, error, hlen, nticks;
> +     int error, hlen;
>  
>       KERNEL_ASSERT_LOCKED();
>  
> @@ -453,8 +466,12 @@ bpfread(dev_t dev, struct uio *uio, int 
>       /*
>        * If there's a timeout, mark when the read should end.
>        */
> -     if (d->bd_rtout)
> -             end = ticks + (int)d->bd_rtout;
> +     if (d->bd_rtout != 0) {
> +             now = nsecuptime();
> +             end = now + d->bd_rtout;
> +             if (end < now)
> +                     end = UINT64_MAX;
> +     }
>  
>       /*
>        * If the hold buffer is empty, then do a timed sleep, which
> @@ -489,11 +506,11 @@ bpfread(dev_t dev, struct uio *uio, int 
>                       error = msleep_nsec(d, &d->bd_mtx, PRINET|PCATCH,
>                           "bpf", INFSLP);
>                       d->bd_nreaders--;
> -             } else if ((nticks = end - ticks) > 0) {
> +             } else if ((now = nsecuptime()) < end) {
>                       /* Read timeout has not expired yet. */
>                       d->bd_nreaders++;
> -                     error = msleep(d, &d->bd_mtx, PRINET|PCATCH, "bpf",
> -                         nticks);
> +                     error = msleep_nsec(d, &d->bd_mtx, PRINET|PCATCH,
> +                         "bpf", end - now);
>                       d->bd_nreaders--;
>               } else {
>                       /* Read timeout has expired. */
> @@ -861,27 +878,19 @@ bpfioctl(dev_t dev, u_long cmd, caddr_t 
>       case BIOCSRTIMEOUT:
>               {
>                       struct timeval *tv = (struct timeval *)addr;
> -                     u_long rtout;
> +                     uint64_t rtout;
>  
> -                     /* Compute number of ticks. */
>                       if (tv->tv_sec < 0 || !timerisvalid(tv)) {
>                               error = EINVAL;
>                               break;
>                       }
> -                     if (tv->tv_sec > INT_MAX / hz) {
> -                             error = EOVERFLOW;
> -                             break;
> -                     }
> -                     rtout = tv->tv_sec * hz;
> -                     if (tv->tv_usec / tick > INT_MAX - rtout) {
> +                     rtout = TIMEVAL_TO_NSEC(tv);
> +                     if (rtout > MAXTSLP) {
>                               error = EOVERFLOW;
>                               break;
>                       }
> -                     rtout += tv->tv_usec / tick;
>                       mtx_enter(&d->bd_mtx);
>                       d->bd_rtout = rtout;
> -                     if (d->bd_rtout == 0 && tv->tv_usec != 0)
> -                             d->bd_rtout = 1;
>                       mtx_leave(&d->bd_mtx);
>                       break;
>               }
> @@ -893,9 +902,9 @@ bpfioctl(dev_t dev, u_long cmd, caddr_t 
>               {
>                       struct timeval *tv = (struct timeval *)addr;
>  
> +                     memset(tv, 0, sizeof(*tv));
>                       mtx_enter(&d->bd_mtx);
> -                     tv->tv_sec = d->bd_rtout / hz;
> -                     tv->tv_usec = (d->bd_rtout % hz) * tick;
> +                     NSEC_TO_TIMEVAL(d->bd_rtout, tv);
>                       mtx_leave(&d->bd_mtx);
>                       break;
>               }
> Index: bpfdesc.h
> ===================================================================
> RCS file: /cvs/src/sys/net/bpfdesc.h,v
> retrieving revision 1.43
> diff -u -p -r1.43 bpfdesc.h
> --- bpfdesc.h 26 Dec 2020 16:30:58 -0000      1.43
> +++ bpfdesc.h 29 Dec 2020 00:21:45 -0000
> @@ -78,7 +78,7 @@ struct bpf_d {
>       int             bd_in_uiomove;  /* for debugging purpose */
>  
>       struct bpf_if  *bd_bif;         /* interface descriptor */
> -     u_long          bd_rtout;       /* [m] Read timeout in 'ticks' */
> +     uint64_t        bd_rtout;       /* [m] Read timeout in nanoseconds */
>       u_long          bd_nreaders;    /* [m] # threads asleep in bpfread() */
>       int             bd_rnonblock;   /* true if nonblocking reads are set */
>       struct bpf_program_smr

Re: bpf(4): remove ticks

Reply via email to