On Mon, Dec 28, 2020 at 10:49:59AM +1000, David Gwynne wrote:
> On Sat, Dec 26, 2020 at 04:48:23PM -0600, Scott Cheloha wrote:
> > Now that we've removed bd_rdStart from the bpf_d struct, removing
> > ticks from bpf(4) itself is straightforward.
> >
> > - bd_rtout becomes a timespec; update bpfioctl() accordingly.
> > Cap it at MAXTSLP nanoseconds to avoid arithmetic overflow
> > in bpfread().
> >
> > - At the start of bpfread(), if a timeout is set, find the end
> > of the read as an absolute uptime. This is the point where
> > we want to avoid overflow: if bd_rtout is only MAXTSLP
> > nanoseconds the timespecadd(3) will effectively never overflow.
> >
> > - Before going to sleep, if we have a timeout set, compute how
> > much longer to sleep in nanoseconds.
> >
> > Here's a spot where an absolute timeout sleep would save a
> > little code, but we don't have such an interface yet. Worth
> > keeping in mind for the future, though.
>
> Are there any other places that would be useful though? bpf is pretty
> special.
kqueue_scan() in kern_event.c can have a spurious wakeup, so an
absolute sleep would be useful there. doppoll() in sys_generic.c is
the same, though I think mpi@/visa@ intend to refactor it to use
kqueue_scan().
In general, if you have a thread that wants to do something on a
strict period you need to use an absolute sleep to avoid drift.
This code drifts:
for (;;) {
do_work();
tsleep_nsec(&nowake, PPAUSE, "worker", SEC_TO_NSEC(1));
}
While this code will not:
uint64_t deadline;
deadline = nsecuptime();
for (;;) {
do_work();
deadline = nsec_advance(deadline, SEC_TO_NSEC(1));
tsleep_abs_nsec(&chan, PPAUSE, "worker", deadline);
}
(Some of those interfaces don't actually exist, but they are easy to
write and you can infer how they work.)
Most developers probably do not care about maintaining a strict period
for periodic workloads, but I have a suspicion that it would keep
system performance more deterministic because you don't have various
periodic workloads drifting into one another, overlapping, and
momentarily causing utilization spikes.
I know that probably sounds far-fetched... it's an idea I've been
fussing with.
> > dlg@: You said you wanted to simplify this loop. Unsure what shape
> > that cleanup would take. Should we clean it up before this change?
>
> I wanted to have a single msleep_nsec call and pass INFSLP when it
> should sleep forever.. You saw my first attempt at that. It had
> issues.
>
> > Thoughts? ok?
>
> How would this look if you used a uint64_t for nsecs for bd_rtout,
> and the nsec uptime thing from your pool diff instead of timespecs
> and nanouptime?
See the attached patch. It is shorter because we can do more inline
stuff with a uint64_t than with a timespec.
> What's the thinking behind nanouptime instead of getnanouptime?
In general we should prefer high resolution time unless there is a
compelling performance reason to use low-res time.
In particular, we should prefer high res time whenever userspace
timeouts are involved as userspace can only use high-res time.
For instance, when tsleep_nsec/etc. are reimplemented with kclock
timeouts (soon!) I will remove the use of low-res time from
nanosleep(2), select(2)/pselect(2), poll(2)/ppoll(2), and kevent(2).
The use of low-res time in these interfaces can cause undersleep right
now. They're buggy.
> More generally, what will getnanouptime do when ticks go away?
Ticks will probably never "go away" entirely.
In general, some CPU is always going to need to call tc_windup()
regularly to keep time moving forward. When tc_windup() is called we
update the low-res timestamps. So getnanouptime(9)/etc. will continue
to work as they do today. I also imagine that the CPU responsible for
calling tc_windup() will continue to increment `ticks' and `jiffies'.
In the near-ish future I want to add support for dynamic clock
interrupts. This would permit a CPU to stay idle, or drop into a
deeper power-saving state, for longer than 1 tick if it has no work to
do. This is not strictly-speaking "tickless" operation, as the clock
interrupt is not disabled. But it is nice, so it is a near-term goal.
In the Distant Future we could add support for disabling the clock
interrupt on select CPUs. This might be useful for certain realtime
applications. This would be "true tickless" operation. But for this
to be useful we'd need to add support for realtime scheduling policies
(e.g. SCHED_FIFO) and support for binding threads to (and excluding
threads from) particular CPUs. So think Distant Future, if ever.
--
Anyway, updated patch here.
Index: bpf.c
===================================================================
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.199
diff -u -p -r1.199 bpf.c
--- bpf.c 26 Dec 2020 16:30:58 -0000 1.199
+++ bpf.c 29 Dec 2020 00:21:45 -0000
@@ -60,6 +60,7 @@
#include <sys/selinfo.h>
#include <sys/sigio.h>
#include <sys/task.h>
+#include <sys/time.h>
#include <net/if.h>
#include <net/bpf.h>
@@ -77,9 +78,6 @@
#define PRINET 26 /* interruptible */
-/* from kern/kern_clock.c; incremented each clock tick. */
-extern int ticks;
-
/*
* The default read buffer size is patchable.
*/
@@ -422,15 +420,30 @@ bpfclose(dev_t dev, int flag, int mode,
(d)->bd_sbuf = (d)->bd_fbuf; \
(d)->bd_slen = 0; \
(d)->bd_fbuf = NULL;
+
+/*
+ * TODO Move nsecuptime() into kern_tc.c and document it when we have
+ * more users elsewhere in the kernel.
+ */
+static uint64_t
+nsecuptime(void)
+{
+ struct timespec now;
+
+ nanouptime(&now);
+ return TIMESPEC_TO_NSEC(&now);
+}
+
/*
* bpfread - read next chunk of packets from buffers
*/
int
bpfread(dev_t dev, struct uio *uio, int ioflag)
{
+ uint64_t end, now;
struct bpf_d *d;
caddr_t hbuf;
- int end, error, hlen, nticks;
+ int error, hlen;
KERNEL_ASSERT_LOCKED();
@@ -453,8 +466,12 @@ bpfread(dev_t dev, struct uio *uio, int
/*
* If there's a timeout, mark when the read should end.
*/
- if (d->bd_rtout)
- end = ticks + (int)d->bd_rtout;
+ if (d->bd_rtout != 0) {
+ now = nsecuptime();
+ end = now + d->bd_rtout;
+ if (end < now)
+ end = UINT64_MAX;
+ }
/*
* If the hold buffer is empty, then do a timed sleep, which
@@ -489,11 +506,11 @@ bpfread(dev_t dev, struct uio *uio, int
error = msleep_nsec(d, &d->bd_mtx, PRINET|PCATCH,
"bpf", INFSLP);
d->bd_nreaders--;
- } else if ((nticks = end - ticks) > 0) {
+ } else if ((now = nsecuptime()) < end) {
/* Read timeout has not expired yet. */
d->bd_nreaders++;
- error = msleep(d, &d->bd_mtx, PRINET|PCATCH, "bpf",
- nticks);
+ error = msleep_nsec(d, &d->bd_mtx, PRINET|PCATCH,
+ "bpf", end - now);
d->bd_nreaders--;
} else {
/* Read timeout has expired. */
@@ -861,27 +878,19 @@ bpfioctl(dev_t dev, u_long cmd, caddr_t
case BIOCSRTIMEOUT:
{
struct timeval *tv = (struct timeval *)addr;
- u_long rtout;
+ uint64_t rtout;
- /* Compute number of ticks. */
if (tv->tv_sec < 0 || !timerisvalid(tv)) {
error = EINVAL;
break;
}
- if (tv->tv_sec > INT_MAX / hz) {
- error = EOVERFLOW;
- break;
- }
- rtout = tv->tv_sec * hz;
- if (tv->tv_usec / tick > INT_MAX - rtout) {
+ rtout = TIMEVAL_TO_NSEC(tv);
+ if (rtout > MAXTSLP) {
error = EOVERFLOW;
break;
}
- rtout += tv->tv_usec / tick;
mtx_enter(&d->bd_mtx);
d->bd_rtout = rtout;
- if (d->bd_rtout == 0 && tv->tv_usec != 0)
- d->bd_rtout = 1;
mtx_leave(&d->bd_mtx);
break;
}
@@ -893,9 +902,9 @@ bpfioctl(dev_t dev, u_long cmd, caddr_t
{
struct timeval *tv = (struct timeval *)addr;
+ memset(tv, 0, sizeof(*tv));
mtx_enter(&d->bd_mtx);
- tv->tv_sec = d->bd_rtout / hz;
- tv->tv_usec = (d->bd_rtout % hz) * tick;
+ NSEC_TO_TIMEVAL(d->bd_rtout, tv);
mtx_leave(&d->bd_mtx);
break;
}
Index: bpfdesc.h
===================================================================
RCS file: /cvs/src/sys/net/bpfdesc.h,v
retrieving revision 1.43
diff -u -p -r1.43 bpfdesc.h
--- bpfdesc.h 26 Dec 2020 16:30:58 -0000 1.43
+++ bpfdesc.h 29 Dec 2020 00:21:45 -0000
@@ -78,7 +78,7 @@ struct bpf_d {
int bd_in_uiomove; /* for debugging purpose */
struct bpf_if *bd_bif; /* interface descriptor */
- u_long bd_rtout; /* [m] Read timeout in 'ticks' */
+ uint64_t bd_rtout; /* [m] Read timeout in nanoseconds */
u_long bd_nreaders; /* [m] # threads asleep in bpfread() */
int bd_rnonblock; /* true if nonblocking reads are set */
struct bpf_program_smr