On Tue, Aug 15, 2017 at 17:14 +0200, Mike Belopuhov wrote: > Hi, > > I've just triggered an assert in hfsc_deferred (a callout) on an > MP kernel running on an SP virtual machine: > > panic: kernel diagnostic assertion "HFSC_ENABLED(ifq)" failed: file > "/home/mike/src/openbsd/sys/net/hfsc.c", line 950 > Stopped at db_enter+0x9: leave > TID PID UID PRFLAGS PFLAGS CPU COMMAND > *247463 28420 0 0x3 0 0 pfctl > db_enter() at db_enter+0x9 > > panic(ffffffff817f78f0,4,ffffffff81a3ffc0,ffffffff8110c140,ffff8000000c2060,fff > fffff81598b1c) at panic+0x102 > __assert(ffffffff81769d93,ffffffff817d7350,3b6,ffffffff817d72bd) at > __assert+0x > 35 > hfsc_deferred(ffff8000000c2060) at hfsc_deferred+0x9e > timeout_run(ffff80000004adc8) at timeout_run+0x4c > softclock(0) at softclock+0x146 > softintr_dispatch(0) at softintr_dispatch+0x9f > Xsoftclock() at Xsoftclock+0x1f > --- interrupt --- > end of kernel > end trace frame: 0x728d481974c08548, count: 7 > 0x2cfe9c031c90000: > https://www.openbsd.org/ddb.html describes the minimum info required in bug > reports. Insufficient info makes it difficult to find and fix bugs. > ddb{0}> ps > PID TID PPID UID S FLAGS WAIT COMMAND > *28420 247463 5000 0 7 0x3 pfctl > > > pfctl runs in the loop reloading the ruleset. So at some point we > disable HFSC on the interface but lose a race with hfsc_deferred > before re-enabling it. > > IFQ has a mechanism to lock the underlying object and I believe this > is the right tool for this job. Any other ideas? > > I don't think it's a good idea to hold the mutex (ifq_q_enter and > ifq_q_leave effectively lock and unlock it) during the ifq_start, > so we have to make a concession and run the ifq_start before knowing > whether or not HFSC is attached. IMO, it's a small price to pay to > avoide clutter. Kernel lock assertion is pointless at this point. > > OK? >
I've been running with this while debugging the issue with the active class list ("panic: kernel diagnostic assertion" from Aug 12 on bugs@) and I'm quite confident that this works and I don't observe the race anymore. In addition, I've figured we can keep the HFSC_ENABLED check as there is no issue with bailing early here: diff --git sys/net/hfsc.c sys/net/hfsc.c index 12504267dc5..c51f1406a0b 100644 --- sys/net/hfsc.c +++ sys/net/hfsc.c @@ -950,10 +950,13 @@ hfsc_deferred(void *arg) { struct ifnet *ifp = arg; struct ifqueue *ifq = &ifp->if_snd; struct hfsc_if *hif; + if (!HFSC_ENABLED(ifq)) + return; + if (!ifq_empty(ifq)) ifq_start(ifq); hif = ifq_q_enter(&ifp->if_snd, ifq_hfsc_ops); if (hif == NULL) > diff --git sys/net/hfsc.c sys/net/hfsc.c > index 410bea733c6..3c5b6f6ef78 100644 > --- sys/net/hfsc.c > +++ sys/net/hfsc.c > @@ -944,20 +944,19 @@ hfsc_deferred(void *arg) > { > struct ifnet *ifp = arg; > struct ifqueue *ifq = &ifp->if_snd; > struct hfsc_if *hif; > > - KERNEL_ASSERT_LOCKED(); > - KASSERT(HFSC_ENABLED(ifq)); > - > if (!ifq_empty(ifq)) > ifq_start(ifq); > > - hif = ifq->ifq_q; > - > + hif = ifq_q_enter(&ifp->if_snd, ifq_hfsc_ops); > + if (hif == NULL) > + return; > /* XXX HRTIMER nearest virtual/fit time is likely less than 1/HZ. */ > timeout_add(&hif->hif_defer, 1); > + ifq_q_leave(&ifp->if_snd, hif); > } > > void > hfsc_cl_purge(struct hfsc_if *hif, struct hfsc_class *cl, struct mbuf_list > *ml) > {