On Tue, Aug 15, 2017 at 17:14 +0200, Mike Belopuhov wrote:
> Hi,
> 
> I've just triggered an assert in hfsc_deferred (a callout) on an
> MP kernel running on an SP virtual machine:
> 
>   panic: kernel diagnostic assertion "HFSC_ENABLED(ifq)" failed: file 
> "/home/mike/src/openbsd/sys/net/hfsc.c", line 950
>   Stopped at      db_enter+0x9:   leave
>       TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>   *247463  28420      0         0x3          0    0  pfctl
>   db_enter() at db_enter+0x9
>   
> panic(ffffffff817f78f0,4,ffffffff81a3ffc0,ffffffff8110c140,ffff8000000c2060,fff
>   fffff81598b1c) at panic+0x102
>   __assert(ffffffff81769d93,ffffffff817d7350,3b6,ffffffff817d72bd) at 
> __assert+0x
>   35
>   hfsc_deferred(ffff8000000c2060) at hfsc_deferred+0x9e
>   timeout_run(ffff80000004adc8) at timeout_run+0x4c
>   softclock(0) at softclock+0x146
>   softintr_dispatch(0) at softintr_dispatch+0x9f
>   Xsoftclock() at Xsoftclock+0x1f
>   --- interrupt ---
>   end of kernel
>   end trace frame: 0x728d481974c08548, count: 7
>   0x2cfe9c031c90000:
>   https://www.openbsd.org/ddb.html describes the minimum info required in bug
>   reports.  Insufficient info makes it difficult to find and fix bugs.
>   ddb{0}> ps
>      PID     TID   PPID    UID  S       FLAGS  WAIT          COMMAND
>   *28420  247463   5000      0  7         0x3                pfctl
> 
> 
> pfctl runs in the loop reloading the ruleset.  So at some point we
> disable HFSC on the interface but lose a race with hfsc_deferred
> before re-enabling it.
> 
> IFQ has a mechanism to lock the underlying object and I believe this
> is the right tool for this job.  Any other ideas?
> 
> I don't think it's a good idea to hold the mutex (ifq_q_enter and
> ifq_q_leave effectively lock and unlock it) during the ifq_start,
> so we have to make a concession and run the ifq_start before knowing
> whether or not HFSC is attached.  IMO, it's a small price to pay to
> avoide clutter.  Kernel lock assertion is pointless at this point.
> 
> OK?
>

I've been running with this while debugging the issue with the active
class list ("panic: kernel diagnostic assertion" from Aug 12 on bugs@)
and I'm quite confident that this works and I don't observe the race
anymore.

In addition, I've figured we can keep the HFSC_ENABLED check as there
is no issue with bailing early here:

diff --git sys/net/hfsc.c sys/net/hfsc.c
index 12504267dc5..c51f1406a0b 100644
--- sys/net/hfsc.c
+++ sys/net/hfsc.c
@@ -950,10 +950,13 @@ hfsc_deferred(void *arg)
 {
        struct ifnet *ifp = arg;
        struct ifqueue *ifq = &ifp->if_snd;
        struct hfsc_if *hif;
 
+       if (!HFSC_ENABLED(ifq))
+               return;
+
        if (!ifq_empty(ifq))
                ifq_start(ifq);
 
        hif = ifq_q_enter(&ifp->if_snd, ifq_hfsc_ops);
        if (hif == NULL)


> diff --git sys/net/hfsc.c sys/net/hfsc.c
> index 410bea733c6..3c5b6f6ef78 100644
> --- sys/net/hfsc.c
> +++ sys/net/hfsc.c
> @@ -944,20 +944,19 @@ hfsc_deferred(void *arg)
>  {
>       struct ifnet *ifp = arg;
>       struct ifqueue *ifq = &ifp->if_snd;
>       struct hfsc_if *hif;
>  
> -     KERNEL_ASSERT_LOCKED();
> -     KASSERT(HFSC_ENABLED(ifq));
> -
>       if (!ifq_empty(ifq))
>               ifq_start(ifq);
>  
> -     hif = ifq->ifq_q;
> -
> +     hif = ifq_q_enter(&ifp->if_snd, ifq_hfsc_ops);
> +     if (hif == NULL)
> +             return;
>       /* XXX HRTIMER nearest virtual/fit time is likely less than 1/HZ. */
>       timeout_add(&hif->hif_defer, 1);
> +     ifq_q_leave(&ifp->if_snd, hif);
>  }
>  
>  void
>  hfsc_cl_purge(struct hfsc_if *hif, struct hfsc_class *cl, struct mbuf_list 
> *ml)
>  {

Reply via email to