Re: wg destroy hangs
On Wed, Oct 04, 2023 at 11:31:47PM +0200, Alexander Bluhm wrote: > On Wed, Oct 04, 2023 at 11:03:27PM +0300, Vitaliy Makkoveev wrote: > > On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote: > > > On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: > > > > > If it happns again, could you send an 'ps axlww | grep ifconifg' > > > > > output? Then we see the wait channel where it hangs in the kernel. > > > > > > > > > > $ ps axlww > > > > >UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT > > > > > TIME COMMAND > > > > > > > > Here it happened again: > > > > > > > > 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 > > > > ifconfig wg1 destroy > > > > > > wg_peer_destroy() > > > ... > > > NET_LOCK(); > > > while (!ifq_empty(>sc_if.if_snd)) { > > > NET_UNLOCK(); > > > tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); > > > NET_LOCK(); > > > } > > > NET_UNLOCK(); > > > > > > This net lock dance looks fishy. And the sleep has a timeout of 1 > > > milli second. But that is may be per packet. So if you have a > > > long queue or the queue refills somehow, it will take forever. > > > > > > I think the difference in the usage is constant traffic that keeps > > > the send queue full. The timeout hides the problem when there are > > > only a few packets. > > > > > > > This should ensure wg_qstart() will not dereference dying `peer'. Looks > > crappy and potentially could block forever, but should work. However > > netlock it unnecessary here. netlocked wg_output() could fill `if_snd' > > while netlock released before tsleep(), so it serializes nothing but > > stops packets processing. > > > > Kirill, does this diff help? > > I doubt that it changes much. When netlock is not taken, the queue > can still be filled with packets. > > Removing this ugly netlock makes sense anyway. But without any > synchronisation just reading a variable feels wrong. Can we add a > read once like for mq_len in sys/mbuf.h? And the ifq_set_maxlen() > also looks very unsafe. For mbuf queues I added a mutex, interface > queues should do the same. > > ok? I reverted this diff. Since it breaks the API. There are numerous cases where ifq_set_maxlen() is called before the mutex is initalized. So doing this does not work. I also question this diff in general. ifq_set_maxlen() is not called concurrently, it is called when the interface is attached. So there is no need for a mutex here. Also the READ_ONCE() added seem not needed. ifiq_len() is unused and ifiq_empty() is only used by ifiq_process(). So maybe that call should be moved into the mutex protected block, the task should only run when the ifiq_ml has data enqueued. > Index: net/ifq.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.c,v > retrieving revision 1.50 > diff -u -p -r1.50 ifq.c > --- net/ifq.c 30 Jul 2023 05:39:52 - 1.50 > +++ net/ifq.c 4 Oct 2023 21:04:20 - > @@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq) > return (len); > } > > +void > +ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen) > +{ > + mtx_enter(>ifq_mtx); > + ifq->ifq_maxlen = maxlen; > + mtx_leave(>ifq_mtx); > +} > + > unsigned int > ifq_purge(struct ifqueue *ifq) > { > Index: net/ifq.h > === > RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.h,v > retrieving revision 1.38 > diff -u -p -r1.38 ifq.h > --- net/ifq.h 30 Jul 2023 05:39:52 - 1.38 > +++ net/ifq.h 4 Oct 2023 21:09:04 - > @@ -435,6 +435,7 @@ void ifq_deq_commit(struct ifqueue *, > void ifq_deq_rollback(struct ifqueue *, struct mbuf *); > struct mbuf *ifq_dequeue(struct ifqueue *); > int ifq_hdatalen(struct ifqueue *); > +void ifq_set_maxlen(struct ifqueue *, unsigned int); > void ifq_mfreem(struct ifqueue *, struct mbuf *); > void ifq_mfreeml(struct ifqueue *, struct mbuf_list *); > unsigned int ifq_purge(struct ifqueue *); > @@ -448,9 +449,8 @@ intifq_deq_sleep(struct ifqueue *, st >const char *, volatile unsigned int *, >volatile unsigned int *); > > -#define ifq_len(_ifq) ((_ifq)->ifq_len) > -#define ifq_empty(_ifq) (ifq_len(_ifq) == 0) > -#define ifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l)) > +#define ifq_len(_ifq)READ_ONCE((_ifq)->ifq_len) > +#define ifq_empty(_ifq) (ifq_len(_ifq) == 0) > > static inline int > ifq_is_priq(struct ifqueue *ifq) > @@ -490,8 +490,8 @@ intifiq_input(struct ifiqueue *, stru > int ifiq_enqueue(struct ifiqueue *, struct mbuf *); > void ifiq_add_data(struct ifiqueue *, struct if_data *); > > -#define
Re: wg destroy hangs
On Thu, Oct 05, 2023 at 07:15:23AM +0200, Kirill Miazine wrote: > > This diff checks IFF_RUNNING flag within while (!ifq_empty()) loop of > > wg_peer_destroy(). If the flag is not set queue will be purged and check > > performed again. I intentionally keep netlock to prevent ifconfig > > manipulations on the interface. > > I confirm that just the diff below solved the issue > > +* XXX: `if_snd' of stopped interface could still packets This sentnece is missing a verb. ... could still contain packets? Or: `if_snd' of stopped interface does not consume packets OK bluhm@ > > Index: sys/net/if_wg.c > > === > > RCS file: /cvs/src/sys/net/if_wg.c,v > > retrieving revision 1.31 > > diff -u -p -r1.31 if_wg.c > > --- sys/net/if_wg.c 26 Sep 2023 15:16:44 - 1.31 > > +++ sys/net/if_wg.c 4 Oct 2023 23:09:14 - > > @@ -509,6 +509,13 @@ wg_peer_destroy(struct wg_peer *peer) > > > > NET_LOCK(); > > while (!ifq_empty(>sc_if.if_snd)) { > > + /* > > +* XXX: `if_snd' of stopped interface could still packets > > +*/ > > + if (!ISSET(sc->sc_if.if_flags, IFF_RUNNING)) { > > + ifq_purge(>sc_if.if_snd); > > + continue; > > + } > > NET_UNLOCK(); > > tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); > > NET_LOCK(); > >
Re: wg destroy hangs
• Vitaliy Makkoveev [2023-10-05 01:10]: On Thu, Oct 05, 2023 at 12:08:55AM +0200, Kirill Miazine wrote: • Vitaliy Makkoveev [2023-10-05 00:02]: On 5 Oct 2023, at 00:56, Kirill Miazine wrote: new diff doesn't prevent hang in test scenario either. Which one? I meant to say new diffS, as I had applied both... what I have now is this: Understood. The problem lays here: ifq_start_task(void *p) { struct ifqueue *ifq = p; struct ifnet *ifp = ifq->ifq_if; if (!ISSET(ifp->if_flags, IFF_RUNNING) || ifq_empty(ifq) || ifq_is_oactive(ifq)) return; ifp->if_qstart(ifq); } wg_qstart(struct ifqueue *ifq) { /* [...] */ while ((m = ifq_dequeue(ifq)) != NULL) { /* [...] */ } wg_peer_destroy(struct wg_peer *peer) { /* [...] */ NET_LOCK(); while (!ifq_empty(>sc_if.if_snd)) { NET_UNLOCK(); tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); NET_LOCK(); } NET_UNLOCK(); /* [...] */ } 1. wg_output() placed some packets to sc->sc_if.if_snd and scheduled ifq_start_task() to run. 2. You performed "ifconfig wg1 down", so wg_down() cleared IFF_RUNNING flag. 3. ifq_start_task() started to run, IFF_RUNNING is not set, so wg_qstart() will not be called as the ifq_dequeue(). Packets rests within sc->sc_if.if_snd. The interface is down, so nothing would schedule ifq_start_task() to run. 4. You performed "ifconfig wg1 destroy". The while(!ifq_empty()) loop is infinite because nothing would empty sc->sc_if.if_snd at this point. The unlocked !ISSET(ifp->if_flags, IFF_RUNNING), ifq_empty() and ifq_is_oactive() are bad, but netlock dances provide caches synchronisation. I have no quick solution for this. Probably we should rethink ifq_start_task(). This diff checks IFF_RUNNING flag within while (!ifq_empty()) loop of wg_peer_destroy(). If the flag is not set queue will be purged and check performed again. I intentionally keep netlock to prevent ifconfig manipulations on the interface. I confirm that just the diff below solved the issue Index: sys/net/if_wg.c === RCS file: /cvs/src/sys/net/if_wg.c,v retrieving revision 1.31 diff -u -p -r1.31 if_wg.c --- sys/net/if_wg.c 26 Sep 2023 15:16:44 - 1.31 +++ sys/net/if_wg.c 4 Oct 2023 23:09:14 - @@ -509,6 +509,13 @@ wg_peer_destroy(struct wg_peer *peer) NET_LOCK(); while (!ifq_empty(>sc_if.if_snd)) { + /* +* XXX: `if_snd' of stopped interface could still packets +*/ + if (!ISSET(sc->sc_if.if_flags, IFF_RUNNING)) { + ifq_purge(>sc_if.if_snd); + continue; + } NET_UNLOCK(); tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); NET_LOCK();
Re: wg destroy hangs
On Thu, Oct 05, 2023 at 12:08:55AM +0200, Kirill Miazine wrote: > • Vitaliy Makkoveev [2023-10-05 00:02]: > > > On 5 Oct 2023, at 00:56, Kirill Miazine wrote: > > > > > > new diff doesn't prevent hang in test scenario either. > > > > > > > Which one? > > I meant to say new diffS, as I had applied both... what I have now is this: > Understood. The problem lays here: ifq_start_task(void *p) { struct ifqueue *ifq = p; struct ifnet *ifp = ifq->ifq_if; if (!ISSET(ifp->if_flags, IFF_RUNNING) || ifq_empty(ifq) || ifq_is_oactive(ifq)) return; ifp->if_qstart(ifq); } wg_qstart(struct ifqueue *ifq) { /* [...] */ while ((m = ifq_dequeue(ifq)) != NULL) { /* [...] */ } wg_peer_destroy(struct wg_peer *peer) { /* [...] */ NET_LOCK(); while (!ifq_empty(>sc_if.if_snd)) { NET_UNLOCK(); tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); NET_LOCK(); } NET_UNLOCK(); /* [...] */ } 1. wg_output() placed some packets to sc->sc_if.if_snd and scheduled ifq_start_task() to run. 2. You performed "ifconfig wg1 down", so wg_down() cleared IFF_RUNNING flag. 3. ifq_start_task() started to run, IFF_RUNNING is not set, so wg_qstart() will not be called as the ifq_dequeue(). Packets rests within sc->sc_if.if_snd. The interface is down, so nothing would schedule ifq_start_task() to run. 4. You performed "ifconfig wg1 destroy". The while(!ifq_empty()) loop is infinite because nothing would empty sc->sc_if.if_snd at this point. The unlocked !ISSET(ifp->if_flags, IFF_RUNNING), ifq_empty() and ifq_is_oactive() are bad, but netlock dances provide caches synchronisation. I have no quick solution for this. Probably we should rethink ifq_start_task(). This diff checks IFF_RUNNING flag within while (!ifq_empty()) loop of wg_peer_destroy(). If the flag is not set queue will be purged and check performed again. I intentionally keep netlock to prevent ifconfig manipulations on the interface. Index: sys/net/if_wg.c === RCS file: /cvs/src/sys/net/if_wg.c,v retrieving revision 1.31 diff -u -p -r1.31 if_wg.c --- sys/net/if_wg.c 26 Sep 2023 15:16:44 - 1.31 +++ sys/net/if_wg.c 4 Oct 2023 23:09:14 - @@ -509,6 +509,13 @@ wg_peer_destroy(struct wg_peer *peer) NET_LOCK(); while (!ifq_empty(>sc_if.if_snd)) { + /* +* XXX: `if_snd' of stopped interface could still packets +*/ + if (!ISSET(sc->sc_if.if_flags, IFF_RUNNING)) { + ifq_purge(>sc_if.if_snd); + continue; + } NET_UNLOCK(); tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); NET_LOCK();
Re: wg destroy hangs
• Vitaliy Makkoveev [2023-10-05 00:02]: On 5 Oct 2023, at 00:56, Kirill Miazine wrote: new diff doesn't prevent hang in test scenario either. Which one? I meant to say new diffS, as I had applied both... what I have now is this: === RCS file: /cvs/src/sys/net/if_wg.c,v retrieving revision 1.31 diff -u -p -r1.31 if_wg.c --- net/if_wg.c 26 Sep 2023 15:16:44 - 1.31 +++ net/if_wg.c 4 Oct 2023 22:05:05 - @@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer) noise_remote_clear(>p_remote); - NET_LOCK(); - while (!ifq_empty(>sc_if.if_snd)) { - NET_UNLOCK(); + while (!ifq_empty(>sc_if.if_snd)) tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); - NET_LOCK(); - } - NET_UNLOCK(); taskq_barrier(wg_crypt_taskq); taskq_barrier(net_tq(sc->sc_if.if_index)); @@ -2580,6 +2575,7 @@ wg_down(struct wg_softc *sc) wg_unbind(sc); rw_exit_read(>sc_lock); NET_LOCK(); + ifq_purge(>sc_if.if_snd); } int Index: net/ifq.c === RCS file: /cvs/src/sys/net/ifq.c,v retrieving revision 1.50 diff -u -p -r1.50 ifq.c --- net/ifq.c 30 Jul 2023 05:39:52 - 1.50 +++ net/ifq.c 4 Oct 2023 22:05:05 - @@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq) return (len); } +void +ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen) +{ + mtx_enter(>ifq_mtx); + ifq->ifq_maxlen = maxlen; + mtx_leave(>ifq_mtx); +} + unsigned int ifq_purge(struct ifqueue *ifq) { Index: net/ifq.h === RCS file: /cvs/src/sys/net/ifq.h,v retrieving revision 1.38 diff -u -p -r1.38 ifq.h --- net/ifq.h 30 Jul 2023 05:39:52 - 1.38 +++ net/ifq.h 4 Oct 2023 22:05:05 - @@ -435,6 +435,7 @@ void ifq_deq_commit(struct ifqueue *, voidifq_deq_rollback(struct ifqueue *, struct mbuf *); struct mbuf*ifq_dequeue(struct ifqueue *); int ifq_hdatalen(struct ifqueue *); +voidifq_set_maxlen(struct ifqueue *, unsigned int); voidifq_mfreem(struct ifqueue *, struct mbuf *); voidifq_mfreeml(struct ifqueue *, struct mbuf_list *); unsigned intifq_purge(struct ifqueue *); @@ -448,9 +449,8 @@ int ifq_deq_sleep(struct ifqueue *, st const char *, volatile unsigned int *, volatile unsigned int *); -#defineifq_len(_ifq) ((_ifq)->ifq_len) -#defineifq_empty(_ifq) (ifq_len(_ifq) == 0) -#defineifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l)) +#define ifq_len(_ifq) READ_ONCE((_ifq)->ifq_len) +#define ifq_empty(_ifq)(ifq_len(_ifq) == 0) static inline int ifq_is_priq(struct ifqueue *ifq) @@ -490,8 +490,8 @@ int ifiq_input(struct ifiqueue *, stru int ifiq_enqueue(struct ifiqueue *, struct mbuf *); voidifiq_add_data(struct ifiqueue *, struct if_data *); -#defineifiq_len(_ifiq) ml_len(&(_ifiq)->ifiq_ml) -#defineifiq_empty(_ifiq) ml_empty(&(_ifiq)->ifiq_ml) +#define ifiq_len(_ifiq)READ_ONCE(ml_len(&(_ifiq)->ifiq_ml)) +#define ifiq_empty(_ifiq) (ifiq_len(_ifiq) == 0) #endif /* _KERNEL */
Re: wg destroy hangs
> On 5 Oct 2023, at 00:56, Kirill Miazine wrote: > > new diff doesn't prevent hang in test scenario either. > Which one?
Re: wg destroy hangs
• Vitaliy Makkoveev [2023-10-04 23:38]: On 5 Oct 2023, at 00:31, Alexander Bluhm wrote: On Wed, Oct 04, 2023 at 11:03:27PM +0300, Vitaliy Makkoveev wrote: On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote: On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: If it happns again, could you send an 'ps axlww | grep ifconifg' output? Then we see the wait channel where it hangs in the kernel. $ ps axlww UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND Here it happened again: 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 ifconfig wg1 destroy wg_peer_destroy() ... NET_LOCK(); while (!ifq_empty(>sc_if.if_snd)) { NET_UNLOCK(); tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); NET_LOCK(); } NET_UNLOCK(); This net lock dance looks fishy. And the sleep has a timeout of 1 milli second. But that is may be per packet. So if you have a long queue or the queue refills somehow, it will take forever. I think the difference in the usage is constant traffic that keeps the send queue full. The timeout hides the problem when there are only a few packets. This should ensure wg_qstart() will not dereference dying `peer'. Looks crappy and potentially could block forever, but should work. However netlock it unnecessary here. netlocked wg_output() could fill `if_snd' while netlock released before tsleep(), so it serializes nothing but stops packets processing. Kirill, does this diff help? I doubt that it changes much. When netlock is not taken, the queue can still be filled with packets. Removing this ugly netlock makes sense anyway. But without any synchronisation just reading a variable feels wrong. Can we add a read once like for mq_len in sys/mbuf.h? And the ifq_set_maxlen() also looks very unsafe. For mbuf queues I added a mutex, interface queues should do the same. ok? I guess this is uniprocessor machine, so synchronisation is not related. new diff doesn't prevent hang in test scenario either. wg destroy would hang on both UP and MP machines -- the fresh Vultr test machine is MP. diff is ok mvs. bluhm Index: net/ifq.c === RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.c,v retrieving revision 1.50 diff -u -p -r1.50 ifq.c --- net/ifq.c 30 Jul 2023 05:39:52 - 1.50 +++ net/ifq.c 4 Oct 2023 21:04:20 - @@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq) return (len); } +void +ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen) +{ + mtx_enter(>ifq_mtx); + ifq->ifq_maxlen = maxlen; + mtx_leave(>ifq_mtx); +} + unsigned int ifq_purge(struct ifqueue *ifq) { Index: net/ifq.h === RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.h,v retrieving revision 1.38 diff -u -p -r1.38 ifq.h --- net/ifq.h 30 Jul 2023 05:39:52 - 1.38 +++ net/ifq.h 4 Oct 2023 21:09:04 - @@ -435,6 +435,7 @@ void ifq_deq_commit(struct ifqueue *, void ifq_deq_rollback(struct ifqueue *, struct mbuf *); struct mbuf *ifq_dequeue(struct ifqueue *); int ifq_hdatalen(struct ifqueue *); +voidifq_set_maxlen(struct ifqueue *, unsigned int); void ifq_mfreem(struct ifqueue *, struct mbuf *); void ifq_mfreeml(struct ifqueue *, struct mbuf_list *); unsigned int ifq_purge(struct ifqueue *); @@ -448,9 +449,8 @@ int ifq_deq_sleep(struct ifqueue *, st const char *, volatile unsigned int *, volatile unsigned int *); -#defineifq_len(_ifq) ((_ifq)->ifq_len) -#defineifq_empty(_ifq) (ifq_len(_ifq) == 0) -#defineifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l)) +#define ifq_len(_ifq) READ_ONCE((_ifq)->ifq_len) +#define ifq_empty(_ifq)(ifq_len(_ifq) == 0) static inline int ifq_is_priq(struct ifqueue *ifq) @@ -490,8 +490,8 @@ int ifiq_input(struct ifiqueue *, stru int ifiq_enqueue(struct ifiqueue *, struct mbuf *); void ifiq_add_data(struct ifiqueue *, struct if_data *); -#defineifiq_len(_ifiq) ml_len(&(_ifiq)->ifiq_ml) -#defineifiq_empty(_ifiq) ml_empty(&(_ifiq)->ifiq_ml) +#define ifiq_len(_ifiq)READ_ONCE(ml_len(&(_ifiq)->ifiq_ml)) +#define ifiq_empty(_ifiq) (ifiq_len(_ifiq) == 0) #endif /* _KERNEL */
Re: wg destroy hangs
> On 5 Oct 2023, at 00:31, Alexander Bluhm wrote: > > On Wed, Oct 04, 2023 at 11:03:27PM +0300, Vitaliy Makkoveev wrote: >> On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote: >>> On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: > If it happns again, could you send an 'ps axlww | grep ifconifg' > output? Then we see the wait channel where it hangs in the kernel. > > $ ps axlww > UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME > COMMAND Here it happened again: 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 ifconfig wg1 destroy >>> >>> wg_peer_destroy() >>> ... >>>NET_LOCK(); >>>while (!ifq_empty(>sc_if.if_snd)) { >>>NET_UNLOCK(); >>>tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); >>>NET_LOCK(); >>>} >>>NET_UNLOCK(); >>> >>> This net lock dance looks fishy. And the sleep has a timeout of 1 >>> milli second. But that is may be per packet. So if you have a >>> long queue or the queue refills somehow, it will take forever. >>> >>> I think the difference in the usage is constant traffic that keeps >>> the send queue full. The timeout hides the problem when there are >>> only a few packets. >>> >> >> This should ensure wg_qstart() will not dereference dying `peer'. Looks >> crappy and potentially could block forever, but should work. However >> netlock it unnecessary here. netlocked wg_output() could fill `if_snd' >> while netlock released before tsleep(), so it serializes nothing but >> stops packets processing. >> >> Kirill, does this diff help? > > I doubt that it changes much. When netlock is not taken, the queue > can still be filled with packets. > > Removing this ugly netlock makes sense anyway. But without any > synchronisation just reading a variable feels wrong. Can we add a > read once like for mq_len in sys/mbuf.h? And the ifq_set_maxlen() > also looks very unsafe. For mbuf queues I added a mutex, interface > queues should do the same. > > ok? > I guess this is uniprocessor machine, so synchronisation is not related. diff is ok mvs. > bluhm > > Index: net/ifq.c > === > RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.c,v > retrieving revision 1.50 > diff -u -p -r1.50 ifq.c > --- net/ifq.c 30 Jul 2023 05:39:52 - 1.50 > +++ net/ifq.c 4 Oct 2023 21:04:20 - > @@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq) > return (len); > } > > +void > +ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen) > +{ > + mtx_enter(>ifq_mtx); > + ifq->ifq_maxlen = maxlen; > + mtx_leave(>ifq_mtx); > +} > + > unsigned int > ifq_purge(struct ifqueue *ifq) > { > Index: net/ifq.h > === > RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.h,v > retrieving revision 1.38 > diff -u -p -r1.38 ifq.h > --- net/ifq.h 30 Jul 2023 05:39:52 - 1.38 > +++ net/ifq.h 4 Oct 2023 21:09:04 - > @@ -435,6 +435,7 @@ void ifq_deq_commit(struct ifqueue *, > void ifq_deq_rollback(struct ifqueue *, struct mbuf *); > struct mbuf *ifq_dequeue(struct ifqueue *); > intifq_hdatalen(struct ifqueue *); > +void ifq_set_maxlen(struct ifqueue *, unsigned int); > void ifq_mfreem(struct ifqueue *, struct mbuf *); > void ifq_mfreeml(struct ifqueue *, struct mbuf_list *); > unsigned int ifq_purge(struct ifqueue *); > @@ -448,9 +449,8 @@ intifq_deq_sleep(struct ifqueue *, st >const char *, volatile unsigned int *, >volatile unsigned int *); > > -#define ifq_len(_ifq) ((_ifq)->ifq_len) > -#define ifq_empty(_ifq) (ifq_len(_ifq) == 0) > -#define ifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l)) > +#define ifq_len(_ifq)READ_ONCE((_ifq)->ifq_len) > +#define ifq_empty(_ifq) (ifq_len(_ifq) == 0) > > static inline int > ifq_is_priq(struct ifqueue *ifq) > @@ -490,8 +490,8 @@ intifiq_input(struct ifiqueue *, stru > intifiq_enqueue(struct ifiqueue *, struct mbuf *); > void ifiq_add_data(struct ifiqueue *, struct if_data *); > > -#define ifiq_len(_ifiq) ml_len(&(_ifiq)->ifiq_ml) > -#define ifiq_empty(_ifiq) ml_empty(&(_ifiq)->ifiq_ml) > +#define ifiq_len(_ifiq) READ_ONCE(ml_len(&(_ifiq)->ifiq_ml)) > +#define ifiq_empty(_ifiq)(ifiq_len(_ifiq) == 0) > > #endif /* _KERNEL */
Re: wg destroy hangs
On Wed, Oct 04, 2023 at 11:03:27PM +0300, Vitaliy Makkoveev wrote: > On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote: > > On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: > > > > If it happns again, could you send an 'ps axlww | grep ifconifg' > > > > output? Then we see the wait channel where it hangs in the kernel. > > > > > > > > $ ps axlww > > > >UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME > > > > COMMAND > > > > > > Here it happened again: > > > > > > 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 > > > ifconfig wg1 destroy > > > > wg_peer_destroy() > > ... > > NET_LOCK(); > > while (!ifq_empty(>sc_if.if_snd)) { > > NET_UNLOCK(); > > tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); > > NET_LOCK(); > > } > > NET_UNLOCK(); > > > > This net lock dance looks fishy. And the sleep has a timeout of 1 > > milli second. But that is may be per packet. So if you have a > > long queue or the queue refills somehow, it will take forever. > > > > I think the difference in the usage is constant traffic that keeps > > the send queue full. The timeout hides the problem when there are > > only a few packets. > > > > This should ensure wg_qstart() will not dereference dying `peer'. Looks > crappy and potentially could block forever, but should work. However > netlock it unnecessary here. netlocked wg_output() could fill `if_snd' > while netlock released before tsleep(), so it serializes nothing but > stops packets processing. > > Kirill, does this diff help? I doubt that it changes much. When netlock is not taken, the queue can still be filled with packets. Removing this ugly netlock makes sense anyway. But without any synchronisation just reading a variable feels wrong. Can we add a read once like for mq_len in sys/mbuf.h? And the ifq_set_maxlen() also looks very unsafe. For mbuf queues I added a mutex, interface queues should do the same. ok? bluhm Index: net/ifq.c === RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.c,v retrieving revision 1.50 diff -u -p -r1.50 ifq.c --- net/ifq.c 30 Jul 2023 05:39:52 - 1.50 +++ net/ifq.c 4 Oct 2023 21:04:20 - @@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq) return (len); } +void +ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen) +{ + mtx_enter(>ifq_mtx); + ifq->ifq_maxlen = maxlen; + mtx_leave(>ifq_mtx); +} + unsigned int ifq_purge(struct ifqueue *ifq) { Index: net/ifq.h === RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.h,v retrieving revision 1.38 diff -u -p -r1.38 ifq.h --- net/ifq.h 30 Jul 2023 05:39:52 - 1.38 +++ net/ifq.h 4 Oct 2023 21:09:04 - @@ -435,6 +435,7 @@ void ifq_deq_commit(struct ifqueue *, voidifq_deq_rollback(struct ifqueue *, struct mbuf *); struct mbuf*ifq_dequeue(struct ifqueue *); int ifq_hdatalen(struct ifqueue *); +voidifq_set_maxlen(struct ifqueue *, unsigned int); voidifq_mfreem(struct ifqueue *, struct mbuf *); voidifq_mfreeml(struct ifqueue *, struct mbuf_list *); unsigned intifq_purge(struct ifqueue *); @@ -448,9 +449,8 @@ int ifq_deq_sleep(struct ifqueue *, st const char *, volatile unsigned int *, volatile unsigned int *); -#defineifq_len(_ifq) ((_ifq)->ifq_len) -#defineifq_empty(_ifq) (ifq_len(_ifq) == 0) -#defineifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l)) +#define ifq_len(_ifq) READ_ONCE((_ifq)->ifq_len) +#define ifq_empty(_ifq)(ifq_len(_ifq) == 0) static inline int ifq_is_priq(struct ifqueue *ifq) @@ -490,8 +490,8 @@ int ifiq_input(struct ifiqueue *, stru int ifiq_enqueue(struct ifiqueue *, struct mbuf *); voidifiq_add_data(struct ifiqueue *, struct if_data *); -#defineifiq_len(_ifiq) ml_len(&(_ifiq)->ifiq_ml) -#defineifiq_empty(_ifiq) ml_empty(&(_ifiq)->ifiq_ml) +#define ifiq_len(_ifiq)READ_ONCE(ml_len(&(_ifiq)->ifiq_ml)) +#define ifiq_empty(_ifiq) (ifiq_len(_ifiq) == 0) #endif /* _KERNEL */
Re: wg destroy hangs
On Wed, Oct 04, 2023 at 11:07:24PM +0200, Kirill Miazine wrote: > > > • Vitaliy Makkoveev [2023-10-04 22:03]: > > On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote: > > > On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: > > > > > If it happns again, could you send an 'ps axlww | grep ifconifg' > > > > > output? Then we see the wait channel where it hangs in the kernel. > > > > > > > > > > $ ps axlww > > > > > UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT > > > > > TIME COMMAND > > > > > > > > Here it happened again: > > > > > > > > 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 > > > > ifconfig wg1 destroy > > > > > > wg_peer_destroy() > > > ... > > > NET_LOCK(); > > > while (!ifq_empty(>sc_if.if_snd)) { > > > NET_UNLOCK(); > > > tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); > > > NET_LOCK(); > > > } > > > NET_UNLOCK(); > > > > > > This net lock dance looks fishy. And the sleep has a timeout of 1 > > > milli second. But that is may be per packet. So if you have a > > > long queue or the queue refills somehow, it will take forever. > > > > > > I think the difference in the usage is constant traffic that keeps > > > the send queue full. The timeout hides the problem when there are > > > only a few packets. > > > > > > > This should ensure wg_qstart() will not dereference dying `peer'. Looks > > crappy and potentially could block forever, but should work. However > > netlock it unnecessary here. netlocked wg_output() could fill `if_snd' > > while netlock released before tsleep(), so it serializes nothing but > > stops packets processing. > > > > Kirill, does this diff help? > > nope, same hang. > > tested on a fresh Vultr VM with -current and patch below. VM got added to my > normal WG network, and VM was accessed by SSH over that WG network. > > then: > > # ifconfig wg1 down (from ssh -- connection to ssh session disappears) > # ifconfig wg1 delete (from console) > # ifconfig wg1 destroy" (from console -- command hangs) > > interestingly, destroy works fine from ssh when commands are entered in a > tmux session and executed immediately after each other: > > # ifconfig wg1 down; ifconfig wg1 delete; ifconfig wg1 destroy > > looks like a timing issue. > Looks like packet stook in `if_snd'. Hypothetically this hack should help. Please note, even it works, I don't want to commit it. Someone should introduce reference counter to wg_peer and remove this crap from wg_peer_destroy(). Index: sys/net/if_wg.c === RCS file: /cvs/src/sys/net/if_wg.c,v retrieving revision 1.31 diff -u -p -r1.31 if_wg.c --- sys/net/if_wg.c 26 Sep 2023 15:16:44 - 1.31 +++ sys/net/if_wg.c 4 Oct 2023 21:21:40 - @@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer) noise_remote_clear(>p_remote); - NET_LOCK(); - while (!ifq_empty(>sc_if.if_snd)) { - NET_UNLOCK(); + while (!ifq_empty(>sc_if.if_snd)) tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); - NET_LOCK(); - } - NET_UNLOCK(); taskq_barrier(wg_crypt_taskq); taskq_barrier(net_tq(sc->sc_if.if_index)); @@ -2580,6 +2575,7 @@ wg_down(struct wg_softc *sc) wg_unbind(sc); rw_exit_read(>sc_lock); NET_LOCK(); + ifq_purge(>sc_if.if_snd); } int
Re: wg destroy hangs
• Vitaliy Makkoveev [2023-10-04 22:03]: On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote: On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: If it happns again, could you send an 'ps axlww | grep ifconifg' output? Then we see the wait channel where it hangs in the kernel. $ ps axlww UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND Here it happened again: 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 ifconfig wg1 destroy wg_peer_destroy() ... NET_LOCK(); while (!ifq_empty(>sc_if.if_snd)) { NET_UNLOCK(); tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); NET_LOCK(); } NET_UNLOCK(); This net lock dance looks fishy. And the sleep has a timeout of 1 milli second. But that is may be per packet. So if you have a long queue or the queue refills somehow, it will take forever. I think the difference in the usage is constant traffic that keeps the send queue full. The timeout hides the problem when there are only a few packets. This should ensure wg_qstart() will not dereference dying `peer'. Looks crappy and potentially could block forever, but should work. However netlock it unnecessary here. netlocked wg_output() could fill `if_snd' while netlock released before tsleep(), so it serializes nothing but stops packets processing. Kirill, does this diff help? nope, same hang. tested on a fresh Vultr VM with -current and patch below. VM got added to my normal WG network, and VM was accessed by SSH over that WG network. then: # ifconfig wg1 down (from ssh -- connection to ssh session disappears) # ifconfig wg1 delete (from console) # ifconfig wg1 destroy" (from console -- command hangs) interestingly, destroy works fine from ssh when commands are entered in a tmux session and executed immediately after each other: # ifconfig wg1 down; ifconfig wg1 delete; ifconfig wg1 destroy looks like a timing issue. Index: sys/net/if_wg.c === RCS file: /cvs/src/sys/net/if_wg.c,v retrieving revision 1.31 diff -u -p -r1.31 if_wg.c --- sys/net/if_wg.c 26 Sep 2023 15:16:44 - 1.31 +++ sys/net/if_wg.c 4 Oct 2023 20:01:16 - @@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer) noise_remote_clear(>p_remote); - NET_LOCK(); - while (!ifq_empty(>sc_if.if_snd)) { - NET_UNLOCK(); + while (!ifq_empty(>sc_if.if_snd)) tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); - NET_LOCK(); - } - NET_UNLOCK(); taskq_barrier(wg_crypt_taskq); taskq_barrier(net_tq(sc->sc_if.if_index));
Re: wg destroy hangs
On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote: > On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: > > > If it happns again, could you send an 'ps axlww | grep ifconifg' > > > output? Then we see the wait channel where it hangs in the kernel. > > > > > > $ ps axlww > > >UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME > > > COMMAND > > > > Here it happened again: > > > > 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 > > ifconfig wg1 destroy > > wg_peer_destroy() > ... > NET_LOCK(); > while (!ifq_empty(>sc_if.if_snd)) { > NET_UNLOCK(); > tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); > NET_LOCK(); > } > NET_UNLOCK(); > > This net lock dance looks fishy. And the sleep has a timeout of 1 > milli second. But that is may be per packet. So if you have a > long queue or the queue refills somehow, it will take forever. > > I think the difference in the usage is constant traffic that keeps > the send queue full. The timeout hides the problem when there are > only a few packets. > This should ensure wg_qstart() will not dereference dying `peer'. Looks crappy and potentially could block forever, but should work. However netlock it unnecessary here. netlocked wg_output() could fill `if_snd' while netlock released before tsleep(), so it serializes nothing but stops packets processing. Kirill, does this diff help? Index: sys/net/if_wg.c === RCS file: /cvs/src/sys/net/if_wg.c,v retrieving revision 1.31 diff -u -p -r1.31 if_wg.c --- sys/net/if_wg.c 26 Sep 2023 15:16:44 - 1.31 +++ sys/net/if_wg.c 4 Oct 2023 20:01:16 - @@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer) noise_remote_clear(>p_remote); - NET_LOCK(); - while (!ifq_empty(>sc_if.if_snd)) { - NET_UNLOCK(); + while (!ifq_empty(>sc_if.if_snd)) tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); - NET_LOCK(); - } - NET_UNLOCK(); taskq_barrier(wg_crypt_taskq); taskq_barrier(net_tq(sc->sc_if.if_index));
Re: wg destroy hangs
On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: > > If it happns again, could you send an 'ps axlww | grep ifconifg' > > output? Then we see the wait channel where it hangs in the kernel. > > > > $ ps axlww > >UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME > > COMMAND > > Here it happened again: > > 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 > ifconfig wg1 destroy wg_peer_destroy() ... NET_LOCK(); while (!ifq_empty(>sc_if.if_snd)) { NET_UNLOCK(); tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); NET_LOCK(); } NET_UNLOCK(); This net lock dance looks fishy. And the sleep has a timeout of 1 milli second. But that is may be per packet. So if you have a long queue or the queue refills somehow, it will take forever. I think the difference in the usage is constant traffic that keeps the send queue full. The timeout hides the problem when there are only a few packets. bluhm
Re: wg destroy hangs
• Kirill Miazine [2023-10-04 20:42]: I saw some changes to wg recently, so I wanted to report the issue in case recent commit changed something in time for release. I understand the issue is probably a year old by now. I guess I hadn't destroyed wg for a while, although I do believe I have... • Alexander Bluhm [2023-10-04 16:31]: On Wed, Oct 04, 2023 at 10:08:01AM -0400, Sonic wrote: See the post: "Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the Bugs archive. I have a test regress/sys/net/wg that configures a wg(4), sends some traffic, and destroys it. I have never seen this bug. There must be something special to trigger it. If it happns again, could you send an 'ps axlww | grep ifconifg' output? Then we see the wait channel where it hangs in the kernel. $ ps axlww UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND Here it happened again: 0 75339 23922 0 10 0 360 296 wg_ifq D+U p0 0:00.00 ifconfig wg1 destroy The WCHAN string can be found in the kernel sources and gives hints. More sophisticated would be to break into ddb and show the kernel stack trace of the ifconfig process. If you want to do that, I can give some advice. But I recommend a serial console for that. Any idea what you did specially to trigger the problem? I did a down, delete and destroy sequence on a newly booted system. Actually, after testing more, I see that doing a down, delete and destroy of a wg interface works fine from console. Also, I tried this command sequence in a tmux session in ssh over a connection though the same wg interface: root@fika ~ # ifconfig wg1 down; ifconfig wg1 delete; ifconfig wg1 destroy That sequence didn't give an error. But then when I did wg1 down via ssh over wg1, and _then_ delete and destroy from console, destroy in console hanged. Interestingly, when doing "ifconfig wg1 down; ifconfig wg1 delete;" in tmux and then destroy in console, there's no hang either. OpenBSD 7.4 (GENERIC) #1332: Wed Oct 4 01:00:54 MDT 2023 bluhm
Re: wg destroy hangs
I saw some changes to wg recently, so I wanted to report the issue in case recent commit changed something in time for release. I understand the issue is probably a year old by now. I guess I hadn't destroyed wg for a while, although I do believe I have... • Alexander Bluhm [2023-10-04 16:31]: On Wed, Oct 04, 2023 at 10:08:01AM -0400, Sonic wrote: See the post: "Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the Bugs archive. I have a test regress/sys/net/wg that configures a wg(4), sends some traffic, and destroys it. I have never seen this bug. There must be something special to trigger it. If it happns again, could you send an 'ps axlww | grep ifconifg' output? Then we see the wait channel where it hangs in the kernel. $ ps axlww UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND Here it happened again: 0 75339 23922 0 10 0 360 296 wg_ifq D+Up00:00.00 ifconfig wg1 destroy The WCHAN string can be found in the kernel sources and gives hints. More sophisticated would be to break into ddb and show the kernel stack trace of the ifconfig process. If you want to do that, I can give some advice. But I recommend a serial console for that. Any idea what you did specially to trigger the problem? I did a down, delete and destroy sequence on a newly booted system. OpenBSD 7.4 (GENERIC) #1332: Wed Oct 4 01:00:54 MDT 2023 bluhm
Re: wg destroy hangs
It works currently on my own firewall. The problem occured on a client's firewall, which I can't test right now. It's an amd64 system (older Supermicro) and as I was shelled in remotely I was getting worried about recovering and then I discovered "reboot -q" - nothing else was working to get the interface back up. Was running -current at the time as well as now. On Wed, Oct 4, 2023 at 12:05 PM Alexander Bluhm wrote: > On Wed, Oct 04, 2023 at 10:53:30AM -0400, Sonic wrote: > > When it happened to me back then (2022) I'm pretty sure I did a "down" > > followed by a "delete" and then the "destroy". > > root@ot6:.../~# cd /usr/src/regress/sys/net/wg > root@ot6:.../wg# make ifconfig > ... > root@ot6:.../wg# ifconfig wg11 > wg11: flags=80c3 rdomain 11 mtu 1420 > index 42 priority 0 llprio 3 > wgport 211 > wgpubkey uQP9F5afOHni9RObVahSPxeJgbsrqGw/P4t5Balpmkc= > wgpeer beT/atjwFPBo3Pv8IvFO5Wf/uVXfgZ5QLSSQIGm/sSc= > wgendpoint 127.0.0.1 212 > tx: 0, rx: 0 > wgaip fdd7:e83e:66bc:46::2/128 > wgaip 10.188.44.2/32 > groups: wg > inet 10.188.44.1 netmask 0xff00 broadcast 10.188.44.255 > inet6 fdd7:e83e:66bc:46::1 prefixlen 64 > root@ot6:.../wg# ifconfig wg11 down > root@ot6:.../wg# ifconfig wg11 delete > root@ot6:.../wg# ifconfig wg11 destroy > root@ot6:.../wg# > > For me it works. Tested on i386 and amd64. > > > Have not tried to recreate since then. > > Can you try it again? What is different in your setup? > > bluhm >
Re: wg destroy hangs
On Wed, Oct 04, 2023 at 10:53:30AM -0400, Sonic wrote: > When it happened to me back then (2022) I'm pretty sure I did a "down" > followed by a "delete" and then the "destroy". root@ot6:.../~# cd /usr/src/regress/sys/net/wg root@ot6:.../wg# make ifconfig ... root@ot6:.../wg# ifconfig wg11 wg11: flags=80c3 rdomain 11 mtu 1420 index 42 priority 0 llprio 3 wgport 211 wgpubkey uQP9F5afOHni9RObVahSPxeJgbsrqGw/P4t5Balpmkc= wgpeer beT/atjwFPBo3Pv8IvFO5Wf/uVXfgZ5QLSSQIGm/sSc= wgendpoint 127.0.0.1 212 tx: 0, rx: 0 wgaip fdd7:e83e:66bc:46::2/128 wgaip 10.188.44.2/32 groups: wg inet 10.188.44.1 netmask 0xff00 broadcast 10.188.44.255 inet6 fdd7:e83e:66bc:46::1 prefixlen 64 root@ot6:.../wg# ifconfig wg11 down root@ot6:.../wg# ifconfig wg11 delete root@ot6:.../wg# ifconfig wg11 destroy root@ot6:.../wg# For me it works. Tested on i386 and amd64. > Have not tried to recreate since then. Can you try it again? What is different in your setup? bluhm
Re: wg destroy hangs
When it happened to me back then (2022) I'm pretty sure I did a "down" followed by a "delete" and then the "destroy". Have not tried to recreate since then. On Wed, Oct 4, 2023 at 10:31 AM Alexander Bluhm wrote: > On Wed, Oct 04, 2023 at 10:08:01AM -0400, Sonic wrote: > > See the post: > > "Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the > > Bugs archive. > > I have a test regress/sys/net/wg that configures a wg(4), sends > some traffic, and destroys it. I have never seen this bug. There > must be something special to trigger it. > > If it happns again, could you send an 'ps axlww | grep ifconifg' > output? Then we see the wait channel where it hangs in the kernel. > > $ ps axlww > UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME > COMMAND > > The WCHAN string can be found in the kernel sources and gives hints. > More sophisticated would be to break into ddb and show the kernel > stack trace of the ifconfig process. If you want to do that, I can > give some advice. But I recommend a serial console for that. > > Any idea what you did specially to trigger the problem? > > bluhm >
Re: wg destroy hangs
On Wed, Oct 04, 2023 at 10:08:01AM -0400, Sonic wrote: > See the post: > "Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the > Bugs archive. I have a test regress/sys/net/wg that configures a wg(4), sends some traffic, and destroys it. I have never seen this bug. There must be something special to trigger it. If it happns again, could you send an 'ps axlww | grep ifconifg' output? Then we see the wait channel where it hangs in the kernel. $ ps axlww UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND The WCHAN string can be found in the kernel sources and gives hints. More sophisticated would be to break into ddb and show the kernel stack trace of the ifconfig process. If you want to do that, I can give some advice. But I recommend a serial console for that. Any idea what you did specially to trigger the problem? bluhm
Re: wg destroy hangs
See the post: "Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the Bugs archive. On Wed, Oct 4, 2023 at 10:04 AM Sonic wrote: > This goes back a ways, to at least 7.2 in October 2022. > > > On Wed, Oct 4, 2023 at 8:54 AM Kirill Miazine wrote: > >> Recently on snapshots I have noticed that ifconfig wgN destroy would >> just hang there, without any way to get back the control. Power reset >> would be the only way to reboot and regain control. >> >> I don't have exact date when it happened first, maybe some 10 days ago, >> but problem is present on most recent snapshot (amd64). >> >> -- Kirill >> >>
Re: wg destroy hangs
This goes back a ways, to at least 7.2 in October 2022. On Wed, Oct 4, 2023 at 8:54 AM Kirill Miazine wrote: > Recently on snapshots I have noticed that ifconfig wgN destroy would > just hang there, without any way to get back the control. Power reset > would be the only way to reboot and regain control. > > I don't have exact date when it happened first, maybe some 10 days ago, > but problem is present on most recent snapshot (amd64). > > -- Kirill > >
Re: wg destroy hangs
Can you try compiling without this: https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/net/if_wg.c.diff?r1=1.29=1.30 Kirill Miazine [k...@krot.org] wrote: > Recently on snapshots I have noticed that ifconfig wgN destroy would just > hang there, without any way to get back the control. Power reset would be > the only way to reboot and regain control. > > I don't have exact date when it happened first, maybe some 10 days ago, but > problem is present on most recent snapshot (amd64). > > -- Kirill