subject:"Re\: wg destroy hangs"





• Vitaliy Makkoveev [2023-10-05 01:10]:

On Thu, Oct 05, 2023 at 12:08:55AM +0200, Kirill Miazine wrote:

• Vitaliy Makkoveev [2023-10-05 00:02]:

On 5 Oct 2023, at 00:56, Kirill Miazine  wrote:

new diff doesn't prevent hang in test scenario either.



Which one?


I meant to say new diffS, as I had applied both... what I have now is this:



Understood.

The problem lays here:

ifq_start_task(void *p)
{
 struct ifqueue *ifq = p;
 struct ifnet *ifp = ifq->ifq_if;

 if (!ISSET(ifp->if_flags, IFF_RUNNING) ||
 ifq_empty(ifq) || ifq_is_oactive(ifq))
 return;

 ifp->if_qstart(ifq);
}

wg_qstart(struct ifqueue *ifq)
{
 /* [...] */
 while ((m = ifq_dequeue(ifq)) != NULL) {
 /* [...] */
}

wg_peer_destroy(struct wg_peer *peer)
{
 /* [...] */
 NET_LOCK();
 while (!ifq_empty(>sc_if.if_snd)) {
 NET_UNLOCK();
 tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
 NET_LOCK();
 }
 NET_UNLOCK();
 /* [...] */
}

1. wg_output() placed some packets to sc->sc_if.if_snd and scheduled
ifq_start_task() to run.

2. You performed "ifconfig wg1 down", so wg_down() cleared IFF_RUNNING
flag.

3. ifq_start_task() started to run, IFF_RUNNING is not set, so
wg_qstart() will not be called as the ifq_dequeue(). Packets rests
within sc->sc_if.if_snd. The interface is down, so nothing would
schedule ifq_start_task() to run.

4. You performed "ifconfig wg1 destroy". The while(!ifq_empty()) loop is
infinite because nothing would empty sc->sc_if.if_snd at this point.

The unlocked !ISSET(ifp->if_flags, IFF_RUNNING), ifq_empty() and
ifq_is_oactive() are bad, but netlock dances provide caches
synchronisation.

I have no quick solution for this. Probably we should rethink
ifq_start_task().

This diff checks IFF_RUNNING flag within while (!ifq_empty()) loop of
wg_peer_destroy(). If the flag is not set queue will be purged and check
performed again. I intentionally keep netlock to prevent ifconfig
manipulations on the interface.


I confirm that just the diff below solved the issue


Index: sys/net/if_wg.c
===
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.31
diff -u -p -r1.31 if_wg.c
--- sys/net/if_wg.c 26 Sep 2023 15:16:44 -  1.31
+++ sys/net/if_wg.c 4 Oct 2023 23:09:14 -
@@ -509,6 +509,13 @@ wg_peer_destroy(struct wg_peer *peer)
  
  	NET_LOCK();

while (!ifq_empty(>sc_if.if_snd)) {
+   /*
+* XXX: `if_snd' of stopped interface could still packets
+*/
+   if (!ISSET(sc->sc_if.if_flags, IFF_RUNNING)) {
+   ifq_purge(>sc_if.if_snd);
+   continue;
+   }
NET_UNLOCK();
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
NET_LOCK();

Re: wg destroy hangs

On Thu, Oct 05, 2023 at 12:08:55AM +0200, Kirill Miazine wrote:
> • Vitaliy Makkoveev [2023-10-05 00:02]:
> > > On 5 Oct 2023, at 00:56, Kirill Miazine  wrote:
> > > 
> > > new diff doesn't prevent hang in test scenario either.
> > > 
> > 
> > Which one?
> 
> I meant to say new diffS, as I had applied both... what I have now is this:
> 

Understood.

The problem lays here:

ifq_start_task(void *p)
{
struct ifqueue *ifq = p;
struct ifnet *ifp = ifq->ifq_if;

if (!ISSET(ifp->if_flags, IFF_RUNNING) ||
ifq_empty(ifq) || ifq_is_oactive(ifq))
return;

ifp->if_qstart(ifq);
}

wg_qstart(struct ifqueue *ifq)
{
/* [...] */ 
while ((m = ifq_dequeue(ifq)) != NULL) {
/* [...] */ 
}

wg_peer_destroy(struct wg_peer *peer)
{
/* [...] */ 
NET_LOCK();
while (!ifq_empty(>sc_if.if_snd)) {
NET_UNLOCK();
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
NET_LOCK();
}
NET_UNLOCK();
/* [...] */ 
}

1. wg_output() placed some packets to sc->sc_if.if_snd and scheduled
ifq_start_task() to run.

2. You performed "ifconfig wg1 down", so wg_down() cleared IFF_RUNNING
flag.

3. ifq_start_task() started to run, IFF_RUNNING is not set, so
wg_qstart() will not be called as the ifq_dequeue(). Packets rests
within sc->sc_if.if_snd. The interface is down, so nothing would
schedule ifq_start_task() to run.

4. You performed "ifconfig wg1 destroy". The while(!ifq_empty()) loop is
infinite because nothing would empty sc->sc_if.if_snd at this point.

The unlocked !ISSET(ifp->if_flags, IFF_RUNNING), ifq_empty() and 
ifq_is_oactive() are bad, but netlock dances provide caches
synchronisation.

I have no quick solution for this. Probably we should rethink
ifq_start_task().

This diff checks IFF_RUNNING flag within while (!ifq_empty()) loop of
wg_peer_destroy(). If the flag is not set queue will be purged and check
performed again. I intentionally keep netlock to prevent ifconfig
manipulations on the interface.

Index: sys/net/if_wg.c
===
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.31
diff -u -p -r1.31 if_wg.c
--- sys/net/if_wg.c 26 Sep 2023 15:16:44 -  1.31
+++ sys/net/if_wg.c 4 Oct 2023 23:09:14 -
@@ -509,6 +509,13 @@ wg_peer_destroy(struct wg_peer *peer)

NET_LOCK();
while (!ifq_empty(>sc_if.if_snd)) {
+   /*
+* XXX: `if_snd' of stopped interface could still packets
+*/
+   if (!ISSET(sc->sc_if.if_flags, IFF_RUNNING)) {
+   ifq_purge(>sc_if.if_snd);
+   continue;
+   }
NET_UNLOCK();
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
NET_LOCK();

Re: wg destroy hangs


• Vitaliy Makkoveev [2023-10-05 00:02]:

On 5 Oct 2023, at 00:56, Kirill Miazine  wrote:

new diff doesn't prevent hang in test scenario either.



Which one?


I meant to say new diffS, as I had applied both... what I have now is this:

===
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.31
diff -u -p -r1.31 if_wg.c
--- net/if_wg.c 26 Sep 2023 15:16:44 -  1.31
+++ net/if_wg.c 4 Oct 2023 22:05:05 -
@@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer)

noise_remote_clear(>p_remote);

-   NET_LOCK();
-   while (!ifq_empty(>sc_if.if_snd)) {
-   NET_UNLOCK();
+   while (!ifq_empty(>sc_if.if_snd))
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
-   NET_LOCK();
-   }
-   NET_UNLOCK();

taskq_barrier(wg_crypt_taskq);
taskq_barrier(net_tq(sc->sc_if.if_index));
@@ -2580,6 +2575,7 @@ wg_down(struct wg_softc *sc)
wg_unbind(sc);
rw_exit_read(>sc_lock);
NET_LOCK();
+   ifq_purge(>sc_if.if_snd);
 }

 int
Index: net/ifq.c
===
RCS file: /cvs/src/sys/net/ifq.c,v
retrieving revision 1.50
diff -u -p -r1.50 ifq.c
--- net/ifq.c   30 Jul 2023 05:39:52 -  1.50
+++ net/ifq.c   4 Oct 2023 22:05:05 -
@@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq)
return (len);
 }

+void
+ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen)
+{
+   mtx_enter(>ifq_mtx);
+   ifq->ifq_maxlen = maxlen;
+   mtx_leave(>ifq_mtx);
+}
+
 unsigned int
 ifq_purge(struct ifqueue *ifq)
 {
Index: net/ifq.h
===
RCS file: /cvs/src/sys/net/ifq.h,v
retrieving revision 1.38
diff -u -p -r1.38 ifq.h
--- net/ifq.h   30 Jul 2023 05:39:52 -  1.38
+++ net/ifq.h   4 Oct 2023 22:05:05 -
@@ -435,6 +435,7 @@ void ifq_deq_commit(struct ifqueue *,
 voidifq_deq_rollback(struct ifqueue *, struct mbuf *);
 struct mbuf*ifq_dequeue(struct ifqueue *);
 int ifq_hdatalen(struct ifqueue *);
+voidifq_set_maxlen(struct ifqueue *, unsigned int);
 voidifq_mfreem(struct ifqueue *, struct mbuf *);
 voidifq_mfreeml(struct ifqueue *, struct mbuf_list *);
 unsigned intifq_purge(struct ifqueue *);
@@ -448,9 +449,8 @@ int  ifq_deq_sleep(struct ifqueue *, st
 const char *, volatile unsigned int *,
 volatile unsigned int *);

-#defineifq_len(_ifq)   ((_ifq)->ifq_len)
-#defineifq_empty(_ifq) (ifq_len(_ifq) == 0)
-#defineifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l))
+#define ifq_len(_ifq)  READ_ONCE((_ifq)->ifq_len)
+#define ifq_empty(_ifq)(ifq_len(_ifq) == 0)

 static inline int
 ifq_is_priq(struct ifqueue *ifq)
@@ -490,8 +490,8 @@ int  ifiq_input(struct ifiqueue *, stru
 int ifiq_enqueue(struct ifiqueue *, struct mbuf *);
 voidifiq_add_data(struct ifiqueue *, struct if_data *);

-#defineifiq_len(_ifiq) ml_len(&(_ifiq)->ifiq_ml)
-#defineifiq_empty(_ifiq)   ml_empty(&(_ifiq)->ifiq_ml)
+#define ifiq_len(_ifiq)READ_ONCE(ml_len(&(_ifiq)->ifiq_ml))
+#define ifiq_empty(_ifiq)  (ifiq_len(_ifiq) == 0)

 #endif /* _KERNEL */

Re: wg destroy hangs

> On 5 Oct 2023, at 00:56, Kirill Miazine  wrote:
> 
> new diff doesn't prevent hang in test scenario either.
> 

Which one?

Re: wg destroy hangs





• Vitaliy Makkoveev [2023-10-04 23:38]:

On 5 Oct 2023, at 00:31, Alexander Bluhm  wrote:

On Wed, Oct 04, 2023 at 11:03:27PM +0300, Vitaliy Makkoveev wrote:

On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote:

On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote:

If it happns again, could you send an 'ps axlww | grep ifconifg'
output?  Then we see the wait channel where it hangs in the kernel.

$ ps axlww
   UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME COMMAND


Here it happened again:

 0 75339 23922   0  10   0   360   296 wg_ifq  D+Up00:00.00
ifconfig wg1 destroy


wg_peer_destroy()
...
NET_LOCK();
while (!ifq_empty(>sc_if.if_snd)) {
NET_UNLOCK();
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
NET_LOCK();
}
NET_UNLOCK();

This net lock dance looks fishy.  And the sleep has a timeout of 1
milli second.  But that is may be per packet.  So if you have a
long queue or the queue refills somehow, it will take forever.

I think the difference in the usage is constant traffic that keeps
the send queue full.  The timeout hides the problem when there are
only a few packets.



This should ensure wg_qstart() will not dereference dying `peer'. Looks
crappy and potentially could block forever, but should work. However
netlock it unnecessary here. netlocked wg_output() could fill `if_snd'
while netlock released before tsleep(), so it serializes nothing but
stops packets processing.

Kirill, does this diff help?


I doubt that it changes much.  When netlock is not taken, the queue
can still be filled with packets.

Removing this ugly netlock makes sense anyway.  But without any
synchronisation just reading a variable feels wrong.  Can we add a
read once like for mq_len in sys/mbuf.h?  And the ifq_set_maxlen()
also looks very unsafe.  For mbuf queues I added a mutex, interface
queues should do the same.

ok?



I guess this is uniprocessor machine, so synchronisation is not
related.

new diff doesn't prevent hang in test scenario either.

wg destroy would hang on both UP and MP machines -- the fresh Vultr test 
machine is MP.



diff is ok mvs.


bluhm

Index: net/ifq.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.c,v
retrieving revision 1.50
diff -u -p -r1.50 ifq.c
--- net/ifq.c   30 Jul 2023 05:39:52 -  1.50
+++ net/ifq.c   4 Oct 2023 21:04:20 -
@@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq)
return (len);
}

+void
+ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen)
+{
+   mtx_enter(>ifq_mtx);
+   ifq->ifq_maxlen = maxlen;
+   mtx_leave(>ifq_mtx);
+}
+
unsigned int
ifq_purge(struct ifqueue *ifq)
{
Index: net/ifq.h
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.h,v
retrieving revision 1.38
diff -u -p -r1.38 ifq.h
--- net/ifq.h   30 Jul 2023 05:39:52 -  1.38
+++ net/ifq.h   4 Oct 2023 21:09:04 -
@@ -435,6 +435,7 @@ void ifq_deq_commit(struct ifqueue *,
void ifq_deq_rollback(struct ifqueue *, struct mbuf *);
struct mbuf *ifq_dequeue(struct ifqueue *);
int  ifq_hdatalen(struct ifqueue *);
+voidifq_set_maxlen(struct ifqueue *, unsigned int);
void ifq_mfreem(struct ifqueue *, struct mbuf *);
void ifq_mfreeml(struct ifqueue *, struct mbuf_list *);
unsigned int ifq_purge(struct ifqueue *);
@@ -448,9 +449,8 @@ int  ifq_deq_sleep(struct ifqueue *, st
 const char *, volatile unsigned int *,
 volatile unsigned int *);

-#defineifq_len(_ifq)   ((_ifq)->ifq_len)
-#defineifq_empty(_ifq) (ifq_len(_ifq) == 0)
-#defineifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l))
+#define ifq_len(_ifq)  READ_ONCE((_ifq)->ifq_len)
+#define ifq_empty(_ifq)(ifq_len(_ifq) == 0)

static inline int
ifq_is_priq(struct ifqueue *ifq)
@@ -490,8 +490,8 @@ int  ifiq_input(struct ifiqueue *, stru
int  ifiq_enqueue(struct ifiqueue *, struct mbuf *);
void ifiq_add_data(struct ifiqueue *, struct if_data *);

-#defineifiq_len(_ifiq) ml_len(&(_ifiq)->ifiq_ml)
-#defineifiq_empty(_ifiq)   ml_empty(&(_ifiq)->ifiq_ml)
+#define ifiq_len(_ifiq)READ_ONCE(ml_len(&(_ifiq)->ifiq_ml))
+#define ifiq_empty(_ifiq)  (ifiq_len(_ifiq) == 0)

#endif /* _KERNEL */

Re: wg destroy hangs

> On 5 Oct 2023, at 00:31, Alexander Bluhm  wrote:
> 
> On Wed, Oct 04, 2023 at 11:03:27PM +0300, Vitaliy Makkoveev wrote:
>> On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote:
>>> On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote:
> If it happns again, could you send an 'ps axlww | grep ifconifg'
> output?  Then we see the wait channel where it hangs in the kernel.
> 
> $ ps axlww
>   UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME 
> COMMAND
 
 Here it happened again:
 
 0 75339 23922   0  10   0   360   296 wg_ifq  D+Up00:00.00 
 ifconfig wg1 destroy
>>> 
>>> wg_peer_destroy()
>>> ...
>>>NET_LOCK();
>>>while (!ifq_empty(>sc_if.if_snd)) {
>>>NET_UNLOCK();
>>>tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
>>>NET_LOCK();
>>>}
>>>NET_UNLOCK();
>>> 
>>> This net lock dance looks fishy.  And the sleep has a timeout of 1
>>> milli second.  But that is may be per packet.  So if you have a
>>> long queue or the queue refills somehow, it will take forever.
>>> 
>>> I think the difference in the usage is constant traffic that keeps
>>> the send queue full.  The timeout hides the problem when there are
>>> only a few packets.
>>> 
>> 
>> This should ensure wg_qstart() will not dereference dying `peer'. Looks
>> crappy and potentially could block forever, but should work. However
>> netlock it unnecessary here. netlocked wg_output() could fill `if_snd'
>> while netlock released before tsleep(), so it serializes nothing but
>> stops packets processing.
>> 
>> Kirill, does this diff help? 
> 
> I doubt that it changes much.  When netlock is not taken, the queue
> can still be filled with packets.
> 
> Removing this ugly netlock makes sense anyway.  But without any
> synchronisation just reading a variable feels wrong.  Can we add a
> read once like for mq_len in sys/mbuf.h?  And the ifq_set_maxlen()
> also looks very unsafe.  For mbuf queues I added a mutex, interface
> queues should do the same.
> 
> ok?
> 

I guess this is uniprocessor machine, so synchronisation is not
related.

diff is ok mvs.

> bluhm
> 
> Index: net/ifq.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.c,v
> retrieving revision 1.50
> diff -u -p -r1.50 ifq.c
> --- net/ifq.c 30 Jul 2023 05:39:52 -  1.50
> +++ net/ifq.c 4 Oct 2023 21:04:20 -
> @@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq)
>   return (len);
> }
> 
> +void
> +ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen)
> +{
> + mtx_enter(>ifq_mtx);
> + ifq->ifq_maxlen = maxlen;
> + mtx_leave(>ifq_mtx);
> +}
> +
> unsigned int
> ifq_purge(struct ifqueue *ifq)
> {
> Index: net/ifq.h
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.h,v
> retrieving revision 1.38
> diff -u -p -r1.38 ifq.h
> --- net/ifq.h 30 Jul 2023 05:39:52 -  1.38
> +++ net/ifq.h 4 Oct 2023 21:09:04 -
> @@ -435,6 +435,7 @@ void   ifq_deq_commit(struct ifqueue *, 
> void   ifq_deq_rollback(struct ifqueue *, struct mbuf *);
> struct mbuf   *ifq_dequeue(struct ifqueue *);
> intifq_hdatalen(struct ifqueue *);
> +void  ifq_set_maxlen(struct ifqueue *, unsigned int);
> void   ifq_mfreem(struct ifqueue *, struct mbuf *);
> void   ifq_mfreeml(struct ifqueue *, struct mbuf_list *);
> unsigned int   ifq_purge(struct ifqueue *);
> @@ -448,9 +449,8 @@ intifq_deq_sleep(struct ifqueue *, st
>const char *, volatile unsigned int *,
>volatile unsigned int *);
> 
> -#define  ifq_len(_ifq)   ((_ifq)->ifq_len)
> -#define  ifq_empty(_ifq) (ifq_len(_ifq) == 0)
> -#define  ifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l))
> +#define ifq_len(_ifq)READ_ONCE((_ifq)->ifq_len)
> +#define ifq_empty(_ifq)  (ifq_len(_ifq) == 0)
> 
> static inline int
> ifq_is_priq(struct ifqueue *ifq)
> @@ -490,8 +490,8 @@ intifiq_input(struct ifiqueue *, stru
> intifiq_enqueue(struct ifiqueue *, struct mbuf *);
> void   ifiq_add_data(struct ifiqueue *, struct if_data *);
> 
> -#define  ifiq_len(_ifiq) ml_len(&(_ifiq)->ifiq_ml)
> -#define  ifiq_empty(_ifiq)   ml_empty(&(_ifiq)->ifiq_ml)
> +#define ifiq_len(_ifiq)  READ_ONCE(ml_len(&(_ifiq)->ifiq_ml))
> +#define ifiq_empty(_ifiq)(ifiq_len(_ifiq) == 0)
> 
> #endif /* _KERNEL */

Re: wg destroy hangs

On Wed, Oct 04, 2023 at 11:03:27PM +0300, Vitaliy Makkoveev wrote:
> On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote:
> > On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote:
> > > > If it happns again, could you send an 'ps axlww | grep ifconifg'
> > > > output?  Then we see the wait channel where it hangs in the kernel.
> > > > 
> > > > $ ps axlww
> > > >UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME 
> > > > COMMAND
> > > 
> > > Here it happened again:
> > > 
> > >  0 75339 23922   0  10   0   360   296 wg_ifq  D+Up00:00.00 
> > > ifconfig wg1 destroy
> > 
> > wg_peer_destroy()
> > ...
> > NET_LOCK();
> > while (!ifq_empty(>sc_if.if_snd)) {
> > NET_UNLOCK();
> > tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
> > NET_LOCK();
> > }
> > NET_UNLOCK();
> > 
> > This net lock dance looks fishy.  And the sleep has a timeout of 1
> > milli second.  But that is may be per packet.  So if you have a
> > long queue or the queue refills somehow, it will take forever.
> > 
> > I think the difference in the usage is constant traffic that keeps
> > the send queue full.  The timeout hides the problem when there are
> > only a few packets.
> > 
> 
> This should ensure wg_qstart() will not dereference dying `peer'. Looks
> crappy and potentially could block forever, but should work. However
> netlock it unnecessary here. netlocked wg_output() could fill `if_snd'
> while netlock released before tsleep(), so it serializes nothing but
> stops packets processing.
> 
> Kirill, does this diff help? 

I doubt that it changes much.  When netlock is not taken, the queue
can still be filled with packets.

Removing this ugly netlock makes sense anyway.  But without any
synchronisation just reading a variable feels wrong.  Can we add a
read once like for mq_len in sys/mbuf.h?  And the ifq_set_maxlen()
also looks very unsafe.  For mbuf queues I added a mutex, interface
queues should do the same.

ok?

bluhm

Index: net/ifq.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.c,v
retrieving revision 1.50
diff -u -p -r1.50 ifq.c
--- net/ifq.c   30 Jul 2023 05:39:52 -  1.50
+++ net/ifq.c   4 Oct 2023 21:04:20 -
@@ -529,6 +529,14 @@ ifq_hdatalen(struct ifqueue *ifq)
return (len);
 }
 
+void
+ifq_set_maxlen(struct ifqueue *ifq, unsigned int maxlen)
+{
+   mtx_enter(>ifq_mtx);
+   ifq->ifq_maxlen = maxlen;
+   mtx_leave(>ifq_mtx);
+}
+
 unsigned int
 ifq_purge(struct ifqueue *ifq)
 {
Index: net/ifq.h
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/ifq.h,v
retrieving revision 1.38
diff -u -p -r1.38 ifq.h
--- net/ifq.h   30 Jul 2023 05:39:52 -  1.38
+++ net/ifq.h   4 Oct 2023 21:09:04 -
@@ -435,6 +435,7 @@ void ifq_deq_commit(struct ifqueue *, 
 voidifq_deq_rollback(struct ifqueue *, struct mbuf *);
 struct mbuf*ifq_dequeue(struct ifqueue *);
 int ifq_hdatalen(struct ifqueue *);
+voidifq_set_maxlen(struct ifqueue *, unsigned int);
 voidifq_mfreem(struct ifqueue *, struct mbuf *);
 voidifq_mfreeml(struct ifqueue *, struct mbuf_list *);
 unsigned intifq_purge(struct ifqueue *);
@@ -448,9 +449,8 @@ int  ifq_deq_sleep(struct ifqueue *, st
 const char *, volatile unsigned int *,
 volatile unsigned int *);
 
-#defineifq_len(_ifq)   ((_ifq)->ifq_len)
-#defineifq_empty(_ifq) (ifq_len(_ifq) == 0)
-#defineifq_set_maxlen(_ifq, _l)((_ifq)->ifq_maxlen = (_l))
+#define ifq_len(_ifq)  READ_ONCE((_ifq)->ifq_len)
+#define ifq_empty(_ifq)(ifq_len(_ifq) == 0)
 
 static inline int
 ifq_is_priq(struct ifqueue *ifq)
@@ -490,8 +490,8 @@ int  ifiq_input(struct ifiqueue *, stru
 int ifiq_enqueue(struct ifiqueue *, struct mbuf *);
 voidifiq_add_data(struct ifiqueue *, struct if_data *);
 
-#defineifiq_len(_ifiq) ml_len(&(_ifiq)->ifiq_ml)
-#defineifiq_empty(_ifiq)   ml_empty(&(_ifiq)->ifiq_ml)
+#define ifiq_len(_ifiq)READ_ONCE(ml_len(&(_ifiq)->ifiq_ml))
+#define ifiq_empty(_ifiq)  (ifiq_len(_ifiq) == 0)
 
 #endif /* _KERNEL */

Re: wg destroy hangs

On Wed, Oct 04, 2023 at 11:07:24PM +0200, Kirill Miazine wrote:
> 
> 
> • Vitaliy Makkoveev [2023-10-04 22:03]:
> > On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote:
> > > On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote:
> > > > > If it happns again, could you send an 'ps axlww | grep ifconifg'
> > > > > output?  Then we see the wait channel where it hangs in the kernel.
> > > > > 
> > > > > $ ps axlww
> > > > > UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   
> > > > > TIME COMMAND
> > > > 
> > > > Here it happened again:
> > > > 
> > > >   0 75339 23922   0  10   0   360   296 wg_ifq  D+Up00:00.00
> > > > ifconfig wg1 destroy
> > > 
> > > wg_peer_destroy()
> > >   ...
> > >  NET_LOCK();
> > >  while (!ifq_empty(>sc_if.if_snd)) {
> > >  NET_UNLOCK();
> > >  tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
> > >  NET_LOCK();
> > >  }
> > >  NET_UNLOCK();
> > > 
> > > This net lock dance looks fishy.  And the sleep has a timeout of 1
> > > milli second.  But that is may be per packet.  So if you have a
> > > long queue or the queue refills somehow, it will take forever.
> > > 
> > > I think the difference in the usage is constant traffic that keeps
> > > the send queue full.  The timeout hides the problem when there are
> > > only a few packets.
> > > 
> > 
> > This should ensure wg_qstart() will not dereference dying `peer'. Looks
> > crappy and potentially could block forever, but should work. However
> > netlock it unnecessary here. netlocked wg_output() could fill `if_snd'
> > while netlock released before tsleep(), so it serializes nothing but
> > stops packets processing.
> > 
> > Kirill, does this diff help?
> 
> nope, same hang.
> 
> tested on a fresh Vultr VM with -current and patch below. VM got added to my
> normal WG network, and VM was accessed by SSH over that WG network.
> 
> then:
> 
>  # ifconfig wg1 down (from ssh -- connection to ssh session disappears)
>  # ifconfig wg1 delete (from console)
>  # ifconfig wg1 destroy" (from console -- command hangs)
> 
> interestingly, destroy works fine from ssh when commands are entered in a
> tmux session and executed immediately after each other:
> 
>   # ifconfig wg1 down; ifconfig wg1 delete; ifconfig wg1 destroy
> 
> looks like a timing issue.
> 

Looks like packet stook in `if_snd'. Hypothetically this hack should
help. Please note, even it works, I don't want to commit it. Someone
should introduce reference counter to wg_peer and remove this crap from
wg_peer_destroy().

Index: sys/net/if_wg.c
===
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.31
diff -u -p -r1.31 if_wg.c
--- sys/net/if_wg.c 26 Sep 2023 15:16:44 -  1.31
+++ sys/net/if_wg.c 4 Oct 2023 21:21:40 -
@@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer)
 
noise_remote_clear(>p_remote);
 
-   NET_LOCK();
-   while (!ifq_empty(>sc_if.if_snd)) {
-   NET_UNLOCK();
+   while (!ifq_empty(>sc_if.if_snd))
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
-   NET_LOCK();
-   }
-   NET_UNLOCK();
 
taskq_barrier(wg_crypt_taskq);
taskq_barrier(net_tq(sc->sc_if.if_index));
@@ -2580,6 +2575,7 @@ wg_down(struct wg_softc *sc)
wg_unbind(sc);
rw_exit_read(>sc_lock);
NET_LOCK();
+   ifq_purge(>sc_if.if_snd);
 }
 
 int

Re: wg destroy hangs





• Vitaliy Makkoveev [2023-10-04 22:03]:

On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote:

On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote:

If it happns again, could you send an 'ps axlww | grep ifconifg'
output?  Then we see the wait channel where it hangs in the kernel.

$ ps axlww
UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME COMMAND


Here it happened again:

  0 75339 23922   0  10   0   360   296 wg_ifq  D+Up00:00.00
ifconfig wg1 destroy


wg_peer_destroy()
...
 NET_LOCK();
 while (!ifq_empty(>sc_if.if_snd)) {
 NET_UNLOCK();
 tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
 NET_LOCK();
 }
 NET_UNLOCK();

This net lock dance looks fishy.  And the sleep has a timeout of 1
milli second.  But that is may be per packet.  So if you have a
long queue or the queue refills somehow, it will take forever.

I think the difference in the usage is constant traffic that keeps
the send queue full.  The timeout hides the problem when there are
only a few packets.



This should ensure wg_qstart() will not dereference dying `peer'. Looks
crappy and potentially could block forever, but should work. However
netlock it unnecessary here. netlocked wg_output() could fill `if_snd'
while netlock released before tsleep(), so it serializes nothing but
stops packets processing.

Kirill, does this diff help?


nope, same hang.

tested on a fresh Vultr VM with -current and patch below. VM got added 
to my normal WG network, and VM was accessed by SSH over that WG network.


then:

 # ifconfig wg1 down (from ssh -- connection to ssh session disappears)
 # ifconfig wg1 delete (from console)
 # ifconfig wg1 destroy" (from console -- command hangs)

interestingly, destroy works fine from ssh when commands are entered in 
a tmux session and executed immediately after each other:


  # ifconfig wg1 down; ifconfig wg1 delete; ifconfig wg1 destroy

looks like a timing issue.



Index: sys/net/if_wg.c
===
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.31
diff -u -p -r1.31 if_wg.c
--- sys/net/if_wg.c 26 Sep 2023 15:16:44 -  1.31
+++ sys/net/if_wg.c 4 Oct 2023 20:01:16 -
@@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer)
  
  	noise_remote_clear(>p_remote);
  
-	NET_LOCK();

-   while (!ifq_empty(>sc_if.if_snd)) {
-   NET_UNLOCK();
+   while (!ifq_empty(>sc_if.if_snd))
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
-   NET_LOCK();
-   }
-   NET_UNLOCK();
  
  	taskq_barrier(wg_crypt_taskq);

taskq_barrier(net_tq(sc->sc_if.if_index));

Re: wg destroy hangs

On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote:
> On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote:
> > > If it happns again, could you send an 'ps axlww | grep ifconifg'
> > > output?  Then we see the wait channel where it hangs in the kernel.
> > > 
> > > $ ps axlww
> > >UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME 
> > > COMMAND
> > 
> > Here it happened again:
> > 
> >  0 75339 23922   0  10   0   360   296 wg_ifq  D+Up00:00.00 
> > ifconfig wg1 destroy
> 
> wg_peer_destroy()
>   ...
> NET_LOCK();
> while (!ifq_empty(>sc_if.if_snd)) {
> NET_UNLOCK();
> tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
> NET_LOCK();
> }
> NET_UNLOCK();
> 
> This net lock dance looks fishy.  And the sleep has a timeout of 1
> milli second.  But that is may be per packet.  So if you have a
> long queue or the queue refills somehow, it will take forever.
> 
> I think the difference in the usage is constant traffic that keeps
> the send queue full.  The timeout hides the problem when there are
> only a few packets.
> 

This should ensure wg_qstart() will not dereference dying `peer'. Looks
crappy and potentially could block forever, but should work. However
netlock it unnecessary here. netlocked wg_output() could fill `if_snd'
while netlock released before tsleep(), so it serializes nothing but
stops packets processing.

Kirill, does this diff help? 

Index: sys/net/if_wg.c
===
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.31
diff -u -p -r1.31 if_wg.c
--- sys/net/if_wg.c 26 Sep 2023 15:16:44 -  1.31
+++ sys/net/if_wg.c 4 Oct 2023 20:01:16 -
@@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer)
 
noise_remote_clear(>p_remote);
 
-   NET_LOCK();
-   while (!ifq_empty(>sc_if.if_snd)) {
-   NET_UNLOCK();
+   while (!ifq_empty(>sc_if.if_snd))
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
-   NET_LOCK();
-   }
-   NET_UNLOCK();
 
taskq_barrier(wg_crypt_taskq);
taskq_barrier(net_tq(sc->sc_if.if_index));

Re: wg destroy hangs

On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote:
> > If it happns again, could you send an 'ps axlww | grep ifconifg'
> > output?  Then we see the wait channel where it hangs in the kernel.
> > 
> > $ ps axlww
> >UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME 
> > COMMAND
> 
> Here it happened again:
> 
>  0 75339 23922   0  10   0   360   296 wg_ifq  D+Up00:00.00 
> ifconfig wg1 destroy

wg_peer_destroy()
...
NET_LOCK();
while (!ifq_empty(>sc_if.if_snd)) {
NET_UNLOCK();
tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
NET_LOCK();
}
NET_UNLOCK();

This net lock dance looks fishy.  And the sleep has a timeout of 1
milli second.  But that is may be per packet.  So if you have a
long queue or the queue refills somehow, it will take forever.

I think the difference in the usage is constant traffic that keeps
the send queue full.  The timeout hides the problem when there are
only a few packets.

bluhm

Re: wg destroy hangs

• Kirill Miazine [2023-10-04 20:42]:

I saw some changes to wg recently, so I wanted to report the issue in
case recent commit changed something in time for release. I understand
the issue is probably a year old by now. I guess I hadn't destroyed wg
for a while, although I do believe I have...

• Alexander Bluhm [2023-10-04 16:31]:

On Wed, Oct 04, 2023 at 10:08:01AM -0400, Sonic wrote:

See the post:
"Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in
the

Bugs archive.

I have a test regress/sys/net/wg that configures a wg(4), sends
some traffic, and destroys it. I have never seen this bug. There
must be something special to trigger it.

If it happns again, could you send an 'ps axlww | grep ifconifg'
output? Then we see the wait channel where it hangs in the kernel.

$ ps axlww
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT
TIME COMMAND

Here it happened again:

0 75339 23922 0 10 0 360 296 wg_ifq D+U p0 0:00.00
ifconfig wg1 destroy

The WCHAN string can be found in the kernel sources and gives hints.
More sophisticated would be to break into ddb and show the kernel
stack trace of the ifconfig process. If you want to do that, I can
give some advice. But I recommend a serial console for that.

Any idea what you did specially to trigger the problem?

I did a down, delete and destroy sequence on a newly booted system.

Actually, after testing more, I see that doing a down, delete and
destroy of a wg interface works fine from console.

Also, I tried this command sequence in a tmux session in ssh over a
connection though the same wg interface:

root@fika ~ # ifconfig wg1 down; ifconfig wg1 delete; ifconfig wg1 destroy

That sequence didn't give an error.

But then when I did wg1 down via ssh over wg1, and _then_ delete and
destroy from console, destroy in console hanged.

Interestingly, when doing "ifconfig wg1 down; ifconfig wg1 delete;" in
tmux and then destroy in console, there's no hang either.

OpenBSD 7.4 (GENERIC) #1332: Wed Oct 4 01:00:54 MDT 2023

bluhm

Re: wg destroy hangs




I saw some changes to wg recently, so I wanted to report the issue in 
case recent commit changed something in time for release. I understand 
the issue is probably a year old by now. I guess I hadn't destroyed wg 
for a while, although I do believe I have...


• Alexander Bluhm [2023-10-04 16:31]:

On Wed, Oct 04, 2023 at 10:08:01AM -0400, Sonic wrote:

See the post:
"Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the
Bugs archive.


I have a test regress/sys/net/wg that configures a wg(4), sends
some traffic, and destroys it.  I have never seen this bug.  There
must be something special to trigger it.

If it happns again, could you send an 'ps axlww | grep ifconifg'
output?  Then we see the wait channel where it hangs in the kernel.

$ ps axlww
   UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME COMMAND


Here it happened again:

0 75339 23922   0  10   0   360   296 wg_ifq  D+Up00:00.00 
ifconfig wg1 destroy



The WCHAN string can be found in the kernel sources and gives hints.
More sophisticated would be to break into ddb and show the kernel
stack trace of the ifconfig process.  If you want to do that, I can
give some advice.  But I recommend a serial console for that.

Any idea what you did specially to trigger the problem?


I did a down, delete and destroy sequence on a newly booted system.

OpenBSD 7.4 (GENERIC) #1332: Wed Oct  4 01:00:54 MDT 2023



bluhm

Re: wg destroy hangs

It works currently on my own firewall. The problem occured on a client's
firewall, which I can't test right now. It's an amd64 system (older
Supermicro) and as I was shelled in remotely I was getting worried about
recovering and then I discovered "reboot -q" - nothing else was working to
get the interface back up.
Was running -current at the time as well as now.


On Wed, Oct 4, 2023 at 12:05 PM Alexander Bluhm 
wrote:

> On Wed, Oct 04, 2023 at 10:53:30AM -0400, Sonic wrote:
> > When it happened to me back then (2022) I'm pretty sure I did a "down"
> > followed by a "delete" and then the "destroy".
>
> root@ot6:.../~# cd /usr/src/regress/sys/net/wg
> root@ot6:.../wg# make ifconfig
> ...
> root@ot6:.../wg# ifconfig wg11
> wg11: flags=80c3 rdomain 11 mtu 1420
> index 42 priority 0 llprio 3
> wgport 211
> wgpubkey uQP9F5afOHni9RObVahSPxeJgbsrqGw/P4t5Balpmkc=
> wgpeer beT/atjwFPBo3Pv8IvFO5Wf/uVXfgZ5QLSSQIGm/sSc=
> wgendpoint 127.0.0.1 212
> tx: 0, rx: 0
> wgaip fdd7:e83e:66bc:46::2/128
> wgaip 10.188.44.2/32
> groups: wg
> inet 10.188.44.1 netmask 0xff00 broadcast 10.188.44.255
> inet6 fdd7:e83e:66bc:46::1 prefixlen 64
> root@ot6:.../wg# ifconfig wg11 down
> root@ot6:.../wg# ifconfig wg11 delete
> root@ot6:.../wg# ifconfig wg11 destroy
> root@ot6:.../wg#
>
> For me it works.  Tested on i386 and amd64.
>
> > Have not tried to recreate since then.
>
> Can you try it again?  What is different in your setup?
>
> bluhm
>

Re: wg destroy hangs

On Wed, Oct 04, 2023 at 10:53:30AM -0400, Sonic wrote:
> When it happened to me back then (2022) I'm pretty sure I did a "down"
> followed by a "delete" and then the "destroy".

root@ot6:.../~# cd /usr/src/regress/sys/net/wg
root@ot6:.../wg# make ifconfig
...
root@ot6:.../wg# ifconfig wg11
wg11: flags=80c3 rdomain 11 mtu 1420
index 42 priority 0 llprio 3
wgport 211
wgpubkey uQP9F5afOHni9RObVahSPxeJgbsrqGw/P4t5Balpmkc=
wgpeer beT/atjwFPBo3Pv8IvFO5Wf/uVXfgZ5QLSSQIGm/sSc=
wgendpoint 127.0.0.1 212
tx: 0, rx: 0
wgaip fdd7:e83e:66bc:46::2/128
wgaip 10.188.44.2/32
groups: wg
inet 10.188.44.1 netmask 0xff00 broadcast 10.188.44.255
inet6 fdd7:e83e:66bc:46::1 prefixlen 64
root@ot6:.../wg# ifconfig wg11 down
root@ot6:.../wg# ifconfig wg11 delete
root@ot6:.../wg# ifconfig wg11 destroy
root@ot6:.../wg#

For me it works.  Tested on i386 and amd64.

> Have not tried to recreate since then.

Can you try it again?  What is different in your setup?

bluhm

Re: wg destroy hangs

When it happened to me back then (2022) I'm pretty sure I did a "down"
followed by a "delete" and then the "destroy".
Have not tried to recreate since then.


On Wed, Oct 4, 2023 at 10:31 AM Alexander Bluhm 
wrote:

> On Wed, Oct 04, 2023 at 10:08:01AM -0400, Sonic wrote:
> > See the post:
> > "Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the
> > Bugs archive.
>
> I have a test regress/sys/net/wg that configures a wg(4), sends
> some traffic, and destroys it.  I have never seen this bug.  There
> must be something special to trigger it.
>
> If it happns again, could you send an 'ps axlww | grep ifconifg'
> output?  Then we see the wait channel where it hangs in the kernel.
>
> $ ps axlww
>   UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME
> COMMAND
>
> The WCHAN string can be found in the kernel sources and gives hints.
> More sophisticated would be to break into ddb and show the kernel
> stack trace of the ifconfig process.  If you want to do that, I can
> give some advice.  But I recommend a serial console for that.
>
> Any idea what you did specially to trigger the problem?
>
> bluhm
>

Re: wg destroy hangs

On Wed, Oct 04, 2023 at 10:08:01AM -0400, Sonic wrote:
> See the post:
> "Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the
> Bugs archive.

I have a test regress/sys/net/wg that configures a wg(4), sends
some traffic, and destroys it.  I have never seen this bug.  There
must be something special to trigger it.

If it happns again, could you send an 'ps axlww | grep ifconifg'
output?  Then we see the wait channel where it hangs in the kernel.

$ ps axlww
  UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT   TIME COMMAND

The WCHAN string can be found in the kernel sources and gives hints.
More sophisticated would be to break into ddb and show the kernel
stack trace of the ifconfig process.  If you want to do that, I can
give some advice.  But I recommend a serial console for that.

Any idea what you did specially to trigger the problem?

bluhm

Re: wg destroy hangs

See the post:
"Uninterruptible D State after ifconfig wg0 destroy" Oct. 31, 2022 in the
Bugs archive.



On Wed, Oct 4, 2023 at 10:04 AM Sonic  wrote:

> This goes back a ways, to at least 7.2 in October 2022.
>
>
> On Wed, Oct 4, 2023 at 8:54 AM Kirill Miazine  wrote:
>
>> Recently on snapshots I have noticed that ifconfig wgN destroy would
>> just hang there, without any way to get back the control. Power reset
>> would be the only way to reboot and regain control.
>>
>> I don't have exact date when it happened first, maybe some 10 days ago,
>> but problem is present on most recent snapshot (amd64).
>>
>> -- Kirill
>>
>>

Re: wg destroy hangs