Re: [External] : Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Alexandr Nedvedicky
Hello,

my statement here is just for the record. we should have a follow up
discussion in a different thread, which is yet to be started (when time
will come).


> 
> > - Make ARP MP safe.  Currently we need the kernel lock there or
> >   it crashes.  This creates latency for all kind of packets.
> > - Convert the rwlock in pf to mutex.  I think your argument counts
> >   much more there.  But I cannot prove it.
> 
> My argument about what?  Didn't we all agree about converting the rwlock
> to a mutex for now?
> 

I disagree here. If we will trade current rw-lock, which protects
state table in pf(4), we will see significant drop in throughput.
great deal of packets is forwarded as readers on pf_state_lock.
only those, which don't match state are taking slow path. to turn
rw-lock to mutex, we must break single giant tree into smaller
buckets protected by mutexes. Or something like that must happen
first.

thanks and
regards
sashan



pipex(4): introduce mutexes to protect 'pipex_session' context

2021-07-21 Thread Vitaliy Makkoveev
With bluhm@'s diff for parallel forwarding pipex(4) could be accessed in
parallel through (*ifp->if_input)() -> ether_input() ->
pipex_pppoe_input(). PPPOE pipex(4) sessions are mostly immutable except
MPPE crypt. 

The diff below makes pipex(4) a little bit more parallelization
reliable.

The new per-session `pxs_mtx' mutex(9) used to protect session's
`ccp-id' which is incremented each time we send CCP reset-request.

The new per-MPPE context `pxm_mtx' mutex(9) used to protect MPPE
context. We have two of them: one for the input and one for output path.
With bluhm@'s diff those paths are mostly serialized, except the case
when we send CCP reset-request. I don't see the reason to other
pipex_pppoe_input() threads spin while we perform pipex_ccp_output(). 

Where is no lock order limitations because those new mutex(9)'es newer
held together.

I'm not PPPOE user and I'll be happy if such user tests this diff in
real workflow. 

Index: sys/net/pipex.c
===
RCS file: /cvs/src/sys/net/pipex.c,v
retrieving revision 1.134
diff -u -p -r1.134 pipex.c
--- sys/net/pipex.c 20 Jul 2021 16:44:55 -  1.134
+++ sys/net/pipex.c 21 Jul 2021 22:27:47 -
@@ -263,6 +263,7 @@ pipex_init_session(struct pipex_session 
 
/* prepare a new session */
session = pool_get(_session_pool, PR_WAITOK | PR_ZERO);
+   mtx_init(>pxs_mtx, IPL_SOFTNET);
session->state = PIPEX_STATE_INITIAL;
session->protocol = req->pr_protocol;
session->session_id = req->pr_session_id;
@@ -2099,6 +2100,7 @@ pipex_mppe_init(struct pipex_mppe *mppe,
 u_char *master_key, int has_oldkey)
 {
memset(mppe, 0, sizeof(struct pipex_mppe));
+   mtx_init(>pxm_mtx, IPL_SOFTNET);
if (stateless)
mppe->stateless = 1;
if (has_oldkey)
@@ -2238,10 +2240,6 @@ pipex_mppe_input(struct mbuf *m0, struct
coher_cnt &= PIPEX_COHERENCY_CNT_MASK;
pktloss = 0;
 
-   PIPEX_MPPE_DBG((session, LOG_DEBUG, "in coher_cnt=%03x %s%s",
-   mppe->coher_cnt, (flushed) ? "[flushed]" : "",
-   (encrypt) ? "[encrypt]" : ""));
-
if (encrypt == 0) {
pipex_session_log(session, LOG_DEBUG,
"Received unexpected MPPE packet.(no ecrypt)");
@@ -2251,6 +2249,12 @@ pipex_mppe_input(struct mbuf *m0, struct
/* adjust mbuf */
m_adj(m0, sizeof(coher_cnt));
 
+   mtx_enter(>pxm_mtx);
+
+   PIPEX_MPPE_DBG((session, LOG_DEBUG, "in coher_cnt=%03x %s%s",
+   mppe->coher_cnt, (flushed) ? "[flushed]" : "",
+   (encrypt) ? "[encrypt]" : ""));
+
/*
 * L2TP data session may be used without sequencing, PPP frames may
 * arrive in disorder.  The 'coherency counter' of MPPE detects such
@@ -2274,6 +2278,7 @@ pipex_mppe_input(struct mbuf *m0, struct
pipex_session_log(session, LOG_DEBUG,
"Workaround the out-of-sequence PPP framing 
problem: "
"%d => %d", mppe->coher_cnt, coher_cnt);
+   mtx_leave(>pxm_mtx);
goto drop;
}
rewind = 1;
@@ -2305,10 +2310,19 @@ pipex_mppe_input(struct mbuf *m0, struct
coher_cnt &= PIPEX_COHERENCY_CNT_MASK;
mppe->coher_cnt = coher_cnt;
} else if (mppe->coher_cnt != coher_cnt) {
+   int ccp_id;
+
+   mtx_leave(>pxm_mtx);
+
/* Send CCP ResetReq */
PIPEX_DBG((session, LOG_DEBUG, "CCP SendResetReq"));
-   pipex_ccp_output(session, CCP_RESETREQ,
-   session->ccp_id++);
+
+   mtx_enter(>pxs_mtx);
+   ccp_id=session->ccp_id;
+   session->ccp_id++;
+   mtx_leave(>pxs_mtx);
+
+   pipex_ccp_output(session, CCP_RESETREQ, ccp_id);
goto drop;
}
if ((coher_cnt & 0xff) == 0xff) {
@@ -2336,6 +2350,9 @@ pipex_mppe_input(struct mbuf *m0, struct
mppe->coher_cnt++;
mppe->coher_cnt &= PIPEX_COHERENCY_CNT_MASK;
}
+
+   mtx_leave(>pxm_mtx);
+
if (m0->m_pkthdr.len < PIPEX_PPPMINLEN)
goto drop;
 
@@ -2387,6 +2404,8 @@ pipex_mppe_output(struct mbuf *m0, struc
flushed = 0;
encrypt = 1;
 
+   mtx_enter(>pxm_mtx);
+
if (mppe->stateless != 0) {
flushed = 1;
mppe_key_change(mppe);
@@ -2429,6 +2448,8 @@ pipex_mppe_output(struct mbuf *m0, struc
pipex_mppe_crypt(mppe, len, cp, cp);
}
 
+   mtx_leave(>pxm_mtx);
+
pipex_ppp_output(m0, session, PPP_COMP);
 
return;
@@ -2455,7 +2476,9 @@ pipex_ccp_input(struct mbuf *m0, struct 
switch (code) {
case 

Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Alexander Bluhm
On Thu, Jul 22, 2021 at 12:06:09AM +0200, Hrvoje Popovski wrote:
> I'm combining this and last parallel diff and i can't see any drops in
> traffic. Even sending at high rate, traffic through iked or isakmpd is
> stable at 150Kpps, which is good ..

Thanks, good news.

> One funny thing is that with top -SHs1 crypto CPU usage is always at 0.00%
> Could it be because of aes-ni?

I don't use the crypto thread anymore.  It is idle.  All crypto is
done in softnet between handling the network part of ipsec.

Here is the flame graph.  Crypto is running on top of network.
http://bluhm.genua.de/perform/results/current/patch-sys-ipsec-noqueue.1/btrace/ssh_perform%40lt13_iperf3_-c10.4.56.36_-P10_-t10-btrace-kstack.0.svg

bluhm



Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Hrvoje Popovski
On 21.7.2021. 22:21, Alexander Bluhm wrote:
> Ahh, to many diffs in my tree.  I have forgotten the cunk
> crp->crp_flags = ... | CRYPTO_F_NOQUEUE
> 
> Try this.  Still testing it myself, it looks a bit faster.


I'm combining this and last parallel diff and i can't see any drops in
traffic. Even sending at high rate, traffic through iked or isakmpd is
stable at 150Kpps, which is good ..

One funny thing is that with top -SHs1 crypto CPU usage is always at 0.00%
Could it be because of aes-ni?


r620-1# cat /etc/ipsec.conf
ike active esp from 192.168.232.0/24 to 192.168.123.0/24 \
local 192.168.42.1 peer 192.168.42.111 \
main auth hmac-sha1 enc aes group modp1024 \
quick enc aes-128-gcm group modp1024 \
psk "123"

r620-1# cat /etc/iked.conf
ikev2 active esp from 192.168.232.0/24 to 192.168.123.0/24 \
local 192.168.42.1 peer 192.168.42.111 \
ikesa enc aes-128-gcm group modp1024 prf hmac-sha1 \
childsa enc aes-128-gcm group modp1024 \
psk "123"



Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Vitaliy Makkoveev
On Wed, Jul 21, 2021 at 06:41:30PM +0200, Alexander Bluhm wrote:
> On Mon, Jul 19, 2021 at 07:33:55PM +0300, Vitaliy Makkoveev wrote:
> > Hi, pipex(4) is also not ready for parallel access. In the chunk below
> > it will be accessed through (*ifp->if_input)() -> ether_input() ->
> > pipex_pppoe_input(). This looks not fatal but makes at least session
> > statistics inconsistent.
> 
> For pipex and pppoe we can put a kernel lock into ether_resolve().
> I also put locks in mpls.
> 
> I am not aware of any other issues.
> 
> Can we commit this now?
> 

Hmm, did I missed locks for mpls?

> bluhm
> 
> Index: net/if.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if.c,v
> retrieving revision 1.643
> diff -u -p -r1.643 if.c
> --- net/if.c  20 Jul 2021 16:32:28 -  1.643
> +++ net/if.c  21 Jul 2021 16:07:15 -
> @@ -109,6 +109,10 @@
>  #include 
>  #endif
>  
> +#ifdef IPSEC
> +#include 
> +#endif
> +
>  #ifdef MPLS
>  #include 
>  #endif
> @@ -238,7 +242,7 @@ int   ifq_congestion;
>  
>  int   netisr;
>  
> -#define  NET_TASKQ   1
> +#define  NET_TASKQ   4
>  struct taskq *nettqmp[NET_TASKQ];
>  
>  struct task if_input_task_locked = TASK_INITIALIZER(if_netisr, NULL);
> @@ -815,6 +819,7 @@ void
>  if_input_process(struct ifnet *ifp, struct mbuf_list *ml)
>  {
>   struct mbuf *m;
> + int exclusive_lock = 0;
>  
>   if (ml_empty(ml))
>   return;
> @@ -835,15 +840,27 @@ if_input_process(struct ifnet *ifp, stru
>* lists and the socket layer.
>*/
>  
> +#ifdef IPSEC
>   /*
>* XXXSMP IPsec data structures are not ready to be accessed
>* by multiple network threads in parallel.  In this case
>* use an exclusive lock.
>*/
> - NET_LOCK();
> + if (ipsec_in_use)
> + exclusive_lock = 1;
> +#endif
> + if (exclusive_lock)
> + NET_LOCK();
> + else
> + NET_RLOCK_IN_SOFTNET();
> +
>   while ((m = ml_dequeue(ml)) != NULL)
>   (*ifp->if_input)(ifp, m);
> - NET_UNLOCK();
> +
> + if (exclusive_lock)
> + NET_UNLOCK();
> + else
> + NET_RUNLOCK_IN_SOFTNET();
>  }
>  
>  void
> @@ -900,6 +917,12 @@ if_netisr(void *unused)
>   arpintr();
>   KERNEL_UNLOCK();
>   }
> +#endif
> + if (n & (1 << NETISR_IP))
> + ipintr();
> +#ifdef INET6
> + if (n & (1 << NETISR_IPV6))
> + ip6intr();
>  #endif
>  #if NPPP > 0
>   if (n & (1 << NETISR_PPP)) {
> Index: net/if_ethersubr.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v
> retrieving revision 1.275
> diff -u -p -r1.275 if_ethersubr.c
> --- net/if_ethersubr.c7 Jul 2021 20:19:01 -   1.275
> +++ net/if_ethersubr.c21 Jul 2021 16:34:51 -
> @@ -222,7 +222,10 @@ ether_resolve(struct ifnet *ifp, struct 
>  
>   switch (af) {
>   case AF_INET:
> + KERNEL_LOCK();
> + /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> + KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   eh->ether_type = htons(ETHERTYPE_IP);
> @@ -245,7 +248,10 @@ ether_resolve(struct ifnet *ifp, struct 
>   break;
>  #ifdef INET6
>   case AF_INET6:
> + KERNEL_LOCK();
> + /* XXXSMP there is a MP race in nd6_resolve() */
>   error = nd6_resolve(ifp, rt, m, dst, eh->ether_dhost);
> + KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   eh->ether_type = htons(ETHERTYPE_IPV6);
> @@ -271,13 +277,19 @@ ether_resolve(struct ifnet *ifp, struct 
>   break;
>  #ifdef INET6
>   case AF_INET6:
> + KERNEL_LOCK();
> + /* XXXSMP there is a MP race in nd6_resolve() */
>   error = nd6_resolve(ifp, rt, m, dst, eh->ether_dhost);
> + KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   break;
>  #endif
>   case AF_INET:
> + KERNEL_LOCK();
> + /* XXXSMP there is a MP race in arpresolve() */
>   error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
> + KERNEL_UNLOCK();
>   if (error)
>   return (error);
>   break;
> @@ -528,12 +540,14 @@ ether_input(struct ifnet *ifp, struct mb
>   case ETHERTYPE_PPPOE:
>   if (m->m_flags & (M_MCAST | M_BCAST))
>   goto dropanyway;
> + KERNEL_LOCK();
>  

Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Alexander Bluhm
On Wed, Jul 21, 2021 at 07:53:55PM +0200, Hrvoje Popovski wrote:
> i've applied this and ipsec crypto no queue diff and i'm getting
> splasserts below ... maybe it's something obvious, if not, i will try
> diff by diff ..

Ahh, to many diffs in my tree.  I have forgotten the cunk
crp->crp_flags = ... | CRYPTO_F_NOQUEUE

Try this.  Still testing it myself, it looks a bit faster.

bluhm

Index: crypto/crypto.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/crypto/crypto.c,v
retrieving revision 1.84
diff -u -p -r1.84 crypto.c
--- crypto/crypto.c 21 Jul 2021 11:11:41 -  1.84
+++ crypto/crypto.c 21 Jul 2021 19:51:14 -
@@ -387,23 +387,32 @@ crypto_unregister(u_int32_t driverid, in
 int
 crypto_dispatch(struct cryptop *crp)
 {
-   struct taskq *tq = crypto_taskq;
-   int error = 0, s;
+   int error = 0, lock = 1, s;
u_int32_t hid;
 
s = splvm();
hid = (crp->crp_sid >> 32) & 0x;
if (hid < crypto_drivers_num) {
if (crypto_drivers[hid].cc_flags & CRYPTOCAP_F_MPSAFE)
-   tq = crypto_taskq_mpsafe;
+   lock = 0;
}
splx(s);
 
-   if ((crp->crp_flags & CRYPTO_F_NOQUEUE) == 0) {
+   /* XXXSMP crypto_invoke() is not MP safe */
+   lock = 1;
+
+   if (crp->crp_flags & CRYPTO_F_NOQUEUE) {
+   if (lock)
+   KERNEL_LOCK();
+   error = crypto_invoke(crp);
+   if (lock)
+   KERNEL_UNLOCK();
+   } else {
+   struct taskq *tq;
+
+   tq = lock ? crypto_taskq : crypto_taskq_mpsafe;
task_set(>crp_task, (void (*))crypto_invoke, crp);
task_add(tq, >crp_task);
-   } else {
-   error = crypto_invoke(crp);
}
 
return error;
@@ -424,6 +433,8 @@ crypto_invoke(struct cryptop *crp)
if (crp == NULL || crp->crp_callback == NULL)
return EINVAL;
 
+   KERNEL_ASSERT_LOCKED();
+
s = splvm();
if (crp->crp_ndesc < 1 || crypto_drivers == NULL) {
crp->crp_etype = EINVAL;
@@ -535,11 +546,16 @@ void
 crypto_done(struct cryptop *crp)
 {
crp->crp_flags |= CRYPTO_F_DONE;
+
if (crp->crp_flags & CRYPTO_F_NOQUEUE) {
/* not from the crypto queue, wakeup the userland process */
crp->crp_callback(crp);
} else {
+   struct taskq *tq;
+
+   tq = (crp->crp_flags & CRYPTO_F_MPSAFE) ?
+   crypto_taskq_mpsafe : crypto_taskq;
task_set(>crp_task, (void (*))crp->crp_callback, crp);
-   task_add(crypto_taskq, >crp_task);
+   task_add(tq, >crp_task);
}
 }
Index: crypto/cryptodev.h
===
RCS file: /data/mirror/openbsd/cvs/src/sys/crypto/cryptodev.h,v
retrieving revision 1.73
diff -u -p -r1.73 cryptodev.h
--- crypto/cryptodev.h  9 Jul 2021 20:43:28 -   1.73
+++ crypto/cryptodev.h  21 Jul 2021 19:48:44 -
@@ -171,6 +171,7 @@ struct cryptop {
 
 #define CRYPTO_F_IMBUF 0x0001  /* Input/output are mbuf chains, otherwise 
contig */
 #define CRYPTO_F_IOV   0x0002  /* Input/output are uio */
+#define CRYPTO_F_MPSAFE0x0004  /* Do not use kernel lock for callback 
*/
 #define CRYPTO_F_NOQUEUE   0x0008  /* Don't use crypto queue/thread */
 #define CRYPTO_F_DONE  0x0010  /* request completed */
 
Index: netinet/ip_ah.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/ip_ah.c,v
retrieving revision 1.151
diff -u -p -r1.151 ip_ah.c
--- netinet/ip_ah.c 18 Jul 2021 14:38:20 -  1.151
+++ netinet/ip_ah.c 21 Jul 2021 19:51:14 -
@@ -684,7 +684,7 @@ ah_input(struct mbuf *m, struct tdb *tdb
 
/* Crypto operation descriptor. */
crp->crp_ilen = m->m_pkthdr.len; /* Total input length. */
-   crp->crp_flags = CRYPTO_F_IMBUF;
+   crp->crp_flags = CRYPTO_F_IMBUF | CRYPTO_F_MPSAFE | CRYPTO_F_NOQUEUE;
crp->crp_buf = (caddr_t)m;
crp->crp_callback = ipsec_input_cb;
crp->crp_sid = tdb->tdb_cryptoid;
@@ -699,9 +699,7 @@ ah_input(struct mbuf *m, struct tdb *tdb
memcpy(>tc_dst, >tdb_dst, sizeof(union sockaddr_union));
tc->tc_rpl = tdb->tdb_rpl;
 
-   KERNEL_LOCK();
error = crypto_dispatch(crp);
-   KERNEL_UNLOCK();
return error;
 
  drop:
@@ -1134,7 +1132,7 @@ ah_output(struct mbuf *m, struct tdb *td
 
/* Crypto operation descriptor. */
crp->crp_ilen = m->m_pkthdr.len; /* Total input length. */
-   crp->crp_flags = CRYPTO_F_IMBUF;
+   crp->crp_flags = CRYPTO_F_IMBUF | CRYPTO_F_MPSAFE | CRYPTO_F_NOQUEUE;
crp->crp_buf = (caddr_t)m;
crp->crp_callback = ipsec_output_cb;
crp->crp_sid = tdb->tdb_cryptoid;
@@ -1148,9 +1146,7 @@ 

Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Hrvoje Popovski
On 21.7.2021. 18:41, Alexander Bluhm wrote:
> On Mon, Jul 19, 2021 at 07:33:55PM +0300, Vitaliy Makkoveev wrote:
>> Hi, pipex(4) is also not ready for parallel access. In the chunk below
>> it will be accessed through (*ifp->if_input)() -> ether_input() ->
>> pipex_pppoe_input(). This looks not fatal but makes at least session
>> statistics inconsistent.
> For pipex and pppoe we can put a kernel lock into ether_resolve().
> I also put locks in mpls.
> 
> I am not aware of any other issues.

Hi,

i've applied this and ipsec crypto no queue diff and i'm getting
splasserts below ... maybe it's something obvious, if not, i will try
diff by diff ..

I'm running iked tunnel ...

log on box where traffic goes into tunnel

splassert: ipsp_process_done: want 2 have 0
Starting stack trace...
ipsp_process_done(fd80a8647400,811832a0) at
ipsp_process_done+0x59
esp_output_cb(811832a0,81194e00,fd80a8647400,21c,0)
at esp_output_cb+0x41
ipsec_output_cb(fd8397dad440) at ipsec_output_cb+0x177
taskq_thread(8013f480) at taskq_thread+0x81
end trace frame: 0x0, count: 253
End of stack trace.


splassert: ip_output: want 2 have 0
Starting stack trace...
ip_output(fd80a8647400,0,0,2,0,0) at ip_output+0x88
ipsp_process_done(fd80a8647400,811832a0) at
ipsp_process_done+0x384
esp_output_cb(811832a0,81194e00,fd80a8647400,21c,0)
at esp_o

 utput_cb+0x41
ipsec_output_cb(fd8397dad440) at ipsec_output_cb+0x177
taskq_thread(8013f480) at taskq_thread+0x81
end trace frame: 0x0, count: 252
End of stack trace.


logs on box where traffic goes from tunnel out
splassert: tdb_hash: want 2 have 0
Starting stack trace...
tdb_hash(36b50c1c,81199200,32) at tdb_hash+0x5c
gettdb_dir(0,36b50c1c,81199200,32,0) at gettdb_dir+0x83
ipsec_input_cb(fd839e0c6ca0) at ipsec_input_cb+0x7a
taskq_thread(80140d80) at taskq_thread+0x81
end trace frame: 0x0, count: 253
End of stack trace.

splassert: esp_input_cb: want 2 have 0
Starting stack trace...
esp_input_cb(8118a5f8,81199200,fd80026f5500,0) at
esp_input_cb+0x73
ipsec_input_cb(fd839e0c6ca0) at ipsec_input_cb+0x162
taskq_thread(80140d80) at taskq_thread+0x81
end trace frame: 0x0, count: 254
End of stack trace.

splassert: enc_getif: want 2 have 0
Starting stack trace...
enc_getif(0,0) at enc_getif+0x54
ipsec_common_input_cb(fd80026f5500,8118a5f8,14,9) at
ipsec_common_input_cb+0x934
esp_input_cb(8118a5f8,81199200,fd80026f5500,0) at
esp_input_cb+0x456
ipsec_input_cb(fd839e0c6ca0) at ipsec_input_cb+0x162
taskq_thread(80140d80) at taskq_thread+0x81
end trace frame: 0x0, count: 252
End of stack trace.



ipsec crypto no queue

2021-07-21 Thread Alexander Bluhm
Hi,

Hrvoje had some problems with to many network packets in the unlimited
queues of the crypto task queues.

Removing the queues and just calling the crypto operations directly
improves throughput by 20%.  See the sys-crypto-dispatch-klock
column.
http://bluhm.genua.de/perform/results/2021-07-21T06:50:10Z/perform.html
http://bluhm.genua.de/perform/results/2021-07-21T06:50:10Z/gnuplot/ipsec.html

Even more important the functions inherit the locks and we have
much less contention.
With task queue:
http://bluhm.genua.de/perform/results/2021-07-21T06:50:10Z/2021-07-21T00%3A00%3A00Z/btrace/ssh_perform%40lt13_iperf3_-c10.4.56.36_-P10_-t10-btrace-kstack.0.svg
Direct call:
http://bluhm.genua.de/perform/results/2021-07-21T06:50:10Z/patch-sys-crypto-dispatch-klock.1/btrace/ssh_perform%40lt13_iperf3_-c10.4.56.36_-P10_-t10-btrace-kstack.0.svg

Note that I have software crypto.  Softcrypto is so expensive that
network operations are no longer visible.  Before the latter were
spinning for kernel lock.

I would be interested in results with AES-NI.  Maybe we needed the
queues when sleeping hardware drivers were used and softnet was
running as soft interrupt.

This diff makes IPsec faster, less complex, more reliable, and more
MP friendly.

There is also a small bugfix.  After crypto dispatch error we should
goto drop to free the packet.  But this error should happen anyway.

ok?

bluhm

Index: crypto/crypto.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/crypto/crypto.c,v
retrieving revision 1.84
diff -u -p -r1.84 crypto.c
--- crypto/crypto.c 21 Jul 2021 11:11:41 -  1.84
+++ crypto/crypto.c 21 Jul 2021 16:58:57 -
@@ -387,23 +387,32 @@ crypto_unregister(u_int32_t driverid, in
 int
 crypto_dispatch(struct cryptop *crp)
 {
-   struct taskq *tq = crypto_taskq;
-   int error = 0, s;
+   int error = 0, lock = 1, s;
u_int32_t hid;
 
s = splvm();
hid = (crp->crp_sid >> 32) & 0x;
if (hid < crypto_drivers_num) {
if (crypto_drivers[hid].cc_flags & CRYPTOCAP_F_MPSAFE)
-   tq = crypto_taskq_mpsafe;
+   lock = 0;
}
splx(s);
 
-   if ((crp->crp_flags & CRYPTO_F_NOQUEUE) == 0) {
+   /* XXXSMP crypto_invoke() is not MP safe */
+   lock = 1;
+
+   if (crp->crp_flags & CRYPTO_F_NOQUEUE) {
+   if (lock)
+   KERNEL_LOCK();
+   error = crypto_invoke(crp);
+   if (lock)
+   KERNEL_UNLOCK();
+   } else {
+   struct taskq *tq;
+
+   tq = lock ? crypto_taskq : crypto_taskq_mpsafe;
task_set(>crp_task, (void (*))crypto_invoke, crp);
task_add(tq, >crp_task);
-   } else {
-   error = crypto_invoke(crp);
}
 
return error;
@@ -424,6 +433,8 @@ crypto_invoke(struct cryptop *crp)
if (crp == NULL || crp->crp_callback == NULL)
return EINVAL;
 
+   KERNEL_ASSERT_LOCKED();
+
s = splvm();
if (crp->crp_ndesc < 1 || crypto_drivers == NULL) {
crp->crp_etype = EINVAL;
@@ -535,11 +546,16 @@ void
 crypto_done(struct cryptop *crp)
 {
crp->crp_flags |= CRYPTO_F_DONE;
+
if (crp->crp_flags & CRYPTO_F_NOQUEUE) {
/* not from the crypto queue, wakeup the userland process */
crp->crp_callback(crp);
} else {
+   struct taskq *tq;
+
+   tq = (crp->crp_flags & CRYPTO_F_MPSAFE) ?
+   crypto_taskq_mpsafe : crypto_taskq;
task_set(>crp_task, (void (*))crp->crp_callback, crp);
-   task_add(crypto_taskq, >crp_task);
+   task_add(tq, >crp_task);
}
 }
Index: crypto/cryptodev.h
===
RCS file: /data/mirror/openbsd/cvs/src/sys/crypto/cryptodev.h,v
retrieving revision 1.73
diff -u -p -r1.73 cryptodev.h
--- crypto/cryptodev.h  9 Jul 2021 20:43:28 -   1.73
+++ crypto/cryptodev.h  21 Jul 2021 16:58:57 -
@@ -171,6 +171,7 @@ struct cryptop {
 
 #define CRYPTO_F_IMBUF 0x0001  /* Input/output are mbuf chains, otherwise 
contig */
 #define CRYPTO_F_IOV   0x0002  /* Input/output are uio */
+#define CRYPTO_F_MPSAFE0x0004  /* Do not use kernel lock for callback 
*/
 #define CRYPTO_F_NOQUEUE   0x0008  /* Don't use crypto queue/thread */
 #define CRYPTO_F_DONE  0x0010  /* request completed */
 
Index: netinet/ip_ah.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/ip_ah.c,v
retrieving revision 1.151
diff -u -p -r1.151 ip_ah.c
--- netinet/ip_ah.c 18 Jul 2021 14:38:20 -  1.151
+++ netinet/ip_ah.c 21 Jul 2021 16:58:57 -
@@ -684,7 +684,7 @@ ah_input(struct mbuf *m, struct tdb *tdb
 
/* Crypto operation descriptor. */

Re: nanosleep.2: miscellaneous cleanup

2021-07-21 Thread Ingo Schwarze
Hi Scott,

Scott Cheloha wrote on Wed, Jul 21, 2021 at 11:02:00AM -0500:

[ EFAULT ]
> Given deraadt@'s response I'm just going to leave the existing
> language.

Fine with me.

> I guess I will need to dig into it a bit.  Finding the text of the
> really early documents, prior to SUSv2, is tricky.

I think you are right.  I recently tried to find an online copy
of XPG4.2 / SUSv1 (1994) and did not succeed.  What you want here
is even one step further back: POSIX.1 (1988), POSIX.1b (1993),
POSIX.1c (1995), POSIX.2 (1992), XPG4 (1992), see mdoc(7) for an
overview.

> One additional change:
> - Escape the minus sign for "-1".

I like that, using \- for mathematical minus is certainly correct in
roff(7) in general and perfectly acceptable in manual pages, too.
Our mandoc_char(7) page explains that we don't require it, but there
is a risk that groff(1) might start requiring it in the next release.
I tried to push back, but without success so far.

It seems you somewhere lost the s/one billion/1000 million/ tweak
that you advertised earlier.

Either way, OK schwarze@.

Yours,
  Ingo


> Index: nanosleep.2
> ===
> RCS file: /cvs/src/lib/libc/sys/nanosleep.2,v
> retrieving revision 1.16
> diff -u -p -r1.16 nanosleep.2
> --- nanosleep.2   31 Dec 2018 18:54:00 -  1.16
> +++ nanosleep.2   21 Jul 2021 15:57:55 -
> @@ -41,53 +41,71 @@
>  .Ft int
>  .Fn nanosleep "const struct timespec *timeout" "struct timespec *remainder"
>  .Sh DESCRIPTION
> +The
>  .Fn nanosleep
> -suspends execution of the calling process for the duration specified by
> +function suspends execution of the calling thread for at least the
> +duration specified by
>  .Fa timeout .
> -An unmasked signal will cause it to terminate the sleep early,
> -regardless of the
> +Delivery of an unmasked signal to the calling thread terminates the
> +sleep early,
> +even if
>  .Dv SA_RESTART
> -value on the interrupting signal.
> +is set with
> +.Xr sigaction 2
> +for the interrupting signal.
>  .Sh RETURN VALUES
> -If the
> +If
>  .Fn nanosleep
> -function returns because the requested duration has elapsed, the value
> -returned will be zero.
> +sleeps the full
> +.Fa timeout
> +without interruption it returns 0.
> +Unless
> +.Fa remainder
> +is
> +.Dv NULL ,
> +it is set to zero.
>  .Pp
> -If the
> +If
>  .Fn nanosleep
> -function returns due to the delivery of a signal, the value returned
> -will be \-1, and the global variable
> +is interrupted by a signal it returns \-1 and the global variable
>  .Va errno
> -will be set to indicate the interruption.
> -If
> +is set to
> +.Dv EINTR .
> +Unless
>  .Fa remainder
> -is non-null, the timespec structure it references is updated to contain the
> -unslept amount (the requested duration minus the duration actually slept).
> -.Sh ERRORS
> -If any of the following conditions occur, the
> +is
> +.Dv NULL ,
> +it is set to the unslept portion of the
> +.Fa timeout .
> +.Pp
> +Otherwise,
>  .Fn nanosleep
> -function shall return \-1 and set
> +returns \-1 and the global variable
>  .Va errno
> -to the corresponding value.
> +is set to indicate the error.
> +.Sh ERRORS
> +.Fn nanosleep
> +will fail if:
>  .Bl -tag -width Er
>  .It Bq Er EINTR
> -.Fn nanosleep
> -was interrupted by the delivery of a signal.
> +The call is interrupted by the delivery of a signal.
>  .It Bq Er EINVAL
>  .Fa timeout
> -specified a nanosecond value less than zero or greater than or equal to
> -1000 million,
> +specifies a nanosecond value less than zero or greater than or equal to
> +one billion,
>  or a second value less than zero.
>  .It Bq Er EFAULT
> -Either
>  .Fa timeout
> -or
> -.Fa remainder
>  points to memory that is not a valid part of the process address space.
> +.It Bq Er EFAULT
> +.Fa remainder
> +is not
> +.Dv NULL
> +and points to memory that is not a valid part of the process address space.
>  .El
>  .Sh SEE ALSO
>  .Xr sleep 1 ,
> +.Xr sigaction 2 ,
>  .Xr sleep 3
>  .Sh STANDARDS
>  The
> @@ -97,17 +115,23 @@ function conforms to
>  .Sh HISTORY
>  The predecessor of this system call,
>  .Fn sleep ,
> -appeared in
> -.At v3 ,
> -but was removed when
> +first appeared in
> +.At v3 .
> +It was removed in
> +.At v7
> +and replaced with a C library implementation based on
>  .Xr alarm 3
> -was introduced into
> -.At v7 .
> +and
> +.Xr signal 3 .
> +.Pp
>  The
>  .Fn nanosleep
> -system call has been available since
> +function first appeared in
> +.St -p1003.1b-93 .
> +.Pp
> +This implementation of
> +.Fn nanosleep
> +first appeared in
>  .Nx 1.3
>  and was ported to
> -.Ox 2.1
> -and
> -.Fx 3.0 .
> +.Ox 2.1 .



Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Alexander Bluhm
On Mon, Jul 19, 2021 at 07:33:55PM +0300, Vitaliy Makkoveev wrote:
> Hi, pipex(4) is also not ready for parallel access. In the chunk below
> it will be accessed through (*ifp->if_input)() -> ether_input() ->
> pipex_pppoe_input(). This looks not fatal but makes at least session
> statistics inconsistent.

For pipex and pppoe we can put a kernel lock into ether_resolve().
I also put locks in mpls.

I am not aware of any other issues.

Can we commit this now?

bluhm

Index: net/if.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/if.c,v
retrieving revision 1.643
diff -u -p -r1.643 if.c
--- net/if.c20 Jul 2021 16:32:28 -  1.643
+++ net/if.c21 Jul 2021 16:07:15 -
@@ -109,6 +109,10 @@
 #include 
 #endif
 
+#ifdef IPSEC
+#include 
+#endif
+
 #ifdef MPLS
 #include 
 #endif
@@ -238,7 +242,7 @@ int ifq_congestion;
 
 int netisr;
 
-#defineNET_TASKQ   1
+#defineNET_TASKQ   4
 struct taskq   *nettqmp[NET_TASKQ];
 
 struct task if_input_task_locked = TASK_INITIALIZER(if_netisr, NULL);
@@ -815,6 +819,7 @@ void
 if_input_process(struct ifnet *ifp, struct mbuf_list *ml)
 {
struct mbuf *m;
+   int exclusive_lock = 0;
 
if (ml_empty(ml))
return;
@@ -835,15 +840,27 @@ if_input_process(struct ifnet *ifp, stru
 * lists and the socket layer.
 */
 
+#ifdef IPSEC
/*
 * XXXSMP IPsec data structures are not ready to be accessed
 * by multiple network threads in parallel.  In this case
 * use an exclusive lock.
 */
-   NET_LOCK();
+   if (ipsec_in_use)
+   exclusive_lock = 1;
+#endif
+   if (exclusive_lock)
+   NET_LOCK();
+   else
+   NET_RLOCK_IN_SOFTNET();
+
while ((m = ml_dequeue(ml)) != NULL)
(*ifp->if_input)(ifp, m);
-   NET_UNLOCK();
+
+   if (exclusive_lock)
+   NET_UNLOCK();
+   else
+   NET_RUNLOCK_IN_SOFTNET();
 }
 
 void
@@ -900,6 +917,12 @@ if_netisr(void *unused)
arpintr();
KERNEL_UNLOCK();
}
+#endif
+   if (n & (1 << NETISR_IP))
+   ipintr();
+#ifdef INET6
+   if (n & (1 << NETISR_IPV6))
+   ip6intr();
 #endif
 #if NPPP > 0
if (n & (1 << NETISR_PPP)) {
Index: net/if_ethersubr.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_ethersubr.c,v
retrieving revision 1.275
diff -u -p -r1.275 if_ethersubr.c
--- net/if_ethersubr.c  7 Jul 2021 20:19:01 -   1.275
+++ net/if_ethersubr.c  21 Jul 2021 16:34:51 -
@@ -222,7 +222,10 @@ ether_resolve(struct ifnet *ifp, struct 
 
switch (af) {
case AF_INET:
+   KERNEL_LOCK();
+   /* XXXSMP there is a MP race in arpresolve() */
error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
+   KERNEL_UNLOCK();
if (error)
return (error);
eh->ether_type = htons(ETHERTYPE_IP);
@@ -245,7 +248,10 @@ ether_resolve(struct ifnet *ifp, struct 
break;
 #ifdef INET6
case AF_INET6:
+   KERNEL_LOCK();
+   /* XXXSMP there is a MP race in nd6_resolve() */
error = nd6_resolve(ifp, rt, m, dst, eh->ether_dhost);
+   KERNEL_UNLOCK();
if (error)
return (error);
eh->ether_type = htons(ETHERTYPE_IPV6);
@@ -271,13 +277,19 @@ ether_resolve(struct ifnet *ifp, struct 
break;
 #ifdef INET6
case AF_INET6:
+   KERNEL_LOCK();
+   /* XXXSMP there is a MP race in nd6_resolve() */
error = nd6_resolve(ifp, rt, m, dst, eh->ether_dhost);
+   KERNEL_UNLOCK();
if (error)
return (error);
break;
 #endif
case AF_INET:
+   KERNEL_LOCK();
+   /* XXXSMP there is a MP race in arpresolve() */
error = arpresolve(ifp, rt, m, dst, eh->ether_dhost);
+   KERNEL_UNLOCK();
if (error)
return (error);
break;
@@ -528,12 +540,14 @@ ether_input(struct ifnet *ifp, struct mb
case ETHERTYPE_PPPOE:
if (m->m_flags & (M_MCAST | M_BCAST))
goto dropanyway;
+   KERNEL_LOCK();
 #ifdef PIPEX
if (pipex_enable) {
struct pipex_session *session;
 
if ((session = pipex_pppoe_lookup_session(m)) != NULL) {
pipex_pppoe_input(m, session);
+  

Re: nanosleep.2: miscellaneous cleanup

2021-07-21 Thread Scott Cheloha
On Wed, Jul 21, 2021 at 04:00:51AM +0200, Ingo Schwarze wrote:
> Hi Scott,
> 
> Scott Cheloha wrote on Tue, Jul 20, 2021 at 05:20:16PM -0500:
> 
> > The nanosleep.2 page could use some cleanup.  Here's a bunch of fixes,
> > rewrites, etc.
> > 
> > I've included my notes on the changes below.  I have some (mostly
> > stylistic) questions in there, too.
> 
> Thanks for explaining what you are doing.  I'm snipping what i agree
> with as well as what i have no opinion on and trust you on, but that
> doesn't mean that mentioning it was useless.
> 
> > Should we reference sigaction(2) here alongside SA_RESTART?  e.g.:
> > 
> > ... even if SA_RESTART is set for the interrupting signal
> > (see sigaction(2) for further details).
> 
> [...]
> 
> Maybe something like:
> 
>   ... even if SA_RESTART was set with sigaction(2) for the
>   interrupting signal.
> 
> Your call, i would say.

I like that.  Using this snippet now.

> > I'm unsure about the correct voice here.  Should nanosleep actively
> > return values?  Or should values "be returned"?
> 
> Both are acceptable.  In general, and not just in manual pages,
> prefer the active voice when both work equally well.

Staying active, then.

> > I'm unclear on whether I need the indent here in the .Bd block.  I
> > think it looks better than unindented code, but maybe I'm flouting
> > some practice we have.
> 
> Using tabs is permitted inside .Bd -literal and .Bd -unfilled.
> They should definitely be used for indenting the bodies of for,
> while, and if.
> 
> Whether you indent the whole thing depends on what looks better.
> When the code is relatively narrow, indenting the whole example
> is often a good idea.  When the code contains very long lines,
> not indenting the whole example may be better.
> 
> If you choose to indent, i have no strong prefernce whether you
> do it with tabs or with .Bd -literal -offset indent; maybe i very
> slightly prefer the latter, because the global indentation is
> a formatting choice rather than part of the displayed code.
> But i really wouldn't complain about either.

deraadt@ has vetoed the examples so this is no longer an issue.

> > ERRORS
> > - Simplify the opening sentence.  Yeesh, what a mouthful.
> 
> Indeed, ERRORS should not repeat the content of RETURN VALUES.
> 
> > Unsure if a second EFAULT case for remainder is a good thing here.
> > Strictly speaking, NULL is not a valid part of the process address
> > space.  Maybe that's too pedantic?
> 
> I don't think you can be too pedantic about where NULL is allowed
> and where it isn't.  That's a notorious source of bugs, so precision
> about NULL is usually a good thing.

Keeping it, then.

> > Also, do we have a more standard "boilerplate" sentence for EFAULT?
> 
> Judging from "man -l /usr/share/man/man2/*.2", the most common
> wording is:
> 
>   [EFAULT]  foo points outside the process's allocated address space.
> 
> But i don't really i like that.  The word "allocated" makes me wonder
> because it sounds too much like malloc(3) for my taste.
> Usually, pointers to automatic and to static objects are acceptable,
> too, and are those "allocated"?  Some might say they are not.
> Besides, "process's" looks awkward.
> 
> These occur, too:
> 
>   [EFAULT]  The foo parameter is not in a valid part of the user
> address space.
> 
>   [EFAULT]  foo references invalid memory.
> 
>   [EFAULT]  The name parameter specifies an area outside the
> process address space.
> 
>   [EFAULT]  foo points to an illegal address.
> 
>   [EFAULT]  The userspace address uaddr is invalid.
> 
>   [EFAULT]  Part of buf points outside the process's allocated
> address space.
> 
>   [EFAULT]  The buf parameter points to an invalid address.
> 
>   [EFAULT]  The argument foo specifies an invalid address.
> 
>   [EFAULT]  The foo parameter specified a bad address.
> 
>   [EFAULT]  The foo parameter points to memory not in
> a valid part of the process address space.
> 
>   [EFAULT]  The address specified for foo is invalid.
> 
>   [EFAULT]  An argument address referenced invalid memory.
> 
>   In errno(2):
>   14 EFAULT Bad address. The system detected an invalid address in
> attempting to use an argument of a call.
> 
>   [EFAULT]  There was an error reading or writing the kevent
> structure.
> 
> Quite a mess, i would say.
> 
> I think trying to explain over and over again what an invalid address
> is, in each and every manual page, is neither reasonable nor feasible -
> i'm not convinced the rules are really simple given how many types
> of valid memory there are: static, stack, heap, ...
> 
> So i'd probably settle for a concise, simple wording, and i think
> i like this one best:
> 
>   [EFAULT]  foo points to an invalid address.
> 
> What is valid can easily depend on the function being called and
> even on other function arguments; the same address can easily be
> valid for reading a handful of bytes but invalid for 

Re: nanosleep.2: miscellaneous cleanup

2021-07-21 Thread Ingo Schwarze
Hi Theo,

Theo de Raadt wrote on Tue, Jul 20, 2021 at 08:14:20PM -0600:
> Ingo Schwarze wrote:

>>  [EFAULT]  foo points outside the process's allocated address space.
>>
>> But i don't really i like that.  The word "allocated" makes me wonder
>> because it sounds too much like malloc(3) for my taste.
>> Usually, pointers to automatic and to static objects are acceptable,
>> too, and are those "allocated"?  Some might say they are not.
>> Besides, "process's" looks awkward.

> Disagree on your dislike of this wording.
> 
> A process has an address space, VM_MIN_ADDRESS to VM_MAXUSER_ADDRESS.
> Parts of this are not mapped, because nothing has allocated backing
> resources.  Allocation happens via stack growth, execve, shm, mmap, etc
> etc etc etc.  Those are all methods for "allocating" within the
> processes' address space.
> 
> Unallocated space generates EFAULT.
> 
> Addresses outside the VM_MIN_ADDRESS to VM_MAXUSER_ADDRESS range also
> generate EFAULT, and this is because backing resources are not permitted
> to be allocated there.  No allocation, thus EFAULT.
> 
> "allocated" is correct terminology, and it has nothing to do with
> malloc.

Fair enough, thank you for explaining.

Given that it is correct terminology, i recommend using it in the
nanosleep(2) manual, too, because it is the most widely used idiom
and it isn't excessively long.

> I don't see how trying to mince words is going to help anyone.
> 
> Unifying all the varients into one form will be quite a task.

Right, that would probably be a make-work project.  Having a wide
variety of wordings is slightly messy, but not enough of a problem
to warrant spending a significant effort on unification.

Yours,
  Ingo

> I don't know if this is the right form to use, but I don't like the
> argument you made.



ssh match.c: Remove always true condition

2021-07-21 Thread Martin Vahlensieck
Hi

After the last commit where consecutive `*' are folded, *pattern is
never '*' here.

Best,

Martin

Index: match.c
===
RCS file: /cvs/src/usr.bin/ssh/match.c,v
retrieving revision 1.43
diff -u -p -r1.43 match.c
--- match.c 3 Nov 2020 22:53:12 -   1.43
+++ match.c 21 Jul 2021 09:59:59 -
@@ -69,7 +69,7 @@ match_pattern(const char *s, const char 
return 1;
 
/* If next character in pattern is known, optimize. */
-   if (*pattern != '?' && *pattern != '*') {
+   if (*pattern != '?') {
/*
 * Look instances of the next character in
 * pattern, and try to match starting from



Re: Do not spin on the NET_LOCK() in kqueue

2021-07-21 Thread Martin Pieuchot
On 11/07/21(Sun) 14:45, Visa Hankala wrote:
> On Sat, Jul 10, 2021 at 05:26:57PM +0200, Martin Pieuchot wrote:
> > One of the reasons for the drop of performances in the kqueue-based
> > poll/select is the fact that kqueue filters are called up to 3 times
> > per syscall and that they all spin on the NET_LOCK() for TCP/UDP
> > packets.
> > 
> > Diff below is a RFC for improving the situation.
> > 
> > socket kqueue filters mainly check for the amount of available items to
> > read/write.  This involves comparing various socket buffer fields (sb_cc,
> > sb_lowat, etc).  The diff below introduces a new mutex to serialize
> > updates of those fields with reads in the kqueue filters.
> > 
> > Since these fields are always modified with the socket lock held, either
> > the mutex or the solock are enough to have a coherent view of them.
> > Note that either of these locks is necessary only if multiple fields
> > have to be read (like in sbspace()).
> > 
> > Other per-socket fields accessed in the kqueue filters are never
> > combined (with &&) to determine a condition.  So assuming it is fine to
> > read register-sized fields w/o the socket lock we can safely remove it
> > there.
> > 
> > Could such mutex also be used to serialize klist updates?
> 
> I think the lock should be such that it can serialize socket klists.
> 
> As the main motivator for this change is kqueue, the viability of using
> the mutex for the klist locking should be checked now. The mutex has to
> be held whenever calling KNOTE() on sb_sel.si_note, or selwakeup() on
> sb_sel. Then the socket f_event callbacks will not need to lock the
> mutex themselves.
> 
> I had a diff that serialized socket klists using solock(). It did not
> work well because it increased lock contention, especially when using
> kqueue as backend for poll(2) and select(2). The diff is not even
> correct any longer since recent changes to socket locking have
> introduced new lock order constraints that conflict with it.

Updated diff below does that.  It also uses a single per-socket mutex as
suggested by bluhm@.

Note that as long poll(2) & select(2) use the current implementation a
KERNEL_LOCK()/UNLOCK() dance is necessary in sowakeup().  The goal of
this change combined with the poll/select rewrite is to get rid of this
dance.

Comments?

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.167
diff -u -p -r1.167 kern_event.c
--- kern/kern_event.c   16 Jun 2021 14:26:30 -  1.167
+++ kern/kern_event.c   21 Jul 2021 07:21:56 -
@@ -1884,6 +1884,9 @@ knote_dequeue(struct knote *kn)
 void
 knote_modify(const struct kevent *kev, struct knote *kn)
 {
+   if ((kn->kn_fop->f_flags & FILTEROP_MPSAFE) == 0)
+   KERNEL_ASSERT_LOCKED();
+
kn->kn_sfflags = kev->fflags;
kn->kn_sdata = kev->data;
kn->kn_udata = kev->udata;
@@ -1897,6 +1900,9 @@ knote_modify(const struct kevent *kev, s
 void
 knote_submit(struct knote *kn, struct kevent *kev)
 {
+   if ((kn->kn_fop->f_flags & FILTEROP_MPSAFE) == 0)
+   KERNEL_ASSERT_LOCKED();
+
if (kev != NULL) {
*kev = kn->kn_kevent;
if (kn->kn_flags & EV_CLEAR) {
Index: kern/uipc_socket.c
===
RCS file: /cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.263
diff -u -p -r1.263 uipc_socket.c
--- kern/uipc_socket.c  28 May 2021 16:24:53 -  1.263
+++ kern/uipc_socket.c  21 Jul 2021 07:21:56 -
@@ -84,7 +84,7 @@ int   filt_solistenprocess(struct knote *k
 intfilt_solisten_common(struct knote *kn, struct socket *so);
 
 const struct filterops solisten_filtops = {
-   .f_flags= FILTEROP_ISFD,
+   .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
.f_attach   = NULL,
.f_detach   = filt_sordetach,
.f_event= filt_solisten,
@@ -93,7 +93,7 @@ const struct filterops solisten_filtops 
 };
 
 const struct filterops soread_filtops = {
-   .f_flags= FILTEROP_ISFD,
+   .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
.f_attach   = NULL,
.f_detach   = filt_sordetach,
.f_event= filt_soread,
@@ -102,7 +102,7 @@ const struct filterops soread_filtops = 
 };
 
 const struct filterops sowrite_filtops = {
-   .f_flags= FILTEROP_ISFD,
+   .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
.f_attach   = NULL,
.f_detach   = filt_sowdetach,
.f_event= filt_sowrite,
@@ -111,7 +111,7 @@ const struct filterops sowrite_filtops =
 };
 
 const struct filterops soexcept_filtops = {
-   .f_flags= FILTEROP_ISFD,
+   .f_flags= FILTEROP_ISFD | FILTEROP_MPSAFE,
.f_attach   = NULL,
.f_detach   = filt_sordetach,
.f_event= filt_soread,
@@ -181,6 +181,9 @@ socreate(int dom, 

Re: ix(4)/riscv64: Make ix(4) work when MSI-X interrupts aren't available

2021-07-21 Thread Mark Kettenis
> Date: Wed, 21 Jul 2021 15:15:11 +1000
> From: Jonathan Matthew 
> 
> On Tue, Jul 20, 2021 at 02:21:39PM +0200, Mark Kettenis wrote:
> > > Date: Tue, 20 Jul 2021 21:55:56 +1000
> > > From: Jonathan Matthew 
> > > 
> > > On Mon, Jul 19, 2021 at 07:37:10PM -0400, Ashton Fagg wrote:
> > > > I have an Intel 82599 10 gigabit ethernet card I wanted to get working
> > > > on my SiFive Unmatched board.
> > > > 
> > > > I found the ix(4) driver has some weirdness around MSI-X
> > > > interrupts. While the driver supports operating both with and without
> > > > MSI-X support, it's hard-coded via a flag rather than dynamically 
> > > > checking
> > > > if it's available. If the flag is set (which it always is right now),
> > > > but MSI-X isn't available, the driver will throw an error and the device
> > > > won't work:
> > > > 
> > > > ix0 at pci7 dev 0 function 0 "Intel 82599" rev 0x01ixgbe_allocate_msix: 
> > > > pci_intr_map_msix vec 0 failed
> > > > 
> > > > The root cause is this call failing in if_ix.c:
> > > > 
> > > > if (pci_intr_map_msix(pa, i, )) {
> > > > printf("ixgbe_allocate_msix: "
> > > > "pci_intr_map_msix vec %d failed\n", i);
> > > > error = ENOMEM;
> > > > goto fail;
> > > > }
> > > > 
> > > > 
> > > > Because in _pci_intr_map_msix (in sys/arch/riscv64/dev/pci_machdep.c):
> > > > 
> > > > if ((pa->pa_flags & PCI_FLAGS_MSI_ENABLED) == 0 ||
> > > > pci_get_capability(pc, tag, PCI_CAP_MSI, NULL, NULL) == 0)
> > > > return -1;
> > > > 
> > > > The PCI attach flags would not have PCI_FLAGS_MSI_ENABLED set.
> > > > 
> > > > The following diff remedies that by checking if PCI_FLAGS_MSI_ENABLED is
> > > > actually set, rather than just trying and failing because the hard-coded
> > > > flag says so. It also enables ix(4) in the kernel config for
> > > > riscv64. Effectively, the driver will now only try to use MSI-X if the
> > > > machine is advertising it to be available.
> > > 
> > > I'd rather not have to do this in every driver.  We otherwise check that 
> > > flag
> > > inside the pci interrupt functions rather than in the driver code, so we
> > > should do so in pci_intr_msix_count() too, since that's what we call in
> > > multi-queue nic drivers to decide whether to use MSI-X.  Drivers that only
> > > want a single vector will just call pci_intr_map_msix() and fall back to 
> > > MSI
> > > or legacy interrupts if that fails.
> > > 
> > > I posted the alternate version of this diff to misc@ a few days ago,
> > > which repeats the checks used to set PCI_FLAGS_MSI_ENABLED in
> > > pci_intr_msix_count(), rather than passing in struct
> > > pci_attach_args, in case we prefer to do it that way.
> > 
> > I don't really read misc@, so don't post your patches there.
> 
> Right, it was just there for testing.
> 
> > 
> > > Mark, what do you think?
> > 
> > Yeah, making pci_intr_msix_count() should return 0 if MSIs are not
> > supported.  A bit strange though to pass both pa and pa->pa_tag.  I'd
> > change the function to only take pa as an argument.
> 
> Yes, on second look that makes sense.  Here's a better diff with that change,
> and that also doesn't break arches without __HAVE_PCI_MSIX.  ok?

ok kettenis@

> Index: if_bnxt.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_bnxt.c,v
> retrieving revision 1.32
> diff -u -p -u -p -r1.32 if_bnxt.c
> --- if_bnxt.c 24 Apr 2021 09:37:46 -  1.32
> +++ if_bnxt.c 21 Jul 2021 03:24:44 -
> @@ -537,7 +537,7 @@ bnxt_attach(struct device *parent, struc
>   sc->sc_flags |= BNXT_FLAG_MSIX;
>   intrstr = pci_intr_string(sc->sc_pc, ih);
>  
> - nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
> + nmsix = pci_intr_msix_count(pa);
>   if (nmsix > 1) {
>   sc->sc_ih = pci_intr_establish(sc->sc_pc, ih,
>   IPL_NET | IPL_MPSAFE, bnxt_admin_intr, sc, 
> DEVNAME(sc));
> Index: if_ix.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
> retrieving revision 1.178
> diff -u -p -u -p -r1.178 if_ix.c
> --- if_ix.c   22 Dec 2020 23:25:37 -  1.178
> +++ if_ix.c   21 Jul 2021 03:24:44 -
> @@ -1783,7 +1783,7 @@ ixgbe_setup_msix(struct ix_softc *sc)
>   if (!ixgbe_enable_msix)
>   return;
>  
> - nmsix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
> + nmsix = pci_intr_msix_count(pa);
>   if (nmsix <= 1)
>   return;
>  
> Index: if_ixl.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_ixl.c,v
> retrieving revision 1.74
> diff -u -p -u -p -r1.74 if_ixl.c
> --- if_ixl.c  26 Mar 2021 08:02:34 -  1.74
> +++ if_ixl.c  21 Jul 2021 03:24:44 -
> @@ -1795,7 +1795,7 @@ 

Re: new kqueue-based select(2) implementation

2021-07-21 Thread Martin Pieuchot
On 23/06/21(Wed) 15:53, Alexander Bluhm wrote:
> On Wed, Jun 23, 2021 at 11:40:18AM +0200, Martin Pieuchot wrote:
> > Our previous attempt [0] to replace the current select(2) implementation
> > has been reverted due to non-acceptable latency increase on sockets [1].
> 
> I have measured the performance difference.
> 
> http://bluhm.genua.de/perform/results/2021-06-21T09%3A44%3A18Z/perform.html
> 
> Worst 20% throughput drop is in 'iperf3 -c10.3.45.35 -u -b10G -w1m
> -t10 -R' which can be seen here.
> 
> http://bluhm.genua.de/perform/results/2021-06-21T09%3A44%3A18Z/gnuplot/udp.html
> 
> Note that iperf3 calls select(2) multiple times per UDP packet.
> 
> As a new feature I have links to btrace kstack flame graphs in the
> table.

Thanks a lot for the tests.  The FlameGraphs have shown that lazy
removal wasn't working correctly.  Updated diff below now works as
expected.

I'm aware of the throughput drop in the UDP iperf3 test, this is not a
real case scenario so I don't consider it as a blocker.  However it is
very useful to check the contention on the NET_LOCK() in select(2).  I'm
working on this issue on another thread, but there's an interdependency
between the two diffs, due to lock ordering. 

Comments?

Index: kern/kern_event.c
===
RCS file: /cvs/src/sys/kern/kern_event.c,v
retrieving revision 1.167
diff -u -p -r1.167 kern_event.c
--- kern/kern_event.c   16 Jun 2021 14:26:30 -  1.167
+++ kern/kern_event.c   13 Jul 2021 07:21:03 -
@@ -92,7 +92,7 @@ void  kqueue_do_check(struct kqueue *kq, 
 #define kqueue_check(kq)   do {} while (0)
 #endif
 
-void   kqpoll_dequeue(struct proc *p);
+void   kqpoll_dequeue(struct proc *p, int all);
 
 static int filter_attach(struct knote *kn);
 static voidfilter_detach(struct knote *kn);
@@ -720,12 +720,12 @@ kqpoll_init(void)
 
if (p->p_kq != NULL) {
/*
-* Discard any knotes that have been enqueued after
+* Discard any badfd knotes that have been enqueued after
 * previous scan.
-* This prevents accumulation of enqueued badfd knotes
-* in case scan does not make progress for some reason.
+* This prevents them from accumulating in case
+* scan does not make progress for some reason.
 */
-   kqpoll_dequeue(p);
+   kqpoll_dequeue(p, 0);
return;
}
 
@@ -747,7 +747,7 @@ kqpoll_exit(void)
 
kqueue_purge(p, p->p_kq);
/* Clear any detached knotes that remain in the queue. */
-   kqpoll_dequeue(p);
+   kqpoll_dequeue(p, 1);
kqueue_terminate(p, p->p_kq);
KASSERT(p->p_kq->kq_refs == 1);
KQRELE(p->p_kq);
@@ -755,33 +755,50 @@ kqpoll_exit(void)
 }
 
 void
-kqpoll_dequeue(struct proc *p)
+kqpoll_dequeue(struct proc *p, int all)
 {
+   struct knote marker;
struct knote *kn;
struct kqueue *kq = p->p_kq;
 
+   /* Bail out early without locking if the queue appears empty. */
+   if (kq->kq_count == 0)
+   return;
+
+   memset(, 0, sizeof(marker));
+   marker.kn_filter = EVFILT_MARKER;
+   marker.kn_status = KN_PROCESSING;
+
mtx_enter(>kq_lock);
-   while ((kn = TAILQ_FIRST(>kq_head)) != NULL) {
+   kn = TAILQ_FIRST(>kq_head);
+   while (kn != NULL) {
/* This kqueue should not be scanned by other threads. */
KASSERT(kn->kn_filter != EVFILT_MARKER);
 
-   if (!knote_acquire(kn, NULL, 0)) {
-   /* knote_acquire() has released kq_lock. */
-   mtx_enter(>kq_lock);
+   if (all == 0 && (kn->kn_status & KN_ATTACHED)) {
+   kn = TAILQ_NEXT(kn, kn_tqe);
continue;
}
 
-   kqueue_check(kq);
-   TAILQ_REMOVE(>kq_head, kn, kn_tqe);
-   kn->kn_status &= ~KN_QUEUED;
-   kq->kq_count--;
-   mtx_leave(>kq_lock);
+   TAILQ_INSERT_BEFORE(kn, , kn_tqe);
+
+   if (!knote_acquire(kn, NULL, 0)) {
+   /* knote_acquire() has released kq_lock. */
+   } else {
+   kqueue_check(kq);
+   TAILQ_REMOVE(>kq_head, kn, kn_tqe);
+   kn->kn_status &= ~KN_QUEUED;
+   kq->kq_count--;
+   mtx_leave(>kq_lock);
 
-   filter_detach(kn);
-   knote_drop(kn, p);
+   filter_detach(kn);
+   knote_drop(kn, p);
+   }
 
mtx_enter(>kq_lock);
kqueue_check(kq);
+   kn = TAILQ_NEXT(, kn_tqe);
+   TAILQ_REMOVE(>kq_head, , kn_tqe);
}
mtx_leave(>kq_lock);
 }
Index: kern/sys_generic.c

Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Martin Pieuchot
On 20/07/21(Tue) 15:46, Alexander Bluhm wrote:
> On Tue, Jul 20, 2021 at 02:26:02PM +0200, Alexander Bluhm wrote:
> > > Note that having multiple threads competing for an exclusive rwlock will
> > > generate unnecessary wakeup/sleep cycles every time the lock is released.
> > > It is valuable to keep this in mind as it might add extra latency when
> > > processing packets.
> > 
> > Of course.  What do you recommend?
> 
> We may have another alternative.
> 
> - Always use a shared net lock but also aquire kernel lock.

Using an exclusive NET_LOCK() seems better, that's what we have now.

My point is just that if we're using an exclusive NET_LOCK() we lose the
gain of having multiple softnet threads, so if we could reduce the
number of threads so would be better.  That said, if it isn't trivial to
do so, I'd better spend the time into making IPsec works with parallel
threads.

Another reason for not using an exclusive lock in the softnet thread is
to be able to execute read ioctls at the same time as the forwarding
path.  This has been reverted due to a bug elsewhere last year and never
got activated again.  But somebody might want to revisit this, 'cause
doing ifconfig(8) on a busy machine was hanging for a very long time.



Re: forwarding in parallel ipsec workaround

2021-07-21 Thread Martin Pieuchot
On 20/07/21(Tue) 14:26, Alexander Bluhm wrote:
> On Tue, Jul 20, 2021 at 10:08:09AM +0200, Martin Pieuchot wrote:
> > On 19/07/21(Mon) 17:53, Alexander Bluhm wrote:
> > > Hi,
> > > 
> > > I found why the IPsec workaround did not work.
> > > 
> > > At init time we set ifiq->ifiq_softnet = net_tq(ifp->if_index +
> > > idx), but the workaround modifies net_tq() at runtime.  Modifying
> > > net_tq() at runtime is bad anyway as task_add() and task_del() could
> > > be called with different task queues.
> > > 
> > > So better use exclusive lock if IPsec is in use.  For me this is
> > > running stable.
> > 
> > Note that having multiple threads competing for an exclusive rwlock will
> > generate unnecessary wakeup/sleep cycles every time the lock is released.  
> > It is valuable to keep this in mind as it might add extra latency when
> > processing packets.
> 
> Of course.  What do you recommend?

What you find easier to make progress with :)

> - Develop outside of the tree until all problems are fixed.
> - Delay work on parallel forwarding until IPsec is MP safe.
> - Accept a possible slowdown of IPsec.  In my measurements it gets
>   faster even with the exclusive lock.
> - Concentrate on making IPsec faster.  By removing the crypto
>   queues you gain much more performance than the exclusive lock may
>   cost.  Did you see the massive kernel locks in my graph?
>   
> http://bluhm.genua.de/perform/results/latest/patch-sys-ip-multiqueue.1/btrace/ssh_perform%40lt13_iperf3_-c10.4.56.36_-P10_-t10_-R-btrace-kstack.0.svg

I didn't see this, nice to know.

> - Make ARP MP safe.  Currently we need the kernel lock there or
>   it crashes.  This creates latency for all kind of packets.
> - Convert the rwlock in pf to mutex.  I think your argument counts
>   much more there.  But I cannot prove it.

My argument about what?  Didn't we all agree about converting the rwlock
to a mutex for now?

> My plan is to commit what we have and improve where most pain is.
> This makes incremental steps easier.

Makes sense.