On Wed, Apr 21, 2021 at 09:36:11PM +0200, Alexander Bluhm wrote:
> Hi,
>
> For a while we are running network without kernel lock, but with a
> network lock. The latter is an exclusive sleeping rwlock.
>
> It is possible to run the forwarding path in parallel on multiple
> cores. I use ix(4) interfaces which provide one input queue for
> each CPU. For that we have to start multiple softnet tasks and
> replace the exclusive lock with a shared lock. This works for IP
> and IPv6 input and forwarding, but not for higher protocols.
>
> So I implement a queue between IP and higher layers. We had that
> before when we were using netlock for IP and kernel lock for TCP.
> Now we have shared lock for IP and exclusive lock for TCP. By using
> a queue, we can upgrade the lock once for multiple packets.
>
> As you can see here, forwardings performance doubles from 4.5x10^9
> to 9x10^9 . Left column is current, right column is with my diff.
> The other dots at 2x10^9 are with socket splicing which is not
> affected.
> http://bluhm.genua.de/perform/results/2021-04-21T10%3A50%3A37Z/gnuplot/forward.png
>
> Here are all numbers with various network tests.
> http://bluhm.genua.de/perform/results/2021-04-21T10%3A50%3A37Z/perform.html
> TCP performance gets less deterministic due to the addition queue.
>
> Kernel stack flame graph looks like this. Machine uses 4 CPU.
> http://bluhm.genua.de/files/kstack-multiqueue-forward.svg
>
> Note the kernel lock around nd6_resolve(). I hat to put it there
> as I have seen an MP related crash there. This can be fixed
> independently of this diff.
>
> We need more MP preassure to find such bugs and races. I think now
> is a good time to give this diff broader testing and commit it.
> You need interfaces with multiple queues to see a difference.
>
> ok?
>
> bluhm
>
Hi.
Did you tested your diff with ipsec(4) enabled? I'm asking because we
have this in net/pfkeyv2.c:
1108 pfkeyv2_send(struct socket *so, void *message, int len)
1109 {
....
2013 ipsec_in_use++;
2014 /*
2015 * XXXSMP IPsec data structures are not ready to be
2016 * accessed by multiple Network threads in parallel,
2017 * so force all packets to be processed by the first
2018 * one.
2019 */
2020 extern int nettaskqs;
2021 nettaskqs = 1;