Re: [External] : running network stack forwarding in parallel

2021-04-21 Thread Alexandr Nedvedicky
Hello,

thanks for good news.

On Wed, Apr 21, 2021 at 10:32:08PM +0200, Alexander Bluhm wrote:
> On Wed, Apr 21, 2021 at 09:59:53PM +0200, Alexandr Nedvedicky wrote:
> > was pf(4) enabled while running those tests?
> 
> Yes.
> 
> > if pf(4) was enabled while those tests were running,
> > what rules were loaded to to pf(4)?
> 
> Default pf.conf:
> 

> 
> Linux iperf3 is sending 10 TCP streams in parallel over OpenBSD
> forward machine.  I see 22 iperf3 states on pf(4).
> 
> > if I remember
> > correctly I could see performance boost by factor ~1.5 when running those 
> > tests
> > with similar diff applied to machines provided by hrvoje@.
> 
> Multiqueue support for ix(4) has improved.  Maybe that is why I see
> factor 2 .  Machine has 4 cores.  The limit seems to be the 10Gig
> interface, although we do not use it optimally.
> 

in my testing I hit state table size limit (1 million states). the test
tool (t-rex traffic generator from cisco [1]) was hammering firewall with
various connections (pop/imap/http...) emulating real network clients and
servers. the throughput/latency got worse as soon as state table filled up.

I'll eventually repeat those tests to get fresh numbers.

thanks and
regards
sashan

[1] https://trex-tgn.cisco.com/



Re: [External] : running network stack forwarding in parallel

2021-04-21 Thread Alexander Bluhm
On Wed, Apr 21, 2021 at 09:59:53PM +0200, Alexandr Nedvedicky wrote:
> was pf(4) enabled while running those tests?

Yes.

> if pf(4) was enabled while those tests were running,
> what rules were loaded to to pf(4)?

Default pf.conf:

#   $OpenBSD: pf.conf,v 1.55 2017/12/03 20:40:04 sthen Exp $
#
# See pf.conf(5) and /etc/examples/pf.conf

set skip on lo

block return# block stateless traffic
pass# establish keep-state

# By default, do not permit remote connections to X11
block return in on ! lo0 proto tcp to port 6000:6010

# Port build user does not need network
block return out log proto {tcp udp} user _pbuild

> my guess is pf(4) was not enabled when running those tests.

Linux iperf3 is sending 10 TCP streams in parallel over OpenBSD
forward machine.  I see 22 iperf3 states on pf(4).

> if I remember
> correctly I could see performance boost by factor ~1.5 when running those 
> tests
> with similar diff applied to machines provided by hrvoje@.

Multiqueue support for ix(4) has improved.  Maybe that is why I see
factor 2 .  Machine has 4 cores.  The limit seems to be the 10Gig
interface, although we do not use it optimally.

> I agree it's time to commit such change.

cool

bluhm



Re: [External] : running network stack forwarding in parallel

2021-04-21 Thread Alexandr Nedvedicky
Hello,

just a quick question:
was pf(4) enabled while running those tests?

if pf(4) was enabled while those tests were running,
what rules were loaded to to pf(4)?

my guess is pf(4) was not enabled when running those tests.  if I remember
correctly I could see performance boost by factor ~1.5 when running those tests
with similar diff applied to machines provided by hrvoje@.

I agree it's time to commit such change.

thanks and
regards
sashan

On Wed, Apr 21, 2021 at 09:36:11PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> For a while we are running network without kernel lock, but with a
> network lock.  The latter is an exclusive sleeping rwlock.
> 
> It is possible to run the forwarding path in parallel on multiple
> cores.  I use ix(4) interfaces which provide one input queue for
> each CPU.  For that we have to start multiple softnet tasks and
> replace the exclusive lock with a shared lock.  This works for IP
> and IPv6 input and forwarding, but not for higher protocols.
> 
> So I implement a queue between IP and higher layers.  We had that
> before when we were using netlock for IP and kernel lock for TCP.
> Now we have shared lock for IP and exclusive lock for TCP.  By using
> a queue, we can upgrade the lock once for multiple packets.
> 
> As you can see here, forwardings performance doubles from 4.5x10^9
> to 9x10^9 .  Left column is current, right column is with my diff.
> The other dots at 2x10^9 are with socket splicing which is not
> affected.
> https://urldefense.com/v3/__http://bluhm.genua.de/perform/results/2021-04-21T10*3A50*3A37Z/gnuplot/forward.png__;JSU!!GqivPVa7Brio!OtwTVOe5OduUy66tflwa4sLzYhsp0IYrrEbYuHZgcNN-ajkw4YV9sxpxPkaP34lTt1CS_5aW$
>  
> 
> Here are all numbers with various network tests.
> https://urldefense.com/v3/__http://bluhm.genua.de/perform/results/2021-04-21T10*3A50*3A37Z/perform.html__;JSU!!GqivPVa7Brio!OtwTVOe5OduUy66tflwa4sLzYhsp0IYrrEbYuHZgcNN-ajkw4YV9sxpxPkaP34lTt4fEPDZ-$
>  
> TCP performance gets less deterministic due to the addition queue.
> 
> Kernel stack flame graph looks like this.  Machine uses 4 CPU.
> https://urldefense.com/v3/__http://bluhm.genua.de/files/kstack-multiqueue-forward.svg__;!!GqivPVa7Brio!OtwTVOe5OduUy66tflwa4sLzYhsp0IYrrEbYuHZgcNN-ajkw4YV9sxpxPkaP34lTt9DxAhcT$
>  
> 
> Note the kernel lock around nd6_resolve().  I hat to put it there
> as I have seen an MP related crash there.  This can be fixed
> independently of this diff.
> 
> We need more MP preassure to find such bugs and races.  I think now
> is a good time to give this diff broader testing and commit it.
> You need interfaces with multiple queues to see a difference.
>