Re: Network stack changes

2013-09-24 Thread Marko Zec
On Tuesday 24 September 2013 00:46:46 Sami Halabi wrote:
> Hi,
>
> > http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf >et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> > http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff >er.hr/dxr/stable_8_20120824.diff>
>
> I've tried the diff in 10-current, applied cleanly but had errors
> compiling new kernel... is there any work to make it work? i'd love to
> test it.

Even if you'd make it compile on current, you could only run synthetic tests 
measuring lookup performance using streams of random keys, as outlined in 
the paper (btw. the paper at Luigi's site is an older draft, the final 
version with slightly revised benchmarks is available here:
http://www.sigcomm.org/sites/default/files/ccr/papers/2012/October/2378956-2378961.pdf)

I.e. the code only hooks into the routing API for testing purposes, but is 
completely disconnected from the forwarding path.

We have a prototype in the works which combines DXR with Netmap in userspace 
and is capable of sustaining well above line rate forwarding with 
full-sized BGP views using Intel 10G cards on commodity multicore machines.  
The work was somewhat stalled during the summer but I plan to wrap it up 
and release the code until the end of this year.  With recent advances in 
netmap it might also be feasible to merge DXR and netmap entirely inside 
the kernel but I've not explored that path yet...

Marko


> Sami
>
>
> On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
>
> melif...@yandex-team.ru> wrote:
> > On 29.08.2013 15:49, Adrian Chadd wrote:
> >> Hi,
> >
> > Hello Adrian!
> > I'm very sorry for the looong reply.
> >
> >> There's a lot of good stuff to review here, thanks!
> >>
> >> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to
> >> keep locking things like that on a per-packet basis. We should be able
> >> to do this in a cleaner way - we can defer RX into a CPU pinned
> >> taskqueue and convert the interrupt handler to a fast handler that
> >> just schedules that taskqueue. We can ignore the ithread entirely
> >> here.
> >>
> >> What do you think?
> >
> > Well, it sounds good :) But performance numbers and Jack opinion is
> > more important :)
> >
> > Are you going to Malta?
> >
> >> Totally pie in the sky handwaving at this point:
> >>
> >> * create an array of mbuf pointers for completed mbufs;
> >> * populate the mbuf array;
> >> * pass the array up to ether_demux().
> >>
> >> For vlan handling, it may end up populating its own list of mbufs to
> >> push up to ether_demux(). So maybe we should extend the API to have a
> >> bitmap of packets to actually handle from the array, so we can pass up
> >> a larger array of mbufs, note which ones are for the destination and
> >> then the upcall can mark which frames its consumed.
> >>
> >> I specifically wonder how much work/benefit we may see by doing:
> >>
> >> * batching packets into lists so various steps can batch process
> >> things rather than run to completion;
> >> * batching the processing of a list of frames under a single lock
> >> instance - eg, if the forwarding code could do the forwarding lookup
> >> for 'n' packets under a single lock, then pass that list of frames up
> >> to inet_pfil_hook() to do the work under one lock, etc, etc.
> >
> > I'm thinking the same way, but we're stuck with 'forwarding lookup' due
> > to problem with egress interface pointer, as I mention earlier. However
> > it is interesting to see how much it helps, regardless of locking.
> >
> > Currently I'm thinking that we should try to change radix to something
> > different (it seems that it can be checked fast) and see what happened.
> > Luigi's performance numbers for our radix are too awful, and there is a
> > patch implementing alternative trie:
> > http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf >et.unipi.it/~luigi/papers/20120601-dxr.pdf>
> > http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff >er.hr/dxr/stable_8_20120824.diff>
> >
> >> Here, the processing would look less like "grab lock and process to
> >> completion" and more like "mark and sweep" - ie, we have a list of
> >> frames that we mark as needing processing and mark as having been
> >> processed at each layer, so we know where to next dispatch them.
> >>
> >> I still have some tool coding to do with PMC before I even think about
> >> tinkering with this as I'd like to measure stuff like per-packet
> >> latency as well as top-level processing overhead (ie,
> >> CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC
> >> interrupts on that core, etc.)
> >
> > That will be great to see!
> >
> >> Thanks,
> >>
> >>
> >>
> >> -adrian
> >
> > __**_
> > freebsd-...@freebsd.org mailing list
> > http://lists.freebsd.org/**mailman/listinfo/freebsd-net >eebsd.org/mailman/listinfo/freebsd-net> To unsubscribe, send any mail t

Re: Network stack changes

2013-09-23 Thread Sami Halabi
Hi,
> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf
> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff
I've tried the diff in 10-current, applied cleanly but had errors compiling
new kernel... is there any work to make it work? i'd love to test it.

Sami


On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
melif...@yandex-team.ru> wrote:

> On 29.08.2013 15:49, Adrian Chadd wrote:
>
>> Hi,
>>
> Hello Adrian!
> I'm very sorry for the looong reply.
>
>
>
>> There's a lot of good stuff to review here, thanks!
>>
>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to
>> keep locking things like that on a per-packet basis. We should be able to
>> do this in a cleaner way - we can defer RX into a CPU pinned taskqueue and
>> convert the interrupt handler to a fast handler that just schedules that
>> taskqueue. We can ignore the ithread entirely here.
>>
>> What do you think?
>>
> Well, it sounds good :) But performance numbers and Jack opinion is more
> important :)
>
> Are you going to Malta?
>
>
>> Totally pie in the sky handwaving at this point:
>>
>> * create an array of mbuf pointers for completed mbufs;
>> * populate the mbuf array;
>> * pass the array up to ether_demux().
>>
>> For vlan handling, it may end up populating its own list of mbufs to push
>> up to ether_demux(). So maybe we should extend the API to have a bitmap of
>> packets to actually handle from the array, so we can pass up a larger array
>> of mbufs, note which ones are for the destination and then the upcall can
>> mark which frames its consumed.
>>
>> I specifically wonder how much work/benefit we may see by doing:
>>
>> * batching packets into lists so various steps can batch process things
>> rather than run to completion;
>> * batching the processing of a list of frames under a single lock
>> instance - eg, if the forwarding code could do the forwarding lookup for
>> 'n' packets under a single lock, then pass that list of frames up to
>> inet_pfil_hook() to do the work under one lock, etc, etc.
>>
> I'm thinking the same way, but we're stuck with 'forwarding lookup' due to
> problem with egress interface pointer, as I mention earlier. However it is
> interesting to see how much it helps, regardless of locking.
>
> Currently I'm thinking that we should try to change radix to something
> different (it seems that it can be checked fast) and see what happened.
> Luigi's performance numbers for our radix are too awful, and there is a
> patch implementing alternative trie:
> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf
> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff
>
>
>
>
>> Here, the processing would look less like "grab lock and process to
>> completion" and more like "mark and sweep" - ie, we have a list of frames
>> that we mark as needing processing and mark as having been processed at
>> each layer, so we know where to next dispatch them.
>>
>> I still have some tool coding to do with PMC before I even think about
>> tinkering with this as I'd like to measure stuff like per-packet latency as
>> well as top-level processing overhead (ie, CPU_CLK_UNHALTED.THREAD_P /
>> lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, etc.)
>>
> That will be great to see!
>
>>
>> Thanks,
>>
>>
>>
>> -adrian
>>
>>
> __**_
> freebsd-...@freebsd.org mailing list
> http://lists.freebsd.org/**mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to 
> "freebsd-net-unsubscribe@**freebsd.org
> "
>



-- 
Sami Halabi
Information Systems Engineer
NMS Projects Expert
FreeBSD SysAdmin Expert


On Sun, Sep 22, 2013 at 11:12 PM, Alexander V. Chernikov <
melif...@yandex-team.ru> wrote:

> On 29.08.2013 15:49, Adrian Chadd wrote:
>
>> Hi,
>>
> Hello Adrian!
> I'm very sorry for the looong reply.
>
>
>
>> There's a lot of good stuff to review here, thanks!
>>
>> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to
>> keep locking things like that on a per-packet basis. We should be able to
>> do this in a cleaner way - we can defer RX into a CPU pinned taskqueue and
>> convert the interrupt handler to a fast handler that just schedules that
>> taskqueue. We can ignore the ithread entirely here.
>>
>> What do you think?
>>
> Well, it sounds good :) But performance numbers and Jack opinion is more
> important :)
>
> Are you going to Malta?
>
>
>> Totally pie in the sky handwaving at this point:
>>
>> * create an array of mbuf pointers for completed mbufs;
>> * populate the mbuf array;
>> * pass the array up to ether_demux().
>>
>> For vlan handling, it may end up populating its own list of mbufs to push
>> up to ether_demux(). So maybe we should extend the 

Re: Network stack changes

2013-09-23 Thread Slawa Olhovchenkov
On Sun, Sep 22, 2013 at 11:58:37PM +0400, Alexander V. Chernikov wrote:

> I've found the paper I was talking about:
> http://info.iet.unipi.it/~luigi/papers/20120601-dxr.pdf
> 
> It claims that our radix is able to do 6MPPS/core and it does not scale 
> with number of cores.

Our radix is bugly and don't work corretly.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-22 Thread Adrian Chadd
Hi!



On 22 September 2013 13:12, Alexander V. Chernikov
wrote:


>  I'm thinking the same way, but we're stuck with 'forwarding lookup' due
> to problem with egress interface pointer, as I mention earlier. However it
> is interesting to see how much it helps, regardless of locking.
>
> Currently I'm thinking that we should try to change radix to something
> different (it seems that it can be checked fast) and see what happened.
> Luigi's performance numbers for our radix are too awful, and there is a
> patch implementing alternative trie:
> http://info.iet.unipi.it/~**luigi/papers/20120601-dxr.pdf
> http://www.nxlab.fer.hr/dxr/**stable_8_20120824.diff
>
>
So, I can make educated guesses about why this is better for forwarding
workloads. I'd like to characterize it though. So, what's it doing that's
better? better locking? better caching behaviour? less memory lookups? etc.

Thanks,



-adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-22 Thread Alexander V. Chernikov

On 29.08.2013 15:49, Adrian Chadd wrote:

Hi,

Hello Adrian!
I'm very sorry for the looong reply.



There's a lot of good stuff to review here, thanks!

Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to 
keep locking things like that on a per-packet basis. We should be able 
to do this in a cleaner way - we can defer RX into a CPU pinned 
taskqueue and convert the interrupt handler to a fast handler that 
just schedules that taskqueue. We can ignore the ithread entirely here.


What do you think?
Well, it sounds good :) But performance numbers and Jack opinion is more 
important :)


Are you going to Malta?


Totally pie in the sky handwaving at this point:

* create an array of mbuf pointers for completed mbufs;
* populate the mbuf array;
* pass the array up to ether_demux().

For vlan handling, it may end up populating its own list of mbufs to 
push up to ether_demux(). So maybe we should extend the API to have a 
bitmap of packets to actually handle from the array, so we can pass up 
a larger array of mbufs, note which ones are for the destination and 
then the upcall can mark which frames its consumed.


I specifically wonder how much work/benefit we may see by doing:

* batching packets into lists so various steps can batch process 
things rather than run to completion;
* batching the processing of a list of frames under a single lock 
instance - eg, if the forwarding code could do the forwarding lookup 
for 'n' packets under a single lock, then pass that list of frames up 
to inet_pfil_hook() to do the work under one lock, etc, etc.
I'm thinking the same way, but we're stuck with 'forwarding lookup' due 
to problem with egress interface pointer, as I mention earlier. However 
it is interesting to see how much it helps, regardless of locking.


Currently I'm thinking that we should try to change radix to something 
different (it seems that it can be checked fast) and see what happened.
Luigi's performance numbers for our radix are too awful, and there is a 
patch implementing alternative trie:

http://info.iet.unipi.it/~luigi/papers/20120601-dxr.pdf
http://www.nxlab.fer.hr/dxr/stable_8_20120824.diff




Here, the processing would look less like "grab lock and process to 
completion" and more like "mark and sweep" - ie, we have a list of 
frames that we mark as needing processing and mark as having been 
processed at each layer, so we know where to next dispatch them.


I still have some tool coding to do with PMC before I even think about 
tinkering with this as I'd like to measure stuff like per-packet 
latency as well as top-level processing overhead (ie, 
CPU_CLK_UNHALTED.THREAD_P / lagg0 TX bytes/pkts, RX bytes/pkts, NIC 
interrupts on that core, etc.)

That will be great to see!


Thanks,



-adrian



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-22 Thread Slawa Olhovchenkov
On Mon, Sep 23, 2013 at 12:01:17AM +0400, Alexander V. Chernikov wrote:

> On 29.08.2013 05:32, Slawa Olhovchenkov wrote:
> > On Thu, Aug 29, 2013 at 12:24:48AM +0200, Andre Oppermann wrote:
> >
> >>> ..
> >>> while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on 
> >>> the same-class hardware and
> >>> _userland_ forwarding.
> >> Those numbers sound a bit far out.  Maybe if the packet isn't touched
> >> or looked at at all in a pure netmap interface to interface bridging
> >> scenario.  I don't believe these numbers.
> > 80*64*8 = 40.960 Gb/s
> > May be DCA? And use CPU with 40 PCIe lane and 4 memory chanell.
> Intel introduces DDIO instead of DCA: 
> http://www.intel.com/content/www/us/en/io/direct-data-i-o.html
> (and it seems DCA does not help much):
> https://www.myricom.com/software/myri10ge/790-how-do-i-enable-intel-direct-cache-access-dca-with-the-linux-myri10ge-driver.html
> https://www.myricom.com/software/myri10ge/783-how-do-i-get-the-best-performance-with-my-myri-10g-network-adapters-on-a-host-that-supports-intel-data-direct-i-o-ddio.html
> 
> (However, DPDK paper notes DDIO is of signifficant helpers)

Ha, Intel paper say SMT is signifficant better HT. In real word --
same shit.

For network application, if buffring need more then L3 cache, what
happening? May be some bad things...
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-22 Thread Alexander V. Chernikov

On 14.09.2013 22:49, Olivier Cochard-Labbé wrote:

On Sat, Sep 14, 2013 at 4:28 PM, Luigi Rizzo  wrote:

IXIA ? For the timescales we need to address we don't need an IXIA,
a netmap sender is more than enough


The great netmap generates only one IP flow (same src/dst IP and same
src/dst port).
This don't permit to test multi-queue NIC (or SMP packet-filter) on a
simple lab like this:
netmap sender => freebsd router => netmap receiver
I've got the variant which is capable on doing linerate pcap replays on 
single queue.

(However this is true for small pcaps only)


Regards,

Olivier


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-22 Thread Alexander V. Chernikov

On 29.08.2013 02:24, Andre Oppermann wrote:

On 28.08.2013 20:30, Alexander V. Chernikov wrote:

Hello list!


Hello Alexander,

Hello Andre!
I'm very sorry to answer so late.


you sent quite a few things in the same email.  I'll try to respond
as much as I can right now.  Later you should split it up to have
more in-depth discussions on the individual parts.

If you could make it to the EuroBSDcon 2013 DevSummit that would be
even more awesome.  Most of the active network stack people will be
there too.
I've sent presentation describing nearly the same things to devsummit@ 
so I hope this can be discussed in Networking group.

I hope to attend DevSummit & EuroBSDcon.


There is a lot constantly raising discussions related to networking 
stack performance/changes.


I'll try to summarize current problems and possible solutions from my 
point of view.
(Generally this is one problem: stack is 
slooow, but we need to know why and

what to do).


Compared to others its not thaaat slow. ;)


Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 
'ascii-art' exporter).


Note that we are using process-to-completion model, e.g. process any 
packet in ISR until it is either

consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it 
does not change much:
it can help to do more fine-grained hashing (for GRE or other similar 
traffic), but

1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of 
flags in `netstat -Q`)
People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or 
modified PPPoe/GRE version)

report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine 
since there is nearly no contention
(the only thing that can happen is driver reconfiguration which is 
rare and, more signifficant, we

do this once
for the batch of packets received in given interrupt). However, due 
to some (im)possible deadlocks

current code
does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing:
http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html

1*) Possible BPF users. Here we have one rlock if there are any 
readers present
(and mutex for any matching packets, but this is more or less OK. 
Additionally, there is WIP to

implement multiqueue BPF
and there is chance that we can reduce lock contention there).


Rlock to rmlock?

Yes, probably.



There is also an "optimize_writers" hack permitting applications
like CDP to use BPF as writers but not registering them as receivers 
(which implies rlock)


I believe longer term we should solve this with a protocol type 
"ethernet"

so that one can send/receive ethernet frames through a normal socket.

Yes. AF_LINK or any similar.


2/3) Virtual interfaces (laggs/vlans over lagg and other simular 
constructions).
Currently we simply use rlock to make s/ix0/lagg0/ and, what is much 
more funny - we use complex

vlan_hash with another rlock to
get vlan interface from underlying one.

This is definitely not like things should be done and this can be 
changed more or less easily.


Indeed.

There are some useful terms/techniques in world of software/hardware 
routing: they have clear

'control plane' and 'data plane' separation.
Former one is for dealing control traffic (IGP, MLD, IGMP snooping, 
lagg hellos, ARP/NDP, etc..) and
some data traffic (packets with TTL=1, with options, destined to 
hosts without ARP/NDP record, and
similar). Latter one is done in hardware (or effective software 
implementation).
Control plane is responsible to provide data for efficient data plane 
operations. This is the point

we are missing nearly everywhere.


ACK.

What I want to say is: lagg is pure control-plane stuff and vlan is 
nearly the same. We can't apply
this approach to complex cases like 
lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
but we definitely can do this for most common setups like (igb* or 
ix* in lagg with or without vlans

on top of lagg).


ACK.

We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can 
add some more. We even have

per-driver hooks to program HW filtering.


We could.  Though for vlan it looks like it would be easier to remove the
hardware vlan tag stripping and insertion.  It only adds complexity in 
all

drivers for no gain.
No. Actually as far as I understand it helps driver to perform TSO. 
Anyway, IMO we should use HW capabilities if we can.
(this probably does not add much speed on 1G, but on 10/20/40G this can 
help much more).


One small step to do is to throw packet to vlan interface directly 
(P1), proof-of-concept(working in

production):
http://lists.freebsd.org/piperma

Re: Network stack changes

2013-09-22 Thread Alexander V. Chernikov

On 29.08.2013 05:32, Slawa Olhovchenkov wrote:

On Thu, Aug 29, 2013 at 12:24:48AM +0200, Andre Oppermann wrote:


..
while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the 
same-class hardware and
_userland_ forwarding.

Those numbers sound a bit far out.  Maybe if the packet isn't touched
or looked at at all in a pure netmap interface to interface bridging
scenario.  I don't believe these numbers.

80*64*8 = 40.960 Gb/s
May be DCA? And use CPU with 40 PCIe lane and 4 memory chanell.
Intel introduces DDIO instead of DCA: 
http://www.intel.com/content/www/us/en/io/direct-data-i-o.html

(and it seems DCA does not help much):
https://www.myricom.com/software/myri10ge/790-how-do-i-enable-intel-direct-cache-access-dca-with-the-linux-myri10ge-driver.html
https://www.myricom.com/software/myri10ge/783-how-do-i-get-the-best-performance-with-my-myri-10g-network-adapters-on-a-host-that-supports-intel-data-direct-i-o-ddio.html

(However, DPDK paper notes DDIO is of signifficant helpers)
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-20 Thread George Neville-Neil

On Sep 19, 2013, at 16:08 , Luigi Rizzo  wrote:

> On Thu, Sep 19, 2013 at 03:54:34PM -0400, George Neville-Neil wrote:
>> 
>> On Sep 14, 2013, at 15:24 , Luigi Rizzo  wrote:
>> 
>>> 
>>> 
>>> On Saturday, September 14, 2013, Olivier Cochard-Labb?  
>>> wrote:
 On Sat, Sep 14, 2013 at 4:28 PM, Luigi Rizzo  wrote:
> 
> IXIA ? For the timescales we need to address we don't need an IXIA,
> a netmap sender is more than enough
> 
 
 The great netmap generates only one IP flow (same src/dst IP and same
 src/dst port).
>>> 
>>> True the sample app generates only one flow but it is trivial to modify it 
>>> to generate multiple flows. My point was, we have the ability to generate 
>>> high rate traffic, as long as we do tolerate a .1-1us jitter. Beyond that, 
>>> you do need some ixia-like solution.
>>> 
>> 
>> On the bandwidth side, can a modern sender with netmap really do a full 10G? 
>>  I hate the cost of an
>> IXIA but I have not been able to destroy our stack as effectively with 
>> anything else.
> 
> yes george, you can download the picobsd image
> 
> http://info.iet.unipi.it/~luigi/netmap/20120618-netmap-picobsd-head-amd64.bin
> 
> and try for yourself.
> 
> Granted this does not have all the knobs of an ixia but it can
> surely blast the full 14.88 Mpps to the link, and it only takes a
> bit of userspace programming to generate reasonably arbitrary streams
> of packets. A netmap sender/receiver is not CPU bound even with 1 core.
> 

Interesting.  It's on my todo.

Best,
George




signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Network stack changes

2013-09-19 Thread Luigi Rizzo
On Thu, Sep 19, 2013 at 03:54:34PM -0400, George Neville-Neil wrote:
> 
> On Sep 14, 2013, at 15:24 , Luigi Rizzo  wrote:
> 
> > 
> > 
> > On Saturday, September 14, 2013, Olivier Cochard-Labb?  
> > wrote:
> > > On Sat, Sep 14, 2013 at 4:28 PM, Luigi Rizzo  wrote:
> > >>
> > >> IXIA ? For the timescales we need to address we don't need an IXIA,
> > >> a netmap sender is more than enough
> > >>
> > >
> > > The great netmap generates only one IP flow (same src/dst IP and same
> > > src/dst port).
> > 
> > True the sample app generates only one flow but it is trivial to modify it 
> > to generate multiple flows. My point was, we have the ability to generate 
> > high rate traffic, as long as we do tolerate a .1-1us jitter. Beyond that, 
> > you do need some ixia-like solution.
> > 
> 
> On the bandwidth side, can a modern sender with netmap really do a full 10G?  
> I hate the cost of an
> IXIA but I have not been able to destroy our stack as effectively with 
> anything else.

yes george, you can download the picobsd image

http://info.iet.unipi.it/~luigi/netmap/20120618-netmap-picobsd-head-amd64.bin

and try for yourself.

Granted this does not have all the knobs of an ixia but it can
surely blast the full 14.88 Mpps to the link, and it only takes a
bit of userspace programming to generate reasonably arbitrary streams
of packets. A netmap sender/receiver is not CPU bound even with 1 core.

cheers
luigi


> Best,
> George


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-19 Thread George Neville-Neil

On Sep 14, 2013, at 15:24 , Luigi Rizzo  wrote:

> On Saturday, September 14, 2013, Olivier Cochard-Labbé 
> wrote:
>> On Sat, Sep 14, 2013 at 4:28 PM, Luigi Rizzo  wrote:
>>> 
>>> IXIA ? For the timescales we need to address we don't need an IXIA,
>>> a netmap sender is more than enough
>>> 
>> 
>> The great netmap generates only one IP flow (same src/dst IP and same
>> src/dst port).
> 
> True the sample app generates only one flow but it is trivial to modify it
> to generate multiple flows. My point was, we have the ability to generate
> high rate traffic, as long as we do tolerate a .1-1us jitter. Beyond that,
> you do need some ixia-like solution.
> 
On the bandwidth side, can a modern sender with netmap really do a full 10G?  I 
hate the cost of an
IXIA but I have not been able to destroy our stack as effectively with anything 
else.

Best,
George



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Network stack changes

2013-09-19 Thread George Neville-Neil

On Sep 14, 2013, at 15:24 , Luigi Rizzo  wrote:

> 
> 
> On Saturday, September 14, 2013, Olivier Cochard-Labbé  
> wrote:
> > On Sat, Sep 14, 2013 at 4:28 PM, Luigi Rizzo  wrote:
> >>
> >> IXIA ? For the timescales we need to address we don't need an IXIA,
> >> a netmap sender is more than enough
> >>
> >
> > The great netmap generates only one IP flow (same src/dst IP and same
> > src/dst port).
> 
> True the sample app generates only one flow but it is trivial to modify it to 
> generate multiple flows. My point was, we have the ability to generate high 
> rate traffic, as long as we do tolerate a .1-1us jitter. Beyond that, you do 
> need some ixia-like solution.
> 

On the bandwidth side, can a modern sender with netmap really do a full 10G?  I 
hate the cost of an
IXIA but I have not been able to destroy our stack as effectively with anything 
else.

Best,
George


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Network stack changes

2013-09-14 Thread Luigi Rizzo
On Saturday, September 14, 2013, Olivier Cochard-Labbé 
wrote:
> On Sat, Sep 14, 2013 at 4:28 PM, Luigi Rizzo  wrote:
>>
>> IXIA ? For the timescales we need to address we don't need an IXIA,
>> a netmap sender is more than enough
>>
>
> The great netmap generates only one IP flow (same src/dst IP and same
> src/dst port).

True the sample app generates only one flow but it is trivial to modify it
to generate multiple flows. My point was, we have the ability to generate
high rate traffic, as long as we do tolerate a .1-1us jitter. Beyond that,
you do need some ixia-like solution.

Cheers
Luigi


> This don't permit to test multi-queue NIC (or SMP packet-filter) on a
> simple lab like this:
> netmap sender => freebsd router => netmap receiver
>
> Regards,
>
> Olivier
>

-- 
-+---
 Prof. Luigi RIZZO, ri...@iet.unipi.it  . Dip. di Ing. dell'Informazione
 http://www.iet.unipi.it/~luigi/. Universita` di Pisa
 TEL  +39-050-2211611   . via Diotisalvi 2
 Mobile   +39-338-6809875   . 56122 PISA (Italy)
-+---
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-14 Thread Olivier Cochard-Labbé
On Sat, Sep 14, 2013 at 4:28 PM, Luigi Rizzo  wrote:
>
> IXIA ? For the timescales we need to address we don't need an IXIA,
> a netmap sender is more than enough
>

The great netmap generates only one IP flow (same src/dst IP and same
src/dst port).
This don't permit to test multi-queue NIC (or SMP packet-filter) on a
simple lab like this:
netmap sender => freebsd router => netmap receiver

Regards,

Olivier
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-14 Thread Luigi Rizzo
On Fri, Sep 13, 2013 at 11:08:27AM -0400, George Neville-Neil wrote:
> 
> On Aug 29, 2013, at 7:49 , Adrian Chadd  wrote:
...
> One quick note here.  Every time you increase batching you may increase 
> bandwidth
> but you will also increase per packet latency for the last packet in a batch.

The ones who suffer are the first ones, because their processing
is somewhat delayed to 1) let the input batch build up, and 2) complete
processing of the batch before pushing results to the next stage.

However one should never wait for an input batch to grow; you process
whatever your source gives you (one or more packets)
by the time you are ready (and if you are slow/overloaded, of course
you will get a large backlog at once). Either way, there is no
reason to create additional delay on input.

cheers
luigi
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-14 Thread Luigi Rizzo
On Fri, Sep 13, 2013 at 11:08:27AM -0400, George Neville-Neil wrote:
> 
> On Aug 29, 2013, at 7:49 , Adrian Chadd  wrote:
...
> > I still have some tool coding to do with PMC before I even think about
> > tinkering with this as I'd like to measure stuff like per-packet latency as
> > well as top-level processing overhead (ie, CPU_CLK_UNHALTED.THREAD_P /
> > lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, etc.)
> > 
> 
> This would be very useful in identifying the actual hot spots, and would be 
> helpful
> to anyone who can generate a decent stream of packets with, say, an IXIA.

IXIA ? For the timescales we need to address we don't need an IXIA,
a netmap sender is more than enough

cheers
luigi
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-14 Thread Adrian Chadd
On 13 September 2013 15:43, Rick Macklem  wrote:


> And any time you increase latency, that will have a negative impact on
> NFS performance. NFS RPCs are usually small messages (except Write requests
> and Read replies) and the RTT for these (mostly small, bidirectional)
> messages can have a significant impact on NFS perf.
>

Hi,

the penalties to go to main memory quite a few times each time we process a
frame is expensive. If we can get some better behaviour through batching
leading to more efficient cache usage, it may not actually _have_ a delay.

But, that requires a whole lot of design stars to align. And I'm still knee
deep elsewhere, so I haven't really finished getting up to speed with what
everyone else has done / said about it..



-adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-14 Thread Rick Macklem
Sam Fourman Jr. wrote:
> >
> 
> > And any time you increase latency, that will have a negative impact
> > on
> > NFS performance. NFS RPCs are usually small messages (except Write
> > requests
> > and Read replies) and the RTT for these (mostly small,
> > bidirectional)
> > messages can have a significant impact on NFS perf.
> >
> > rick
> >
> >
> this may be a bit off topic but not much... I have wondered with all
> of the
> new
> tcp algorithms
> http://freebsdfoundation.blogspot.com/2011/03/summary-of-five-new-tcp-congestion.html
> 
> what algorithm is best suited for NFS over gigabit Ethernet, say
> FreeBSD to
> FreeBSD.
> and further more would a NFS optimized tcp algorithm be useful?
> 
I have no idea what effect they might have. NFS traffic is quite different than
streaming or bulk data transfer. I think this might make a nice research
project for someone.

rick

> Sam Fourman Jr.
> 
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-14 Thread Sam Fourman Jr.
>

> And any time you increase latency, that will have a negative impact on
> NFS performance. NFS RPCs are usually small messages (except Write requests
> and Read replies) and the RTT for these (mostly small, bidirectional)
> messages can have a significant impact on NFS perf.
>
> rick
>
>
this may be a bit off topic but not much... I have wondered with all of the
new
tcp algorithms
http://freebsdfoundation.blogspot.com/2011/03/summary-of-five-new-tcp-congestion.html

what algorithm is best suited for NFS over gigabit Ethernet, say FreeBSD to
FreeBSD.
and further more would a NFS optimized tcp algorithm be useful?

Sam Fourman Jr.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-13 Thread Rick Macklem
George Neville-Neil wrote:
> 
> On Aug 29, 2013, at 7:49 , Adrian Chadd  wrote:
> 
> > Hi,
> > 
> > There's a lot of good stuff to review here, thanks!
> > 
> > Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless
> > to keep
> > locking things like that on a per-packet basis. We should be able
> > to do
> > this in a cleaner way - we can defer RX into a CPU pinned taskqueue
> > and
> > convert the interrupt handler to a fast handler that just schedules
> > that
> > taskqueue. We can ignore the ithread entirely here.
> > 
> > What do you think?
> > 
> > Totally pie in the sky handwaving at this point:
> > 
> > * create an array of mbuf pointers for completed mbufs;
> > * populate the mbuf array;
> > * pass the array up to ether_demux().
> > 
> > For vlan handling, it may end up populating its own list of mbufs
> > to push
> > up to ether_demux(). So maybe we should extend the API to have a
> > bitmap of
> > packets to actually handle from the array, so we can pass up a
> > larger array
> > of mbufs, note which ones are for the destination and then the
> > upcall can
> > mark which frames its consumed.
> > 
> > I specifically wonder how much work/benefit we may see by doing:
> > 
> > * batching packets into lists so various steps can batch process
> > things
> > rather than run to completion;
> > * batching the processing of a list of frames under a single lock
> > instance
> > - eg, if the forwarding code could do the forwarding lookup for 'n'
> > packets
> > under a single lock, then pass that list of frames up to
> > inet_pfil_hook()
> > to do the work under one lock, etc, etc.
> > 
> > Here, the processing would look less like "grab lock and process to
> > completion" and more like "mark and sweep" - ie, we have a list of
> > frames
> > that we mark as needing processing and mark as having been
> > processed at
> > each layer, so we know where to next dispatch them.
> > 
> 
> One quick note here.  Every time you increase batching you may
> increase bandwidth
> but you will also increase per packet latency for the last packet in
> a batch.
> That is fine so long as we remember that and that this is a tuning
> knob
> to balance the two.
> 
And any time you increase latency, that will have a negative impact on
NFS performance. NFS RPCs are usually small messages (except Write requests
and Read replies) and the RTT for these (mostly small, bidirectional)
messages can have a significant impact on NFS perf.

rick

> > I still have some tool coding to do with PMC before I even think
> > about
> > tinkering with this as I'd like to measure stuff like per-packet
> > latency as
> > well as top-level processing overhead (ie,
> > CPU_CLK_UNHALTED.THREAD_P /
> > lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core,
> > etc.)
> > 
> 
> This would be very useful in identifying the actual hot spots, and
> would be helpful
> to anyone who can generate a decent stream of packets with, say, an
> IXIA.
> 
> Best,
> George
> 
> 
> 
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-09-13 Thread George Neville-Neil

On Aug 29, 2013, at 7:49 , Adrian Chadd  wrote:

> Hi,
> 
> There's a lot of good stuff to review here, thanks!
> 
> Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to keep
> locking things like that on a per-packet basis. We should be able to do
> this in a cleaner way - we can defer RX into a CPU pinned taskqueue and
> convert the interrupt handler to a fast handler that just schedules that
> taskqueue. We can ignore the ithread entirely here.
> 
> What do you think?
> 
> Totally pie in the sky handwaving at this point:
> 
> * create an array of mbuf pointers for completed mbufs;
> * populate the mbuf array;
> * pass the array up to ether_demux().
> 
> For vlan handling, it may end up populating its own list of mbufs to push
> up to ether_demux(). So maybe we should extend the API to have a bitmap of
> packets to actually handle from the array, so we can pass up a larger array
> of mbufs, note which ones are for the destination and then the upcall can
> mark which frames its consumed.
> 
> I specifically wonder how much work/benefit we may see by doing:
> 
> * batching packets into lists so various steps can batch process things
> rather than run to completion;
> * batching the processing of a list of frames under a single lock instance
> - eg, if the forwarding code could do the forwarding lookup for 'n' packets
> under a single lock, then pass that list of frames up to inet_pfil_hook()
> to do the work under one lock, etc, etc.
> 
> Here, the processing would look less like "grab lock and process to
> completion" and more like "mark and sweep" - ie, we have a list of frames
> that we mark as needing processing and mark as having been processed at
> each layer, so we know where to next dispatch them.
> 

One quick note here.  Every time you increase batching you may increase 
bandwidth
but you will also increase per packet latency for the last packet in a batch.
That is fine so long as we remember that and that this is a tuning knob
to balance the two.

> I still have some tool coding to do with PMC before I even think about
> tinkering with this as I'd like to measure stuff like per-packet latency as
> well as top-level processing overhead (ie, CPU_CLK_UNHALTED.THREAD_P /
> lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, etc.)
> 

This would be very useful in identifying the actual hot spots, and would be 
helpful
to anyone who can generate a decent stream of packets with, say, an IXIA.

Best,
George




signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Network stack changes

2013-08-29 Thread Adrian Chadd
Hi,

There's a lot of good stuff to review here, thanks!

Yes, the ixgbe RX lock needs to die in a fire. It's kinda pointless to keep
locking things like that on a per-packet basis. We should be able to do
this in a cleaner way - we can defer RX into a CPU pinned taskqueue and
convert the interrupt handler to a fast handler that just schedules that
taskqueue. We can ignore the ithread entirely here.

What do you think?

Totally pie in the sky handwaving at this point:

* create an array of mbuf pointers for completed mbufs;
* populate the mbuf array;
* pass the array up to ether_demux().

For vlan handling, it may end up populating its own list of mbufs to push
up to ether_demux(). So maybe we should extend the API to have a bitmap of
packets to actually handle from the array, so we can pass up a larger array
of mbufs, note which ones are for the destination and then the upcall can
mark which frames its consumed.

I specifically wonder how much work/benefit we may see by doing:

* batching packets into lists so various steps can batch process things
rather than run to completion;
* batching the processing of a list of frames under a single lock instance
- eg, if the forwarding code could do the forwarding lookup for 'n' packets
under a single lock, then pass that list of frames up to inet_pfil_hook()
to do the work under one lock, etc, etc.

Here, the processing would look less like "grab lock and process to
completion" and more like "mark and sweep" - ie, we have a list of frames
that we mark as needing processing and mark as having been processed at
each layer, so we know where to next dispatch them.

I still have some tool coding to do with PMC before I even think about
tinkering with this as I'd like to measure stuff like per-packet latency as
well as top-level processing overhead (ie, CPU_CLK_UNHALTED.THREAD_P /
lagg0 TX bytes/pkts, RX bytes/pkts, NIC interrupts on that core, etc.)

Thanks,



-adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-08-28 Thread Bryan Venteicher


- Original Message -
> On 28.08.2013 20:30, Alexander V. Chernikov wrote:
> > Hello list!
> 
> Hello Alexander,
> 
> you sent quite a few things in the same email.  I'll try to respond
> as much as I can right now.  Later you should split it up to have
> more in-depth discussions on the individual parts.
> 

> 
> > We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add
> > some more. We even have
> > per-driver hooks to program HW filtering.
> 
> We could.  Though for vlan it looks like it would be easier to remove the
> hardware vlan tag stripping and insertion.  It only adds complexity in all
> drivers for no gain.
> 

In the shorter term, can we remove the requirement for the parent
interface to support IFCAP_VLAN_HWTAGGING in order to do checksum
offloading on the VLAN interface (see vlan_capabilities())?

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-08-28 Thread Slawa Olhovchenkov
On Thu, Aug 29, 2013 at 12:24:48AM +0200, Andre Oppermann wrote:

> > ..
> > while Intel DPDK claims 80MPPS (and 6windgate talks about 160 or so) on the 
> > same-class hardware and
> > _userland_ forwarding.
> 
> Those numbers sound a bit far out.  Maybe if the packet isn't touched
> or looked at at all in a pure netmap interface to interface bridging
> scenario.  I don't believe these numbers.

80*64*8 = 40.960 Gb/s
May be DCA? And use CPU with 40 PCIe lane and 4 memory chanell.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"


Re: Network stack changes

2013-08-28 Thread Andre Oppermann

On 28.08.2013 20:30, Alexander V. Chernikov wrote:

Hello list!


Hello Alexander,

you sent quite a few things in the same email.  I'll try to respond
as much as I can right now.  Later you should split it up to have
more in-depth discussions on the individual parts.

If you could make it to the EuroBSDcon 2013 DevSummit that would be
even more awesome.  Most of the active network stack people will be
there too.


There is a lot constantly raising  discussions related to networking stack 
performance/changes.

I'll try to summarize current problems and possible solutions from my point of 
view.
(Generally this is one problem: stack is slooow, but we 
need to know why and
what to do).


Compared to others its not thaaat slow. ;)


Let's start with current IPv4 packet flow on a typical router:
http://static.ipfw.ru/images/freebsd_ipv4_flow.png

(I'm sorry I can't provide this as text since Visio don't have any 'ascii-art' 
exporter).

Note that we are using process-to-completion model, e.g. process any packet in 
ISR until it is either
consumed by L4+ stack or dropped or put to egress NIC queue.

(There is also deferred ISR model implemented inside netisr but it does not 
change much:
it can help to do more fine-grained hashing (for GRE or other similar traffic), 
but
1) it uses per-packet mutex locking which kills all performance
2) it currently does not have _any_ hashing functions (see absence of flags in 
`netstat -Q`)
People using http://static.ipfw.ru/patches/netisr_ip_flowid.diff (or modified 
PPPoe/GRE version)
report some profit, but without fixing (1) it can't help much
)

So, let's start:

1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since there 
is nearly no contention
(the only thing that can happen is driver reconfiguration which is rare and, 
more signifficant, we
do this once
for the batch of packets received in given interrupt). However, due to some 
(im)possible deadlocks
current code
does per-packet ring unlock/lock (see ixgbe_rx_input()).
There was a discussion ended with nothing:
http://lists.freebsd.org/pipermail/freebsd-net/2012-October/033520.html

1*) Possible BPF users. Here we have one rlock if there are any readers present
(and mutex for any matching packets, but this is more or less OK. Additionally, 
there is WIP to
implement multiqueue BPF
and there is chance that we can reduce lock contention there).


Rlock to rmlock?


There is also an "optimize_writers" hack permitting applications
like CDP to use BPF as writers but not registering them as receivers (which 
implies rlock)


I believe longer term we should solve this with a protocol type "ethernet"
so that one can send/receive ethernet frames through a normal socket.


2/3) Virtual interfaces (laggs/vlans over lagg and other simular constructions).
Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more funny 
- we use complex
vlan_hash with another rlock to
get vlan interface from underlying one.

This is definitely not like things should be done and this can be changed more 
or less easily.


Indeed.


There are some useful terms/techniques in world of software/hardware routing: 
they have clear
'control plane' and 'data plane' separation.
Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg 
hellos, ARP/NDP, etc..) and
some data traffic (packets with TTL=1, with options, destined to hosts without 
ARP/NDP record, and
similar). Latter one is done in hardware (or effective software implementation).
Control plane is responsible to provide data for efficient data plane 
operations. This is the point
we are missing nearly everywhere.


ACK.


What I want to say is: lagg is pure control-plane stuff and vlan is nearly the 
same. We can't apply
this approach to complex cases like 
lagg-over-vlans-over-vlans-over-(pppoe_ng0-and_wifi0)
but we definitely can do this for most common setups like (igb* or ix* in lagg 
with or without vlans
on top of lagg).


ACK.


We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add some 
more. We even have
per-driver hooks to program HW filtering.


We could.  Though for vlan it looks like it would be easier to remove the
hardware vlan tag stripping and insertion.  It only adds complexity in all
drivers for no gain.


One small step to do is to throw packet to vlan interface directly (P1), 
proof-of-concept(working in
production):
http://lists.freebsd.org/pipermail/freebsd-net/2013-April/035270.html

Another is to change lagg packet accounting:
http://lists.freebsd.org/pipermail/svn-src-all/2013-April/067570.html
Again, this is more like HW boxes do (aggregate all counters including errors) 
(and I can't imagine
what real error we can get from _lagg_).

>

4) If we are router, we can do either slooow ip_input() -> ip_forward() -> 
ip_output() cycle or use
optimized ip_fastfwd() which falls back to 'slow' path for 
multicast/options/local traffic (e.g.
works exactly like 'data plane' part).

Re: Network stack changes

2013-08-28 Thread Jack Vogel
Very interesting material Alexander, only had time to glance at it now,
will look in more
depth later, thanks!

Jack



On Wed, Aug 28, 2013 at 11:30 AM, Alexander V. Chernikov <
melif...@yandex-team.ru> wrote:

> Hello list!
>
> There is a lot constantly raising  discussions related to networking stack
> performance/changes.
>
> I'll try to summarize current problems and possible solutions from my
> point of view.
> (Generally this is one problem: stack is slooow**,
> but we need to know why and what to do).
>
> Let's start with current IPv4 packet flow on a typical router:
> http://static.ipfw.ru/images/**freebsd_ipv4_flow.png
>
> (I'm sorry I can't provide this as text since Visio don't have any
> 'ascii-art' exporter).
>
> Note that we are using process-to-completion model, e.g. process any
> packet in ISR until it is either
> consumed by L4+ stack or dropped or put to egress NIC queue.
>
> (There is also deferred ISR model implemented inside netisr but it does
> not change much:
> it can help to do more fine-grained hashing (for GRE or other similar
> traffic), but
> 1) it uses per-packet mutex locking which kills all performance
> 2) it currently does not have _any_ hashing functions (see absence of
> flags in `netstat -Q`)
> People using 
> http://static.ipfw.ru/patches/**netisr_ip_flowid.diff(or
>  modified PPPoe/GRE version)
> report some profit, but without fixing (1) it can't help much
> )
>
> So, let's start:
>
> 1) Ixgbe uses mutex to protect each RX ring which is perfectly fine since
> there is nearly no contention
> (the only thing that can happen is driver reconfiguration which is rare
> and, more signifficant, we do this once
> for the batch of packets received in given interrupt). However, due to
> some (im)possible deadlocks current code
> does per-packet ring unlock/lock (see ixgbe_rx_input()).
> There was a discussion ended with nothing: http://lists.freebsd.org/**
> pipermail/freebsd-net/2012-**October/033520.html
>
> 1*) Possible BPF users. Here we have one rlock if there are any readers
> present
> (and mutex for any matching packets, but this is more or less OK.
> Additionally, there is WIP to implement multiqueue BPF
> and there is chance that we can reduce lock contention there). There is
> also an "optimize_writers" hack permitting applications
> like CDP to use BPF as writers but not registering them as receivers
> (which implies rlock)
>
> 2/3) Virtual interfaces (laggs/vlans over lagg and other simular
> constructions).
> Currently we simply use rlock to make s/ix0/lagg0/ and, what is much more
> funny - we use complex vlan_hash with another rlock to
> get vlan interface from underlying one.
>
> This is definitely not like things should be done and this can be changed
> more or less easily.
>
> There are some useful terms/techniques in world of software/hardware
> routing: they have clear 'control plane' and 'data plane' separation.
> Former one is for dealing control traffic (IGP, MLD, IGMP snooping, lagg
> hellos, ARP/NDP, etc..) and some data traffic (packets with TTL=1, with
> options, destined to hosts without ARP/NDP record, and similar). Latter one
> is done in hardware (or effective software implementation).
> Control plane is responsible to provide data for efficient data plane
> operations. This is the point we are missing nearly everywhere.
>
> What I want to say is: lagg is pure control-plane stuff and vlan is nearly
> the same. We can't apply this approach to complex cases like
> lagg-over-vlans-over-vlans-**over-(pppoe_ng0-and_wifi0)
> but we definitely can do this for most common setups like (igb* or ix* in
> lagg with or without vlans on top of lagg).
>
> We already have some capabilities like VLANHWFILTER/VLANHWTAG, we can add
> some more. We even have per-driver hooks to program HW filtering.
>
> One small step to do is to throw packet to vlan interface directly (P1),
> proof-of-concept(working in production):
> http://lists.freebsd.org/**pipermail/freebsd-net/2013-**April/035270.html
>
> Another is to change lagg packet accounting: http://lists.freebsd.org/**
> pipermail/svn-src-all/2013-**April/067570.html
> Again, this is more like HW boxes do (aggregate all counters including
> errors) (and I can't imagine what real error we can get from _lagg_).
>
> 4) If we are router, we can do either slooow ip_input() -> ip_forward() ->
> ip_output() cycle or use optimized ip_fastfwd() which falls back to 'slow'
> path for multicast/options/local traffic (e.g. works exactly like 'data
> plane' part).
> (Btw, we can consider net.inet.ip.fastforwarding to be turned on by
> default at least for non-IPSEC kernels)
>
> Here we have to