Re: Some performance measurements on the FreeBSD network stack

2012-05-03 Thread Luigi Rizzo
On Thu, Apr 19, 2012 at 03:30:18PM +0200, Luigi Rizzo wrote:
 I have been running some performance tests on UDP sockets,
 using the netsend program in tools/tools/netrate/netsend
 and instrumenting the source code and the kernel do return in
 various points of the path. Here are some results which
 I hope you find interesting.
...

I have summarized the info on this thread in the camera ready
version of an upcoming Usenix paper, which you can find here:

http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


RE: Some performance measurements on the FreeBSD network stack

2012-04-25 Thread Maxim Konovalov
Hi,

On Tue, 24 Apr 2012, 17:40-, Li, Qing wrote:

 Yup, all good points. In fact we have considered all of these while doing
 the work. In case you haven't seen it already, we did write about these
 issues in our paper and how we tried to address those, flow-table was one
 of the solutions.

 http://dl.acm.org/citation.cfm?id=1592641

Is this article available for those without ACM subscription?

-- 
Maxim Konovalov
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-25 Thread Slawa Olhovchenkov
On Wed, Apr 25, 2012 at 01:22:06PM +0400, Maxim Konovalov wrote:

 Hi,
 
 On Tue, 24 Apr 2012, 17:40-, Li, Qing wrote:
 
  Yup, all good points. In fact we have considered all of these while doing
  the work. In case you haven't seen it already, we did write about these
  issues in our paper and how we tried to address those, flow-table was one
  of the solutions.
 
  http://dl.acm.org/citation.cfm?id=1592641
 
 Is this article available for those without ACM subscription?

Tip: get citation from abstract to google.
3'th link:
http://conferences.sigcomm.org/sigcomm/2009/workshops/presto/papers/p37.pdf

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-25 Thread K. Macy
 Because there were leaks, there were 100% panics for IPv6, ... at least on
 the version I had seen in autumn last year.

 There is certainly no one more interested then me on these in, esp. for v6
 where the removal of route caching a long time ago made nd6_nud_hint() a NOP
 with dst and rt being passed down as NULL only, and where we are doing up to
 three route lookups in the output path if no cached rt is passed down along
 from the ULP.

 If there is an updated patch, I'd love to see it.

Ok, I'm following up as this seems to be getting some interest. This
the relevant part of the last mail that I received from you. The final
part having been dedicated to the narrow potential ABI changes that
were to make it in to the release.

From: Bjoern A. Zeeb b...@freebsd.org
Date: Mon, Sep 19, 2011 at 3:19 PM
To: K. Macy km...@freebsd.org
Cc: Robert Watson rwat...@freebsd.org, rysto32 ryst...@gmail.com,
Qing Li qin...@freebsd.org

Sorry it's taking me so long while I was travelling but also now being
back home again.
I would yet have to find a code path through IPv6 that will a) not
panic on INVARIANTS
and b) actually update the inp_lle cache.

Once I stop finding the next hiccup going one step deeper into the
stack (and I made
it to if_ethersubr.c) I'll get to legacy IP and the beef and I'll hope
that all you
all will have reviewed and tested that thoroughly.

Checking whether a similar problem would exist in v4 I however found a possible
lle reference leak in the legacy IP path as well.

There's also a missed place where we do not update the generation counter (even
though kind of pointless place but still to do for completeness).

I am also pondering why we are not always invalidating the ro_lle cache (when we
update the ro_rt entry in the callgraph after tcp_output).  I wonder if we can
provoke strange results say changing the default route from something connected
on interface 1 to interface 2.

...

/bz

--
Bjoern A. Zeeb You have to have visions!
Stop bit received. Insert coin for new address family.

===

The only comment in here which was sufficiently specific to actually
take action on was: pondering why we are not always invalidating the
ro_lle cache (when we update the ro_rt entry in the callgraph after
tcp_output). Which was subsequently addressed by ensuring that the
LLE_VALID flag was actually meaningful by clearing it when the llentry
is removed from the interface's hash table in an unrelated commit
because of weird behaviour observed with the flow.

a) Where is the possible leak in the legacy path?

b) Where were the panics in v6?

In light of the fact that I don't or at least didn't have any means of
testing v6 (I can probably get a testbed set up at iX now) and the
netinet6 specific portions of the patch consist of 4 lines of code
which should really be entrusted to you given that your performance
parity work for v6 has actively being funded, it was clearly a mistake
to tie the fate of the patch as a whole to those narrow bits.

Once I get a response to a) and b) I'll follow up with a patch against
head. I'm sure whatever I had has bitrotted somewhat in the meantime.

Thanks for your help,
Kip
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-25 Thread Bjoern A. Zeeb
On 25. Apr 2012, at 15:45 , K. Macy wrote:

 a) Where is the possible leak in the legacy path?

It's been somewhere in ip_output() in one of the possible combinations
go through the code flow.  I'd probably need to apply a patch to a tree
to get there again.  It's been more than 6 months for me as well.  I think
it was related to the flowtable path but I could completely misremember.


 b) Where were the panics in v6?

Again completely quoting from memory.
I think the problem was that the INVARIANTS check in what's currently
nd6_output_lle() was hit given both the rtentry and llentry were passed in
but no *chain.  Fixing this seems trivial even when trying to keep the
current invariants checked.
However the bigger problem then was that the cached value was never updated
as the *ro passed down had been lost on the way.  Whatever came then, is
again off my head without the patch in front of me.

Btw. you don't need more than two machines connected, virtual or not, worst
two vnet instances on a lab machine, to enable and do IPv6. No need for
global connectivity at all, as would not be required for IPv4 either.


If you can get the patch updated to apply to a modern HEAD and compile (even
if as-is) I'll try to help solving those to my best (though limited)
availability to help you to get that thing in.


/bz

-- 
Bjoern A. Zeeb You have to have visions!
   It does not matter how good you are. It matters what good you do!

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread Andre Oppermann

On 19.04.2012 22:46, Luigi Rizzo wrote:

On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:

On 19.04.2012 15:30, Luigi Rizzo wrote:

I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.
- another big bottleneck is the route lookup in ip_output()
   (between entries 51 and 56). Not only it eats another
   100ns+ on an empty routing table, but it also
   causes huge contentions when multiple cores
   are involved.


This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?


I've completed the updating of the routing table rmlock patch.  There
are two steps.  Step one is just changing the rwlock to an rmlock.
Step two streamlines the route lookup in ip_output and ip_fastfwd by
copying out the relevant data while only holding the rmlock instead
of obtaining a reference to the route.

Would be very interesting to see how your benchmark/profiling changes
with these patches applied.

http://svn.freebsd.org/changeset/base/234649
Log:
  Change the radix head lock to an rmlock (read mostly lock).

  There is some header pollution going on because rmlock's are
  not entirely abstracted and need per-CPU structures.

  A comment in _rmlock.h says this can be hidden if there were
  per-cpu linker magic/support.  I don't know if we have that
  already.

http://svn.freebsd.org/changeset/base/234650
Log:
  Add a function rtlookup() that copies out the relevant information
  from an rtentry instead of returning the rtentry.  This avoids the
  need to lock the rtentry and to increase the refcount on it.

  Convert ip_output() to use rtlookup() in a simplistic way.  Certain
  seldom used functionality may not work anymore and the flowtable
  isn't available at the moment.

  Convert ip_fastfwd() to use rtlookup().

  This code is meant to be used for profiling and to be experimented
  with further to determine which locking strategy returns the best
  results.

Make sure to apply this one as well:
http://svn.freebsd.org/changeset/base/234648
Log:
  Add INVARIANT and WITNESS support to rm_lock locks and optimize the
  synchronization path by replacing a LIST of active readers with a
  TAILQ.

  Obtained from:Isilon
  Submitted by: mlaier

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread Luigi Rizzo
On Tue, Apr 24, 2012 at 03:16:48PM +0200, Andre Oppermann wrote:
 On 19.04.2012 22:46, Luigi Rizzo wrote:
 On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:
 On 19.04.2012 15:30, Luigi Rizzo wrote:
 I have been running some performance tests on UDP sockets,
 using the netsend program in tools/tools/netrate/netsend
 and instrumenting the source code and the kernel do return in
 various points of the path. Here are some results which
 I hope you find interesting.
 - another big bottleneck is the route lookup in ip_output()
(between entries 51 and 56). Not only it eats another
100ns+ on an empty routing table, but it also
causes huge contentions when multiple cores
are involved.
 
 This is indeed a big problem.  I'm working (rough edges remain) on
 changing the routing table locking to an rmlock (read-mostly) which
 
 i was wondering, is there a way (and/or any advantage) to use the
 fastforward code to look up the route for locally sourced packets ?
 
 I've completed the updating of the routing table rmlock patch.  There
 are two steps.  Step one is just changing the rwlock to an rmlock.
 Step two streamlines the route lookup in ip_output and ip_fastfwd by
 copying out the relevant data while only holding the rmlock instead
 of obtaining a reference to the route.
 
 Would be very interesting to see how your benchmark/profiling changes
 with these patches applied.

If you want to give it a try yourself, the high level benchmark is
just the 'netsend' program from tools/tools/netrate/netsend -- i
am running something like

for i in $X ; do
netsend 10.0.0.2  18 0 5 
done

and the cardinality of $X can be used to test contention on the
low layers (routing tables and interface/queues).

From previous tests, the difference between flowtable and
routing table was small with a single process (about 5% or 50ns
in the total packet processing time, if i remember well),
but there was a large gain with multiple concurrent processes.

Probably the change in throughput between HEAD and your
branch is all you need. The info below shows that your
gain is something around 100-200 ns depending on how good
is the info that you return back (see below).

My profiling changes were mostly aimed at charging the costs to the
various layers. With my current setting (single process i7-870 @2933
MHz+Turboboost, ixgbe, FreeBSD HEAD, FLOWTABLE enabled, UDP) i see
the following:

FileFunction/descriptionTotal/delta
nanoseconds
user programsendto()8   96
system call

uipc_syscalls.c sys_sendto104 
uipc_syscalls.c sendit111
uipc_syscalls.c kern_sendit   118
uipc_socket.c   sosend
uipc_socket.c   sosend_dgram  146  137
  sockbuf locking, mbuf alloc, copyin

udp_usrreq.cudp_send  273
udp_usrreq.cudp_output273   57

ip_output.c ip_output 330  198
  route lookup, ip header setup

if_ethersubr.c  ether_output  528  162
  MAC header lookup and construction,
  loopback checks
if_ethersubr.c  ether_output_frame690

ixgbe.c ixgbe_mq_start698
ixgbe.c ixgbe_mq_start_locked 720
ixgbe.c ixgbe_xmit730  220
 mbuf mangling, device programming

--  packet on the wire950

Removing flowtable increases the cost in ip_output()
(obviously) but also in ether_output() (because the
route does not have a lle entry so you need to call
arpresolve on each packet). It also causes trouble
in the device driver because the mbuf does not have a
flowid set, so the ixgbe device driver puts the
packet on the queue corresponding to the current CPU.
If the process (as in my case) floats, one flow might end
up on multiple queues.

So in revising the route lookup i believe it would be good
if we could also get at once most of the info that
ether_output() is computing again and again.

cheers
luigi


 http://svn.freebsd.org/changeset/base/234649
 Log:
   Change the radix head lock to an rmlock (read mostly lock).
 
   There is some header pollution going on because rmlock's are
   not entirely abstracted and need per-CPU structures.
 
   A comment in _rmlock.h says this can be hidden if there were
   per-cpu linker magic/support.  I don't know if we have that
   already.
 
 http://svn.freebsd.org/changeset/base/234650
 Log:
   Add a function rtlookup() that copies out the relevant information
   from an rtentry instead of returning the rtentry.  This avoids the
   need to lock the rtentry and to increase the refcount on it.
 
   Convert ip_output() to use rtlookup() in a simplistic way.  Certain
   seldom used functionality may not work anymore and the flowtable
   isn't available at the moment.
 
   

RE: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread Li, Qing

From previous tests, the difference between flowtable and
routing table was small with a single process (about 5% or 50ns
in the total packet processing time, if i remember well),
but there was a large gain with multiple concurrent processes.


Yes, that sounds about right when we did the tests a long while ago.


 Removing flowtable increases the cost in ip_output()
 (obviously) but also in ether_output() (because the
 route does not have a lle entry so you need to call
 arpresolve on each packet). 


Yup.


 So in revising the route lookup i believe it would be good
 if we could also get at once most of the info that
 ether_output() is computing again and again.


Well, the routing table no longer maintains any lle info, so there
isn't much to copy out the rtentry at the completion of route
lookup.

If I understood you correctly, you do believe there is a lot of value
in Flowtable caching concept, but you are not suggesting we reverting
back to having the routing table maintain L2 entries, are you ?

--Qing
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread K. Macy
On Tue, Apr 24, 2012 at 4:16 PM, Li, Qing qing...@bluecoat.com wrote:

 From previous tests, the difference between flowtable and
routing table was small with a single process (about 5% or 50ns
in the total packet processing time, if i remember well),
but there was a large gain with multiple concurrent processes.


 Yes, that sounds about right when we did the tests a long while ago.


 Removing flowtable increases the cost in ip_output()
 (obviously) but also in ether_output() (because the
 route does not have a lle entry so you need to call
 arpresolve on each packet).


 Yup.


 So in revising the route lookup i believe it would be good
 if we could also get at once most of the info that
 ether_output() is computing again and again.


 Well, the routing table no longer maintains any lle info, so there
 isn't much to copy out the rtentry at the completion of route
 lookup.

 If I understood you correctly, you do believe there is a lot of value
 in Flowtable caching concept, but you are not suggesting we reverting
 back to having the routing table maintain L2 entries, are you ?



One could try a similar conversion of the L2 table to an rmlock
without copy while lock is held.

-Kip


-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread K. Macy
On Tue, Apr 24, 2012 at 5:03 PM, K. Macy km...@freebsd.org wrote:
 On Tue, Apr 24, 2012 at 4:16 PM, Li, Qing qing...@bluecoat.com wrote:

 From previous tests, the difference between flowtable and
routing table was small with a single process (about 5% or 50ns
in the total packet processing time, if i remember well),
but there was a large gain with multiple concurrent processes.


 Yes, that sounds about right when we did the tests a long while ago.


 Removing flowtable increases the cost in ip_output()
 (obviously) but also in ether_output() (because the
 route does not have a lle entry so you need to call
 arpresolve on each packet).


 Yup.


 So in revising the route lookup i believe it would be good
 if we could also get at once most of the info that
 ether_output() is computing again and again.


 Well, the routing table no longer maintains any lle info, so there
 isn't much to copy out the rtentry at the completion of route
 lookup.

 If I understood you correctly, you do believe there is a lot of value
 in Flowtable caching concept, but you are not suggesting we reverting
 back to having the routing table maintain L2 entries, are you ?



 One could try a similar conversion of the L2 table to an rmlock
 without copy while lock is held.

Odd .. *with* copy while lock is held.

-Kip
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread Luigi Rizzo
On Tue, Apr 24, 2012 at 02:16:18PM +, Li, Qing wrote:
 
 From previous tests, the difference between flowtable and
 routing table was small with a single process (about 5% or 50ns
 in the total packet processing time, if i remember well),
 but there was a large gain with multiple concurrent processes.
 
 
 Yes, that sounds about right when we did the tests a long while ago.
 
 
  Removing flowtable increases the cost in ip_output()
  (obviously) but also in ether_output() (because the
  route does not have a lle entry so you need to call
  arpresolve on each packet). 
 
 
 Yup.
 
 
  So in revising the route lookup i believe it would be good
  if we could also get at once most of the info that
  ether_output() is computing again and again.
 
 
 Well, the routing table no longer maintains any lle info, so there
 isn't much to copy out the rtentry at the completion of route
 lookup.
 
 If I understood you correctly, you do believe there is a lot of value
 in Flowtable caching concept, but you are not suggesting we reverting
 back to having the routing table maintain L2 entries, are you ?

I see a lot of value in caching in general.

Especially for a bound socket it seems pointless to lookup the
route, iface and mac address(es) on every single packet instead of
caching them. And, routes and MAC addresses are volatile anyways
so making sure that we do the lookup 1us closer to the actual use
gives no additional guarantee.

The frequency with which these info (routes and MAC addresses)
change clearly influences the mechanism to validate the cache.
I suppose we have the following options:

- direct notification: a failure in a direct chain of calls
  can be used to invalidate the info cached in the socket.
  Similarly, some incoming traffic (e.g. TCP RST, FIN,
  ICMP messages) that reach a socket can invalidate the cached values
- assume a minimum lifetime for the info (i think this is what
  happens in the flowtable) and flush it unconditionally
  every such interval (say 10ms).
- if some info changes infrequently (e.g. MAC addresses) one could
  put a version number in the cached value and use it to validate
  the cache.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread K. Macy
On Tue, Apr 24, 2012 at 6:34 PM, Luigi Rizzo ri...@iet.unipi.it wrote:
 On Tue, Apr 24, 2012 at 02:16:18PM +, Li, Qing wrote:
 
 From previous tests, the difference between flowtable and
 routing table was small with a single process (about 5% or 50ns
 in the total packet processing time, if i remember well),
 but there was a large gain with multiple concurrent processes.
 

 Yes, that sounds about right when we did the tests a long while ago.

 
  Removing flowtable increases the cost in ip_output()
  (obviously) but also in ether_output() (because the
  route does not have a lle entry so you need to call
  arpresolve on each packet).
 

 Yup.

 
  So in revising the route lookup i believe it would be good
  if we could also get at once most of the info that
  ether_output() is computing again and again.
 

 Well, the routing table no longer maintains any lle info, so there
 isn't much to copy out the rtentry at the completion of route
 lookup.

 If I understood you correctly, you do believe there is a lot of value
 in Flowtable caching concept, but you are not suggesting we reverting
 back to having the routing table maintain L2 entries, are you ?

 I see a lot of value in caching in general.

 Especially for a bound socket it seems pointless to lookup the
 route, iface and mac address(es) on every single packet instead of
 caching them. And, routes and MAC addresses are volatile anyways
 so making sure that we do the lookup 1us closer to the actual use
 gives no additional guarantee.

 The frequency with which these info (routes and MAC addresses)
 change clearly influences the mechanism to validate the cache.
 I suppose we have the following options:

 - direct notification: a failure in a direct chain of calls
  can be used to invalidate the info cached in the socket.
  Similarly, some incoming traffic (e.g. TCP RST, FIN,
  ICMP messages) that reach a socket can invalidate the cached values
 - assume a minimum lifetime for the info (i think this is what
  happens in the flowtable) and flush it unconditionally
  every such interval (say 10ms).
 - if some info changes infrequently (e.g. MAC addresses) one could
  put a version number in the cached value and use it to validate
  the cache.

I have a patch that has been sitting around for a long time due to
review cycle latency that caches a pointer to the rtentry (and
llentry) in the the inpcb. Before each use the rtentry is checked
against a generation number in the routing tree that is incremented on
every routing table update.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread Fabien Thomas
 
 
 I have a patch that has been sitting around for a long time due to
 review cycle latency that caches a pointer to the rtentry (and
 llentry) in the the inpcb. Before each use the rtentry is checked
 against a generation number in the routing tree that is incremented on
 every routing table update.

Hi Kip,

Is there a public location for the patch ?
What can be done to speedup the commit: testing ?

Fabien



RE: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread Li, Qing
Yup, all good points. In fact we have considered all of these while doing
the work. In case you haven't seen it already, we did write about these 
issues in our paper and how we tried to address those, flow-table was one
of the solutions.

http://dl.acm.org/citation.cfm?id=1592641

--Qing


 
  Well, the routing table no longer maintains any lle info, so there
  isn't much to copy out the rtentry at the completion of route
  lookup.
 
  If I understood you correctly, you do believe there is a lot of value
  in Flowtable caching concept, but you are not suggesting we reverting
  back to having the routing table maintain L2 entries, are you ?
 
 I see a lot of value in caching in general.
 
 Especially for a bound socket it seems pointless to lookup the
 route, iface and mac address(es) on every single packet instead of
 caching them. And, routes and MAC addresses are volatile anyways
 so making sure that we do the lookup 1us closer to the actual use
 gives no additional guarantee.
 
 The frequency with which these info (routes and MAC addresses)
 change clearly influences the mechanism to validate the cache.
 I suppose we have the following options:
 
 - direct notification: a failure in a direct chain of calls
   can be used to invalidate the info cached in the socket.
   Similarly, some incoming traffic (e.g. TCP RST, FIN,
   ICMP messages) that reach a socket can invalidate the cached values
 - assume a minimum lifetime for the info (i think this is what
   happens in the flowtable) and flush it unconditionally
   every such interval (say 10ms).
 - if some info changes infrequently (e.g. MAC addresses) one could
   put a version number in the cached value and use it to validate
   the cache.
 
 cheers
 luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


RE: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread Li, Qing
 
  I have a patch that has been sitting around for a long time due to
  review cycle latency that caches a pointer to the rtentry (and
  llentry) in the the inpcb. Before each use the rtentry is checked
  against a generation number in the routing tree that is incremented
 on
  every routing table update.
 
 Hi Kip,
 
 Is there a public location for the patch ?
 What can be done to speedup the commit: testing ?
 
 Fabien

I performed extensive review of this patch from Kip, and it was
ready to go. Really good work. 

Not sure what is stopping its commit into the tree.

--Qing



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-24 Thread Bjoern A. Zeeb

On 24. Apr 2012, at 17:42 , Li, Qing wrote:

 
 I have a patch that has been sitting around for a long time due to
 review cycle latency that caches a pointer to the rtentry (and
 llentry) in the the inpcb. Before each use the rtentry is checked
 against a generation number in the routing tree that is incremented
 on
 every routing table update.
 
 Hi Kip,
 
 Is there a public location for the patch ?
 What can be done to speedup the commit: testing ?
 
 Fabien
 
 I performed extensive review of this patch from Kip, and it was
 ready to go. Really good work. 
 
 Not sure what is stopping its commit into the tree.

Because there were leaks, there were 100% panics for IPv6, ... at least on
the version I had seen in autumn last year.

There is certainly no one more interested then me on these in, esp. for v6
where the removal of route caching a long time ago made nd6_nud_hint() a NOP
with dst and rt being passed down as NULL only, and where we are doing up to
three route lookups in the output path if no cached rt is passed down along
from the ULP.

If there is an updated patch, I'd love to see it.

/bz

-- 
Bjoern A. Zeeb You have to have visions!
   It does not matter how good you are. It matters what good you do!

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-22 Thread K. Macy
Most of these issues are well known. Addressing the bottlenecks is
simply time consuming due to the fact that any bugs introduced during
development potentially impact many users.

-Kip
On Sun, Apr 22, 2012 at 4:14 AM, Adrian Chadd adr...@freebsd.org wrote:
 Hi,

 This honestly sounds like it's begging for an
 instrumentation/analysis/optimisation project.

 What do we need to do?


 Adrian



-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-21 Thread Bruce Evans

On Fri, 20 Apr 2012, K. Macy wrote:


On Fri, Apr 20, 2012 at 4:44 PM, Luigi Rizzo ri...@iet.unipi.it wrote:



The small penalty when flowtable is disabled but compiled in is
probably because the net.flowtable.enable flag is checked
a bit deep in the code.

The advantage with non-connect()ed sockets is huge. I don't
quite understand why disabling the flowtable still helps there.


Do you mean having it compiled in but disabled still helps
performance? Yes, that is extremely strange.


This reminds me that when I worked on this, I saw very large throughput
differences (in the 20-50% range) as a result of minor changes in
unrelated code.  I could get these changes intentionally by adding or
removing padding in unrelated unused text space, so the differences were
apparently related to text alignment.  I thought I had some significant
micro-optimizations, but it turned out that they were acting mainly by
changing the layout in related used text space where it is harder to
control.  Later, I suspected that the differences were more due to cache
misses for data than for text.  The CPU and its caching must affect this
significantly.  I tested on an AthlonXP and Athlon64, and the differences
were larger on the AthlonXP.  Both of these have a shared I/D cache so
pressure on the I part would affect the D part, but in this benchmark
the D part is much more active than the I part so it is unclear how
text layout could have such a large effect.

Anyway, the large differences made it impossible to trust the results
of benchmarking any single micro-benchmark.  Also, ministat is useless
for understanding the results.  (I note that luigi didn't provide any
standard deviations and neither would I. :-).  My results depended on
the cache behaviour but didn't change significantly when rerun, unless
the code was changed.

Bruce
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-21 Thread Adrian Chadd
Hi,

This honestly sounds like it's begging for an
instrumentation/analysis/optimisation project.

What do we need to do?


Adrian
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-20 Thread Luigi Rizzo
On Fri, Apr 20, 2012 at 12:37:21AM +0200, Andre Oppermann wrote:
 On 20.04.2012 00:03, Luigi Rizzo wrote:
 On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote:
 On 19.04.2012 22:46, Luigi Rizzo wrote:
 The allocation happens while the code has already an exclusive
 lock on so-snd_buf so a pool of fresh buffers could be attached
 there.
 
 Ah, there it is not necessary to hold the snd_buf lock while
 doing the allocate+copyin.  With soreceive_stream() (which is
 
 it is not held in the tx path either -- but there is a short section
 before m_uiotombuf() which does
 
  ...
  SOCKBUF_LOCK(so-so_snd);
  // check for pending errors, sbspace, so_state
  SOCKBUF_UNLOCK(so-so_snd);
  ...
 
 (some of this is slightly dubious, but that's another story)
 
 Indeed the lock isn't held across the m_uiotombuf().  You're talking
 about filling an sockbuf mbuf cache while holding the lock?

all i am thinking is that when we have a serialization point we
could use it for multiple related purposes. In this case yes we
could keep a small mbuf cache attached to so_snd. When the cache
is empty either get a new batch (say 10-20 bufs) from the zone
allocator, possibly dropping and regaining the lock if the so_snd
must be a leaf.  Besides for protocols like TCP (does it use the
same path ?) the mbufs are already there (released by incoming acks)
in the steady state, so it is not even necessary to to refill the
cache.

This said, i am not 100% sure that the 100ns I am seeing are all
spent in the zone allocator.  As i said the chain of indirect calls
and other ops is rather long on both acquire and release.

 But the other consideration is that one could defer the mbuf allocation
 to a later time when the packet is actually built (or anyways
 right before the thread returns).
 What i envision (and this would fit nicely with netmap) is the following:
 - have a (possibly readonly) template for the headers (MAC+IP+UDP)
attached to the socket, built on demand, and cached and managed
with similar invalidation rules as used by fastforward;
 
 That would require to cross-pointer the rtentry and whatnot again.
 
 i was planning to keep a copy, not a reference. If the copy becomes
 temporarily stale, no big deal, as long as you can detect it reasonably
 quiclky -- routes are not guaranteed to be correct, anyways.
 
 Be wary of disappearing interface pointers...

(this reminds me, what prevents a route grabbed from the flowtable
from disappearing and releasing the ifp reference ?)

In any case, it seems better to keep a more persistent ifp reference
in the socket rather than grab and release one on every single
packet transmission.

 - possibly extend the pru_send interface so one can pass down the uio
instead of the mbuf;
 - make an opportunistic buffer allocation in some place downstream,
where the code already has an x-lock on some resource (could be
the snd_buf, the interface, ...) so the allocation comes for free.
 
 ETOOCOMPLEXOVERTIME.
 
 maybe. But i want to investigate this.
 
 I fail see what passing down the uio would gain you.  The snd_buf lock
 isn't obtained again after the copyin.  Not that I want to prevent you
 from investigating other ways. ;)

maybe it can open the way to other optimizations, such as reducing
the number of places where you need to lock, or save some data
copies, or reduce fragmentation, etc.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-20 Thread Alexander V. Chernikov

On 20.04.2012 01:12, Andre Oppermann wrote:

On 19.04.2012 22:34, K. Macy wrote:

This is indeed a big problem. I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which




This only helps if your flows aren't hitting the same rtentry.
Otherwise you still convoy on the lock for the rtentry itself to
increment and decrement the rtentry's reference count.


The rtentry lock isn't obtained anymore. While the rmlock read
lock is held on the rtable the relevant information like ifp and
such is copied out. No later referencing possible. In the end
any referencing of an rtentry would be forbidden and the rtentry
lock can be removed. The second step can be optional though.


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?



If the number of peers is bounded then you can use the flowtable. Max
PPS is much higher bypassing routing lookup. However, it doesn't scale
From my experience, turning fastfwd on gives ~20-30% performance 
increase (10G forwarding with firewalling, 1.4MPPS). ip_forward() uses 2 
lookups (ip_rtaddr + ip_output) vs 1 ip_fastfwd().
The worst current problem IMHO is number of locks packet have to 
traverse, not number of lookups.



to arbitrary flow numbers.




In theory a rmlock-only lookup into a default-route only routing
table would be faster than creating a flow table entry for every
destination. It a matter of churn though. The flowtable isn't
lockless in itself, is it?




--
WBR, Alexander
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-20 Thread Andre Oppermann

On 20.04.2012 10:26, Alexander V. Chernikov wrote:

On 20.04.2012 01:12, Andre Oppermann wrote:

On 19.04.2012 22:34, K. Macy wrote:

If the number of peers is bounded then you can use the flowtable. Max
PPS is much higher bypassing routing lookup. However, it doesn't scale



 From my experience, turning fastfwd on gives ~20-30% performance
increase (10G forwarding with firewalling, 1.4MPPS). ip_forward() uses 2
lookups (ip_rtaddr + ip_output) vs 1 ip_fastfwd().


Another difference is the packet copy the normal forwarding path
does to be able to send a ICMP redirect message if the packet is
forwarded to a different gateway on the same LAN.  fastforward
doesn't do that.


The worst current problem IMHO is number of locks packet have to
traverse, not number of lookups.


Agreed.  Actually the locking in itself is not the problem.  It's
the side effects of cache line dirtying/bouncing and contention.
However in the great majority of the cases the data protected by
the lock is only read, not modified making a 'full' lock expensive.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-20 Thread Andre Oppermann

On 20.04.2012 08:35, Luigi Rizzo wrote:

On Fri, Apr 20, 2012 at 12:37:21AM +0200, Andre Oppermann wrote:

On 20.04.2012 00:03, Luigi Rizzo wrote:

On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote:

On 19.04.2012 22:46, Luigi Rizzo wrote:

The allocation happens while the code has already an exclusive
lock on so-snd_buf so a pool of fresh buffers could be attached
there.


Ah, there it is not necessary to hold the snd_buf lock while
doing the allocate+copyin.  With soreceive_stream() (which is


it is not held in the tx path either -- but there is a short section
before m_uiotombuf() which does

...
SOCKBUF_LOCK(so-so_snd);
// check for pending errors, sbspace, so_state
SOCKBUF_UNLOCK(so-so_snd);
...

(some of this is slightly dubious, but that's another story)


Indeed the lock isn't held across the m_uiotombuf().  You're talking
about filling an sockbuf mbuf cache while holding the lock?


all i am thinking is that when we have a serialization point we
could use it for multiple related purposes. In this case yes we
could keep a small mbuf cache attached to so_snd. When the cache
is empty either get a new batch (say 10-20 bufs) from the zone
allocator, possibly dropping and regaining the lock if the so_snd
must be a leaf.  Besides for protocols like TCP (does it use the
same path ?) the mbufs are already there (released by incoming acks)
in the steady state, so it is not even necessary to to refill the
cache.


I'm sure things can be tuned towards particular cases but almost
always that some at the expense of versatility.  I was looking
at netmap for a project.  It's great when there is one thing being
done by one process at great speed.  However as soon as I have to
dispatch certain packets somewhere else for further processing,
in another process, things quickly become complicated and fall
apart.  It would have meant to replicate what the kernel does
with protosw  friends in userspace coated with IPC.  No to mention
re-inventing the socket layer abstraction again.

So netmap is fantastic for simple, bulk and repetitive tasks with
little variance.  Things like packet routing, bridging, encapsulation,
perhaps inspection and acting as a traffic sink/source.  There are
plenty of use cases for that.

Coming back to your UDP test case, while the 'hacks' you propose
may benefit the bulk sending of a bound socket it may not help or
pessimize the DNS server case where a large number of packets is
send to a large number of destinations.

The layering abstractions we have in BSD are excellent and have
served us quite well so far.  Adding new protocols is a simple
task and so on.  Of course it has some trade-offs by having some
indirections and not being bare-metal fast.  Yes, there is a lot
of potential in optimizing the locking strategies we currently
have within the BSD network stack layering.  Your profiling work
is immensely helpful in identifying where to aim at.  Once that
is fixed we should stop there.  Anyone who needs a particular as
close as possible to the bare metal UDP packet blaster should
fork the tree and do their own short-cuts and whatnot.  But FreeBSD
should stay a reasonable general purpose.  It won't be a Ferrari,
but an Audi S6 is a damn nice car as well and it can carry your
whole family. :)


This said, i am not 100% sure that the 100ns I am seeing are all
spent in the zone allocator.  As i said the chain of indirect calls
and other ops is rather long on both acquire and release.


But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
   attached to the socket, built on demand, and cached and managed
   with similar invalidation rules as used by fastforward;


That would require to cross-pointer the rtentry and whatnot again.


i was planning to keep a copy, not a reference. If the copy becomes
temporarily stale, no big deal, as long as you can detect it reasonably
quiclky -- routes are not guaranteed to be correct, anyways.


Be wary of disappearing interface pointers...


(this reminds me, what prevents a route grabbed from the flowtable
from disappearing and releasing the ifp reference ?)


It has to keep a refcounted reference to the rtentry.


In any case, it seems better to keep a more persistent ifp reference
in the socket rather than grab and release one on every single
packet transmission.


The socket doesn't and shouldn't know anything about ifp's.


- possibly extend the pru_send interface so one can pass down the uio
   instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
   where the code already has an x-lock on some resource (could be
   the snd_buf, the interface, ...) so the allocation comes for free.


ETOOCOMPLEXOVERTIME.



Re: Some performance measurements on the FreeBSD network stack

2012-04-20 Thread John Baldwin
On Thursday, April 19, 2012 4:46:22 pm Luigi Rizzo wrote:
 What might be moderately expensive are the critical_enter()/critical_exit()
 calls around individual allocations.
 The allocation happens while the code has already an exclusive
 lock on so-snd_buf so a pool of fresh buffers could be attached
 there.

Keep in mind that in the common case critical_enter() and critical_exit()
should be very cheap as they should just do td-td_critnest++ and
td-td_critnest--.  critical_enter() should probably be inlined if KTR
is not enabled.

-- 
John Baldwin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-20 Thread Luigi Rizzo
On Thu, Apr 19, 2012 at 11:06:38PM +0200, K. Macy wrote:
 On Thu, Apr 19, 2012 at 11:22 PM, Luigi Rizzo ri...@iet.unipi.it wrote:
  On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote:
   This is indeed a big problem. ?I'm working (rough edges remain) on
   changing the routing table locking to an rmlock (read-mostly) which
  
 
  This only helps if your flows aren't hitting the same rtentry.
  Otherwise you still convoy on the lock for the rtentry itself to
  increment and decrement the rtentry's reference count.
 
   i was wondering, is there a way (and/or any advantage) to use the
   fastforward code to look up the route for locally sourced packets ?
 
  actually, now that i look at the code, both ip_output() and
  the ip_fastforward code use the same in_rtalloc_ign(...)
 
  
 
  If the number of peers is bounded then you can use the flowtable. Max
  PPS is much higher bypassing routing lookup. However, it doesn't scale
  to arbitrary flow numbers.
 
  re. flowtable, could you point me to what i should do instead of
  calling in_rtalloc_ign() ?
 
 If you build with it in your kernel config and enable the sysctl
 ip_output will automatically use it for TCP and UDP connections. If
 you're doing forwarding you'll need to patch the forwarding path.

cool.
For the records, with netsend 10.0.0.2 ports 18 0 5 on an ixgbe
talking to a remote host i get the following results (with a single
port netsend does a connect() and then send(), otherwise it
loops around a sendto() )

net.flowtable.enabled   portns/pkt
-
not compiled in 5000 944M_FLOWID not set
0 (disable) 50001004
1 (enable)  5000 980

not compiled in 5000-5001   3400M_FLOWID not set
0 (disable) 5000-5001   1418
1 (enable)  5000-5001   1230

The small penalty when flowtable is disabled but compiled in is
probably because the net.flowtable.enable flag is checked
a bit deep in the code.

The advantage with non-connect()ed sockets is huge. I don't
quite understand why disabling the flowtable still helps there.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-20 Thread K. Macy
Comments inline below:

On Fri, Apr 20, 2012 at 4:44 PM, Luigi Rizzo ri...@iet.unipi.it wrote:
 On Thu, Apr 19, 2012 at 11:06:38PM +0200, K. Macy wrote:
 On Thu, Apr 19, 2012 at 11:22 PM, Luigi Rizzo ri...@iet.unipi.it wrote:
  On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote:
   This is indeed a big problem. ?I'm working (rough edges remain) on
   changing the routing table locking to an rmlock (read-mostly) which
  
 
  This only helps if your flows aren't hitting the same rtentry.
  Otherwise you still convoy on the lock for the rtentry itself to
  increment and decrement the rtentry's reference count.
 
   i was wondering, is there a way (and/or any advantage) to use the
   fastforward code to look up the route for locally sourced packets ?
 
  actually, now that i look at the code, both ip_output() and
  the ip_fastforward code use the same in_rtalloc_ign(...)
 
  
 
  If the number of peers is bounded then you can use the flowtable. Max
  PPS is much higher bypassing routing lookup. However, it doesn't scale
  to arbitrary flow numbers.
 
  re. flowtable, could you point me to what i should do instead of
  calling in_rtalloc_ign() ?

 If you build with it in your kernel config and enable the sysctl
 ip_output will automatically use it for TCP and UDP connections. If
 you're doing forwarding you'll need to patch the forwarding path.

 cool.
 For the records, with netsend 10.0.0.2 ports 18 0 5 on an ixgbe
 talking to a remote host i get the following results (with a single
 port netsend does a connect() and then send(), otherwise it
 loops around a sendto() )


Sorry, 5000 vs 5000-5001 means 1 vs 2 streams? Does this mean for a
single socket the overhead is less without it compiled in than with it
compiled in but enabled? That is certainly different from what I see
with TCP where I see a 30% increase in aggregate throughput the last
time I tried this (on IPoIB).

For the record the M_FLOWID is used to pick the transmit queue so with
multiple streams you're best of setting it if your device has more
than one hardware device queue.

        net.flowtable.enabled   port            ns/pkt
        -
        not compiled in         5000             944    M_FLOWID not set
        0 (disable)             5000            1004
        1 (enable)              5000             980

        not compiled in         5000-5001       3400    M_FLOWID not set
        0 (disable)             5000-5001       1418
        1 (enable)              5000-5001       1230

 The small penalty when flowtable is disabled but compiled in is
 probably because the net.flowtable.enable flag is checked
 a bit deep in the code.

 The advantage with non-connect()ed sockets is huge. I don't
 quite understand why disabling the flowtable still helps there.

Do you mean having it compiled in but disabled still helps
performance? Yes, that is extremely strange.

-Kip


-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.

Test conditions:
- intel i7-870 CPU running at 2.93 GHz + TurboBoost,
  all 4 cores enabled, no hyperthreading
- FreeBSD HEAD as of 15 april 2012, no ipfw, no other
  pfilter clients, no ipv6 or ipsec.
- userspace running 'netsend 10.0.0.2  18 0 5'
  (output to a physical interface, udp port , small
  frame, no rate limitations, 5sec experiments)
- the 'ns' column reports
  the total time divided by the number of successful
  transmissions we report the min and max in 5 tests
- 1 to 4 parallel tasks, variable packet sizes
- there are variations in the numbers which become
  larger as we reach the bottom of the stack

Caveats:
- in the table below, clock and pktlen are constant.
  I am including the info here so it is easier to compare
  the results with future experiments

- i have a small number of samples, so i am only reporting
  the min and the max in a handful of experiments.

- i am only measuring average values over millions of
  cycles. I have no info on what is the variance between
  the various executions.

- from what i have seen, numbers vary significantly on
  different systems, depending on memory speed, caches
  and other things. The big jumps are significant and present
  on all systems, but the small deltas (say  5%) are
  not even statistically significant.

- if someone is interested in replicating the experiments
  email me and i will post a link to a suitable picobsd image.

- i have not yet instrumented the bottom layers (if_output
  and below).

The results show a few interesting things:

- the packet-sending application is reasonably fast
  and certainly not a bottleneck (over 100Mpps before
  calling the system call);

- the system call is somewhat expensive, about 100ns.
  I am not sure where the time is spent (the amd64 code
  does a few push on the stack and then runs syscall
  (followed by a sysret). I am not sure how much
  room for improvement is there in this area.
  The relevant code is in lib/libc/i386/SYS.h and
  lib/libc/i386/sys/syscall.S (KERNCALL translates
  to syscall on amd64, and int 0x80 on the i386)

- the next expensive operation, consuming another 100ns,
  is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
  seems to scale decently at least with 4 cores.  The copyin() is
  relatively inexpensive (not reported in the data below, but
  disabling it saves only 15-20ns for a short packet).

  I have not followed the details, but the allocator calls the zone
  allocator and there is at least one critical_enter()/critical_exit()
  pair, and the highly modular architecture invokes long chains of
  indirect function calls both on allocation and release.

  It might make sense to keep a small pool of mbufs attached to the
  socket buffer instead of going to the zone allocator.
  Or defer the actual encapsulation to the
  (*so-so_proto-pr_usrreqs-pru_send)() which is called inline, anyways.

- another big bottleneck is the route lookup in ip_output()
  (between entries 51 and 56). Not only it eats another
  100ns+ on an empty routing table, but it also
  causes huge contentions when multiple cores
  are involved.

There is other bad stuff occurring in if_output() and
below (on this system it takes about 1300ns to send one
packet even with one core, and ony 500-550 are consumed
before the call to if_output()) but i don't have
detailed information yet.


POS CPU clock pktlen ns/pkt--- EXIT POINT 
  min  max
-
U   1   2934 18 88  userspace, before the send() call
  [ syscall ]
20  1   2934 18   103  107  sys_sendto(): begin
20  4   2934 18   104  107

21  1   2934 18   110  113  sendit(): begin
21  4   2934 18   111  116

22  1   2934 18   110  114  sendit() after getsockaddr(to, ...)
22  4   2934 18   111  124

23  1   2934 18   111  115  sendit() before kern_sendit
23  4   2934 18   112  120

24  1   2934 18   117  120  kern_sendit() after AUDIT_ARG_FD
24  4   2934 18   117  121

25  1   2934 18   134  140  kern_sendit() before sosend()
25  4   2934 18   134  146

40  1   2934 18   144  149  sosend_dgram(): start
40  4   2934 18   144  151

41  1   2934 18   157  166  sosend_dgram() before m_uiotombuf()
41  4   2934 18   157  168
   [ mbuf allocation and copy. The copy is relatively cheap ]
42  1   2934 18   264  268  sosend_dgram() after m_uiotombuf()
42  4   2934 18   265  269

30  1   2934 18   273  276  udp_send() begin
30  4   2934 18   274  278
   [ here we start seeing some contention with multiple threads ]
31  1   2934 18   323  324  udp_output() before ip_output()
31  4   2934 18   344  348

50  1   

Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Slawa Olhovchenkov
On Thu, Apr 19, 2012 at 03:30:18PM +0200, Luigi Rizzo wrote:

 I have been running some performance tests on UDP sockets,
 using the netsend program in tools/tools/netrate/netsend
 and instrumenting the source code and the kernel do return in
 various points of the path. Here are some results which
 I hope you find interesting.

I do some test in 2011.
May be this test is not actual now.
May be actual.

Initial message 
http://lists.freebsd.org/pipermail/freebsd-performance/2011-January/004156.html
UDP socket in FreeBSD 
http://lists.freebsd.org/pipermail/freebsd-performance/2011-February/004176.html
About 4BSD/ULE 
http://lists.freebsd.org/pipermail/freebsd-performance/2011-February/004181.html

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 19.04.2012 15:30, Luigi Rizzo wrote:

I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.


Jumping over very interesting analysis...


- the next expensive operation, consuming another 100ns,
   is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
   seems to scale decently at least with 4 cores.  The copyin() is
   relatively inexpensive (not reported in the data below, but
   disabling it saves only 15-20ns for a short packet).

   I have not followed the details, but the allocator calls the zone
   allocator and there is at least one critical_enter()/critical_exit()
   pair, and the highly modular architecture invokes long chains of
   indirect function calls both on allocation and release.

   It might make sense to keep a small pool of mbufs attached to the
   socket buffer instead of going to the zone allocator.
   Or defer the actual encapsulation to the
   (*so-so_proto-pr_usrreqs-pru_send)() which is called inline, anyways.


The UMA mbuf allocator is certainly not perfect but rather good.
It has a per-CPU cache of mbuf's that are very fast to allocate
from.  Once it has used them it needs to refill from the global
pool which may happen from time to time and show up in the averages.


- another big bottleneck is the route lookup in ip_output()
   (between entries 51 and 56). Not only it eats another
   100ns+ on an empty routing table, but it also
   causes huge contentions when multiple cores
   are involved.


This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which
doesn't produce any lock contention or cache pollution.  Also skipping
the per-route lock while the table read-lock is held should help some
more.  All in all this should give a massive gain in high pps situations
at the expense of costlier routing table changes.  However changes
are seldom to essentially never with a single default route.

After that the ARP table will gets same treatment and the low stack
lock contention points should be gone for good.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:
 On 19.04.2012 15:30, Luigi Rizzo wrote:
 I have been running some performance tests on UDP sockets,
 using the netsend program in tools/tools/netrate/netsend
 and instrumenting the source code and the kernel do return in
 various points of the path. Here are some results which
 I hope you find interesting.
 
 Jumping over very interesting analysis...
 
 - the next expensive operation, consuming another 100ns,
is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
seems to scale decently at least with 4 cores.  The copyin() is
relatively inexpensive (not reported in the data below, but
disabling it saves only 15-20ns for a short packet).
 
I have not followed the details, but the allocator calls the zone
allocator and there is at least one critical_enter()/critical_exit()
pair, and the highly modular architecture invokes long chains of
indirect function calls both on allocation and release.
 
It might make sense to keep a small pool of mbufs attached to the
socket buffer instead of going to the zone allocator.
Or defer the actual encapsulation to the
(*so-so_proto-pr_usrreqs-pru_send)() which is called inline, anyways.
 
 The UMA mbuf allocator is certainly not perfect but rather good.
 It has a per-CPU cache of mbuf's that are very fast to allocate
 from.  Once it has used them it needs to refill from the global
 pool which may happen from time to time and show up in the averages.

indeed i was pleased to see no difference between 1 and 4 threads.
This also suggests that the global pool is accessed very seldom,
and for short times, otherwise you'd see the effect with 4 threads.

What might be moderately expensive are the critical_enter()/critical_exit()
calls around individual allocations.
The allocation happens while the code has already an exclusive
lock on so-snd_buf so a pool of fresh buffers could be attached
there.

But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
  attached to the socket, built on demand, and cached and managed
  with similar invalidation rules as used by fastforward;
- possibly extend the pru_send interface so one can pass down the uio
  instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
  where the code already has an x-lock on some resource (could be
  the snd_buf, the interface, ...) so the allocation comes for free.

 - another big bottleneck is the route lookup in ip_output()
(between entries 51 and 56). Not only it eats another
100ns+ on an empty routing table, but it also
causes huge contentions when multiple cores
are involved.
 
 This is indeed a big problem.  I'm working (rough edges remain) on
 changing the routing table locking to an rmlock (read-mostly) which

i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
 This is indeed a big problem.  I'm working (rough edges remain) on
 changing the routing table locking to an rmlock (read-mostly) which


This only helps if your flows aren't hitting the same rtentry.
Otherwise you still convoy on the lock for the rtentry itself to
increment and decrement the rtentry's reference count.

 i was wondering, is there a way (and/or any advantage) to use the
 fastforward code to look up the route for locally sourced packets ?


If the number of peers is bounded then you can use the flowtable. Max
PPS is much higher bypassing routing lookup. However, it doesn't scale
to arbitrary flow numbers.


-Kip
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote:
  This is indeed a big problem. ?I'm working (rough edges remain) on
  changing the routing table locking to an rmlock (read-mostly) which
 
 
 This only helps if your flows aren't hitting the same rtentry.
 Otherwise you still convoy on the lock for the rtentry itself to
 increment and decrement the rtentry's reference count.
 
  i was wondering, is there a way (and/or any advantage) to use the
  fastforward code to look up the route for locally sourced packets ?

actually, now that i look at the code, both ip_output() and
the ip_fastforward code use the same in_rtalloc_ign(...)

 
 
 If the number of peers is bounded then you can use the flowtable. Max
 PPS is much higher bypassing routing lookup. However, it doesn't scale
 to arbitrary flow numbers.

re. flowtable, could you point me to what i should do instead of
calling in_rtalloc_ign() ?

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
On Thu, Apr 19, 2012 at 11:22 PM, Luigi Rizzo ri...@iet.unipi.it wrote:
 On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote:
  This is indeed a big problem. ?I'm working (rough edges remain) on
  changing the routing table locking to an rmlock (read-mostly) which
 

 This only helps if your flows aren't hitting the same rtentry.
 Otherwise you still convoy on the lock for the rtentry itself to
 increment and decrement the rtentry's reference count.

  i was wondering, is there a way (and/or any advantage) to use the
  fastforward code to look up the route for locally sourced packets ?

 actually, now that i look at the code, both ip_output() and
 the ip_fastforward code use the same in_rtalloc_ign(...)

 

 If the number of peers is bounded then you can use the flowtable. Max
 PPS is much higher bypassing routing lookup. However, it doesn't scale
 to arbitrary flow numbers.

 re. flowtable, could you point me to what i should do instead of
 calling in_rtalloc_ign() ?

If you build with it in your kernel config and enable the sysctl
ip_output will automatically use it for TCP and UDP connections. If
you're doing forwarding you'll need to patch the forwarding path.
Fabien Thomas has a patch for that that I just fixed/identified a bug
in for him.

-Kip


-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 19.04.2012 22:34, K. Macy wrote:

This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which




This only helps if your flows aren't hitting the same rtentry.
Otherwise you still convoy on the lock for the rtentry itself to
increment and decrement the rtentry's reference count.


The rtentry lock isn't obtained anymore.  While the rmlock read
lock is held on the rtable the relevant information like ifp and
such is copied out.  No later referencing possible.  In the end
any referencing of an rtentry would be forbidden and the rtentry
lock can be removed.  The second step can be optional though.


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?



If the number of peers is bounded then you can use the flowtable. Max
PPS is much higher bypassing routing lookup. However, it doesn't scale
to arbitrary flow numbers.


In theory a rmlock-only lookup into a default-route only routing
table would be faster than creating a flow table entry for every
destination.  It a matter of churn though.  The flowtable isn't
lockless in itself, is it?

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
 This only helps if your flows aren't hitting the same rtentry.
 Otherwise you still convoy on the lock for the rtentry itself to
 increment and decrement the rtentry's reference count.


 The rtentry lock isn't obtained anymore.  While the rmlock read
 lock is held on the rtable the relevant information like ifp and
 such is copied out.  No later referencing possible.  In the end
 any referencing of an rtentry would be forbidden and the rtentry
 lock can be removed.  The second step can be optional though.

Can you point me to a tree where you've made these changes?

 i was wondering, is there a way (and/or any advantage) to use the
 fastforward code to look up the route for locally sourced packets ?


 If the number of peers is bounded then you can use the flowtable. Max
 PPS is much higher bypassing routing lookup. However, it doesn't scale
 to arbitrary flow numbers.


 In theory a rmlock-only lookup into a default-route only routing
 table would be faster than creating a flow table entry for every
 destination.  It a matter of churn though.  The flowtable isn't
 lockless in itself, is it?

It is. In a steady state where the working set of peers fits in the
table it should be just a simple hash of the ip and then a lookup.

-Kip
-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 19.04.2012 22:46, Luigi Rizzo wrote:

On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote:

On 19.04.2012 15:30, Luigi Rizzo wrote:

I have been running some performance tests on UDP sockets,
using the netsend program in tools/tools/netrate/netsend
and instrumenting the source code and the kernel do return in
various points of the path. Here are some results which
I hope you find interesting.


Jumping over very interesting analysis...


- the next expensive operation, consuming another 100ns,
   is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator
   seems to scale decently at least with 4 cores.  The copyin() is
   relatively inexpensive (not reported in the data below, but
   disabling it saves only 15-20ns for a short packet).

   I have not followed the details, but the allocator calls the zone
   allocator and there is at least one critical_enter()/critical_exit()
   pair, and the highly modular architecture invokes long chains of
   indirect function calls both on allocation and release.

   It might make sense to keep a small pool of mbufs attached to the
   socket buffer instead of going to the zone allocator.
   Or defer the actual encapsulation to the
   (*so-so_proto-pr_usrreqs-pru_send)() which is called inline, anyways.


The UMA mbuf allocator is certainly not perfect but rather good.
It has a per-CPU cache of mbuf's that are very fast to allocate
from.  Once it has used them it needs to refill from the global
pool which may happen from time to time and show up in the averages.


indeed i was pleased to see no difference between 1 and 4 threads.
This also suggests that the global pool is accessed very seldom,
and for short times, otherwise you'd see the effect with 4 threads.


Robert did the per-CPU mbuf allocator pools a few years ago.
Excellent engineering.


What might be moderately expensive are the critical_enter()/critical_exit()
calls around individual allocations.


Can't get away from those as a thread must not migrate away
when manipulating the per-CPU mbuf pool.


The allocation happens while the code has already an exclusive
lock on so-snd_buf so a pool of fresh buffers could be attached
there.


Ah, there it is not necessary to hold the snd_buf lock while
doing the allocate+copyin.  With soreceive_stream() (which is
experimental not enabled by default) I did just that for the
receive path.  It's quite a significant gain there.

IMHO better resolve the locking order than to juggle yet another
mbuf sink.


But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
   attached to the socket, built on demand, and cached and managed
   with similar invalidation rules as used by fastforward;


That would require to cross-pointer the rtentry and whatnot again.
We want to get away from that to untangle the (locking) mess that
eventually results from it.


- possibly extend the pru_send interface so one can pass down the uio
   instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
   where the code already has an x-lock on some resource (could be
   the snd_buf, the interface, ...) so the allocation comes for free.


ETOOCOMPLEXOVERTIME.


- another big bottleneck is the route lookup in ip_output()
   (between entries 51 and 56). Not only it eats another
   100ns+ on an empty routing table, but it also
   causes huge contentions when multiple cores
   are involved.


This is indeed a big problem.  I'm working (rough edges remain) on
changing the routing table locking to an rmlock (read-mostly) which


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?


No.  The main advantage/difference of fastforward is the short code
path and processing to completion.

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 19.04.2012 23:17, K. Macy wrote:

This only helps if your flows aren't hitting the same rtentry.
Otherwise you still convoy on the lock for the rtentry itself to
increment and decrement the rtentry's reference count.



The rtentry lock isn't obtained anymore.  While the rmlock read
lock is held on the rtable the relevant information like ifp and
such is copied out.  No later referencing possible.  In the end
any referencing of an rtentry would be forbidden and the rtentry
lock can be removed.  The second step can be optional though.


Can you point me to a tree where you've made these changes?


It's not in a public tree.  I just did a 'svn up' and the recent
pf and rtsocket changes created some conflicts.  Have to solve
them before posting.  Timeframe (early) next week.


i was wondering, is there a way (and/or any advantage) to use the
fastforward code to look up the route for locally sourced packets ?



If the number of peers is bounded then you can use the flowtable. Max
PPS is much higher bypassing routing lookup. However, it doesn't scale
to arbitrary flow numbers.



In theory a rmlock-only lookup into a default-route only routing
table would be faster than creating a flow table entry for every
destination.  It a matter of churn though.  The flowtable isn't
lockless in itself, is it?


It is. In a steady state where the working set of peers fits in the
table it should be just a simple hash of the ip and then a lookup.


Yes, but the lookup requires a lock?  Or is every entry replicated
to every CPU?  So a number of concurrent CPU's sending to the same
UDP destination would content on that lock?

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy

 Yes, but the lookup requires a lock?  Or is every entry replicated
 to every CPU?  So a number of concurrent CPU's sending to the same
 UDP destination would content on that lock?

No. In the default case it's per CPU, thus no serialization is
required. But yes, if your transmitting thread manages to bounce to
every core during send within the flow expiration window you'll have
an extra 12 or however many bytes per peer times the number of cores.
There is usually a fair amount of CPU affinity over a given unit time.


-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread K. Macy
On Thu, Apr 19, 2012 at 11:27 PM, Andre Oppermann an...@freebsd.org wrote:
 On 19.04.2012 23:17, K. Macy wrote:

 This only helps if your flows aren't hitting the same rtentry.
 Otherwise you still convoy on the lock for the rtentry itself to
 increment and decrement the rtentry's reference count.



 The rtentry lock isn't obtained anymore.  While the rmlock read
 lock is held on the rtable the relevant information like ifp and
 such is copied out.  No later referencing possible.  In the end
 any referencing of an rtentry would be forbidden and the rtentry
 lock can be removed.  The second step can be optional though.


 Can you point me to a tree where you've made these changes?


 It's not in a public tree.  I just did a 'svn up' and the recent
 pf and rtsocket changes created some conflicts.  Have to solve
 them before posting.  Timeframe (early) next week.



Ok. Keep us posted.

Thanks,
Kip



-- 
   “The real damage is done by those millions who want to 'get by.'
The ordinary men who just want to be left in peace. Those who don’t
want their little lives disturbed by anything bigger than themselves.
Those with no sides and no causes. Those who won’t take measure of
their own strength, for fear of antagonizing their own weakness. Those
who don’t like to make waves—or enemies.

   Those for whom freedom, honour, truth, and principles are only
literature. Those who live small, love small, die small. It’s the
reductionist approach to life: if you keep it small, you’ll keep it
under control. If you don’t make any noise, the bogeyman won’t find
you.

   But it’s all an illusion, because they die too, those people who
roll up their spirits into tiny little balls so as to be safe. Safe?!
From what? Life is always on the edge of death; narrow streets lead to
the same place as wide avenues, and a little candle burns itself out
just like a flaming torch does.

   I choose my own way to burn.”

   Sophie Scholl
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Luigi Rizzo
On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote:
 On 19.04.2012 22:46, Luigi Rizzo wrote:
...
 What might be moderately expensive are the critical_enter()/critical_exit()
 calls around individual allocations.
 
 Can't get away from those as a thread must not migrate away
 when manipulating the per-CPU mbuf pool.

i understand.

 The allocation happens while the code has already an exclusive
 lock on so-snd_buf so a pool of fresh buffers could be attached
 there.
 
 Ah, there it is not necessary to hold the snd_buf lock while
 doing the allocate+copyin.  With soreceive_stream() (which is

it is not held in the tx path either -- but there is a short section
before m_uiotombuf() which does

...
SOCKBUF_LOCK(so-so_snd);
// check for pending errors, sbspace, so_state
SOCKBUF_UNLOCK(so-so_snd);
...

(some of this is slightly dubious, but that's another story)

 But the other consideration is that one could defer the mbuf allocation
 to a later time when the packet is actually built (or anyways
 right before the thread returns).
 What i envision (and this would fit nicely with netmap) is the following:
 - have a (possibly readonly) template for the headers (MAC+IP+UDP)
attached to the socket, built on demand, and cached and managed
with similar invalidation rules as used by fastforward;
 
 That would require to cross-pointer the rtentry and whatnot again.

i was planning to keep a copy, not a reference. If the copy becomes
temporarily stale, no big deal, as long as you can detect it reasonably
quiclky -- routes are not guaranteed to be correct, anyways.

 - possibly extend the pru_send interface so one can pass down the uio
instead of the mbuf;
 - make an opportunistic buffer allocation in some place downstream,
where the code already has an x-lock on some resource (could be
the snd_buf, the interface, ...) so the allocation comes for free.
 
 ETOOCOMPLEXOVERTIME.

maybe. But i want to investigate this.

cheers
luigi
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: Some performance measurements on the FreeBSD network stack

2012-04-19 Thread Andre Oppermann

On 20.04.2012 00:03, Luigi Rizzo wrote:

On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote:

On 19.04.2012 22:46, Luigi Rizzo wrote:

The allocation happens while the code has already an exclusive
lock on so-snd_buf so a pool of fresh buffers could be attached
there.


Ah, there it is not necessary to hold the snd_buf lock while
doing the allocate+copyin.  With soreceive_stream() (which is


it is not held in the tx path either -- but there is a short section
before m_uiotombuf() which does

...
SOCKBUF_LOCK(so-so_snd);
// check for pending errors, sbspace, so_state
SOCKBUF_UNLOCK(so-so_snd);
...

(some of this is slightly dubious, but that's another story)


Indeed the lock isn't held across the m_uiotombuf().  You're talking
about filling an sockbuf mbuf cache while holding the lock?


But the other consideration is that one could defer the mbuf allocation
to a later time when the packet is actually built (or anyways
right before the thread returns).
What i envision (and this would fit nicely with netmap) is the following:
- have a (possibly readonly) template for the headers (MAC+IP+UDP)
   attached to the socket, built on demand, and cached and managed
   with similar invalidation rules as used by fastforward;


That would require to cross-pointer the rtentry and whatnot again.


i was planning to keep a copy, not a reference. If the copy becomes
temporarily stale, no big deal, as long as you can detect it reasonably
quiclky -- routes are not guaranteed to be correct, anyways.


Be wary of disappearing interface pointers...


- possibly extend the pru_send interface so one can pass down the uio
   instead of the mbuf;
- make an opportunistic buffer allocation in some place downstream,
   where the code already has an x-lock on some resource (could be
   the snd_buf, the interface, ...) so the allocation comes for free.


ETOOCOMPLEXOVERTIME.


maybe. But i want to investigate this.


I fail see what passing down the uio would gain you.  The snd_buf lock
isn't obtained again after the copyin.  Not that I want to prevent you
from investigating other ways. ;)

--
Andre
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org