Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 03:30:18PM +0200, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. ... I have summarized the info on this thread in the camera ready version of an upcoming Usenix paper, which you can find here: http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
RE: Some performance measurements on the FreeBSD network stack
Hi, On Tue, 24 Apr 2012, 17:40-, Li, Qing wrote: Yup, all good points. In fact we have considered all of these while doing the work. In case you haven't seen it already, we did write about these issues in our paper and how we tried to address those, flow-table was one of the solutions. http://dl.acm.org/citation.cfm?id=1592641 Is this article available for those without ACM subscription? -- Maxim Konovalov ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Wed, Apr 25, 2012 at 01:22:06PM +0400, Maxim Konovalov wrote: Hi, On Tue, 24 Apr 2012, 17:40-, Li, Qing wrote: Yup, all good points. In fact we have considered all of these while doing the work. In case you haven't seen it already, we did write about these issues in our paper and how we tried to address those, flow-table was one of the solutions. http://dl.acm.org/citation.cfm?id=1592641 Is this article available for those without ACM subscription? Tip: get citation from abstract to google. 3'th link: http://conferences.sigcomm.org/sigcomm/2009/workshops/presto/papers/p37.pdf ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
Because there were leaks, there were 100% panics for IPv6, ... at least on the version I had seen in autumn last year. There is certainly no one more interested then me on these in, esp. for v6 where the removal of route caching a long time ago made nd6_nud_hint() a NOP with dst and rt being passed down as NULL only, and where we are doing up to three route lookups in the output path if no cached rt is passed down along from the ULP. If there is an updated patch, I'd love to see it. Ok, I'm following up as this seems to be getting some interest. This the relevant part of the last mail that I received from you. The final part having been dedicated to the narrow potential ABI changes that were to make it in to the release. From: Bjoern A. Zeeb b...@freebsd.org Date: Mon, Sep 19, 2011 at 3:19 PM To: K. Macy km...@freebsd.org Cc: Robert Watson rwat...@freebsd.org, rysto32 ryst...@gmail.com, Qing Li qin...@freebsd.org Sorry it's taking me so long while I was travelling but also now being back home again. I would yet have to find a code path through IPv6 that will a) not panic on INVARIANTS and b) actually update the inp_lle cache. Once I stop finding the next hiccup going one step deeper into the stack (and I made it to if_ethersubr.c) I'll get to legacy IP and the beef and I'll hope that all you all will have reviewed and tested that thoroughly. Checking whether a similar problem would exist in v4 I however found a possible lle reference leak in the legacy IP path as well. There's also a missed place where we do not update the generation counter (even though kind of pointless place but still to do for completeness). I am also pondering why we are not always invalidating the ro_lle cache (when we update the ro_rt entry in the callgraph after tcp_output). I wonder if we can provoke strange results say changing the default route from something connected on interface 1 to interface 2. ... /bz -- Bjoern A. Zeeb You have to have visions! Stop bit received. Insert coin for new address family. === The only comment in here which was sufficiently specific to actually take action on was: pondering why we are not always invalidating the ro_lle cache (when we update the ro_rt entry in the callgraph after tcp_output). Which was subsequently addressed by ensuring that the LLE_VALID flag was actually meaningful by clearing it when the llentry is removed from the interface's hash table in an unrelated commit because of weird behaviour observed with the flow. a) Where is the possible leak in the legacy path? b) Where were the panics in v6? In light of the fact that I don't or at least didn't have any means of testing v6 (I can probably get a testbed set up at iX now) and the netinet6 specific portions of the patch consist of 4 lines of code which should really be entrusted to you given that your performance parity work for v6 has actively being funded, it was clearly a mistake to tie the fate of the patch as a whole to those narrow bits. Once I get a response to a) and b) I'll follow up with a patch against head. I'm sure whatever I had has bitrotted somewhat in the meantime. Thanks for your help, Kip ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 25. Apr 2012, at 15:45 , K. Macy wrote: a) Where is the possible leak in the legacy path? It's been somewhere in ip_output() in one of the possible combinations go through the code flow. I'd probably need to apply a patch to a tree to get there again. It's been more than 6 months for me as well. I think it was related to the flowtable path but I could completely misremember. b) Where were the panics in v6? Again completely quoting from memory. I think the problem was that the INVARIANTS check in what's currently nd6_output_lle() was hit given both the rtentry and llentry were passed in but no *chain. Fixing this seems trivial even when trying to keep the current invariants checked. However the bigger problem then was that the cached value was never updated as the *ro passed down had been lost on the way. Whatever came then, is again off my head without the patch in front of me. Btw. you don't need more than two machines connected, virtual or not, worst two vnet instances on a lab machine, to enable and do IPv6. No need for global connectivity at all, as would not be required for IPv4 either. If you can get the patch updated to apply to a modern HEAD and compile (even if as-is) I'll try to help solving those to my best (though limited) availability to help you to get that thing in. /bz -- Bjoern A. Zeeb You have to have visions! It does not matter how good you are. It matters what good you do! ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 22:46, Luigi Rizzo wrote: On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote: On 19.04.2012 15:30, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? I've completed the updating of the routing table rmlock patch. There are two steps. Step one is just changing the rwlock to an rmlock. Step two streamlines the route lookup in ip_output and ip_fastfwd by copying out the relevant data while only holding the rmlock instead of obtaining a reference to the route. Would be very interesting to see how your benchmark/profiling changes with these patches applied. http://svn.freebsd.org/changeset/base/234649 Log: Change the radix head lock to an rmlock (read mostly lock). There is some header pollution going on because rmlock's are not entirely abstracted and need per-CPU structures. A comment in _rmlock.h says this can be hidden if there were per-cpu linker magic/support. I don't know if we have that already. http://svn.freebsd.org/changeset/base/234650 Log: Add a function rtlookup() that copies out the relevant information from an rtentry instead of returning the rtentry. This avoids the need to lock the rtentry and to increase the refcount on it. Convert ip_output() to use rtlookup() in a simplistic way. Certain seldom used functionality may not work anymore and the flowtable isn't available at the moment. Convert ip_fastfwd() to use rtlookup(). This code is meant to be used for profiling and to be experimented with further to determine which locking strategy returns the best results. Make sure to apply this one as well: http://svn.freebsd.org/changeset/base/234648 Log: Add INVARIANT and WITNESS support to rm_lock locks and optimize the synchronization path by replacing a LIST of active readers with a TAILQ. Obtained from:Isilon Submitted by: mlaier -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Tue, Apr 24, 2012 at 03:16:48PM +0200, Andre Oppermann wrote: On 19.04.2012 22:46, Luigi Rizzo wrote: On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote: On 19.04.2012 15:30, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? I've completed the updating of the routing table rmlock patch. There are two steps. Step one is just changing the rwlock to an rmlock. Step two streamlines the route lookup in ip_output and ip_fastfwd by copying out the relevant data while only holding the rmlock instead of obtaining a reference to the route. Would be very interesting to see how your benchmark/profiling changes with these patches applied. If you want to give it a try yourself, the high level benchmark is just the 'netsend' program from tools/tools/netrate/netsend -- i am running something like for i in $X ; do netsend 10.0.0.2 18 0 5 done and the cardinality of $X can be used to test contention on the low layers (routing tables and interface/queues). From previous tests, the difference between flowtable and routing table was small with a single process (about 5% or 50ns in the total packet processing time, if i remember well), but there was a large gain with multiple concurrent processes. Probably the change in throughput between HEAD and your branch is all you need. The info below shows that your gain is something around 100-200 ns depending on how good is the info that you return back (see below). My profiling changes were mostly aimed at charging the costs to the various layers. With my current setting (single process i7-870 @2933 MHz+Turboboost, ixgbe, FreeBSD HEAD, FLOWTABLE enabled, UDP) i see the following: FileFunction/descriptionTotal/delta nanoseconds user programsendto()8 96 system call uipc_syscalls.c sys_sendto104 uipc_syscalls.c sendit111 uipc_syscalls.c kern_sendit 118 uipc_socket.c sosend uipc_socket.c sosend_dgram 146 137 sockbuf locking, mbuf alloc, copyin udp_usrreq.cudp_send 273 udp_usrreq.cudp_output273 57 ip_output.c ip_output 330 198 route lookup, ip header setup if_ethersubr.c ether_output 528 162 MAC header lookup and construction, loopback checks if_ethersubr.c ether_output_frame690 ixgbe.c ixgbe_mq_start698 ixgbe.c ixgbe_mq_start_locked 720 ixgbe.c ixgbe_xmit730 220 mbuf mangling, device programming -- packet on the wire950 Removing flowtable increases the cost in ip_output() (obviously) but also in ether_output() (because the route does not have a lle entry so you need to call arpresolve on each packet). It also causes trouble in the device driver because the mbuf does not have a flowid set, so the ixgbe device driver puts the packet on the queue corresponding to the current CPU. If the process (as in my case) floats, one flow might end up on multiple queues. So in revising the route lookup i believe it would be good if we could also get at once most of the info that ether_output() is computing again and again. cheers luigi http://svn.freebsd.org/changeset/base/234649 Log: Change the radix head lock to an rmlock (read mostly lock). There is some header pollution going on because rmlock's are not entirely abstracted and need per-CPU structures. A comment in _rmlock.h says this can be hidden if there were per-cpu linker magic/support. I don't know if we have that already. http://svn.freebsd.org/changeset/base/234650 Log: Add a function rtlookup() that copies out the relevant information from an rtentry instead of returning the rtentry. This avoids the need to lock the rtentry and to increase the refcount on it. Convert ip_output() to use rtlookup() in a simplistic way. Certain seldom used functionality may not work anymore and the flowtable isn't available at the moment.
RE: Some performance measurements on the FreeBSD network stack
From previous tests, the difference between flowtable and routing table was small with a single process (about 5% or 50ns in the total packet processing time, if i remember well), but there was a large gain with multiple concurrent processes. Yes, that sounds about right when we did the tests a long while ago. Removing flowtable increases the cost in ip_output() (obviously) but also in ether_output() (because the route does not have a lle entry so you need to call arpresolve on each packet). Yup. So in revising the route lookup i believe it would be good if we could also get at once most of the info that ether_output() is computing again and again. Well, the routing table no longer maintains any lle info, so there isn't much to copy out the rtentry at the completion of route lookup. If I understood you correctly, you do believe there is a lot of value in Flowtable caching concept, but you are not suggesting we reverting back to having the routing table maintain L2 entries, are you ? --Qing ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Tue, Apr 24, 2012 at 4:16 PM, Li, Qing qing...@bluecoat.com wrote: From previous tests, the difference between flowtable and routing table was small with a single process (about 5% or 50ns in the total packet processing time, if i remember well), but there was a large gain with multiple concurrent processes. Yes, that sounds about right when we did the tests a long while ago. Removing flowtable increases the cost in ip_output() (obviously) but also in ether_output() (because the route does not have a lle entry so you need to call arpresolve on each packet). Yup. So in revising the route lookup i believe it would be good if we could also get at once most of the info that ether_output() is computing again and again. Well, the routing table no longer maintains any lle info, so there isn't much to copy out the rtentry at the completion of route lookup. If I understood you correctly, you do believe there is a lot of value in Flowtable caching concept, but you are not suggesting we reverting back to having the routing table maintain L2 entries, are you ? One could try a similar conversion of the L2 table to an rmlock without copy while lock is held. -Kip -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Tue, Apr 24, 2012 at 5:03 PM, K. Macy km...@freebsd.org wrote: On Tue, Apr 24, 2012 at 4:16 PM, Li, Qing qing...@bluecoat.com wrote: From previous tests, the difference between flowtable and routing table was small with a single process (about 5% or 50ns in the total packet processing time, if i remember well), but there was a large gain with multiple concurrent processes. Yes, that sounds about right when we did the tests a long while ago. Removing flowtable increases the cost in ip_output() (obviously) but also in ether_output() (because the route does not have a lle entry so you need to call arpresolve on each packet). Yup. So in revising the route lookup i believe it would be good if we could also get at once most of the info that ether_output() is computing again and again. Well, the routing table no longer maintains any lle info, so there isn't much to copy out the rtentry at the completion of route lookup. If I understood you correctly, you do believe there is a lot of value in Flowtable caching concept, but you are not suggesting we reverting back to having the routing table maintain L2 entries, are you ? One could try a similar conversion of the L2 table to an rmlock without copy while lock is held. Odd .. *with* copy while lock is held. -Kip ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Tue, Apr 24, 2012 at 02:16:18PM +, Li, Qing wrote: From previous tests, the difference between flowtable and routing table was small with a single process (about 5% or 50ns in the total packet processing time, if i remember well), but there was a large gain with multiple concurrent processes. Yes, that sounds about right when we did the tests a long while ago. Removing flowtable increases the cost in ip_output() (obviously) but also in ether_output() (because the route does not have a lle entry so you need to call arpresolve on each packet). Yup. So in revising the route lookup i believe it would be good if we could also get at once most of the info that ether_output() is computing again and again. Well, the routing table no longer maintains any lle info, so there isn't much to copy out the rtentry at the completion of route lookup. If I understood you correctly, you do believe there is a lot of value in Flowtable caching concept, but you are not suggesting we reverting back to having the routing table maintain L2 entries, are you ? I see a lot of value in caching in general. Especially for a bound socket it seems pointless to lookup the route, iface and mac address(es) on every single packet instead of caching them. And, routes and MAC addresses are volatile anyways so making sure that we do the lookup 1us closer to the actual use gives no additional guarantee. The frequency with which these info (routes and MAC addresses) change clearly influences the mechanism to validate the cache. I suppose we have the following options: - direct notification: a failure in a direct chain of calls can be used to invalidate the info cached in the socket. Similarly, some incoming traffic (e.g. TCP RST, FIN, ICMP messages) that reach a socket can invalidate the cached values - assume a minimum lifetime for the info (i think this is what happens in the flowtable) and flush it unconditionally every such interval (say 10ms). - if some info changes infrequently (e.g. MAC addresses) one could put a version number in the cached value and use it to validate the cache. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Tue, Apr 24, 2012 at 6:34 PM, Luigi Rizzo ri...@iet.unipi.it wrote: On Tue, Apr 24, 2012 at 02:16:18PM +, Li, Qing wrote: From previous tests, the difference between flowtable and routing table was small with a single process (about 5% or 50ns in the total packet processing time, if i remember well), but there was a large gain with multiple concurrent processes. Yes, that sounds about right when we did the tests a long while ago. Removing flowtable increases the cost in ip_output() (obviously) but also in ether_output() (because the route does not have a lle entry so you need to call arpresolve on each packet). Yup. So in revising the route lookup i believe it would be good if we could also get at once most of the info that ether_output() is computing again and again. Well, the routing table no longer maintains any lle info, so there isn't much to copy out the rtentry at the completion of route lookup. If I understood you correctly, you do believe there is a lot of value in Flowtable caching concept, but you are not suggesting we reverting back to having the routing table maintain L2 entries, are you ? I see a lot of value in caching in general. Especially for a bound socket it seems pointless to lookup the route, iface and mac address(es) on every single packet instead of caching them. And, routes and MAC addresses are volatile anyways so making sure that we do the lookup 1us closer to the actual use gives no additional guarantee. The frequency with which these info (routes and MAC addresses) change clearly influences the mechanism to validate the cache. I suppose we have the following options: - direct notification: a failure in a direct chain of calls can be used to invalidate the info cached in the socket. Similarly, some incoming traffic (e.g. TCP RST, FIN, ICMP messages) that reach a socket can invalidate the cached values - assume a minimum lifetime for the info (i think this is what happens in the flowtable) and flush it unconditionally every such interval (say 10ms). - if some info changes infrequently (e.g. MAC addresses) one could put a version number in the cached value and use it to validate the cache. I have a patch that has been sitting around for a long time due to review cycle latency that caches a pointer to the rtentry (and llentry) in the the inpcb. Before each use the rtentry is checked against a generation number in the routing tree that is incremented on every routing table update. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
I have a patch that has been sitting around for a long time due to review cycle latency that caches a pointer to the rtentry (and llentry) in the the inpcb. Before each use the rtentry is checked against a generation number in the routing tree that is incremented on every routing table update. Hi Kip, Is there a public location for the patch ? What can be done to speedup the commit: testing ? Fabien
RE: Some performance measurements on the FreeBSD network stack
Yup, all good points. In fact we have considered all of these while doing the work. In case you haven't seen it already, we did write about these issues in our paper and how we tried to address those, flow-table was one of the solutions. http://dl.acm.org/citation.cfm?id=1592641 --Qing Well, the routing table no longer maintains any lle info, so there isn't much to copy out the rtentry at the completion of route lookup. If I understood you correctly, you do believe there is a lot of value in Flowtable caching concept, but you are not suggesting we reverting back to having the routing table maintain L2 entries, are you ? I see a lot of value in caching in general. Especially for a bound socket it seems pointless to lookup the route, iface and mac address(es) on every single packet instead of caching them. And, routes and MAC addresses are volatile anyways so making sure that we do the lookup 1us closer to the actual use gives no additional guarantee. The frequency with which these info (routes and MAC addresses) change clearly influences the mechanism to validate the cache. I suppose we have the following options: - direct notification: a failure in a direct chain of calls can be used to invalidate the info cached in the socket. Similarly, some incoming traffic (e.g. TCP RST, FIN, ICMP messages) that reach a socket can invalidate the cached values - assume a minimum lifetime for the info (i think this is what happens in the flowtable) and flush it unconditionally every such interval (say 10ms). - if some info changes infrequently (e.g. MAC addresses) one could put a version number in the cached value and use it to validate the cache. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
RE: Some performance measurements on the FreeBSD network stack
I have a patch that has been sitting around for a long time due to review cycle latency that caches a pointer to the rtentry (and llentry) in the the inpcb. Before each use the rtentry is checked against a generation number in the routing tree that is incremented on every routing table update. Hi Kip, Is there a public location for the patch ? What can be done to speedup the commit: testing ? Fabien I performed extensive review of this patch from Kip, and it was ready to go. Really good work. Not sure what is stopping its commit into the tree. --Qing ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 24. Apr 2012, at 17:42 , Li, Qing wrote: I have a patch that has been sitting around for a long time due to review cycle latency that caches a pointer to the rtentry (and llentry) in the the inpcb. Before each use the rtentry is checked against a generation number in the routing tree that is incremented on every routing table update. Hi Kip, Is there a public location for the patch ? What can be done to speedup the commit: testing ? Fabien I performed extensive review of this patch from Kip, and it was ready to go. Really good work. Not sure what is stopping its commit into the tree. Because there were leaks, there were 100% panics for IPv6, ... at least on the version I had seen in autumn last year. There is certainly no one more interested then me on these in, esp. for v6 where the removal of route caching a long time ago made nd6_nud_hint() a NOP with dst and rt being passed down as NULL only, and where we are doing up to three route lookups in the output path if no cached rt is passed down along from the ULP. If there is an updated patch, I'd love to see it. /bz -- Bjoern A. Zeeb You have to have visions! It does not matter how good you are. It matters what good you do! ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
Most of these issues are well known. Addressing the bottlenecks is simply time consuming due to the fact that any bugs introduced during development potentially impact many users. -Kip On Sun, Apr 22, 2012 at 4:14 AM, Adrian Chadd adr...@freebsd.org wrote: Hi, This honestly sounds like it's begging for an instrumentation/analysis/optimisation project. What do we need to do? Adrian -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Fri, 20 Apr 2012, K. Macy wrote: On Fri, Apr 20, 2012 at 4:44 PM, Luigi Rizzo ri...@iet.unipi.it wrote: The small penalty when flowtable is disabled but compiled in is probably because the net.flowtable.enable flag is checked a bit deep in the code. The advantage with non-connect()ed sockets is huge. I don't quite understand why disabling the flowtable still helps there. Do you mean having it compiled in but disabled still helps performance? Yes, that is extremely strange. This reminds me that when I worked on this, I saw very large throughput differences (in the 20-50% range) as a result of minor changes in unrelated code. I could get these changes intentionally by adding or removing padding in unrelated unused text space, so the differences were apparently related to text alignment. I thought I had some significant micro-optimizations, but it turned out that they were acting mainly by changing the layout in related used text space where it is harder to control. Later, I suspected that the differences were more due to cache misses for data than for text. The CPU and its caching must affect this significantly. I tested on an AthlonXP and Athlon64, and the differences were larger on the AthlonXP. Both of these have a shared I/D cache so pressure on the I part would affect the D part, but in this benchmark the D part is much more active than the I part so it is unclear how text layout could have such a large effect. Anyway, the large differences made it impossible to trust the results of benchmarking any single micro-benchmark. Also, ministat is useless for understanding the results. (I note that luigi didn't provide any standard deviations and neither would I. :-). My results depended on the cache behaviour but didn't change significantly when rerun, unless the code was changed. Bruce ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
Hi, This honestly sounds like it's begging for an instrumentation/analysis/optimisation project. What do we need to do? Adrian ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Fri, Apr 20, 2012 at 12:37:21AM +0200, Andre Oppermann wrote: On 20.04.2012 00:03, Luigi Rizzo wrote: On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote: On 19.04.2012 22:46, Luigi Rizzo wrote: The allocation happens while the code has already an exclusive lock on so-snd_buf so a pool of fresh buffers could be attached there. Ah, there it is not necessary to hold the snd_buf lock while doing the allocate+copyin. With soreceive_stream() (which is it is not held in the tx path either -- but there is a short section before m_uiotombuf() which does ... SOCKBUF_LOCK(so-so_snd); // check for pending errors, sbspace, so_state SOCKBUF_UNLOCK(so-so_snd); ... (some of this is slightly dubious, but that's another story) Indeed the lock isn't held across the m_uiotombuf(). You're talking about filling an sockbuf mbuf cache while holding the lock? all i am thinking is that when we have a serialization point we could use it for multiple related purposes. In this case yes we could keep a small mbuf cache attached to so_snd. When the cache is empty either get a new batch (say 10-20 bufs) from the zone allocator, possibly dropping and regaining the lock if the so_snd must be a leaf. Besides for protocols like TCP (does it use the same path ?) the mbufs are already there (released by incoming acks) in the steady state, so it is not even necessary to to refill the cache. This said, i am not 100% sure that the 100ns I am seeing are all spent in the zone allocator. As i said the chain of indirect calls and other ops is rather long on both acquire and release. But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; That would require to cross-pointer the rtentry and whatnot again. i was planning to keep a copy, not a reference. If the copy becomes temporarily stale, no big deal, as long as you can detect it reasonably quiclky -- routes are not guaranteed to be correct, anyways. Be wary of disappearing interface pointers... (this reminds me, what prevents a route grabbed from the flowtable from disappearing and releasing the ifp reference ?) In any case, it seems better to keep a more persistent ifp reference in the socket rather than grab and release one on every single packet transmission. - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. ETOOCOMPLEXOVERTIME. maybe. But i want to investigate this. I fail see what passing down the uio would gain you. The snd_buf lock isn't obtained again after the copyin. Not that I want to prevent you from investigating other ways. ;) maybe it can open the way to other optimizations, such as reducing the number of places where you need to lock, or save some data copies, or reduce fragmentation, etc. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 20.04.2012 01:12, Andre Oppermann wrote: On 19.04.2012 22:34, K. Macy wrote: This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. The rtentry lock isn't obtained anymore. While the rmlock read lock is held on the rtable the relevant information like ifp and such is copied out. No later referencing possible. In the end any referencing of an rtentry would be forbidden and the rtentry lock can be removed. The second step can be optional though. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale From my experience, turning fastfwd on gives ~20-30% performance increase (10G forwarding with firewalling, 1.4MPPS). ip_forward() uses 2 lookups (ip_rtaddr + ip_output) vs 1 ip_fastfwd(). The worst current problem IMHO is number of locks packet have to traverse, not number of lookups. to arbitrary flow numbers. In theory a rmlock-only lookup into a default-route only routing table would be faster than creating a flow table entry for every destination. It a matter of churn though. The flowtable isn't lockless in itself, is it? -- WBR, Alexander ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 20.04.2012 10:26, Alexander V. Chernikov wrote: On 20.04.2012 01:12, Andre Oppermann wrote: On 19.04.2012 22:34, K. Macy wrote: If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale From my experience, turning fastfwd on gives ~20-30% performance increase (10G forwarding with firewalling, 1.4MPPS). ip_forward() uses 2 lookups (ip_rtaddr + ip_output) vs 1 ip_fastfwd(). Another difference is the packet copy the normal forwarding path does to be able to send a ICMP redirect message if the packet is forwarded to a different gateway on the same LAN. fastforward doesn't do that. The worst current problem IMHO is number of locks packet have to traverse, not number of lookups. Agreed. Actually the locking in itself is not the problem. It's the side effects of cache line dirtying/bouncing and contention. However in the great majority of the cases the data protected by the lock is only read, not modified making a 'full' lock expensive. -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 20.04.2012 08:35, Luigi Rizzo wrote: On Fri, Apr 20, 2012 at 12:37:21AM +0200, Andre Oppermann wrote: On 20.04.2012 00:03, Luigi Rizzo wrote: On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote: On 19.04.2012 22:46, Luigi Rizzo wrote: The allocation happens while the code has already an exclusive lock on so-snd_buf so a pool of fresh buffers could be attached there. Ah, there it is not necessary to hold the snd_buf lock while doing the allocate+copyin. With soreceive_stream() (which is it is not held in the tx path either -- but there is a short section before m_uiotombuf() which does ... SOCKBUF_LOCK(so-so_snd); // check for pending errors, sbspace, so_state SOCKBUF_UNLOCK(so-so_snd); ... (some of this is slightly dubious, but that's another story) Indeed the lock isn't held across the m_uiotombuf(). You're talking about filling an sockbuf mbuf cache while holding the lock? all i am thinking is that when we have a serialization point we could use it for multiple related purposes. In this case yes we could keep a small mbuf cache attached to so_snd. When the cache is empty either get a new batch (say 10-20 bufs) from the zone allocator, possibly dropping and regaining the lock if the so_snd must be a leaf. Besides for protocols like TCP (does it use the same path ?) the mbufs are already there (released by incoming acks) in the steady state, so it is not even necessary to to refill the cache. I'm sure things can be tuned towards particular cases but almost always that some at the expense of versatility. I was looking at netmap for a project. It's great when there is one thing being done by one process at great speed. However as soon as I have to dispatch certain packets somewhere else for further processing, in another process, things quickly become complicated and fall apart. It would have meant to replicate what the kernel does with protosw friends in userspace coated with IPC. No to mention re-inventing the socket layer abstraction again. So netmap is fantastic for simple, bulk and repetitive tasks with little variance. Things like packet routing, bridging, encapsulation, perhaps inspection and acting as a traffic sink/source. There are plenty of use cases for that. Coming back to your UDP test case, while the 'hacks' you propose may benefit the bulk sending of a bound socket it may not help or pessimize the DNS server case where a large number of packets is send to a large number of destinations. The layering abstractions we have in BSD are excellent and have served us quite well so far. Adding new protocols is a simple task and so on. Of course it has some trade-offs by having some indirections and not being bare-metal fast. Yes, there is a lot of potential in optimizing the locking strategies we currently have within the BSD network stack layering. Your profiling work is immensely helpful in identifying where to aim at. Once that is fixed we should stop there. Anyone who needs a particular as close as possible to the bare metal UDP packet blaster should fork the tree and do their own short-cuts and whatnot. But FreeBSD should stay a reasonable general purpose. It won't be a Ferrari, but an Audi S6 is a damn nice car as well and it can carry your whole family. :) This said, i am not 100% sure that the 100ns I am seeing are all spent in the zone allocator. As i said the chain of indirect calls and other ops is rather long on both acquire and release. But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; That would require to cross-pointer the rtentry and whatnot again. i was planning to keep a copy, not a reference. If the copy becomes temporarily stale, no big deal, as long as you can detect it reasonably quiclky -- routes are not guaranteed to be correct, anyways. Be wary of disappearing interface pointers... (this reminds me, what prevents a route grabbed from the flowtable from disappearing and releasing the ifp reference ?) It has to keep a refcounted reference to the rtentry. In any case, it seems better to keep a more persistent ifp reference in the socket rather than grab and release one on every single packet transmission. The socket doesn't and shouldn't know anything about ifp's. - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. ETOOCOMPLEXOVERTIME.
Re: Some performance measurements on the FreeBSD network stack
On Thursday, April 19, 2012 4:46:22 pm Luigi Rizzo wrote: What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. The allocation happens while the code has already an exclusive lock on so-snd_buf so a pool of fresh buffers could be attached there. Keep in mind that in the common case critical_enter() and critical_exit() should be very cheap as they should just do td-td_critnest++ and td-td_critnest--. critical_enter() should probably be inlined if KTR is not enabled. -- John Baldwin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 11:06:38PM +0200, K. Macy wrote: On Thu, Apr 19, 2012 at 11:22 PM, Luigi Rizzo ri...@iet.unipi.it wrote: On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote: This is indeed a big problem. ?I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? actually, now that i look at the code, both ip_output() and the ip_fastforward code use the same in_rtalloc_ign(...) If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. re. flowtable, could you point me to what i should do instead of calling in_rtalloc_ign() ? If you build with it in your kernel config and enable the sysctl ip_output will automatically use it for TCP and UDP connections. If you're doing forwarding you'll need to patch the forwarding path. cool. For the records, with netsend 10.0.0.2 ports 18 0 5 on an ixgbe talking to a remote host i get the following results (with a single port netsend does a connect() and then send(), otherwise it loops around a sendto() ) net.flowtable.enabled portns/pkt - not compiled in 5000 944M_FLOWID not set 0 (disable) 50001004 1 (enable) 5000 980 not compiled in 5000-5001 3400M_FLOWID not set 0 (disable) 5000-5001 1418 1 (enable) 5000-5001 1230 The small penalty when flowtable is disabled but compiled in is probably because the net.flowtable.enable flag is checked a bit deep in the code. The advantage with non-connect()ed sockets is huge. I don't quite understand why disabling the flowtable still helps there. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
Comments inline below: On Fri, Apr 20, 2012 at 4:44 PM, Luigi Rizzo ri...@iet.unipi.it wrote: On Thu, Apr 19, 2012 at 11:06:38PM +0200, K. Macy wrote: On Thu, Apr 19, 2012 at 11:22 PM, Luigi Rizzo ri...@iet.unipi.it wrote: On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote: This is indeed a big problem. ?I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? actually, now that i look at the code, both ip_output() and the ip_fastforward code use the same in_rtalloc_ign(...) If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. re. flowtable, could you point me to what i should do instead of calling in_rtalloc_ign() ? If you build with it in your kernel config and enable the sysctl ip_output will automatically use it for TCP and UDP connections. If you're doing forwarding you'll need to patch the forwarding path. cool. For the records, with netsend 10.0.0.2 ports 18 0 5 on an ixgbe talking to a remote host i get the following results (with a single port netsend does a connect() and then send(), otherwise it loops around a sendto() ) Sorry, 5000 vs 5000-5001 means 1 vs 2 streams? Does this mean for a single socket the overhead is less without it compiled in than with it compiled in but enabled? That is certainly different from what I see with TCP where I see a 30% increase in aggregate throughput the last time I tried this (on IPoIB). For the record the M_FLOWID is used to pick the transmit queue so with multiple streams you're best of setting it if your device has more than one hardware device queue. net.flowtable.enabled port ns/pkt - not compiled in 5000 944 M_FLOWID not set 0 (disable) 5000 1004 1 (enable) 5000 980 not compiled in 5000-5001 3400 M_FLOWID not set 0 (disable) 5000-5001 1418 1 (enable) 5000-5001 1230 The small penalty when flowtable is disabled but compiled in is probably because the net.flowtable.enable flag is checked a bit deep in the code. The advantage with non-connect()ed sockets is huge. I don't quite understand why disabling the flowtable still helps there. Do you mean having it compiled in but disabled still helps performance? Yes, that is extremely strange. -Kip -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Some performance measurements on the FreeBSD network stack
I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. Test conditions: - intel i7-870 CPU running at 2.93 GHz + TurboBoost, all 4 cores enabled, no hyperthreading - FreeBSD HEAD as of 15 april 2012, no ipfw, no other pfilter clients, no ipv6 or ipsec. - userspace running 'netsend 10.0.0.2 18 0 5' (output to a physical interface, udp port , small frame, no rate limitations, 5sec experiments) - the 'ns' column reports the total time divided by the number of successful transmissions we report the min and max in 5 tests - 1 to 4 parallel tasks, variable packet sizes - there are variations in the numbers which become larger as we reach the bottom of the stack Caveats: - in the table below, clock and pktlen are constant. I am including the info here so it is easier to compare the results with future experiments - i have a small number of samples, so i am only reporting the min and the max in a handful of experiments. - i am only measuring average values over millions of cycles. I have no info on what is the variance between the various executions. - from what i have seen, numbers vary significantly on different systems, depending on memory speed, caches and other things. The big jumps are significant and present on all systems, but the small deltas (say 5%) are not even statistically significant. - if someone is interested in replicating the experiments email me and i will post a link to a suitable picobsd image. - i have not yet instrumented the bottom layers (if_output and below). The results show a few interesting things: - the packet-sending application is reasonably fast and certainly not a bottleneck (over 100Mpps before calling the system call); - the system call is somewhat expensive, about 100ns. I am not sure where the time is spent (the amd64 code does a few push on the stack and then runs syscall (followed by a sysret). I am not sure how much room for improvement is there in this area. The relevant code is in lib/libc/i386/SYS.h and lib/libc/i386/sys/syscall.S (KERNCALL translates to syscall on amd64, and int 0x80 on the i386) - the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet). I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release. It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so-so_proto-pr_usrreqs-pru_send)() which is called inline, anyways. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. There is other bad stuff occurring in if_output() and below (on this system it takes about 1300ns to send one packet even with one core, and ony 500-550 are consumed before the call to if_output()) but i don't have detailed information yet. POS CPU clock pktlen ns/pkt--- EXIT POINT min max - U 1 2934 18 88 userspace, before the send() call [ syscall ] 20 1 2934 18 103 107 sys_sendto(): begin 20 4 2934 18 104 107 21 1 2934 18 110 113 sendit(): begin 21 4 2934 18 111 116 22 1 2934 18 110 114 sendit() after getsockaddr(to, ...) 22 4 2934 18 111 124 23 1 2934 18 111 115 sendit() before kern_sendit 23 4 2934 18 112 120 24 1 2934 18 117 120 kern_sendit() after AUDIT_ARG_FD 24 4 2934 18 117 121 25 1 2934 18 134 140 kern_sendit() before sosend() 25 4 2934 18 134 146 40 1 2934 18 144 149 sosend_dgram(): start 40 4 2934 18 144 151 41 1 2934 18 157 166 sosend_dgram() before m_uiotombuf() 41 4 2934 18 157 168 [ mbuf allocation and copy. The copy is relatively cheap ] 42 1 2934 18 264 268 sosend_dgram() after m_uiotombuf() 42 4 2934 18 265 269 30 1 2934 18 273 276 udp_send() begin 30 4 2934 18 274 278 [ here we start seeing some contention with multiple threads ] 31 1 2934 18 323 324 udp_output() before ip_output() 31 4 2934 18 344 348 50 1
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 03:30:18PM +0200, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. I do some test in 2011. May be this test is not actual now. May be actual. Initial message http://lists.freebsd.org/pipermail/freebsd-performance/2011-January/004156.html UDP socket in FreeBSD http://lists.freebsd.org/pipermail/freebsd-performance/2011-February/004176.html About 4BSD/ULE http://lists.freebsd.org/pipermail/freebsd-performance/2011-February/004181.html ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 15:30, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. Jumping over very interesting analysis... - the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet). I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release. It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so-so_proto-pr_usrreqs-pru_send)() which is called inline, anyways. The UMA mbuf allocator is certainly not perfect but rather good. It has a per-CPU cache of mbuf's that are very fast to allocate from. Once it has used them it needs to refill from the global pool which may happen from time to time and show up in the averages. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which doesn't produce any lock contention or cache pollution. Also skipping the per-route lock while the table read-lock is held should help some more. All in all this should give a massive gain in high pps situations at the expense of costlier routing table changes. However changes are seldom to essentially never with a single default route. After that the ARP table will gets same treatment and the low stack lock contention points should be gone for good. -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote: On 19.04.2012 15:30, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. Jumping over very interesting analysis... - the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet). I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release. It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so-so_proto-pr_usrreqs-pru_send)() which is called inline, anyways. The UMA mbuf allocator is certainly not perfect but rather good. It has a per-CPU cache of mbuf's that are very fast to allocate from. Once it has used them it needs to refill from the global pool which may happen from time to time and show up in the averages. indeed i was pleased to see no difference between 1 and 4 threads. This also suggests that the global pool is accessed very seldom, and for short times, otherwise you'd see the effect with 4 threads. What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. The allocation happens while the code has already an exclusive lock on so-snd_buf so a pool of fresh buffers could be attached there. But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. -Kip ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote: This is indeed a big problem. ?I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? actually, now that i look at the code, both ip_output() and the ip_fastforward code use the same in_rtalloc_ign(...) If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. re. flowtable, could you point me to what i should do instead of calling in_rtalloc_ign() ? cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 11:22 PM, Luigi Rizzo ri...@iet.unipi.it wrote: On Thu, Apr 19, 2012 at 10:34:45PM +0200, K. Macy wrote: This is indeed a big problem. ?I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? actually, now that i look at the code, both ip_output() and the ip_fastforward code use the same in_rtalloc_ign(...) If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. re. flowtable, could you point me to what i should do instead of calling in_rtalloc_ign() ? If you build with it in your kernel config and enable the sysctl ip_output will automatically use it for TCP and UDP connections. If you're doing forwarding you'll need to patch the forwarding path. Fabien Thomas has a patch for that that I just fixed/identified a bug in for him. -Kip -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 22:34, K. Macy wrote: This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. The rtentry lock isn't obtained anymore. While the rmlock read lock is held on the rtable the relevant information like ifp and such is copied out. No later referencing possible. In the end any referencing of an rtentry would be forbidden and the rtentry lock can be removed. The second step can be optional though. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. In theory a rmlock-only lookup into a default-route only routing table would be faster than creating a flow table entry for every destination. It a matter of churn though. The flowtable isn't lockless in itself, is it? -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. The rtentry lock isn't obtained anymore. While the rmlock read lock is held on the rtable the relevant information like ifp and such is copied out. No later referencing possible. In the end any referencing of an rtentry would be forbidden and the rtentry lock can be removed. The second step can be optional though. Can you point me to a tree where you've made these changes? i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. In theory a rmlock-only lookup into a default-route only routing table would be faster than creating a flow table entry for every destination. It a matter of churn though. The flowtable isn't lockless in itself, is it? It is. In a steady state where the working set of peers fits in the table it should be just a simple hash of the ip and then a lookup. -Kip -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 22:46, Luigi Rizzo wrote: On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote: On 19.04.2012 15:30, Luigi Rizzo wrote: I have been running some performance tests on UDP sockets, using the netsend program in tools/tools/netrate/netsend and instrumenting the source code and the kernel do return in various points of the path. Here are some results which I hope you find interesting. Jumping over very interesting analysis... - the next expensive operation, consuming another 100ns, is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator seems to scale decently at least with 4 cores. The copyin() is relatively inexpensive (not reported in the data below, but disabling it saves only 15-20ns for a short packet). I have not followed the details, but the allocator calls the zone allocator and there is at least one critical_enter()/critical_exit() pair, and the highly modular architecture invokes long chains of indirect function calls both on allocation and release. It might make sense to keep a small pool of mbufs attached to the socket buffer instead of going to the zone allocator. Or defer the actual encapsulation to the (*so-so_proto-pr_usrreqs-pru_send)() which is called inline, anyways. The UMA mbuf allocator is certainly not perfect but rather good. It has a per-CPU cache of mbuf's that are very fast to allocate from. Once it has used them it needs to refill from the global pool which may happen from time to time and show up in the averages. indeed i was pleased to see no difference between 1 and 4 threads. This also suggests that the global pool is accessed very seldom, and for short times, otherwise you'd see the effect with 4 threads. Robert did the per-CPU mbuf allocator pools a few years ago. Excellent engineering. What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. Can't get away from those as a thread must not migrate away when manipulating the per-CPU mbuf pool. The allocation happens while the code has already an exclusive lock on so-snd_buf so a pool of fresh buffers could be attached there. Ah, there it is not necessary to hold the snd_buf lock while doing the allocate+copyin. With soreceive_stream() (which is experimental not enabled by default) I did just that for the receive path. It's quite a significant gain there. IMHO better resolve the locking order than to juggle yet another mbuf sink. But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; That would require to cross-pointer the rtentry and whatnot again. We want to get away from that to untangle the (locking) mess that eventually results from it. - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. ETOOCOMPLEXOVERTIME. - another big bottleneck is the route lookup in ip_output() (between entries 51 and 56). Not only it eats another 100ns+ on an empty routing table, but it also causes huge contentions when multiple cores are involved. This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? No. The main advantage/difference of fastforward is the short code path and processing to completion. -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 19.04.2012 23:17, K. Macy wrote: This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. The rtentry lock isn't obtained anymore. While the rmlock read lock is held on the rtable the relevant information like ifp and such is copied out. No later referencing possible. In the end any referencing of an rtentry would be forbidden and the rtentry lock can be removed. The second step can be optional though. Can you point me to a tree where you've made these changes? It's not in a public tree. I just did a 'svn up' and the recent pf and rtsocket changes created some conflicts. Have to solve them before posting. Timeframe (early) next week. i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? If the number of peers is bounded then you can use the flowtable. Max PPS is much higher bypassing routing lookup. However, it doesn't scale to arbitrary flow numbers. In theory a rmlock-only lookup into a default-route only routing table would be faster than creating a flow table entry for every destination. It a matter of churn though. The flowtable isn't lockless in itself, is it? It is. In a steady state where the working set of peers fits in the table it should be just a simple hash of the ip and then a lookup. Yes, but the lookup requires a lock? Or is every entry replicated to every CPU? So a number of concurrent CPU's sending to the same UDP destination would content on that lock? -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
Yes, but the lookup requires a lock? Or is every entry replicated to every CPU? So a number of concurrent CPU's sending to the same UDP destination would content on that lock? No. In the default case it's per CPU, thus no serialization is required. But yes, if your transmitting thread manages to bounce to every core during send within the flow expiration window you'll have an extra 12 or however many bytes per peer times the number of cores. There is usually a fair amount of CPU affinity over a given unit time. -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 11:27 PM, Andre Oppermann an...@freebsd.org wrote: On 19.04.2012 23:17, K. Macy wrote: This only helps if your flows aren't hitting the same rtentry. Otherwise you still convoy on the lock for the rtentry itself to increment and decrement the rtentry's reference count. The rtentry lock isn't obtained anymore. While the rmlock read lock is held on the rtable the relevant information like ifp and such is copied out. No later referencing possible. In the end any referencing of an rtentry would be forbidden and the rtentry lock can be removed. The second step can be optional though. Can you point me to a tree where you've made these changes? It's not in a public tree. I just did a 'svn up' and the recent pf and rtsocket changes created some conflicts. Have to solve them before posting. Timeframe (early) next week. Ok. Keep us posted. Thanks, Kip -- “The real damage is done by those millions who want to 'get by.' The ordinary men who just want to be left in peace. Those who don’t want their little lives disturbed by anything bigger than themselves. Those with no sides and no causes. Those who won’t take measure of their own strength, for fear of antagonizing their own weakness. Those who don’t like to make waves—or enemies. Those for whom freedom, honour, truth, and principles are only literature. Those who live small, love small, die small. It’s the reductionist approach to life: if you keep it small, you’ll keep it under control. If you don’t make any noise, the bogeyman won’t find you. But it’s all an illusion, because they die too, those people who roll up their spirits into tiny little balls so as to be safe. Safe?! From what? Life is always on the edge of death; narrow streets lead to the same place as wide avenues, and a little candle burns itself out just like a flaming torch does. I choose my own way to burn.” Sophie Scholl ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote: On 19.04.2012 22:46, Luigi Rizzo wrote: ... What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. Can't get away from those as a thread must not migrate away when manipulating the per-CPU mbuf pool. i understand. The allocation happens while the code has already an exclusive lock on so-snd_buf so a pool of fresh buffers could be attached there. Ah, there it is not necessary to hold the snd_buf lock while doing the allocate+copyin. With soreceive_stream() (which is it is not held in the tx path either -- but there is a short section before m_uiotombuf() which does ... SOCKBUF_LOCK(so-so_snd); // check for pending errors, sbspace, so_state SOCKBUF_UNLOCK(so-so_snd); ... (some of this is slightly dubious, but that's another story) But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; That would require to cross-pointer the rtentry and whatnot again. i was planning to keep a copy, not a reference. If the copy becomes temporarily stale, no big deal, as long as you can detect it reasonably quiclky -- routes are not guaranteed to be correct, anyways. - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. ETOOCOMPLEXOVERTIME. maybe. But i want to investigate this. cheers luigi ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: Some performance measurements on the FreeBSD network stack
On 20.04.2012 00:03, Luigi Rizzo wrote: On Thu, Apr 19, 2012 at 11:20:00PM +0200, Andre Oppermann wrote: On 19.04.2012 22:46, Luigi Rizzo wrote: The allocation happens while the code has already an exclusive lock on so-snd_buf so a pool of fresh buffers could be attached there. Ah, there it is not necessary to hold the snd_buf lock while doing the allocate+copyin. With soreceive_stream() (which is it is not held in the tx path either -- but there is a short section before m_uiotombuf() which does ... SOCKBUF_LOCK(so-so_snd); // check for pending errors, sbspace, so_state SOCKBUF_UNLOCK(so-so_snd); ... (some of this is slightly dubious, but that's another story) Indeed the lock isn't held across the m_uiotombuf(). You're talking about filling an sockbuf mbuf cache while holding the lock? But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; That would require to cross-pointer the rtentry and whatnot again. i was planning to keep a copy, not a reference. If the copy becomes temporarily stale, no big deal, as long as you can detect it reasonably quiclky -- routes are not guaranteed to be correct, anyways. Be wary of disappearing interface pointers... - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. ETOOCOMPLEXOVERTIME. maybe. But i want to investigate this. I fail see what passing down the uio would gain you. The snd_buf lock isn't obtained again after the copyin. Not that I want to prevent you from investigating other ways. ;) -- Andre ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org