Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
As Martin said, in some cases cbench may significantly over-report numbers in throughput mode (of course it depends on the controller implementation, so not all the controllers might be affected). The cbench code sleeps for 100ms to clear out buffers after reading the switch counters (fakeswitch_get_count in fakeswitch.c). There are two problems here: * Switch input and output buffers are not cleared under throughput mode. * Having X switches means that the code sleeps for 100X ms instead of a single 100ms for all emulated switches. These would result in a significant over-estimation of controller performance under throughput mode if one is using more than a few emulated switches. For instance, with 128 switches, cbench would sleep for almost 13 seconds before printing out stats of each round, meanwhile the controller fills the input buffer of all the emulated switches. Since the input buffer is not cleared, the stats of the next round would contain the replies received for requests in previous rounds (which is a potentially large number). Rob, I will post a patch soon. Meanwhile, a quick fix is to move the sleep to an appropriate place in run_test (cbench.c) and clear the buffers under throughput mode as well in fakeswitch_get_count (fakeswitch.c). Amin A problem with cbench might even be of interest to those who wrote it :-) If I could bother you to just send me a diff of what you've changed, it would be much appreciated. I can push it back into the main branch. Fwiw, cbench is something I wrote very quickly while jetlagged, so it's not surprising that there are bugs in it. I didn't realize that people were actually using it, or I would try to snag some time to make it less crappy :-) Thanks for the feedback, - Rob ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
Random curiosity: Why would jumbo frames increases replies per sec? Regards KK On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote: I missed that. The single core throughput is ~250k replies/sec, two cores ~450k replies/sec, three cores ~650k replies/sec, four cores ~800 replies/sec. These numbers are higher than what I reported in my previous post. That is most probably because, right now, I am testing with MTU 9000 (jumbo frames) and with more user-space threads. Cheers, Amin On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote: Also, do you mind posting the single core throughput? [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory allocator perform much better for NOX. - Fully asynchronous operation: The patched version avoids wasting CPU cycles polling sockets, or event/timer dispatchers when not necessary. I would like to add that the patched version should perform much better than what I reported above (the number reported is with a run on 4 CPU cores). I guess a single NOX instance running on a machine with 8 CPU cores should handle well above 1 million flow initiation requests per second. Also having a more capable machine should help to serve more requests! The code will be made available soon and I will post updates as well. Cheers, Amin ___ openflow-discuss mailing list openflow-disc...@lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/openflow-discuss ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
I double checked. It does slightly improve the performance (in the order of a few thousand replies/sec). Larger MTUs decrease the CPU workload (by decreasing the number of transfers across the bus) and this means that more CPU cycles are available to the controller to process requests. However, I am not suggesting that people should use jumbo frames. Apparently running with more user-space threads does the trick here. Anyway, I should trust a profiler rather than guessing, so I will get back with a definite answer once I have done a more thorough evaluation. Cheers, Amin On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote: Random curiosity: Why would jumbo frames increases replies per sec? Regards KK On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote: I missed that. The single core throughput is ~250k replies/sec, two cores ~450k replies/sec, three cores ~650k replies/sec, four cores ~800 replies/sec. These numbers are higher than what I reported in my previous post. That is most probably because, right now, I am testing with MTU 9000 (jumbo frames) and with more user-space threads. Cheers, Amin On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote: Also, do you mind posting the single core throughput? [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory allocator perform much better for NOX. - Fully asynchronous operation: The patched version avoids wasting CPU cycles polling sockets, or event/timer dispatchers when not necessary. I would like to add that the patched version should perform much better than what I reported above (the number reported is with a run on 4 CPU cores). I guess a single NOX instance running on a machine with 8 CPU cores should handle well above 1 million flow initiation requests per second. Also having a more capable machine should help to serve more requests! The code will be made available soon and I will post updates as well. Cheers, Amin ___ openflow-discuss mailing list openflow-disc...@lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/openflow-discuss ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
Hi Amin, Just to clarify, does your jumbo frames refer to the OpenFlow messages or the frames in the datapath? By OpenFlow messages, I am assuming you use a TCP connection between NOX and the switches, and you are batching the messages into jumbo frames of 9000 bytes before sending them out. By frames in the datapath, I mean jumbo Ethernet frames are being sent in the datapath. The latter does not make any sense to me, because OpenFlow should send 128 bytes to the controller by default. Thanks. Regards KK On 15 December 2010 12:36, Amin Tootoonchian a...@cs.toronto.edu wrote: I double checked. It does slightly improve the performance (in the order of a few thousand replies/sec). Larger MTUs decrease the CPU workload (by decreasing the number of transfers across the bus) and this means that more CPU cycles are available to the controller to process requests. However, I am not suggesting that people should use jumbo frames. Apparently running with more user-space threads does the trick here. Anyway, I should trust a profiler rather than guessing, so I will get back with a definite answer once I have done a more thorough evaluation. Cheers, Amin On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote: Random curiosity: Why would jumbo frames increases replies per sec? Regards KK On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote: I missed that. The single core throughput is ~250k replies/sec, two cores ~450k replies/sec, three cores ~650k replies/sec, four cores ~800 replies/sec. These numbers are higher than what I reported in my previous post. That is most probably because, right now, I am testing with MTU 9000 (jumbo frames) and with more user-space threads. Cheers, Amin On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote: Also, do you mind posting the single core throughput? [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory allocator perform much better for NOX. - Fully asynchronous operation: The patched version avoids wasting CPU cycles polling sockets, or event/timer dispatchers when not necessary. I would like to add that the patched version should perform much better than what I reported above (the number reported is with a run on 4 CPU cores). I guess a single NOX instance running on a machine with 8 CPU cores should handle well above 1 million flow initiation requests per second. Also having a more capable machine should help to serve more requests! The code will be made available soon and I will post updates as well. Cheers, Amin ___ openflow-discuss mailing list openflow-disc...@lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/openflow-discuss ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
Oh.. another point, if you are batching the frames, then what about delay? There seems to be a trade-off between delay and throughput, and we have went for the former by disabling Nagle's algorithm. Regards KK On 15 December 2010 12:46, kk yap yap...@stanford.edu wrote: Hi Amin, Just to clarify, does your jumbo frames refer to the OpenFlow messages or the frames in the datapath? By OpenFlow messages, I am assuming you use a TCP connection between NOX and the switches, and you are batching the messages into jumbo frames of 9000 bytes before sending them out. By frames in the datapath, I mean jumbo Ethernet frames are being sent in the datapath. The latter does not make any sense to me, because OpenFlow should send 128 bytes to the controller by default. Thanks. Regards KK On 15 December 2010 12:36, Amin Tootoonchian a...@cs.toronto.edu wrote: I double checked. It does slightly improve the performance (in the order of a few thousand replies/sec). Larger MTUs decrease the CPU workload (by decreasing the number of transfers across the bus) and this means that more CPU cycles are available to the controller to process requests. However, I am not suggesting that people should use jumbo frames. Apparently running with more user-space threads does the trick here. Anyway, I should trust a profiler rather than guessing, so I will get back with a definite answer once I have done a more thorough evaluation. Cheers, Amin On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote: Random curiosity: Why would jumbo frames increases replies per sec? Regards KK On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote: I missed that. The single core throughput is ~250k replies/sec, two cores ~450k replies/sec, three cores ~650k replies/sec, four cores ~800 replies/sec. These numbers are higher than what I reported in my previous post. That is most probably because, right now, I am testing with MTU 9000 (jumbo frames) and with more user-space threads. Cheers, Amin On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote: Also, do you mind posting the single core throughput? [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory allocator perform much better for NOX. - Fully asynchronous operation: The patched version avoids wasting CPU cycles polling sockets, or event/timer dispatchers when not necessary. I would like to add that the patched version should perform much better than what I reported above (the number reported is with a run on 4 CPU cores). I guess a single NOX instance running on a machine with 8 CPU cores should handle well above 1 million flow initiation requests per second. Also having a more capable machine should help to serve more requests! The code will be made available soon and I will post updates as well. Cheers, Amin ___ openflow-discuss mailing list openflow-disc...@lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/openflow-discuss
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
I'll let Amin follow up, but from what I understand, the way he's doing batching doesn't introduce any additional delay. Rather, if he can write to the socket, he writes. However, if the socket is blocked for whatever reason (e.g. waiting for an ACK or send buffer is full) he buffers all of the waiting packets and then sends them in aggregate. Oh.. another point, if you are batching the frames, then what about delay? There seems to be a trade-off between delay and throughput, and we have went for the former by disabling Nagle's algorithm. Regards KK On 15 December 2010 12:46, kk yapyap...@stanford.edu wrote: Hi Amin, Just to clarify, does your jumbo frames refer to the OpenFlow messages or the frames in the datapath? By OpenFlow messages, I am assuming you use a TCP connection between NOX and the switches, and you are batching the messages into jumbo frames of 9000 bytes before sending them out. By frames in the datapath, I mean jumbo Ethernet frames are being sent in the datapath. The latter does not make any sense to me, because OpenFlow should send 128 bytes to the controller by default. Thanks. Regards KK On 15 December 2010 12:36, Amin Tootoonchiana...@cs.toronto.edu wrote: I double checked. It does slightly improve the performance (in the order of a few thousand replies/sec). Larger MTUs decrease the CPU workload (by decreasing the number of transfers across the bus) and this means that more CPU cycles are available to the controller to process requests. However, I am not suggesting that people should use jumbo frames. Apparently running with more user-space threads does the trick here. Anyway, I should trust a profiler rather than guessing, so I will get back with a definite answer once I have done a more thorough evaluation. Cheers, Amin On Wed, Dec 15, 2010 at 2:51 PM, kk yapyap...@stanford.edu wrote: Random curiosity: Why would jumbo frames increases replies per sec? Regards KK On 15 December 2010 11:45, Amin Tootoonchiana...@cs.toronto.edu wrote: I missed that. The single core throughput is ~250k replies/sec, two cores ~450k replies/sec, three cores ~650k replies/sec, four cores ~800 replies/sec. These numbers are higher than what I reported in my previous post. That is most probably because, right now, I am testing with MTU 9000 (jumbo frames) and with more user-space threads. Cheers, Amin On Wed, Dec 15, 2010 at 12:36 AM, Martin Casadocas...@nicira.com wrote: Also, do you mind posting the single core throughput? [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory allocator perform much better for NOX. - Fully asynchronous operation: The patched version avoids wasting CPU cycles polling sockets, or event/timer dispatchers when not necessary. I would like to add that the patched version should perform much better than what I reported above (the number reported is with a run on 4 CPU cores). I guess a single NOX instance running on a machine with 8 CPU cores should handle well above 1 million flow initiation requests per second. Also having a more capable machine should help to serve more requests! The code will be made
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
I am talking about jumbo Ethernet frames here. By batching, I mean batching outgoing messages together and writing to the underlying layer which would be the TCP write buffer. The TCP buffer is not limited to MTU or anything like that, so in most cases my code flushes more than 64KB to the TCP write buffer. The gain is due to issuing a single system call with a larger buffer rather than many system calls with tiny buffers (e.g., 128 bytes you mentioned). I do not sacrifice delay for throughput here. I keep a write buffer and keep appending to it until the underlying socket is ready for writes. Once it is ready for a write operation, buffered replies are flush to the underlying layer immediately. This is quite different than Nagle's algorithm and will not add any delays. Amin On Wed, Dec 15, 2010 at 3:47 PM, kk yap yap...@stanford.edu wrote: Oh.. another point, if you are batching the frames, then what about delay? There seems to be a trade-off between delay and throughput, and we have went for the former by disabling Nagle's algorithm. Regards KK On 15 December 2010 12:46, kk yap yap...@stanford.edu wrote: Hi Amin, Just to clarify, does your jumbo frames refer to the OpenFlow messages or the frames in the datapath? By OpenFlow messages, I am assuming you use a TCP connection between NOX and the switches, and you are batching the messages into jumbo frames of 9000 bytes before sending them out. By frames in the datapath, I mean jumbo Ethernet frames are being sent in the datapath. The latter does not make any sense to me, because OpenFlow should send 128 bytes to the controller by default. Thanks. Regards KK On 15 December 2010 12:36, Amin Tootoonchian a...@cs.toronto.edu wrote: I double checked. It does slightly improve the performance (in the order of a few thousand replies/sec). Larger MTUs decrease the CPU workload (by decreasing the number of transfers across the bus) and this means that more CPU cycles are available to the controller to process requests. However, I am not suggesting that people should use jumbo frames. Apparently running with more user-space threads does the trick here. Anyway, I should trust a profiler rather than guessing, so I will get back with a definite answer once I have done a more thorough evaluation. Cheers, Amin On Wed, Dec 15, 2010 at 2:51 PM, kk yap yap...@stanford.edu wrote: Random curiosity: Why would jumbo frames increases replies per sec? Regards KK On 15 December 2010 11:45, Amin Tootoonchian a...@cs.toronto.edu wrote: I missed that. The single core throughput is ~250k replies/sec, two cores ~450k replies/sec, three cores ~650k replies/sec, four cores ~800 replies/sec. These numbers are higher than what I reported in my previous post. That is most probably because, right now, I am testing with MTU 9000 (jumbo frames) and with more user-space threads. Cheers, Amin On Wed, Dec 15, 2010 at 12:36 AM, Martin Casado cas...@nicira.com wrote: Also, do you mind posting the single core throughput? [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory
Re: [nox-dev] [openflow-discuss] NOX performance improvement by a factor 10
This is awesome Amin, thanks for posting. It is also probably worth mentioning that cbench was broken and over-reporting numbers. Do you mind sending out a few details about that? I presume that will be helpful to those using cbench [cross-posting to nox-dev, openflow-discuss, ovs-discuss] I have prepared a patch based on NOX Zaku that improves its performance by a factor of10. This implies that a single controller instance can run a large network with near a million flow initiations per second. I am writing to open up a discussion and get feedback from the community. Here are some preliminary results: - Benchmark configuration: * Benchmark: Throughput test of cbench (controller benchmarker) with 64 switches. Cbench is a part of the OFlops package (http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput mode, cbench sends a batch of ofp_packet_in messages to the controller and counts the number of replies it gets back. * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz quad-core Intel Xeon processor (X3210), and 4GB RAM * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz quad-core Intel Xeon processor (E5405), and 4GB RAM * Connectivity: 1Gbps - Benchmark results: * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core). * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8 available cores). The sustained controller-benchmarker throughput is ~400Mbps. The patch updates the asynchronous harness of NOX to a standard library (boost asynchronous I/O library) which simplifies the code base. It fixes the code in several areas, including but not limited to: - Multi-threading: The patch enables having any number of worker threads running on multiple cores. - Batching: Serving requests individually and sending replies one by one is quite inefficient. The patch tries to batch requests together were possible, as well replies (which reduces the number of system calls significantly). - Memory allocation: The standard C++ memory allocator is not robust in multi-threaded environments. Google's Thread-Caching Malloc (TCMalloc) or Hoard memory allocator perform much better for NOX. - Fully asynchronous operation: The patched version avoids wasting CPU cycles polling sockets, or event/timer dispatchers when not necessary. I would like to add that the patched version should perform much better than what I reported above (the number reported is with a run on 4 CPU cores). I guess a single NOX instance running on a machine with 8 CPU cores should handle well above 1 million flow initiation requests per second. Also having a more capable machine should help to serve more requests! The code will be made available soon and I will post updates as well. Cheers, Amin ___ openflow-discuss mailing list openflow-disc...@lists.stanford.edu https://mailman.stanford.edu/mailman/listinfo/openflow-discuss ___ nox-dev mailing list nox-dev@noxrepo.org http://noxrepo.org/mailman/listinfo/nox-dev_noxrepo.org