Re: e1000 performance issue in 4 simultaneous links
Denys Fedoryshchenko wrote: Sorry. that i interfere in this subject. Do you recommend CONFIG_IRQBALANCE to be enabled? I certainly do not. Manual tweaking and pinning the irq's to the correct CPU will give the best performance (for specific loads). The userspace irqbalance daemon tries very hard to approximate this behaviour and is what I recommend for most situations, it usually does the right thing and does so without making your head spin (just start it). The in-kernel one usually does the wrong thing for network loads. Cheers, Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
David Miller [EMAIL PROTECTED] writes: No IRQ balancing should be done at all for networking device interrupts, with zero exceptions. It destroys performance. Does irqbalanced need to be taught about this? And how about the initial balancing, so that each network card gets assigned to one CPU? /Benny -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Breno Leitao a écrit : On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote: Breno Leitao wrote: When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec of transfer rate. If I run 4 netperf against 4 different interfaces, I get around 720 * 10^6 bits/sec. I hope this explanation makes sense, but what it comes down to is that combining hardware round robin balancing with NAPI is a BAD IDEA. In general the behavior of hardware round robin balancing is bad and I'm sure it is causing all sorts of other performance issues that you may not even be aware of. I've made another test removing the ppc IRQ Round Robin scheme, bonded each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1, CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in average. Take a look at the interrupt table this time: io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] 277: 151362450 13 14 13 14 15 18 XICS Level eth6 278: 12 131348681 19 13 15 10 11 XICS Level eth7 323: 11 18 171348426 18 11 11 13 XICS Level eth16 324: 12 16 11 191402709 13 14 11 XICS Level eth17 I also tried to bound all the 4 interface IRQ to a single CPU (CPU0) using the noirqdistrib boot paramenter, and the performance was a little worse. Rick, The 2 interface test that I showed in my first email, was run in two different NIC. Also, I am running netperf with the following command netperf -H hostname -T 0,8 while netserver is running without any argument at all. Also, running vmstat in parallel shows that there is no bottleneck in the CPU. Take a look: procs ---memory-- ---swap-- -io -system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 2 0 0 6714732 16168 22744000 8 2 203 21 0 1 98 0 0 0 0 0 6715120 16176 22744000 028 16234 505 0 16 83 0 1 0 0 0 6715516 16176 22744000 0 0 16251 518 0 16 83 0 1 1 0 0 6715252 16176 22744000 0 1 16316 497 0 15 84 0 1 0 0 0 6716092 16176 22744000 0 0 16300 520 0 16 83 0 1 0 0 0 6716320 16180 22744000 0 1 16354 486 0 15 84 0 1 If your machine has 8 cpus, then your vmstat output shows a bottleneck :) (100/8 = 12.5), so I guess one of your CPU is full -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Maybe good idea to use sysstat ? http://perso.wanadoo.fr/sebastien.godard/ For example: visp-1 ~ # mpstat -P ALL 1 Linux 2.6.24-rc7-devel (visp-1) 01/11/08 19:27:57 CPU %user %nice%sys %iowait%irq %soft %steal %idleintr/s 19:27:58 all0.000.000.000.000.002.510.00 97.49 7707.00 19:27:58 00.000.000.000.000.004.000.00 96.00 1926.00 19:27:58 10.000.000.000.000.001.010.00 98.99 1926.00 19:27:58 20.000.000.000.000.005.000.00 95.00 1927.00 19:27:58 30.000.000.000.000.000.990.00 99.01 1927.00 19:27:58 40.000.000.000.000.000.000.00 0.00 0.00 When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec of transfer rate. If I run 4 netperf against 4 different interfaces, I get around 720 * 10^6 bits/sec. I hope this explanation makes sense, but what it comes down to is that combining hardware round robin balancing with NAPI is a BAD IDEA. In general the behavior of hardware round robin balancing is bad and I'm sure it is causing all sorts of other performance issues that you may not even be aware of. I've made another test removing the ppc IRQ Round Robin scheme, bonded each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1, CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in average. Take a look at the interrupt table this time: io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] 277: 151362450 13 14 13 14 15 18 XICS Level eth6 278: 12 131348681 19 13 15 10 11 XICS Level eth7 323: 11 18 171348426 18 11 11 13 XICS Level eth16 324: 12 16 11 191402709 13 14 11 XICS Level eth17 I also tried to bound all the 4 interface IRQ to a single CPU (CPU0) using the noirqdistrib boot paramenter, and the performance was a little worse. Rick, The 2 interface test that I showed in my first email, was run in two different NIC. Also, I am running netperf with the following command netperf -H hostname -T 0,8 while netserver is running without any argument at all. Also, running vmstat in parallel shows that there is no bottleneck in the CPU. Take a look: procs ---memory-- ---swap-- -io -system-- - cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 2 0 0 6714732 16168 22744000 8 2 203 21 0 1 98 0 0 0 0 0 6715120 16176 22744000 028 16234 505 0 16 83 0 1 0 0 0 6715516 16176 22744000 0 0 16251 518 0 16 83 0 1 1 0 0 6715252 16176 22744000 0 1 16316 497 0 15 84 0 1 0 0 0 6716092 16176 22744000 0 0 16300 520 0 16 83 0 1 0 0 0 6716320 16180 22744000 0 1 16354 486 0 15 84 0 1 If your machine has 8 cpus, then your vmstat output shows a bottleneck :) (100/8 = 12.5), so I guess one of your CPU is full -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote: Breno Leitao a écrit : Take a look at the interrupt table this time: io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] 277: 151362450 13 14 13 14 15 18 XICS Level eth6 278: 12 131348681 19 13 15 10 11 XICS Level eth7 323: 11 18 171348426 18 11 11 13 XICS Level eth16 324: 12 16 11 191402709 13 14 11 XICS Level eth17 If your machine has 8 cpus, then your vmstat output shows a bottleneck :) (100/8 = 12.5), so I guess one of your CPU is full Well, if I run top while running the test, I see this load distributed among the CPUs, mainly those that had a NIC IRC bonded. Take a look: Tasks: 133 total, 2 running, 130 sleeping, 0 stopped, 1 zombie Cpu0 : 0.3%us, 19.5%sy, 0.0%ni, 73.5%id, 0.0%wa, 0.0%hi, 0.0%si, 6.6%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 75.1%id, 0.0%wa, 0.7%hi, 24.3%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 73.1%id, 0.0%wa, 0.7%hi, 26.2%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.7%hi, 23.3%si, 0.0%st Cpu4 : 0.0%us, 0.3%sy, 0.0%ni, 70.4%id, 0.7%wa, 0.3%hi, 28.2%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Note that this average scenario doesn't change during the entire benchmarking test. Thanks! -- Breno Leitao [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Hello Denys, I've installed sysstat (good tools!) and the result is very similar to the one which appears at top, take a look: 13:34:23 CPU %user %nice%sys %iowait%irq %soft %steal %idleintr/s 13:34:24 all0.000.002.720.000.25 12.130.99 83.91 16267.33 13:34:24 00.000.00 21.780.000.000.007.92 70.30 40.59 13:34:24 10.000.000.000.000.99 24.750.00 74.26 4025.74 13:34:24 20.000.000.000.000.99 24.750.00 74.26 4036.63 13:34:24 30.000.000.000.000.99 21.780.00 77.23 4032.67 13:34:24 40.000.000.000.000.98 24.510.00 74.51 4034.65 13:34:24 50.000.000.000.000.000.000.00 100.00 30.69 13:34:24 60.000.000.000.000.000.000.00 100.00 33.66 13:34:24 70.000.000.000.000.000.000.00 100.00 32.67 So, we can assure that the IRQs are not being balanced, and that there isn't any processor overload. Thanks! On Fri, 2008-01-11 at 19:36 +0200, Denys Fedoryshchenko wrote: Maybe good idea to use sysstat ? http://perso.wanadoo.fr/sebastien.godard/ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 performance issue in 4 simultaneous links
On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote: Breno Leitao wrote: When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec of transfer rate. If I run 4 netperf against 4 different interfaces, I get around 720 * 10^6 bits/sec. I hope this explanation makes sense, but what it comes down to is that combining hardware round robin balancing with NAPI is a BAD IDEA. In general the behavior of hardware round robin balancing is bad and I'm sure it is causing all sorts of other performance issues that you may not even be aware of. I've made another test removing the ppc IRQ Round Robin scheme, bonded each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1, CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in average. Take a look at the interrupt table this time: io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] 277: 151362450 13 14 13 14 15 18 XICS Level eth6 278: 12 131348681 19 13 15 10 11 XICS Level eth7 323: 11 18 171348426 18 11 11 13 XICS Level eth16 324: 12 16 11 191402709 13 14 11 XICS Level eth17 I also tried to bound all the 4 interface IRQ to a single CPU (CPU0) using the noirqdistrib boot paramenter, and the performance was a little worse. Rick, The 2 interface test that I showed in my first email, was run in two different NIC. Also, I am running netperf with the following command netperf -H hostname -T 0,8 while netserver is running without any argument at all. Also, running vmstat in parallel shows that there is no bottleneck in the CPU. Take a look: procs ---memory-- ---swap-- -io -system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 2 0 0 6714732 16168 22744000 8 2 203 21 0 1 98 0 0 0 0 0 6715120 16176 22744000 028 16234 505 0 16 83 0 1 0 0 0 6715516 16176 22744000 0 0 16251 518 0 16 83 0 1 1 0 0 6715252 16176 22744000 0 1 16316 497 0 15 84 0 1 0 0 0 6716092 16176 22744000 0 0 16300 520 0 16 83 0 1 0 0 0 6716320 16180 22744000 0 1 16354 486 0 15 84 0 1 Thanks! -- Breno Leitao [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Breno Leitao wrote: On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote: Breno Leitao a écrit : Take a look at the interrupt table this time: io-dolphins:~/leitao # cat /proc/interrupts | grep eth[1]*[67] 277: 151362450 13 14 13 14 15 18 XICS Level eth6 278: 12 131348681 19 13 15 10 11 XICS Level eth7 323: 11 18 171348426 18 11 11 13 XICS Level eth16 324: 12 16 11 191402709 13 14 11 XICS Level eth17 If your machine has 8 cpus, then your vmstat output shows a bottleneck :) (100/8 = 12.5), so I guess one of your CPU is full Well, if I run top while running the test, I see this load distributed among the CPUs, mainly those that had a NIC IRC bonded. Take a look: Tasks: 133 total, 2 running, 130 sleeping, 0 stopped, 1 zombie Cpu0 : 0.3%us, 19.5%sy, 0.0%ni, 73.5%id, 0.0%wa, 0.0%hi, 0.0%si, 6.6%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 75.1%id, 0.0%wa, 0.7%hi, 24.3%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 73.1%id, 0.0%wa, 0.7%hi, 26.2%si, 0.0%st Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.7%hi, 23.3%si, 0.0%st Cpu4 : 0.0%us, 0.3%sy, 0.0%ni, 70.4%id, 0.7%wa, 0.3%hi, 28.2%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st If you have IRQ's bound to CPUs 1-4, and have four netperfs running, given that the stack ostensibly tries to have applications run on the same CPUs, what is running on CPU0? Is it related to: The 2 interface test that I showed in my first email, was run in two different NIC. Also, I am running netperf with the following command netperf -H hostname -T 0,8 while netserver is running without any argument at all. Also, running vmstat in parallel shows that there is no bottleneck in the CPU. Take a look: Unless you have a morbid curiousity :) there isn't much point in binding all the netperf's to CPU 0 when the interrupts for the NICs servicing their connections are on CPUs 1-4. I also assume then that the system(s) on which netserver is running have 8 CPUs in them? (There are multiple destination systems yes?) Does anything change if you explicitly bind each netperf to the CPU on which the interrups for its connection are processed? Or for that matter if you remove the -T command entirely Does UDP_STREAM show different performance than TCP_STREAM (I'm ass-u-me-ing based on the above we are looking at the netperf side of a TCP_STREAM test above, please correct if otherwise). Are the CPUs above single-core CPUs or multi-core CPUs, and if multi-core are caches shared? How are CPUs numbered if multi-core on that system? Is there any hardware threading involved? I'm wondering if there may be some wrinkles in the system that might lead to reported CPU utilization being low even if a chip is otherwise saturated. Might need some HW counters to check that... Can you describe the I/O subsystem more completely? I understand that you are using at most two ports of a pair of quad-port cards at any one time, but am still curious to know if those two cards are on separate busses, or if they share any bus/link on the way to memory. rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
From: Benny Amorsen [EMAIL PROTECTED] Date: Fri, 11 Jan 2008 12:09:32 +0100 David Miller [EMAIL PROTECTED] writes: No IRQ balancing should be done at all for networking device interrupts, with zero exceptions. It destroys performance. Does irqbalanced need to be taught about this? The userland one already does. It's only the in-kernel IRQ load balancing for these (presumably powerpc) platforms that is broken. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Sorry. that i interfere in this subject. Do you recommend CONFIG_IRQBALANCE to be enabled? If it is enabled - irq's not jumping nonstop over processors. softirqd changing this behavior. If it is disabled, irq's distributed over each processor, and in loaded systems it seems harmful. I work a little yesterday with server with CONFIG_IRQBALANCE=no, 160kpps load. It was packetloss-ing, till i set smp_affinity. Maybe it is useful to put more info in Kconfig, since it is very important for performance option. On Fri, 11 Jan 2008 17:41:09 -0800 (PST), David Miller wrote From: Benny Amorsen benny [EMAIL PROTECTED] Date: Fri, 11 Jan 2008 12:09:32 0100 David Miller [EMAIL PROTECTED] writes: No IRQ balancing should be done at all for networking device interrupts, with zero exceptions. It destroys performance. Does irqbalanced need to be taught about this? The userland one already does. It's only the in-kernel IRQ load balancing for these (presumably powerpc) platforms that is broken. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Breno Leitao wrote: Hello, I've perceived that there is a performance issue when running netperf against 4 e1000 links connected end-to-end to another machine with 4 e1000 interfaces. I have 2 4-port interfaces on my machine, but the test is just considering 2 port for each interfaces card. When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec of transfer rate. If I run 4 netperf against 4 different interfaces, I get around 720 * 10^6 bits/sec. snip I take it that's the average for individual interfaces, not the aggregate? RX processing for multi-gigabits per second can be quite expensive. This can be mitigated by interrupt moderation and NAPI polling, jumbo frames (MTU 1500) and/or Large Receive Offload (LRO). I don't think e1000 hardware does LRO, but the driver could presumably be changed use Linux's software LRO. Even with these optimisations, if all RX processing is done on a single CPU this can become a bottleneck. Does the test system have multiple CPUs? Are IRQs for the multiple NICs balanced across multiple CPUs? Ben. -- Ben Hutchings, Senior Software Engineer, Solarflare Communications Not speaking for my employer; that's the marketing department's job. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Ben, I am facing the performance issue when we try to bond the multiple interfaces with virtual interface. It could be related to this thread. My questions are, *) When we use mulitple NICs, will the performance of overall system be summation of all individual lines XX bits/sec. ? *) What are the factors improves the performance if we have multiple interfaces?. [ kind of tuning the parameters in proc ] Breno, I hope this thread will be helpful for performance issue which i have with bonding driver. Jeba On Thu, 2008-01-10 at 16:36 +, Ben Hutchings wrote: Breno Leitao wrote: Hello, I've perceived that there is a performance issue when running netperf against 4 e1000 links connected end-to-end to another machine with 4 e1000 interfaces. I have 2 4-port interfaces on my machine, but the test is just considering 2 port for each interfaces card. When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec of transfer rate. If I run 4 netperf against 4 different interfaces, I get around 720 * 10^6 bits/sec. snip I take it that's the average for individual interfaces, not the aggregate? RX processing for multi-gigabits per second can be quite expensive. This can be mitigated by interrupt moderation and NAPI polling, jumbo frames (MTU 1500) and/or Large Receive Offload (LRO). I don't think e1000 hardware does LRO, but the driver could presumably be changed use Linux's software LRO. Even with these optimisations, if all RX processing is done on a single CPU this can become a bottleneck. Does the test system have multiple CPUs? Are IRQs for the multiple NICs balanced across multiple CPUs? Ben. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
On Thu, 2008-01-10 at 16:36 +, Ben Hutchings wrote: When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec of transfer rate. If I run 4 netperf against 4 different interfaces, I get around 720 * 10^6 bits/sec. snip I take it that's the average for individual interfaces, not the aggregate? Right, each of these results are for individual interfaces. Otherwise, we'd have a huge problem. :-) This can be mitigated by interrupt moderation and NAPI polling, jumbo frames (MTU 1500) and/or Large Receive Offload (LRO). I don't think e1000 hardware does LRO, but the driver could presumably be changed use Linux's software LRO. Without using these features and keeping the MTU as 1500, do you think we could get a better performance than this one? I also tried to increase my interface MTU to 9000, but I am afraid that netperf only transmits packets with less than 1500. Still investigating. single CPU this can become a bottleneck. Does the test system have multiple CPUs? Are IRQs for the multiple NICs balanced across multiple CPUs? Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced across the CPUs, as I see in /proc/interrupts: # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 16:940760 1047904993777 975813 XICS Level IPI 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7 279:893919857909867917 894881 XICS Level eth0 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x BAD: 4216 Thanks, -- Breno Leitao [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Many many things to check when running netperf :) *) Are the cards on the same or separate PCImumble bus, and what sort of bus *) is the two interface performance two interfaces on the same four-port card, or an interface from each of the two four-port cards? *) is there a dreaded (IMO) irqbalance daemon running? one of the very first things I do when running netperf is terminate the irqbalance daemon with as extreme a predjudice as I can. *) what is the distribution of interrupts from the interfaces to the CPUs? if you've tried to set that manually, the dreaded irqbalance daemon will come along shortly thereafter and ruin everything. *) what does netperf say about the overall CPU utilization of the system(s) when the tests are running? *) what does top say about the utilization of any single CPU in the system(s) when the tests are running? *) are you using the global -T option to spread the netperf/netserver processes across the CPUs, or leaving that all up to the stack/scheduler/etc? I suspect there could be more but that is what comes to mind thusfar as far as things I often check when running netperf. rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
Breno Leitao wrote: On Thu, 2008-01-10 at 16:36 +, Ben Hutchings wrote: When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec of transfer rate. If I run 4 netperf against 4 different interfaces, I get around 720 * 10^6 bits/sec. snip I take it that's the average for individual interfaces, not the aggregate? Right, each of these results are for individual interfaces. Otherwise, we'd have a huge problem. :-) This can be mitigated by interrupt moderation and NAPI polling, jumbo frames (MTU 1500) and/or Large Receive Offload (LRO). I don't think e1000 hardware does LRO, but the driver could presumably be changed use Linux's software LRO. Without using these features and keeping the MTU as 1500, do you think we could get a better performance than this one? I also tried to increase my interface MTU to 9000, but I am afraid that netperf only transmits packets with less than 1500. Still investigating. single CPU this can become a bottleneck. Does the test system have multiple CPUs? Are IRQs for the multiple NICs balanced across multiple CPUs? Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced across the CPUs, as I see in /proc/interrupts: which is wrong and hurts performance. you want your ethernet irq's to stick to a CPU for long times to prevent cache thrash. please disable the in-kernel irq balancing code and use the userspace `irqbalance` daemon. Gee I should put that in my signature, I already wrote that twice today :) Auke # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 16:940760 1047904993777 975813 XICS Level IPI 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7 279:893919857909867917 894881 XICS Level eth0 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x BAD: 4216 Thanks, -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
I also tried to increase my interface MTU to 9000, but I am afraid that netperf only transmits packets with less than 1500. Still investigating. It may seem like picking a tiny nit, but netperf never transmits packets. It only provides buffers of specified size to the stack. It is then the stack which transmits and determines the size of the packets on the network. Drifting a bit more... While there are settings, conditions and known stack behaviours where one can be confident of the packet size on the network based on the options passed to netperf, generally speaking one should not ass-u-me a direct relationship between the options one passes to netperf and the size of the packets on the network. And for JumboFrames to be effective it must be set on both ends, otherwise the TCP MSS exchange will result in the smaller of the two MTU's winning as it were. single CPU this can become a bottleneck. Does the test system have multiple CPUs? Are IRQs for the multiple NICs balanced across multiple CPUs? Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced across the CPUs, as I see in /proc/interrupts: That suggests to me anyway that the dreaded irqbalanced is running, shuffling the interrupts as you go. Not often a happy place for running netperf when one want's consistent results. # cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 16:940760 1047904993777975813 XICS Level IPI 18: 4 3 4 1 3 6 8 3 XICS Level hvc_console 19: 0 0 0 0 0 0 0 0 XICS Level RAS_EPOW 273: 10728 10850 10937 10833 10884 10788 10868 10776 XICS Level eth4 275: 0 0 0 0 0 0 0 0 XICS Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3 277: 234933 230275 229770 234048 235906 229858 229975 233859 XICS Level eth6 278: 266225 267606 262844 265985 268789 266869 263110 267422 XICS Level eth7 279:893919857909867917 894881 XICS Level eth0 305: 439246 439117 438495 436072 438053 440111 438973 438951 XICS Level eth0 Neterion Xframe II 10GbE network adapter 321: 3268 3088 3143 3113 3305 2982 3326 3084 XICS Level ipr 323: 268030 273207 269710 271338 270306 273258 270872 273281 XICS Level eth16 324: 215012 221102 219494 216732 216531 220460 219718 218654 XICS Level eth17 325: 7103 3580 7246 3475 7132 3394 7258 3435 XICS Level pata_pdc2027x BAD: 4216 IMO, what you want (in the absence of multi-queue NICs) is one CPU taking the interrupts of one port/interface, and each port/interface's interrupts going to a separate CPU. So, something that looks roughly like concocted example: CPU0 CPU1 CPU2 CPU3 1: 12340 00 eth0 2: 0 1234 00 eth1 3: 00 12340 eth2 4: 00 0 1234 eth3 which you should be able to acheive via the method I think someone else has already mentioned about echoing values into /proc/irq/irq/smp_affinity - after you have slain the dreaded irqbalance daemon. rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 performance issue in 4 simultaneous links
Breno Leitao wrote: When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec of transfer rate. If I run 4 netperf against 4 different interfaces, I get around 720 * 10^6 bits/sec. This is actually a known issue that we have worked with your company before on. It comes down to your system's default behavior of round robining interrupts (see cat /proc/interrupts while running the test) combined with e1000's way of exiting / rescheduling NAPI. The default round robin behavior of the interrupts on your system is the root cause of this issue, and here is what happens: 4 interfaces start generating interrupts, if you're lucky the round robin balancer has them all on different cpus. As the e1000 driver goes into and out of polling mode, the round robin balancer keeps moving the interrupt to the next cpu. Eventually 2 or more driver instances end up on the same CPU, which causes both driver instances to stay in NAPI polling mode, due to the amount of work being done, and that there are always more than netdev-weight packets to do for each instance. This keeps *hardware* interrupts for each interface *disabled*. Staying in NAPI polling mode causes higher cpu utilization on that one processor, which guarantees that when the hardware round robin balancer moves any other network interrupt onto that CPU, it too will join the NAPI polling mode chain. So no matter how many processors you have, with this round robin style of hardware interrupts, it guarantees you that if there is a lot of work to do (more than weight) at each softirq, then, all network interfaces will end up on the same cpu eventually (the busiest one) Your performance becomes the same as if you had booted with maxcpus=1 I hope this explanation makes sense, but what it comes down to is that combining hardware round robin balancing with NAPI is a BAD IDEA. In general the behavior of hardware round robin balancing is bad and I'm sure it is causing all sorts of other performance issues that you may not even be aware of. I'm sure your problem will go away if you run e1000 in interrupt mode. (use make CFLAGS_EXTRA=-DE1000_NO_NAPI) If I run the same test against 2 interfaces I get a 940 * 10^6 bits/sec transfer rate also, and if I run it against 3 interfaces I get around 850 * 10^6 bits/sec performance. I got this results using the upstream netdev-2.6 branch kernel plus David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the result is a bit worse, and the the transfer rate was around 600 * 10^6 bits/sec. Thank you for testing the latest kernel.org kernel. Hope this helps. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 performance issue in 4 simultaneous links
From: Brandeburg, Jesse [EMAIL PROTECTED] Date: Thu, 10 Jan 2008 12:52:15 -0800 I hope this explanation makes sense, but what it comes down to is that combining hardware round robin balancing with NAPI is a BAD IDEA. Absolutely agreed on all counts. No IRQ balancing should be done at all for networking device interrupts, with zero exceptions. It destroys performance. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html