subject:"RE\: e1000 performance issue in 4 simultaneous links"

Re: e1000 performance issue in 4 simultaneous links

2008-01-30 Thread Kok, Auke

Denys Fedoryshchenko wrote:
 Sorry. that i interfere in this subject.
 
 Do you recommend CONFIG_IRQBALANCE to be enabled?

I certainly do not. Manual tweaking and pinning the irq's to the correct CPU 
will
give the best performance (for specific loads).

The userspace irqbalance daemon tries very hard to approximate this behaviour 
and
is what I recommend for most situations, it usually does the right thing and 
does
so without making your head spin (just start it).

The in-kernel one usually does the wrong thing for network loads.

Cheers,

Auke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread Benny Amorsen

David Miller [EMAIL PROTECTED] writes:

 No IRQ balancing should be done at all for networking device
 interrupts, with zero exceptions.  It destroys performance.

Does irqbalanced need to be taught about this? And how about the
initial balancing, so that each network card gets assigned to one CPU?


/Benny


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread Eric Dumazet


Breno Leitao a écrit :

On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote:
  

Breno Leitao wrote:


When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
of transfer rate. If I run 4 netperf against 4 different interfaces, I
get around 720 * 10^6 bits/sec.
  

I hope this explanation makes sense, but what it comes down to is that
combining hardware round robin balancing with NAPI is a BAD IDEA.  In
general the behavior of hardware round robin balancing is bad and I'm
sure it is causing all sorts of other performance issues that you may
not even be aware of.


I've made another test removing the ppc IRQ Round Robin scheme, bonded
each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
average.

Take a look at the interrupt table this time: 


io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
277: 151362450 13 14 13 14 
15 18   XICS  Level eth6
278: 12 131348681 19 13 15 
10 11   XICS  Level eth7
323: 11 18 171348426 18 11 
11 13   XICS  Level eth16
324: 12 16 11 191402709 13 
14 11   XICS  Level eth17


I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
using the noirqdistrib boot paramenter, and the performance was a little
worse.

Rick, 
  The 2 interface test that I showed in my first email, was run in two

different NIC. Also, I am running netperf with the following command
netperf -H hostname -T 0,8 while netserver is running without any
argument at all. Also, running vmstat in parallel shows that there is no
bottleneck in the CPU. Take a look: 


procs ---memory-- ---swap-- -io -system-- -cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
 2  0  0 6714732  16168 22744000 8 2  203   21  0  1 98  0  0
 0  0  0 6715120  16176 22744000 028 16234  505  0 16 83  0 
 1
 0  0  0 6715516  16176 22744000 0 0 16251  518  0 16 83  0 
 1
 1  0  0 6715252  16176 22744000 0 1 16316  497  0 15 84  0 
 1
 0  0  0 6716092  16176 22744000 0 0 16300  520  0 16 83  0 
 1
 0  0  0 6716320  16180 22744000 0 1 16354  486  0 15 84  0 
 1
 

  

If your machine has 8 cpus, then your vmstat output shows a bottleneck :)

(100/8 = 12.5), so I guess one of your CPU is full





--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread Denys Fedoryshchenko

Maybe good idea to use sysstat ?

http://perso.wanadoo.fr/sebastien.godard/

For example:

visp-1 ~ # mpstat -P ALL 1
Linux 2.6.24-rc7-devel (visp-1) 01/11/08

19:27:57 CPU   %user   %nice%sys %iowait%irq   %soft  %steal
   %idleintr/s
19:27:58 all0.000.000.000.000.002.510.00   
97.49   7707.00
19:27:58   00.000.000.000.000.004.000.00   
96.00   1926.00
19:27:58   10.000.000.000.000.001.010.00   
98.99   1926.00
19:27:58   20.000.000.000.000.005.000.00   
95.00   1927.00
19:27:58   30.000.000.000.000.000.990.00   
99.01   1927.00
19:27:58   40.000.000.000.000.000.000.00
0.00  0.00



  
  When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
  of transfer rate. If I run 4 netperf against 4 different interfaces, I
  get around 720 * 10^6 bits/sec.

  I hope this explanation makes sense, but what it comes down to is that
  combining hardware round robin balancing with NAPI is a BAD IDEA.  In
  general the behavior of hardware round robin balancing is bad and I'm
  sure it is causing all sorts of other performance issues that you may
  not even be aware of.
  
  I've made another test removing the ppc IRQ Round Robin scheme, bonded
  each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
  CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
  average.
 
  Take a look at the interrupt table this time: 
 
  io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
  277: 151362450 13 14 13 
14 15 18   XICS  Level eth6
  278: 12 131348681 19 13 
15 10 11   XICS  Level eth7
  323: 11 18 171348426 18 
11 11 13   XICS  Level eth16
  324: 12 16 11 191402709 
13 14 11   XICS  Level eth17
 
 
  I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
  using the noirqdistrib boot paramenter, and the performance was a little
  worse.
 
  Rick, 
The 2 interface test that I showed in my first email, was run in two
  different NIC. Also, I am running netperf with the following command
  netperf -H hostname -T 0,8 while netserver is running without any
  argument at all. Also, running vmstat in parallel shows that there is no
  bottleneck in the CPU. Take a look: 
 
  procs ---memory-- ---swap-- -io -system-- -
cpu--
   r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy 
id wa st
   2  0  0 6714732  16168 22744000 8 2  203   21  0  1 
98  0  0
   0  0  0 6715120  16176 22744000 028 16234  505  0 16 
83  0  1
   0  0  0 6715516  16176 22744000 0 0 16251  518  0 16 
83  0  1
   1  0  0 6715252  16176 22744000 0 1 16316  497  0 15 
84  0  1
   0  0  0 6716092  16176 22744000 0 0 16300  520  0 16 
83  0  1
   0  0  0 6716320  16180 22744000 0 1 16354  486  0 15 
84  0  1
   
 

 If your machine has 8 cpus, then your vmstat output shows a 
 bottleneck :)
 
 (100/8 = 12.5), so I guess one of your CPU is full
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread Breno Leitao

On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote:
 Breno Leitao a écrit :
  Take a look at the interrupt table this time: 
 
  io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
  277: 151362450 13 14 13 14  
 15 18   XICS  Level eth6
  278: 12 131348681 19 13 15  
 10 11   XICS  Level eth7
  323: 11 18 171348426 18 11  
 11 13   XICS  Level eth16
  324: 12 16 11 191402709 13  
 14 11   XICS  Level eth17
 
 

 If your machine has 8 cpus, then your vmstat output shows a bottleneck :)
 
 (100/8 = 12.5), so I guess one of your CPU is full

Well, if I run top while running the test, I see this load distributed
among the CPUs, mainly those that had a NIC IRC bonded. Take a look:

Tasks: 133 total,   2 running, 130 sleeping,   0 stopped,   1 zombie
Cpu0  :  0.3%us, 19.5%sy,  0.0%ni, 73.5%id,  0.0%wa,  0.0%hi,  0.0%si,  6.6%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 75.1%id,  0.0%wa,  0.7%hi, 24.3%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 73.1%id,  0.0%wa,  0.7%hi, 26.2%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 76.1%id,  0.0%wa,  0.7%hi, 23.3%si,  0.0%st
Cpu4  :  0.0%us,  0.3%sy,  0.0%ni, 70.4%id,  0.7%wa,  0.3%hi, 28.2%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Note that this average scenario doesn't change during the entire
benchmarking test.

Thanks!

-- 
Breno Leitao [EMAIL PROTECTED]

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread Breno Leitao

Hello Denys, 
   I've installed sysstat (good tools!) and the result is very similar
to the one which appears at top, take a look:
   13:34:23 CPU   %user   %nice%sys %iowait%irq   %soft  %steal   
%idleintr/s
13:34:24 all0.000.002.720.000.25   12.130.99   
83.91  16267.33
13:34:24   00.000.00   21.780.000.000.007.92   
70.30 40.59
13:34:24   10.000.000.000.000.99   24.750.00   
74.26   4025.74
13:34:24   20.000.000.000.000.99   24.750.00   
74.26   4036.63
13:34:24   30.000.000.000.000.99   21.780.00   
77.23   4032.67
13:34:24   40.000.000.000.000.98   24.510.00   
74.51   4034.65
13:34:24   50.000.000.000.000.000.000.00  
100.00 30.69
13:34:24   60.000.000.000.000.000.000.00  
100.00 33.66
13:34:24   70.000.000.000.000.000.000.00  
100.00 32.67

So, we can assure that the IRQs are not being balanced, and that there
isn't any processor overload.

Thanks!


On Fri, 2008-01-11 at 19:36 +0200, Denys Fedoryshchenko wrote:
 Maybe good idea to use sysstat ?
 
 http://perso.wanadoo.fr/sebastien.godard/

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread Breno Leitao

On Thu, 2008-01-10 at 12:52 -0800, Brandeburg, Jesse wrote:
 Breno Leitao wrote:
  When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
  of transfer rate. If I run 4 netperf against 4 different interfaces, I
  get around 720 * 10^6 bits/sec.
 
 I hope this explanation makes sense, but what it comes down to is that
 combining hardware round robin balancing with NAPI is a BAD IDEA.  In
 general the behavior of hardware round robin balancing is bad and I'm
 sure it is causing all sorts of other performance issues that you may
 not even be aware of.
I've made another test removing the ppc IRQ Round Robin scheme, bonded
each interface (eth6, eth7, eth16 and eth17) to different CPUs (CPU1,
CPU2, CPU3 and CPU4) and I also get around around 720 * 10^6 bits/s in
average.

Take a look at the interrupt table this time: 

io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
277: 151362450 13 14 13 14 
15 18   XICS  Level eth6
278: 12 131348681 19 13 15 
10 11   XICS  Level eth7
323: 11 18 171348426 18 11 
11 13   XICS  Level eth16
324: 12 16 11 191402709 13 
14 11   XICS  Level eth17


I also tried to bound all the 4 interface IRQ to a single CPU (CPU0)
using the noirqdistrib boot paramenter, and the performance was a little
worse.

Rick, 
  The 2 interface test that I showed in my first email, was run in two
different NIC. Also, I am running netperf with the following command
netperf -H hostname -T 0,8 while netserver is running without any
argument at all. Also, running vmstat in parallel shows that there is no
bottleneck in the CPU. Take a look: 

procs ---memory-- ---swap-- -io -system-- -cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
 2  0  0 6714732  16168 22744000 8 2  203   21  0  1 98  0  0
 0  0  0 6715120  16176 22744000 028 16234  505  0 16 83  0 
 1
 0  0  0 6715516  16176 22744000 0 0 16251  518  0 16 83  0 
 1
 1  0  0 6715252  16176 22744000 0 1 16316  497  0 15 84  0 
 1
 0  0  0 6716092  16176 22744000 0 0 16300  520  0 16 83  0 
 1
 0  0  0 6716320  16180 22744000 0 1 16354  486  0 15 84  0 
 1
 

Thanks!

-- 
Breno Leitao [EMAIL PROTECTED]

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread Rick Jones


Breno Leitao wrote:

On Fri, 2008-01-11 at 17:48 +0100, Eric Dumazet wrote:


Breno Leitao a écrit :

Take a look at the interrupt table this time: 


io-dolphins:~/leitao # cat /proc/interrupts  | grep eth[1]*[67]
277: 151362450 13 14 13 14 
15 18   XICS  Level eth6
278: 12 131348681 19 13 15 
10 11   XICS  Level eth7
323: 11 18 171348426 18 11 
11 13   XICS  Level eth16
324: 12 16 11 191402709 13 
14 11   XICS  Level eth17


 


If your machine has 8 cpus, then your vmstat output shows a bottleneck :)

(100/8 = 12.5), so I guess one of your CPU is full



Well, if I run top while running the test, I see this load distributed
among the CPUs, mainly those that had a NIC IRC bonded. Take a look:

Tasks: 133 total,   2 running, 130 sleeping,   0 stopped,   1 zombie
Cpu0  :  0.3%us, 19.5%sy,  0.0%ni, 73.5%id,  0.0%wa,  0.0%hi,  0.0%si,  6.6%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 75.1%id,  0.0%wa,  0.7%hi, 24.3%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 73.1%id,  0.0%wa,  0.7%hi, 26.2%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 76.1%id,  0.0%wa,  0.7%hi, 23.3%si,  0.0%st
Cpu4  :  0.0%us,  0.3%sy,  0.0%ni, 70.4%id,  0.7%wa,  0.3%hi, 28.2%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st


If you have IRQ's bound to CPUs 1-4, and have four netperfs running, 
given that the stack ostensibly tries to have applications run on the 
same CPUs, what is running on CPU0?


Is it related to:


  The 2 interface test that I showed in my first email, was run in two
different NIC. Also, I am running netperf with the following command
netperf -H hostname -T 0,8 while netserver is running without any
argument at all. Also, running vmstat in parallel shows that there is no
bottleneck in the CPU. Take a look: 


Unless you have a morbid curiousity :) there isn't much point in binding 
all the netperf's to CPU 0 when the interrupts for the NICs servicing 
their connections are on CPUs 1-4.  I also assume then that the 
system(s) on which netserver is running have  8 CPUs in them? (There 
are multiple destination systems yes?)


Does anything change if you explicitly bind each netperf to the CPU on 
which the interrups for its connection are processed?  Or for that 
matter if you remove the -T command entirely


Does UDP_STREAM show different performance than TCP_STREAM (I'm 
ass-u-me-ing based on the above we are looking at the netperf side of a 
TCP_STREAM test above, please correct if otherwise).


Are the CPUs above single-core CPUs or multi-core CPUs, and if 
multi-core are caches shared?  How are CPUs numbered if multi-core on 
that system?  Is there any hardware threading involved?  I'm wondering 
if there may be some wrinkles in the system that might lead to reported 
CPU utilization being low even if a chip is otherwise saturated.  Might 
need some HW counters to check that...


Can you describe the I/O subsystem more completely?  I understand that 
you are using at most two ports of a pair of quad-port cards at any one 
time, but am still curious to know if those two cards are on separate 
busses, or if they share any bus/link on the way to memory.


rick jones
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread David Miller

From: Benny Amorsen [EMAIL PROTECTED]
Date: Fri, 11 Jan 2008 12:09:32 +0100

 David Miller [EMAIL PROTECTED] writes:

  No IRQ balancing should be done at all for networking device
  interrupts, with zero exceptions.  It destroys performance.

 Does irqbalanced need to be taught about this?

The userland one already does.

It's only the in-kernel IRQ load balancing for these (presumably
powerpc) platforms that is broken.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-11 Thread Denys Fedoryshchenko

Sorry. that i interfere in this subject.

Do you recommend CONFIG_IRQBALANCE to be enabled?

If it is enabled - irq's not jumping nonstop over processors. softirqd 
changing this behavior.
If it is disabled, irq's distributed over each processor, and in loaded 
systems it seems harmful. 
I work a little yesterday with server with CONFIG_IRQBALANCE=no, 160kpps load.
It was packetloss-ing, till i set smp_affinity.

Maybe it is useful to put more info in Kconfig, since it is very important 
for performance option.

On Fri, 11 Jan 2008 17:41:09 -0800 (PST), David Miller wrote
 From: Benny Amorsen benny [EMAIL PROTECTED]
 Date: Fri, 11 Jan 2008 12:09:32  0100
 
  David Miller [EMAIL PROTECTED] writes:
  
   No IRQ balancing should be done at all for networking device
   interrupts, with zero exceptions.  It destroys performance.
  
  Does irqbalanced need to be taught about this?
 
 The userland one already does.
 
 It's only the in-kernel IRQ load balancing for these (presumably
 powerpc) platforms that is broken.
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-10 Thread Ben Hutchings

Breno Leitao wrote:
 Hello, 
 
 I've perceived that there is a performance issue when running netperf
 against 4 e1000 links connected end-to-end to another machine with 4
 e1000 interfaces. 
 
 I have 2 4-port interfaces on my machine, but the test is just
 considering 2 port for each interfaces card.
 
 When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
 of transfer rate. If I run 4 netperf against 4 different interfaces, I
 get around 720 * 10^6 bits/sec.
snip

I take it that's the average for individual interfaces, not the
aggregate?  RX processing for multi-gigabits per second can be quite
expensive.  This can be mitigated by interrupt moderation and NAPI
polling, jumbo frames (MTU 1500) and/or Large Receive Offload (LRO).
I don't think e1000 hardware does LRO, but the driver could presumably
be changed use Linux's software LRO.

Even with these optimisations, if all RX processing is done on a
single CPU this can become a bottleneck.  Does the test system have
multiple CPUs?  Are IRQs for the multiple NICs balanced across
multiple CPUs?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-10 Thread Jeba Anandhan

Ben,
I am facing the performance issue when we try to bond the multiple
interfaces with virtual interface. It could be related to this thread. 
My questions are,
*) When we use mulitple NICs, will the performance of overall system  be
summation of all individual lines  XX bits/sec. ?
*) What are the factors improves the performance if we have multiple
interfaces?. [ kind of tuning the parameters in proc ]

Breno, 
I hope this thread will be helpful for performance issue which i have
with bonding driver.

Jeba
On Thu, 2008-01-10 at 16:36 +, Ben Hutchings wrote:
 Breno Leitao wrote:
  Hello, 
  
  I've perceived that there is a performance issue when running netperf
  against 4 e1000 links connected end-to-end to another machine with 4
  e1000 interfaces. 
  
  I have 2 4-port interfaces on my machine, but the test is just
  considering 2 port for each interfaces card.
  
  When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
  of transfer rate. If I run 4 netperf against 4 different interfaces, I
  get around 720 * 10^6 bits/sec.
 snip
 
 I take it that's the average for individual interfaces, not the
 aggregate?  RX processing for multi-gigabits per second can be quite
 expensive.  This can be mitigated by interrupt moderation and NAPI
 polling, jumbo frames (MTU 1500) and/or Large Receive Offload (LRO).
 I don't think e1000 hardware does LRO, but the driver could presumably
 be changed use Linux's software LRO.
 
 Even with these optimisations, if all RX processing is done on a
 single CPU this can become a bottleneck.  Does the test system have
 multiple CPUs?  Are IRQs for the multiple NICs balanced across
 multiple CPUs?
 
 Ben.
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-10 Thread Breno Leitao

On Thu, 2008-01-10 at 16:36 +, Ben Hutchings wrote:
  When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
  of transfer rate. If I run 4 netperf against 4 different interfaces, I
  get around 720 * 10^6 bits/sec.
 snip
 
 I take it that's the average for individual interfaces, not the
 aggregate?
Right, each of these results are for individual interfaces. Otherwise,
we'd have a huge problem. :-)

 This can be mitigated by interrupt moderation and NAPI
 polling, jumbo frames (MTU 1500) and/or Large Receive Offload (LRO).
 I don't think e1000 hardware does LRO, but the driver could presumably
 be changed use Linux's software LRO.
Without using these features and keeping the MTU as 1500, do you think
we could get a better performance than this one?

I also tried to increase my interface MTU to 9000, but I am afraid that
netperf only transmits packets with less than 1500. Still investigating.

 single CPU this can become a bottleneck.  Does the test system have
 multiple CPUs?  Are IRQs for the multiple NICs balanced across
 multiple CPUs?
Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
across the CPUs, as I see in /proc/interrupts: 

# cat /proc/interrupts 
   CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   
CPU6   CPU7   
 16:940760   1047904993777
975813   XICS  Level IPI
 18:  4  3  4  1  3  6  
8  3   XICS  Level hvc_console
 19:  0  0  0  0  0  0  
0  0   XICS  Level RAS_EPOW
273:  10728  10850  10937  10833  10884  10788  
10868  10776   XICS  Level eth4
275:  0  0  0  0  0  0  
0  0   XICS  Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
277: 234933 230275 229770 234048 235906 229858 
229975 233859   XICS  Level eth6
278: 266225 267606 262844 265985 268789 266869 
263110 267422   XICS  Level eth7
279:893919857909867917
894881   XICS  Level eth0
305: 439246 439117 438495 436072 438053 440111 
438973 438951   XICS  Level eth0 Neterion Xframe II 10GbE network 
adapter
321:   3268   3088   3143   3113   3305   2982   
3326   3084   XICS  Level ipr
323: 268030 273207 269710 271338 270306 273258 
270872 273281   XICS  Level eth16
324: 215012 221102 219494 216732 216531 220460 
219718 218654   XICS  Level eth17
325:   7103   3580   7246   3475   7132   3394   
7258   3435   XICS  Level pata_pdc2027x
BAD:   4216

Thanks,

-- 
Breno Leitao [EMAIL PROTECTED]

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-10 Thread Rick Jones


Many many things to check when running netperf :)

*) Are the cards on the same or separate PCImumble bus, and what sort of bus

*) is the two interface performance two interfaces on the same four-port 
card, or an interface from each of the two four-port cards?


*) is there a dreaded (IMO) irqbalance daemon running?  one of the very 
first things I do when running netperf is terminate the irqbalance 
daemon with as extreme a predjudice as I can.


*) what is the distribution of interrupts from the interfaces to the 
CPUs?  if you've tried to set that manually, the dreaded irqbalance 
daemon will come along shortly thereafter and ruin everything.


*) what does netperf say about the overall CPU utilization of the 
system(s) when the tests are running?


*) what does top say about the utilization of any single CPU in the 
system(s) when the tests are running?


*) are you using the global -T option to spread the netperf/netserver 
processes across the CPUs, or leaving that all up to the 
stack/scheduler/etc?


I suspect there could be more but that is what comes to mind thusfar as 
far as things I often check when running netperf.


rick jones

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-10 Thread Kok, Auke

Breno Leitao wrote:
 On Thu, 2008-01-10 at 16:36 +, Ben Hutchings wrote:
 When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
 of transfer rate. If I run 4 netperf against 4 different interfaces, I
 get around 720 * 10^6 bits/sec.
 snip

 I take it that's the average for individual interfaces, not the
 aggregate?
 Right, each of these results are for individual interfaces. Otherwise,
 we'd have a huge problem. :-)
 
 This can be mitigated by interrupt moderation and NAPI
 polling, jumbo frames (MTU 1500) and/or Large Receive Offload (LRO).
 I don't think e1000 hardware does LRO, but the driver could presumably
 be changed use Linux's software LRO.
 Without using these features and keeping the MTU as 1500, do you think
 we could get a better performance than this one?
 
 I also tried to increase my interface MTU to 9000, but I am afraid that
 netperf only transmits packets with less than 1500. Still investigating.
 
 single CPU this can become a bottleneck.  Does the test system have
 multiple CPUs?  Are IRQs for the multiple NICs balanced across
 multiple CPUs?
 Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
 across the CPUs, as I see in /proc/interrupts: 


which is wrong and hurts performance. you want your ethernet irq's to stick to a
CPU for long times to prevent cache thrash.

please disable the in-kernel irq balancing code and use the userspace 
`irqbalance`
daemon.

Gee I should put that in my signature, I already wrote that twice today :)

Auke

 
 # cat /proc/interrupts 
CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   
 CPU6   CPU7   
  16:940760   1047904993777
 975813   XICS  Level IPI
  18:  4  3  4  1  3  6
   8  3   XICS  Level hvc_console
  19:  0  0  0  0  0  0
   0  0   XICS  Level RAS_EPOW
 273:  10728  10850  10937  10833  10884  10788  
 10868  10776   XICS  Level eth4
 275:  0  0  0  0  0  0
   0  0   XICS  Level ehci_hcd:usb1, ohci_hcd:usb2, 
 ohci_hcd:usb3
 277: 234933 230275 229770 234048 235906 229858 
 229975 233859   XICS  Level eth6
 278: 266225 267606 262844 265985 268789 266869 
 263110 267422   XICS  Level eth7
 279:893919857909867917
 894881   XICS  Level eth0
 305: 439246 439117 438495 436072 438053 440111 
 438973 438951   XICS  Level eth0 Neterion Xframe II 10GbE network 
 adapter
 321:   3268   3088   3143   3113   3305   2982   
 3326   3084   XICS  Level ipr
 323: 268030 273207 269710 271338 270306 273258 
 270872 273281   XICS  Level eth16
 324: 215012 221102 219494 216732 216531 220460 
 219718 218654   XICS  Level eth17
 325:   7103   3580   7246   3475   7132   3394   
 7258   3435   XICS  Level pata_pdc2027x
 BAD:   4216
 
 Thanks,
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-10 Thread Rick Jones


I also tried to increase my interface MTU to 9000, but I am afraid that
netperf only transmits packets with less than 1500. Still investigating.


It may seem like picking a tiny nit, but netperf never transmits 
packets.  It only provides buffers of specified size to the stack. It is 
then the stack which transmits and determines the size of the packets on 
the network.


Drifting a bit more...

While there are settings, conditions and known stack behaviours where 
one can be confident of the packet size on the network based on the 
options passed to netperf, generally speaking one should not ass-u-me a 
direct relationship between the options one passes to netperf and the 
size of the packets on the network.


And for JumboFrames to be effective it must be set on both ends, 
otherwise the TCP MSS exchange will result in the smaller of the two 
MTU's winning as it were.



single CPU this can become a bottleneck.  Does the test system have
multiple CPUs?  Are IRQs for the multiple NICs balanced across
multiple CPUs?


Yes, this machine has 8 ppc 1.9Ghz CPUs. And the IRQs are balanced
across the CPUs, as I see in /proc/interrupts: 


That suggests to me anyway that the dreaded irqbalanced is running, 
shuffling the interrupts as you go.  Not often a happy place for running 
netperf when one want's consistent results.




# cat /proc/interrupts 
   CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   CPU6   CPU7   
 16:940760   1047904993777975813   XICS  Level IPI

 18:  4  3  4  1  3  6  
8  3   XICS  Level hvc_console
 19:  0  0  0  0  0  0  
0  0   XICS  Level RAS_EPOW
273:  10728  10850  10937  10833  10884  10788  
10868  10776   XICS  Level eth4
275:  0  0  0  0  0  0  
0  0   XICS  Level ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
277: 234933 230275 229770 234048 235906 229858 
229975 233859   XICS  Level eth6
278: 266225 267606 262844 265985 268789 266869 
263110 267422   XICS  Level eth7
279:893919857909867917
894881   XICS  Level eth0
305: 439246 439117 438495 436072 438053 440111 
438973 438951   XICS  Level eth0 Neterion Xframe II 10GbE network 
adapter
321:   3268   3088   3143   3113   3305   2982   
3326   3084   XICS  Level ipr
323: 268030 273207 269710 271338 270306 273258 
270872 273281   XICS  Level eth16
324: 215012 221102 219494 216732 216531 220460 
219718 218654   XICS  Level eth17
325:   7103   3580   7246   3475   7132   3394   
7258   3435   XICS  Level pata_pdc2027x
BAD:   4216


IMO, what you want (in the absence of multi-queue NICs) is one CPU 
taking the interrupts of one port/interface, and each port/interface's 
interrupts going to a separate CPU.  So, something that looks roughly 
like concocted example:


   CPU0 CPU1  CPU2 CPU3
  1:   12340 00   eth0
  2:  0 1234 00   eth1
  3:  00  12340   eth2
  4:  00 0 1234   eth3

which you should be able to acheive via the method I think someone else 
has already mentioned about echoing values into 
/proc/irq/irq/smp_affinity  - after you have slain the dreaded 
irqbalance daemon.


rick jones
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: e1000 performance issue in 4 simultaneous links

2008-01-10 Thread Brandeburg, Jesse

Breno Leitao wrote:
 When I run netperf in just one interface, I get 940.95 * 10^6 bits/sec
 of transfer rate. If I run 4 netperf against 4 different interfaces, I
 get around 720 * 10^6 bits/sec.

This is actually a known issue that we have worked with your company
before on.  It comes down to your system's default behavior of round
robining interrupts (see cat /proc/interrupts while running the test)
combined with e1000's way of exiting / rescheduling NAPI.

The default round robin behavior of the interrupts on your system is the
root cause of this issue, and here is what happens:

4 interfaces start generating interrupts, if you're lucky the round
robin balancer has them all on different cpus.
As the e1000 driver goes into and out of polling mode, the round robin
balancer keeps moving the interrupt to the next cpu.
Eventually 2 or more driver instances end up on the same CPU, which
causes both driver instances to stay in NAPI polling mode, due to the
amount of work being done, and that there are always more than
netdev-weight packets to do for each instance.  This keeps *hardware*
interrupts for each interface *disabled*.
Staying in NAPI polling mode causes higher cpu utilization on that one
processor, which guarantees that when the hardware round robin balancer
moves any other network interrupt onto that CPU, it too will join the
NAPI polling mode chain.
So no matter how many processors you have, with this round robin style
of hardware interrupts, it guarantees you that if there is a lot of work
to do (more than weight) at each softirq, then, all network interfaces
will end up on the same cpu eventually (the busiest one)
Your performance becomes the same as if you had booted with maxcpus=1

I hope this explanation makes sense, but what it comes down to is that
combining hardware round robin balancing with NAPI is a BAD IDEA.  In
general the behavior of hardware round robin balancing is bad and I'm
sure it is causing all sorts of other performance issues that you may
not even be aware of.

I'm sure your problem will go away if you run e1000 in interrupt mode.
(use make CFLAGS_EXTRA=-DE1000_NO_NAPI)
 
 If I run the same test against 2 interfaces I get a 940 * 10^6
 bits/sec transfer rate also, and if I run it against 3 interfaces I
 get around 850 * 10^6 bits/sec performance.
 
 I got this results using the upstream netdev-2.6 branch kernel plus
 David Miller's 7 NAPI patches set[1]. In the kernel 2.6.23.12 the
 result is a bit worse, and the the transfer rate was around 600 * 10^6
 bits/sec.

Thank you for testing the latest kernel.org kernel.

Hope this helps.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

2008-01-10 Thread David Miller

From: Brandeburg, Jesse [EMAIL PROTECTED]
Date: Thu, 10 Jan 2008 12:52:15 -0800

 I hope this explanation makes sense, but what it comes down to is that
 combining hardware round robin balancing with NAPI is a BAD IDEA.

Absolutely agreed on all counts.

No IRQ balancing should be done at all for networking device
interrupts, with zero exceptions.  It destroys performance.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

RE: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

RE: e1000 performance issue in 4 simultaneous links

Re: e1000 performance issue in 4 simultaneous links

18 matches

Site Navigation

Mail list logo

Footer information