RE: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Jesse,


It's good to be talking directly to one of the e1000 developers and
maintainers.  Although at this point I am starting to think that the
issue may be TCP stack related and nothing to do with the NIC.  Am I
correct that these are quite distinct parts of the kernel?


Yes, quite.


OK.  I hope that there is also someone knowledgable about the TCP stack 
who is following this thread. (Perhaps you also know this part of the 
kernel, but I am assuming that your expertise is on the e1000/NIC bits.)



Important note: we ARE able to get full duplex wire speed (over 900
Mb/s simulaneously in both directions) using UDP.  The problems occur
only with TCP connections.


That eliminates bus bandwidth issues, probably, but small packets take
up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.


I see.  Your concern is the extra ACK packets associated with TCP.  Even 
those these represent a small volume of data (around 5% with MTU=1500, and 
less at larger MTU) they double the number of packets that must be handled 
by the system compared to UDP transmission at the same data rate. Is that 
correct?



I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in
Germany).  So we'll provide this info in ~10 hours.


I would suggest you try TCP_RR with a command line something like this:
netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K

I think you'll have to compile netperf with burst mode support enabled.


I just saw Carsten a few minutes ago.  He has to take part in a 
'Baubesprechung' meeting this morning, after which he will start answering 
the technical questions and doing additional testing as suggested by you 
and others.  If you are on the US west coast, he should have some answers 
and results posted by Thursday morning Pacific time.



I assume that the interrupt load is distributed among all four cores
-- the default affinity is 0xff, and I also assume that there is some
type of interrupt aggregation taking place in the driver.  If the
CPUs were not able to service the interrupts fast enough, I assume
that we would also see loss of performance with UDP testing.


One other thing you can try with e1000 is disabling the dynamic
interrupt moderation by loading the driver with
InterruptThrottleRate=8000,8000,... (the number of commas depends on
your number of ports) which might help in your particular benchmark.


OK.  Is 'dynamic interrupt moderation' another name for 'interrupt
aggregation'?  Meaning that if more than one interrupt is generated
in a given time interval, then they are replaced by a single
interrupt?


Yes, InterruptThrottleRate=8000 means there will be no more than 8000
ints/second from that adapter, and if interrupts are generated faster
than that they are aggregated.

Interestingly since you are interested in ultra low latency, and may be
willing to give up some cpu for it during bulk transfers you should try
InterruptThrottleRate=1 (can generate up to 7 ints/s)


I'm not sure it's quite right to say that we are interested in ultra low 
latency. Most of our network transfers involve bulk data movement (a few 
MB or more).  We don't care so much about low latency (meaning how long it 
takes the FIRST byte of data to travel from sender to receiver).  We care 
about aggregate bandwidth: once the pipe is full, how fast can data be 
moved through it. Sow we don't care so much if getting the pipe full takes 
20 us or 50 us.  We just want the data to flow fast once the pipe IS full.



Welcome, its an interesting discussion.  Hope we can come to a good
conclusion.


Thank you. Carsten will post more info and answers later today.

Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Sangtae,

Thanks for joining this discussion -- it's good to a CUBIC author and 
expert here!


In our application (cluster computing) we use a very tightly coupled 
high-speed low-latency network.  There is no 'wide area traffic'.  So 
it's hard for me to understand why any networking components or 
software layers should take more than milliseconds to ramp up or back 
off in speed. Perhaps we should be asking for a TCP congestion 
avoidance algorithm which is designed for a data center environment 
where there are very few hops and typical packet delivery times are 
tens or hundreds of microseconds. It's very different than delivering 
data thousands of km across a WAN.


If your network latency is low, regardless of type of protocols should 
give you more than 900Mbps.


Yes, this is also what I had thought.

In the graph that we posted, the two machines are connected by an ethernet 
crossover cable.  The total RTT of the two machines is probably AT MOST a 
couple of hundred microseconds.  Typically it takes 20 or 30 microseconds 
to get the first packet out the NIC.  Travel across the wire is a few 
nanoseconds.  Then getting the packet into the receiving NIC might be 
another 20 or 30 microseconds.  The ACK should fly back in about the same 
time.


I can guess the RTT of two machines is less than 4ms in your case and I 
remember the throughputs of all high-speed protocols (including 
tcp-reno) were more than 900Mbps with 4ms RTT. So, my question which 
kernel version did you use with your broadcomm NIC and got more than 
900Mbps?


We are going to double-check this (we did the broadcom testing about two 
months ago). Carsten is going to re-run the broadcomm experiments later 
today and will then post the results.


You can see results from some testing on crossover-cable wired systems 
with broadcomm NICs, that I did about 2 years ago, here:

http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/SMC_8508T_Performance.html
You'll notice that total TCP throughput on the crossover cable was about 
220 MB/sec.  With TCP overhead this is very close to 2Gb/s.


I have two machines connected by a gig switch and I can see what happens 
in my environment. Could you post what parameters did you use for 
netperf testing?


Carsten will post these in the next few hours.  If you want to simplify 
further, you can even take away the gig switch and just use a crossover 
cable.



and also if you set any parameters for your testing, please post them
here so that I can see that happens to me as well.


Carsten will post all the sysctl and ethtool parameters shortly.

Thanks again for chiming in. I am sure that with help from you, Jesse, and 
Rick, we can figure out what is going on here, and get it fixed.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Andi!


Important note: we ARE able to get full duplex wire speed (over 900
Mb/s simulaneously in both directions) using UDP.  The problems occur
only with TCP connections.


Another issue with full duplex TCP not mentioned yet is that if TSO is used
the output  will be somewhat bursty and might cause problems with the
TCP ACK clock of the other direction because the ACKs would need
to squeeze in between full TSO bursts.

You could try disabling TSO with ethtool.


Noted.  We'll try this also.

Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Bill,


I see similar results on my test systems


Thanks for this report and for confirming our observations.  Could you 
please confirm that a single-port bidrectional UDP link runs at wire 
speed?  This helps to localize the problem to the TCP stack or interaction 
of the TCP stack with the e1000 driver and hardware.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi David,

Could this be an issue with pause frames?  At a previous job I remember 
having issues with a similar configuration using two broadcom sb1250 3 
gigE port devices. If I ran bidirectional tests on a single pair of 
ports connected via cross over, it was slower than when I gave each 
direction its own pair of ports.  The problem turned out to be that 
pause frame generation and handling was not configured correctly.


We had PAUSE frames turned off for our testing.  The idea is to let TCP 
do the flow and congestion control.


The problem with PAUSE+TCP is that it can cause head-of-line blocking, 
where a single oversubscribed output port on a switch can PAUSE a large 
number of flows on other paths.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Auke,


Important note: we ARE able to get full duplex wire speed (over 900
Mb/s simulaneously in both directions) using UDP.  The problems occur
only with TCP connections.


That eliminates bus bandwidth issues, probably, but small packets take
up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.


I see.  Your concern is the extra ACK packets associated with TCP.  Even
those these represent a small volume of data (around 5% with MTU=1500,
and less at larger MTU) they double the number of packets that must be
handled by the system compared to UDP transmission at the same data
rate. Is that correct?


A lot of people tend to forget that the pci-express bus has enough 
bandwidth on first glance - 2.5gbit/sec for 1gbit of traffix, but apart 
from data going over it there is significant overhead going on: each 
packet requires transmit, cleanup and buffer transactions, and there are 
many irq register clears per second (slow ioread/writes). The 
transactions double for TCP ack processing, and this all accumulates and 
starts to introduce latency, higher cpu utilization etc...


Based on the discussion in this thread, I am inclined to believe that lack 
of PCI-e bus bandwidth is NOT the issue.  The theory is that the extra 
packet handling associated with TCP acknowledgements are pushing the PCI-e 
x1 bus past its limits.  However the evidence seems to show otherwise:


(1) Bill Fink has reported the same problem on a NIC with a 133 MHz 64-bit 
PCI connection.  That connection can transfer data at 8Gb/s.


(2) If the theory is right, then doubling the MTU from 1500 to 3000 should 
have significantly reduce the problem, since it drops the number of ACK's 
by two.  Similarly, going from MTU 1500 to MTU 9000 should reduce the 
number of ACK's by a factor of six, practically eliminating the problem. 
But changing the MTU size does not help.


(3) The interrupt counts are quite reasonable.  Broadcom NICs without 
interrupt aggregation generate an order of magnitude more irq/s and this 
doesn't prevent wire speed performance there.


(4) The CPUs on the system are largely idle.  There are plenty of 
computing resources available.


(5) I don't think that the overhead will increase the bandwidth needed by 
more than a factor of two.  Of course you and the other e1000 developers 
are the experts, but the dominant bus cost should be copying data buffers 
across the bus. Everything else in minimal in comparison.


Intel insiders: isn't there some simple instrumentation available (which 
read registers or statistics counters on the PCI-e interface chip) to tell 
us statistics such as how many bits have moved over the link in each 
direction? This plus some accurate timing would make it easy to see if the 
TCP case is saturating the PCI-e bus.  Then the theory addressed with data 
rather than with opinions.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Auke,


Based on the discussion in this thread, I am inclined to believe that
lack of PCI-e bus bandwidth is NOT the issue.  The theory is that the
extra packet handling associated with TCP acknowledgements are pushing
the PCI-e x1 bus past its limits.  However the evidence seems to show
otherwise:

(1) Bill Fink has reported the same problem on a NIC with a 133 MHz
64-bit PCI connection.  That connection can transfer data at 8Gb/s.


That was even a PCI-X connection, which is known to have extremely good latency
numbers, IIRC better than PCI-e? (?) which could account for a lot of the
latency-induced lower performance...

also, 82573's are _not_ a serverpart and were not designed for this 
usage. 82546's are and that really does make a difference.


I'm confused.  It DOESN'T make a difference! Using 'server grade' 82546's 
on a PCI-X bus, Bill Fink reports the SAME loss of throughput with TCP 
full duplex that we see on a 'consumer grade' 82573 attached to a PCI-e x1 
bus.


Just like us, when Bill goes from TCP to UDP, he gets wire speed back.

Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Bill,


I see similar results on my test systems


Thanks for this report and for confirming our observations.  Could you
please confirm that a single-port bidrectional UDP link runs at wire
speed?  This helps to localize the problem to the TCP stack or interaction
of the TCP stack with the e1000 driver and hardware.


Yes, a single-port bidirectional UDP test gets full GigE line rate
in both directions with no packet loss.


Thanks for confirming this.  And thanks also for nuttcp!  I just 
recognized you as the author.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Bill,

I started musing if once one side's transmitter got the upper hand, it 
might somehow defer the processing of received packets, causing the 
resultant ACKs to be delayed and thus further slowing down the other 
end's transmitter.  I began to wonder if the txqueuelen could have an 
affect on the TCP performance behavior.  I normally have the txqueuelen 
set to 1 for 10-GigE testing, so decided to run a test with 
txqueuelen set to 200 (actually settled on this value through some 
experimentation).  Here is a typical result:


[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.6.79
tx:  1120.6345 MB /  10.07 sec =  933.4042 Mbps 12 %TX 9 %RX 0 retrans
rx:  1104.3081 MB /  10.09 sec =  917.7365 Mbps 12 %TX 11 %RX 0 retrans

This is significantly better, but there was more variability in the
results.  The above was with TSO enabled.  I also then ran a test
with TSO disabled, with the following typical result:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.6.79
tx:  1119.4749 MB /  10.05 sec =  934.2922 Mbps 13 %TX 9 %RX 0 retrans
rx:  1131.7334 MB /  10.05 sec =  944.8437 Mbps 15 %TX 12 %RX 0 retrans

This was a little better yet and getting closer to expected results.


We'll also try changing txqueuelen.  I have not looked, but I suppose that 
this is set to the default value of 1000.  We'd be delighted to see 
full-duplex performance that was consistent and greater than 900 Mb/s x 2.


I do have some other test systems at work that I might be able to try 
with newer kernels and/or drivers or maybe even with other vendor's GigE 
NICs, but I won't be back to work until early next week sometime.


Bill, we'd be happy to give you root access to a couple of our systems 
here if you want to do additional testing.  We can put the latest drivers 
on them (and reboot if/as needed).  If you want to do this, please just 
send an ssh public key to Carsten.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


e1000 full-duplex TCP performance well below wire speed

2008-01-30 Thread Bruce Allen
(Pádraig Brady has suggested that I post this to Netdev.  It was 
originally posted to LKML here: http://lkml.org/lkml/2008/1/30/141 )



Dear NetDev,

We've connected a pair of modern high-performance boxes with integrated copper 
Gb/s Intel NICS, with an ethernet crossover cable, and have run some netperf 
full duplex TCP tests.  The transfer rates are well below wire speed.  We're 
reporting this as a kernel bug, because we expect a vanilla kernel with default 
settings to give wire speed (or close to wire speed) performance in this case. 
We DO see wire speed in simplex transfers. The behavior has been verified on 
multiple machines with identical hardware.


Details:
Kernel version: 2.6.23.12
ethernet NIC: Intel 82573L
ethernet driver: e1000 version 7.3.20-k2
motherboard: Supermicro PDSML-LN2+ (one quad core Intel Xeon X3220, Intel 3000 
chipset, 8GB memory)


The test was done with various mtu sizes ranging from 1500 to 9000, with 
ethernet flow control switched on and off, and using reno and cubic as a TCP 
congestion control.


The behavior depends on the setup. In one test we used cubic congestion 
control, flow control off. The transfer rate in one direction was above 0.9Gb/s 
while in the other direction it was 0.6 to 0.8 Gb/s. After 15-20s the rates 
flipped. Perhaps the two steams are fighting for resources. (The performance of 
a full duplex stream should be close to 1Gb/s in both directions.)  A graph of 
the transfer speed as a function of time is here: 
https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg

Red shows transmit and green shows receive (please ignore other plots):

We're happy to do additional testing, if that would help, and very grateful for 
any advice!


Bruce Allen
Carsten Aulbert
Henning Fehrmann

Re: e1000 full-duplex TCP performance well below wire speed

2008-01-30 Thread Bruce Allen

Hi David,

Thanks for your note.


(The performance of a full duplex stream should be close to 1Gb/s in
both directions.)


This is not a reasonable expectation.

ACKs take up space on the link in the opposite direction of the
transfer.

So the link usage in the opposite direction of the transfer is
very far from zero.


Indeed, we are not asking to see 1000 Mb/s.  We'd be happy to see 900 
Mb/s.


Netperf is trasmitting a large buffer in MTU-sized packets (min 1500 
bytes).  Since the acks are only about 60 bytes in size, they should be 
around 4% of the total traffic.  Hence we would not expect to see more 
than 960 Mb/s.


We have run these same tests on older kernels (with Broadcomm NICS) and 
gotten above 900 Mb/s full duplex.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-30 Thread Bruce Allen

Hi Stephen,

Thanks for your helpful reply and especially for the literature pointers.


Indeed, we are not asking to see 1000 Mb/s.  We'd be happy to see 900
Mb/s.

Netperf is trasmitting a large buffer in MTU-sized packets (min 1500
bytes).  Since the acks are only about 60 bytes in size, they should be
around 4% of the total traffic.  Hence we would not expect to see more
than 960 Mb/s.



Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/
Max TCP Payload data rates over ethernet:
 (1500-40)/(38+1500) = 94.9285 %  IPv4, minimal headers
 (1500-52)/(38+1500) = 94.1482 %  IPv4, TCP timestamps


Yes.  If you look further down the page, you will see that with jumbo 
frames (which we have also tried) on Gb/s ethernet the maximum throughput 
is:


  (9000-20-20-12)/(9000+14+4+7+1+12)*10/100 = 990.042 Mbps

We are very far from this number -- averaging perhaps 600 or 700 Mbps.


I believe what you are seeing is an effect that occurs when using
cubic on links with no other idle traffic. With two flows at high speed,
the first flow consumes most of the router buffer and backs off gradually,
and the second flow is not very aggressive.  It has been discussed
back and forth between TCP researchers with no agreement, one side
says that it is unfairness and the other side says it is not a problem in
the real world because of the presence of background traffic.


At least in principle, we should have NO congestion here.  We have ports 
on two different machines wired with a crossover cable.  Box A can not 
transmit faster than 1 Gb/s.  Box B should be able to receive that data 
without dropping packets.  It's not doing anything else!



See:
 http://www.hamilton.ie/net/pfldnet2007_cubic_final.pdf
 http://www.csc.ncsu.edu/faculty/rhee/Rebuttal-LSM-new.pdf


This is extremely helpful.  The typical oscillation (startup) period shown 
in the plots in these papers is of order 10 seconds, which is similar to 
the types of oscillation periods that we are seeing.


*However* we have also seen similar behavior with the Reno congestion 
control algorithm.  So this might not be due to cubic, or entirely due to 
cubic.


In our application (cluster computing) we use a very tightly coupled 
high-speed low-latency network.  There is no 'wide area traffic'.  So it's 
hard for me to understand why any networking components or software layers 
should take more than milliseconds to ramp up or back off in speed. 
Perhaps we should be asking for a TCP congestion avoidance algorithm which 
is designed for a data center environment where there are very few hops 
and typical packet delivery times are tens or hundreds of microseconds. 
It's very different than delivering data thousands of km across a WAN.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-30 Thread Bruce Allen

Hi Ben,

Thank you for the suggestions and questions.

We've connected a pair of modern high-performance boxes with integrated 
copper Gb/s Intel NICS, with an ethernet crossover cable, and have run some 
netperf full duplex TCP tests.  The transfer rates are well below wire 
speed.  We're reporting this as a kernel bug, because we expect a vanilla 
kernel with default settings to give wire speed (or close to wire speed) 
performance in this case. We DO see wire speed in simplex transfers. The 
behavior has been verified on multiple machines with identical hardware.


Try using NICs in the pci-e slots.  We have better luck there, as you 
usually have more lanes and/or higher quality NIC chipsets available in 
this case.


It's a good idea.  We can try this, though it will take a little time to 
organize.



Try a UDP test to make sure the NIC can actually handle the throughput.


I should have mentioned this in my original post -- we already did this.

We can run UDP wire speed full duplex (over 900 Mb/s in each direction, at 
the same time). So the problem stems from TCP or is aggravated by TCP. 
It's not a hardware limitation.


Look at the actual link usage as reported by the ethernet driver so that 
you take all of the ACKS and other overhead into account.


OK.  We'll report on this as soon as possible.

Try the same test using 10G hardware (CX4 NICs are quite affordable 
these days, and we drove a 2-port 10G NIC based on the Intel ixgbe 
chipset at around 4Gbps on two ports, full duplex, using pktgen). As in 
around 16Gbps throughput across the busses.  That may also give you an 
idea if the bottleneck is hardware or software related.


OK.  That will take more time to organize.

Cheers,
Bruce



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: e1000 full-duplex TCP performance well below wire speed

2008-01-30 Thread Bruce Allen

Hi Jesse,

It's good to be talking directly to one of the e1000 developers and 
maintainers.  Although at this point I am starting to think that the 
issue may be TCP stack related and nothing to do with the NIC.  Am I 
correct that these are quite distinct parts of the kernel?



The 82573L (a client NIC, regardless of the class of machine it is in)
only has a x1 connection which does introduce some latency since the
slot is only capable of about 2Gb/s data total, which includes overhead
of descriptors and other transactions.  As you approach the maximum of
the slot it gets more and more difficult to get wire speed in a
bidirectional test.


According to the Intel datasheet, the PCI-e x1 connection is 2Gb/s in each 
direction.  So we only need to get up to 50% of peak to saturate a 
full-duplex wire-speed link.  I hope that the overhead is not a factor of 
two.


Important note: we ARE able to get full duplex wire speed (over 900 Mb/s 
simulaneously in both directions) using UDP.  The problems occur only with 
TCP connections.



The test was done with various mtu sizes ranging from 1500 to 9000,
with ethernet flow control switched on and off, and using reno and
cubic as a TCP congestion control.


As asked in LKML thread, please post the exact netperf command used to
start the client/server, whether or not you're using irqbalanced (aka
irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
right?)


I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in 
Germany).  So we'll provide this info in ~10 hours.


I assume that the interrupt load is distributed among all four cores -- 
the default affinity is 0xff, and I also assume that there is some type of 
interrupt aggregation taking place in the driver.  If the CPUs were not 
able to service the interrupts fast enough, I assume that we would also 
see loss of performance with UDP testing.



I've recently discovered that particularly with the most recent kernels
if you specify any socket options (-- -SX -sY) to netperf it does worse
than if it just lets the kernel auto-tune.


I am pretty sure that no socket options were specified, but again need to 
wait until Carsten or Henning come back on-line.



The behavior depends on the setup. In one test we used cubic
congestion control, flow control off. The transfer rate in one
direction was above 0.9Gb/s while in the other direction it was 0.6
to 0.8 Gb/s. After 15-20s the rates flipped. Perhaps the two steams
are fighting for resources. (The performance of a full duplex stream
should be close to 1Gb/s in both directions.)  A graph of the
transfer speed as a function of time is here:
https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg
Red shows transmit and green shows receive (please ignore other
plots):



One other thing you can try with e1000 is disabling the dynamic
interrupt moderation by loading the driver with
InterruptThrottleRate=8000,8000,... (the number of commas depends on
your number of ports) which might help in your particular benchmark.


OK.  Is 'dynamic interrupt moderation' another name for 'interrupt 
aggregation'?  Meaning that if more than one interrupt is generated in a 
given time interval, then they are replaced by a single interrupt?



just for completeness can you post the dump of ethtool -e eth0 and lspci
-vvv?


Yup, we'll give that info also.

Thanks again!

Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-30 Thread Bruce Allen

Hi Rick,

First off, thanks for netperf. I've used it a lot and find it an extremely 
useful tool.



As asked in LKML thread, please post the exact netperf command used to
start the client/server, whether or not you're using irqbalanced (aka
irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
right?)


In particular, it would be good to know if you are doing two concurrent 
streams, or if you are using the burst mode TCP_RR with large 
request/response sizes method which then is only using one connection.


I'm not sure -- must wait for Henning and Carsten to respond tomorrow.

Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-30 Thread Bruce Allen

Hi Stephen,


Indeed, we are not asking to see 1000 Mb/s.  We'd be happy to see 900
Mb/s.

Netperf is trasmitting a large buffer in MTU-sized packets (min 1500
bytes).  Since the acks are only about 60 bytes in size, they should be
around 4% of the total traffic.  Hence we would not expect to see more
than 960 Mb/s.



Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/
Max TCP Payload data rates over ethernet:
 (1500-40)/(38+1500) = 94.9285 %  IPv4, minimal headers
 (1500-52)/(38+1500) = 94.1482 %  IPv4, TCP timestamps


Yes.  If you look further down the page, you will see that with jumbo
frames (which we have also tried) on Gb/s ethernet the maximum throughput
is:

   (9000-20-20-12)/(9000+14+4+7+1+12)*10/100 = 990.042 Mbps

We are very far from this number -- averaging perhaps 600 or 700 Mbps.



That is the upper bound of performance on a standard PCI bus (32 bit).
To go higher you need PCI-X or PCI-Express. Also make sure you are really
getting 64-bit PCI, because I have seen some e1000 PCI-X boards that
are only 32bit.


The motherboard NIC is in a PCI-e x1 slot.  This has a maximum speed of 
250 MB/s (2 Gb/s) in each direction.  It should be a factor of 2 more 
interface speed than is needed.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html