Re: e1000 full-duplex TCP performance well below wire speed

2008-02-01 Thread Carsten Aulbert

Hi all

Rick Jones wrote:
2) use the aforementioned burst TCP_RR test.  This is then a single 
netperf with data flowing both ways on a single connection so no issue 
of skew, but perhaps an issue of being one connection and so one process 
on each end.


Since our major gaol is to establish a reliable way to test duplex 
connections this looks like a very good choice. Right now we just run 
this on a back to back test (cable connecting two hosts), but want to 
move to a high performance network with up to three switches between 
hosts. For this we want to have a stable test.


I doubt that I will be able to finish the tests tonight, but I'll post a 
follow-up latest on Monday.


Have a nice week-end and thanks a lot for all the suggestions so far!

Cheers

Carsten
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Good morning (my TZ),

I'll try to answer all questions, hoewver if I miss something big, 
please point my nose to it again.


Rick Jones wrote:

As asked in LKML thread, please post the exact netperf command used to
start the client/server, whether or not you're using irqbalanced (aka
irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
right?)


netperf was used without any special tuning parameters. Usually we start 
two processes on two hosts which start (almost) simultaneously, last for 
20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e.


on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20
on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20

192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we 
use the .2. part for eth1 for a range of mtus.


The server is started on both nodes with the start-stop-daemon and no 
special parameters I'm aware of.


/proc/interrupts shows me PCI_MSI-edge thus, I think YES.

In particular, it would be good to know if you are doing two concurrent 
streams, or if you are using the burst mode TCP_RR with large 
request/response sizes method which then is only using one connection.




As outlined above: Two concurrent streams right now. If you think TCP_RR 
should be better I'm happy to rerun some tests.


More in other emails.

I'll wade through them slowly.

Carsten
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Hi all, slowly crawling through the mails.

Brandeburg, Jesse wrote:


The test was done with various mtu sizes ranging from 1500 to 9000,
with ethernet flow control switched on and off, and using reno and
cubic as a TCP congestion control.

As asked in LKML thread, please post the exact netperf command used
to start the client/server, whether or not you're using irqbalanced
(aka irqbalance) and what cat /proc/interrupts looks like (you ARE
using MSI, right?)


We are using MSI, /proc/interrupts look like:
n0003:~# cat /proc/interrupts
   CPU0   CPU1   CPU2   CPU3
  0:6536963  0  0  0   IO-APIC-edge  timer
  1:  2  0  0  0   IO-APIC-edge  i8042
  3:  1  0  0  0   IO-APIC-edge  serial
  8:  0  0  0  0   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 14:  32321  0  0  0   IO-APIC-edge  libata
 15:  0  0  0  0   IO-APIC-edge  libata
 16:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb5
 18:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb4
 19:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb3
 23:  0  0  0  0   IO-APIC-fasteoi 
ehci_hcd:usb1, uhci_hcd:usb2

378:   17234866  0  0  0   PCI-MSI-edge  eth1
379: 129826  0  0  0   PCI-MSI-edge  eth0
NMI:  0  0  0  0
LOC:6537181653732665371496537052
ERR:  0

(sorry for the line break).

What we don't understand is why only core0 gets the interrupts, since 
the affinity is set to f:

# cat /proc/irq/378/smp_affinity
f

Right now, irqbalance is not running, though I can give it shot if 
people think this will make a difference.



I would suggest you try TCP_RR with a command line something like this:
netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K


I did that and the results can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest

The results with netperf running like
netperf -t TCP_STREAM -H host -l 20
can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf1

I reran the tests with
netperf -t test -H host -l 20 -c -C
or in the case of TCP_RR with the suggested burst settings -b 4 -r 64k



Yes, InterruptThrottleRate=8000 means there will be no more than 8000
ints/second from that adapter, and if interrupts are generated faster
than that they are aggregated.

Interestingly since you are interested in ultra low latency, and may be
willing to give up some cpu for it during bulk transfers you should try
InterruptThrottleRate=1 (can generate up to 7 ints/s)



On the web page you'll see that there are about 4000 interrupts/s for 
most tests and up to 20,000/s for the TCP_RR test. Shall I change the 
throttle rate?



just for completeness can you post the dump of ethtool -e eth0 and
lspci -vvv?

Yup, we'll give that info also.


n0002:~# ethtool -e eth1
Offset  Values
--  --
0x  00 30 48 93 94 2d 20 0d 46 f7 57 00 ff ff ff ff
0x0010  ff ff ff ff 6b 02 9a 10 d9 15 9a 10 86 80 df 80
0x0020  00 00 00 20 54 7e 00 00 00 10 da 00 04 00 00 27
0x0030  c9 6c 50 31 32 07 0b 04 84 29 00 00 00 c0 06 07
0x0040  08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
0x0050  14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00
0x0060  00 01 00 40 1e 12 ff ff ff ff ff ff ff ff ff ff
0x0070  ff ff ff ff ff ff ff ff ff ff ff ff ff ff cf 2f

lspci -vvv for this card:
0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet 
Controller

Subsystem: Super Micro Computer Inc Unknown device 109a
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- 
TAbort- MAbort- SERR- PERR-

Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 378
Region 0: Memory at ee20 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 5000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)

Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ 
Queue=0/0 Enable+

Address: fee0f00c  Data: 41b9
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, 
ExtTag-

Device: Latency L0s 512ns, L1 64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- 

Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Brief question I forgot to ask:

Right now we are using the old version 7.3.20-k2. To save some effort 
on your end, shall we upgrade this to 7.6.15 or should our version be 
good enough?


Thanks

Carsten
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Hi Andi,

Andi Kleen wrote:
Another issue with full duplex TCP not mentioned yet is that if TSO is used 
the output  will be somewhat bursty and might cause problems with the 
TCP ACK clock of the other direction because the ACKs would need 
to squeeze in between full TSO bursts.


You could try disabling TSO with ethtool.


I just tried that:

https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3

It seems that the numbers do get better (sweet-spot seems to be MTU6000 
with 914 MBit/s and 927 MBit/s), however for other settings the results 
vary a lot so I'm not sure how large the statistical fluctuations are.


Next test I'll try if it makes sense to enlarge the ring buffers.

Thanks

Carsten
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Hi all,

Brandeburg, Jesse wrote:

I would suggest you try TCP_RR with a command line something like
this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K

I did that and the results can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest


seems something went wrong and all you ran was the 1 byte tests, where
it should have been 64K both directions (request/response).
 


Yes, shell-quoting got me there. I'll re-run the tests, so please don't 
look at the TCP_RR results too closely. I think I'll be able to run 
maybe one or two more tests today, rest will follow tomorrow.


Thanks for bearing with me

Carsten

PS: Am I right that the TCP_RR tests should only be run on a single node 
at a time, not on both ends simultaneously?

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html