Re: e1000 full-duplex TCP performance well below wire speed
Hi all Rick Jones wrote: 2) use the aforementioned burst TCP_RR test. This is then a single netperf with data flowing both ways on a single connection so no issue of skew, but perhaps an issue of being one connection and so one process on each end. Since our major gaol is to establish a reliable way to test duplex connections this looks like a very good choice. Right now we just run this on a back to back test (cable connecting two hosts), but want to move to a high performance network with up to three switches between hosts. For this we want to have a stable test. I doubt that I will be able to finish the tests tonight, but I'll post a follow-up latest on Monday. Have a nice week-end and thanks a lot for all the suggestions so far! Cheers Carsten -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 full-duplex TCP performance well below wire speed
Hi Jesse, It's good to be talking directly to one of the e1000 developers and maintainers. Although at this point I am starting to think that the issue may be TCP stack related and nothing to do with the NIC. Am I correct that these are quite distinct parts of the kernel? Yes, quite. OK. I hope that there is also someone knowledgable about the TCP stack who is following this thread. (Perhaps you also know this part of the kernel, but I am assuming that your expertise is on the e1000/NIC bits.) Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. That eliminates bus bandwidth issues, probably, but small packets take up a lot of extra descriptors, bus bandwidth, CPU, and cache resources. I see. Your concern is the extra ACK packets associated with TCP. Even those these represent a small volume of data (around 5% with MTU=1500, and less at larger MTU) they double the number of packets that must be handled by the system compared to UDP transmission at the same data rate. Is that correct? I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in Germany). So we'll provide this info in ~10 hours. I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I think you'll have to compile netperf with burst mode support enabled. I just saw Carsten a few minutes ago. He has to take part in a 'Baubesprechung' meeting this morning, after which he will start answering the technical questions and doing additional testing as suggested by you and others. If you are on the US west coast, he should have some answers and results posted by Thursday morning Pacific time. I assume that the interrupt load is distributed among all four cores -- the default affinity is 0xff, and I also assume that there is some type of interrupt aggregation taking place in the driver. If the CPUs were not able to service the interrupts fast enough, I assume that we would also see loss of performance with UDP testing. One other thing you can try with e1000 is disabling the dynamic interrupt moderation by loading the driver with InterruptThrottleRate=8000,8000,... (the number of commas depends on your number of ports) which might help in your particular benchmark. OK. Is 'dynamic interrupt moderation' another name for 'interrupt aggregation'? Meaning that if more than one interrupt is generated in a given time interval, then they are replaced by a single interrupt? Yes, InterruptThrottleRate=8000 means there will be no more than 8000 ints/second from that adapter, and if interrupts are generated faster than that they are aggregated. Interestingly since you are interested in ultra low latency, and may be willing to give up some cpu for it during bulk transfers you should try InterruptThrottleRate=1 (can generate up to 7 ints/s) I'm not sure it's quite right to say that we are interested in ultra low latency. Most of our network transfers involve bulk data movement (a few MB or more). We don't care so much about low latency (meaning how long it takes the FIRST byte of data to travel from sender to receiver). We care about aggregate bandwidth: once the pipe is full, how fast can data be moved through it. Sow we don't care so much if getting the pipe full takes 20 us or 50 us. We just want the data to flow fast once the pipe IS full. Welcome, its an interesting discussion. Hope we can come to a good conclusion. Thank you. Carsten will post more info and answers later today. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Sangtae, Thanks for joining this discussion -- it's good to a CUBIC author and expert here! In our application (cluster computing) we use a very tightly coupled high-speed low-latency network. There is no 'wide area traffic'. So it's hard for me to understand why any networking components or software layers should take more than milliseconds to ramp up or back off in speed. Perhaps we should be asking for a TCP congestion avoidance algorithm which is designed for a data center environment where there are very few hops and typical packet delivery times are tens or hundreds of microseconds. It's very different than delivering data thousands of km across a WAN. If your network latency is low, regardless of type of protocols should give you more than 900Mbps. Yes, this is also what I had thought. In the graph that we posted, the two machines are connected by an ethernet crossover cable. The total RTT of the two machines is probably AT MOST a couple of hundred microseconds. Typically it takes 20 or 30 microseconds to get the first packet out the NIC. Travel across the wire is a few nanoseconds. Then getting the packet into the receiving NIC might be another 20 or 30 microseconds. The ACK should fly back in about the same time. I can guess the RTT of two machines is less than 4ms in your case and I remember the throughputs of all high-speed protocols (including tcp-reno) were more than 900Mbps with 4ms RTT. So, my question which kernel version did you use with your broadcomm NIC and got more than 900Mbps? We are going to double-check this (we did the broadcom testing about two months ago). Carsten is going to re-run the broadcomm experiments later today and will then post the results. You can see results from some testing on crossover-cable wired systems with broadcomm NICs, that I did about 2 years ago, here: http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/SMC_8508T_Performance.html You'll notice that total TCP throughput on the crossover cable was about 220 MB/sec. With TCP overhead this is very close to 2Gb/s. I have two machines connected by a gig switch and I can see what happens in my environment. Could you post what parameters did you use for netperf testing? Carsten will post these in the next few hours. If you want to simplify further, you can even take away the gig switch and just use a crossover cable. and also if you set any parameters for your testing, please post them here so that I can see that happens to me as well. Carsten will post all the sysctl and ethtool parameters shortly. Thanks again for chiming in. I am sure that with help from you, Jesse, and Rick, we can figure out what is going on here, and get it fixed. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Bruce Allen [EMAIL PROTECTED] writes: Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Andi! Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. Noted. We'll try this also. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
On Wed, 30 Jan 2008, SANGTAE HA wrote: On Jan 30, 2008 5:25 PM, Bruce Allen [EMAIL PROTECTED] wrote: In our application (cluster computing) we use a very tightly coupled high-speed low-latency network. There is no 'wide area traffic'. So it's hard for me to understand why any networking components or software layers should take more than milliseconds to ramp up or back off in speed. Perhaps we should be asking for a TCP congestion avoidance algorithm which is designed for a data center environment where there are very few hops and typical packet delivery times are tens or hundreds of microseconds. It's very different than delivering data thousands of km across a WAN. If your network latency is low, regardless of type of protocols should give you more than 900Mbps. I can guess the RTT of two machines is less than 4ms in your case and I remember the throughputs of all high-speed protocols (including tcp-reno) were more than 900Mbps with 4ms RTT. So, my question which kernel version did you use with your broadcomm NIC and got more than 900Mbps? I have two machines connected by a gig switch and I can see what happens in my environment. Could you post what parameters did you use for netperf testing? and also if you set any parameters for your testing, please post them here so that I can see that happens to me as well. I see similar results on my test systems, using Tyan Thunder K8WE (S2895) motherboard with dual Intel Xeon 3.06 GHZ CPUs and 1 GB memory, running a 2.6.15.4 kernel. The GigE NICs are Intel PRO/1000 82546EB_QUAD_COPPER, on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000 driver, and running with 9000-byte jumbo frames. The TCP congestion control is BIC. Unidirectional TCP test: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 tx: 1186.5649 MB / 10.05 sec = 990.2741 Mbps 11 %TX 9 %RX 0 retrans and: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Irx -r -w2m 192.168.6.79 rx: 1186.8281 MB / 10.05 sec = 990.5634 Mbps 14 %TX 9 %RX 0 retrans Each direction gets full GigE line rate. Bidirectional TCP test: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.6.79 tx: 898.9934 MB / 10.05 sec = 750.1634 Mbps 10 %TX 8 %RX 0 retrans rx: 1167.3750 MB / 10.06 sec = 973.8617 Mbps 14 %TX 11 %RX 0 retrans While one direction gets close to line rate, the other only got 750 Mbps. Note there were no TCP retransmitted segments for either data stream, so that doesn't appear to be the cause of the slower transfer rate in one direction. If the receive direction uses a different GigE NIC that's part of the same quad-GigE, all is fine: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.5.79 tx: 1186.5051 MB / 10.05 sec = 990.2250 Mbps 12 %TX 13 %RX 0 retrans rx: 1186.7656 MB / 10.05 sec = 990.5204 Mbps 15 %TX 14 %RX 0 retrans Here's a test using the same GigE NIC for both directions with 1-second interval reports: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -i1 -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -i1 -w2m 192.168.6.79 tx:92.3750 MB / 1.01 sec = 767.2277 Mbps 0 retrans rx: 104.5625 MB / 1.01 sec = 872.4757 Mbps 0 retrans tx:83.3125 MB / 1.00 sec = 700.1845 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5541 Mbps 0 retrans tx:83.8125 MB / 1.00 sec = 703.0322 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5502 Mbps 0 retrans tx:83. MB / 1.00 sec = 696.1779 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5522 Mbps 0 retrans tx:83.7500 MB / 1.00 sec = 702.4989 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5512 Mbps 0 retrans tx:83.1250 MB / 1.00 sec = 697.2270 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5512 Mbps 0 retrans tx:84.1875 MB / 1.00 sec = 706.1665 Mbps 0 retrans rx: 117.5625 MB / 1.00 sec = 985.5510 Mbps 0 retrans tx:83.0625 MB / 1.00 sec = 696.7167 Mbps 0 retrans rx: 117.6875 MB / 1.00 sec = 987.5543 Mbps 0 retrans tx:84.1875 MB / 1.00 sec = 706.1545 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5472 Mbps 0 retrans rx: 117.6875 MB / 1.00 sec = 987.0724 Mbps 0 retrans tx:83.3125 MB / 1.00 sec = 698.8137 Mbps 0 retrans tx: 844.9375 MB / 10.07 sec = 703.7699 Mbps 11 %TX 6 %RX 0 retrans rx: 1167.4414 MB / 10.05 sec = 973.9980 Mbps 14 %TX 11 %RX 0 retrans In this test case, the receiver ramped up to nearly full GigE line rate, while the transmitter was stuck at about 700 Mbps. I ran one longer 60-second test and didn't see the oscillating behavior between receiver and transmitter, but maybe that's because I have the GigE NIC interrupts and nuttcp client/server applications both locked to CPU 0. So in my tests, once one direction gets the upper hand, it seems to stay that way. Could this be because the slower side
Re: e1000 full-duplex TCP performance well below wire speed
Good morning (my TZ), I'll try to answer all questions, hoewver if I miss something big, please point my nose to it again. Rick Jones wrote: As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) netperf was used without any special tuning parameters. Usually we start two processes on two hosts which start (almost) simultaneously, last for 20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e. on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20 on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20 192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we use the .2. part for eth1 for a range of mtus. The server is started on both nodes with the start-stop-daemon and no special parameters I'm aware of. /proc/interrupts shows me PCI_MSI-edge thus, I think YES. In particular, it would be good to know if you are doing two concurrent streams, or if you are using the burst mode TCP_RR with large request/response sizes method which then is only using one connection. As outlined above: Two concurrent streams right now. If you think TCP_RR should be better I'm happy to rerun some tests. More in other emails. I'll wade through them slowly. Carsten -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Bill Fink wrote: If the receive direction uses a different GigE NIC that's part of the same quad-GigE, all is fine: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.5.79 tx: 1186.5051 MB / 10.05 sec = 990.2250 Mbps 12 %TX 13 %RX 0 retrans rx: 1186.7656 MB / 10.05 sec = 990.5204 Mbps 15 %TX 14 %RX 0 retrans Could this be an issue with pause frames? At a previous job I remember having issues with a similar configuration using two broadcom sb1250 3 gigE port devices. If I ran bidirectional tests on a single pair of ports connected via cross over, it was slower than when I gave each direction its own pair of ports. The problem turned out to be that pause frame generation and handling was not configured correctly. -Ack -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi all, slowly crawling through the mails. Brandeburg, Jesse wrote: The test was done with various mtu sizes ranging from 1500 to 9000, with ethernet flow control switched on and off, and using reno and cubic as a TCP congestion control. As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) We are using MSI, /proc/interrupts look like: n0003:~# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0:6536963 0 0 0 IO-APIC-edge timer 1: 2 0 0 0 IO-APIC-edge i8042 3: 1 0 0 0 IO-APIC-edge serial 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-fasteoi acpi 14: 32321 0 0 0 IO-APIC-edge libata 15: 0 0 0 0 IO-APIC-edge libata 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb5 18: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 378: 17234866 0 0 0 PCI-MSI-edge eth1 379: 129826 0 0 0 PCI-MSI-edge eth0 NMI: 0 0 0 0 LOC:6537181653732665371496537052 ERR: 0 (sorry for the line break). What we don't understand is why only core0 gets the interrupts, since the affinity is set to f: # cat /proc/irq/378/smp_affinity f Right now, irqbalance is not running, though I can give it shot if people think this will make a difference. I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I did that and the results can be found here: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest The results with netperf running like netperf -t TCP_STREAM -H host -l 20 can be found here: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf1 I reran the tests with netperf -t test -H host -l 20 -c -C or in the case of TCP_RR with the suggested burst settings -b 4 -r 64k Yes, InterruptThrottleRate=8000 means there will be no more than 8000 ints/second from that adapter, and if interrupts are generated faster than that they are aggregated. Interestingly since you are interested in ultra low latency, and may be willing to give up some cpu for it during bulk transfers you should try InterruptThrottleRate=1 (can generate up to 7 ints/s) On the web page you'll see that there are about 4000 interrupts/s for most tests and up to 20,000/s for the TCP_RR test. Shall I change the throttle rate? just for completeness can you post the dump of ethtool -e eth0 and lspci -vvv? Yup, we'll give that info also. n0002:~# ethtool -e eth1 Offset Values -- -- 0x 00 30 48 93 94 2d 20 0d 46 f7 57 00 ff ff ff ff 0x0010 ff ff ff ff 6b 02 9a 10 d9 15 9a 10 86 80 df 80 0x0020 00 00 00 20 54 7e 00 00 00 10 da 00 04 00 00 27 0x0030 c9 6c 50 31 32 07 0b 04 84 29 00 00 00 c0 06 07 0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff 0x0050 14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00 0x0060 00 01 00 40 1e 12 ff ff ff ff ff ff ff ff ff ff 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff cf 2f lspci -vvv for this card: 0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller Subsystem: Super Micro Computer Inc Unknown device 109a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 378 Region 0: Memory at ee20 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at 5000 [size=32] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Address: fee0f00c Data: 41b9 Capabilities: [e0] Express Endpoint IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s 512ns, L1 64us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal-
Re: e1000 full-duplex TCP performance well below wire speed
Brief question I forgot to ask: Right now we are using the old version 7.3.20-k2. To save some effort on your end, shall we upgrade this to 7.6.15 or should our version be good enough? Thanks Carsten -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Bill, I see similar results on my test systems Thanks for this report and for confirming our observations. Could you please confirm that a single-port bidrectional UDP link runs at wire speed? This helps to localize the problem to the TCP stack or interaction of the TCP stack with the e1000 driver and hardware. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi David, Could this be an issue with pause frames? At a previous job I remember having issues with a similar configuration using two broadcom sb1250 3 gigE port devices. If I ran bidirectional tests on a single pair of ports connected via cross over, it was slower than when I gave each direction its own pair of ports. The problem turned out to be that pause frame generation and handling was not configured correctly. We had PAUSE frames turned off for our testing. The idea is to let TCP do the flow and congestion control. The problem with PAUSE+TCP is that it can cause head-of-line blocking, where a single oversubscribed output port on a switch can PAUSE a large number of flows on other paths. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Andi, Andi Kleen wrote: Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. I just tried that: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3 It seems that the numbers do get better (sweet-spot seems to be MTU6000 with 914 MBit/s and 927 MBit/s), however for other settings the results vary a lot so I'm not sure how large the statistical fluctuations are. Next test I'll try if it makes sense to enlarge the ring buffers. Thanks Carsten -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi all, Brandeburg, Jesse wrote: I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I did that and the results can be found here: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest seems something went wrong and all you ran was the 1 byte tests, where it should have been 64K both directions (request/response). Yes, shell-quoting got me there. I'll re-run the tests, so please don't look at the TCP_RR results too closely. I think I'll be able to run maybe one or two more tests today, rest will follow tomorrow. Thanks for bearing with me Carsten PS: Am I right that the TCP_RR tests should only be run on a single node at a time, not on both ends simultaneously? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Bruce, On Thu, 31 Jan 2008, Bruce Allen wrote: I see similar results on my test systems Thanks for this report and for confirming our observations. Could you please confirm that a single-port bidrectional UDP link runs at wire speed? This helps to localize the problem to the TCP stack or interaction of the TCP stack with the e1000 driver and hardware. Yes, a single-port bidirectional UDP test gets full GigE line rate in both directions with no packet loss. [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -u -Ru -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -u -Ru -w2m 192.168.6.79 tx: 1187.0078 MB / 10.04 sec = 992.0550 Mbps 19 %TX 7 %RX 0 / 151937 drop/pkt 0.00 %loss rx: 1187.1016 MB / 10.03 sec = 992.3408 Mbps 19 %TX 7 %RX 0 / 151949 drop/pkt 0.00 %loss -Bill -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 full-duplex TCP performance well below wire speed
Carsten Aulbert wrote: PS: Am I right that the TCP_RR tests should only be run on a single node at a time, not on both ends simultaneously? yes, they are a request/response test, and so perform the bidirectional test with a single node starting the test. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
netperf was used without any special tuning parameters. Usually we start two processes on two hosts which start (almost) simultaneously, last for 20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e. on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20 on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20 192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we use the .2. part for eth1 for a range of mtus. The server is started on both nodes with the start-stop-daemon and no special parameters I'm aware of. So long as you are relying on external (netperf relative) means to report the throughput, those command lines would be fine. I wouldn't be comfortably relying on the sum of the netperf-reported throughtputs with those comand lines though. Netperf2 has no test synchronization, so two separate commands, particularly those initiated on different systems, are subject to skew errors. 99 times out of ten they might be epsilon, but I get a _little_ paranoid there. There are three alternatives: 1) use netperf4. not as convenient for quick testing at present, but it has explicit test synchronization, so you know that the numbers presented are from when all connections were actively transferring data 2) use the aforementioned burst TCP_RR test. This is then a single netperf with data flowing both ways on a single connection so no issue of skew, but perhaps an issue of being one connection and so one process on each end. 3) start both tests from the same system and follow the suggestions contained in : http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/doc/netperf.html particluarly: http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance and use a combination of TCP_STREAM and TCP_MAERTS (STREAM backwards) tests. happy benchmarking, rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Carsten Aulbert wrote: Hi Andi, Andi Kleen wrote: Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. I just tried that: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3 It seems that the numbers do get better (sweet-spot seems to be MTU6000 with 914 MBit/s and 927 MBit/s), however for other settings the results vary a lot so I'm not sure how large the statistical fluctuations are. Next test I'll try if it makes sense to enlarge the ring buffers. sometimes it may help if the system (cpu) is laggy or busy a lot so that the card has more buffers available (and thus can go longer without servicing) Usually (if your system responds quickly) it's better to use *smaller* ring sizes as this reduces cache. Hence the small default value. so, unless the ethtool -S ethX output indicates that your system is too busy (rx_no_buffer_count increases) I would not recommend increasing the ring size. Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Carsten Aulbert wrote: Hi all, slowly crawling through the mails. Brandeburg, Jesse wrote: The test was done with various mtu sizes ranging from 1500 to 9000, with ethernet flow control switched on and off, and using reno and cubic as a TCP congestion control. As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) We are using MSI, /proc/interrupts look like: n0003:~# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0:6536963 0 0 0 IO-APIC-edge timer 1: 2 0 0 0 IO-APIC-edge i8042 3: 1 0 0 0 IO-APIC-edge serial 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-fasteoi acpi 14: 32321 0 0 0 IO-APIC-edge libata 15: 0 0 0 0 IO-APIC-edge libata 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb5 18: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 378: 17234866 0 0 0 PCI-MSI-edge eth1 379: 129826 0 0 0 PCI-MSI-edge eth0 NMI: 0 0 0 0 LOC:6537181653732665371496537052 ERR: 0 (sorry for the line break). What we don't understand is why only core0 gets the interrupts, since the affinity is set to f: # cat /proc/irq/378/smp_affinity f Right now, irqbalance is not running, though I can give it shot if people think this will make a difference. I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I did that and the results can be found here: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest For convenience, 2.4.4 (perhaps earlier I can never remember when I've added things :) allows the output format for a TCP_RR test to be set to the same as a _STREAM or _MAERTS test. And if you add a -v 2 to it you will get the each way values and the average round-trip latency: [EMAIL PROTECTED]:~/netperf2_trunk$ src/netperf -t TCP_RR -H oslowest.cup -f m -v 2 -- -r 64K -b 4 TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to oslowest.cup.hp.com (16.89.84.17) port 0 AF_INET : first burst 4 Local /Remote Socket Size Request Resp. Elapsed Send Recv Size SizeTime Throughput bytes Bytes bytesbytes secs.10^6bits/sec 16384 87380 6553665536 10.01 105.63 16384 87380 Alignment Offset RoundTrip TransThroughput Local Remote Local Remote LatencyRate 10^6bits/s Send RecvSend Recvusec/Tran per sec Outbound Inbound 8 0 0 0 49635.583 100.734 52.81452.814 [EMAIL PROTECTED]:~/netperf2_trunk$ (this was a WAN test :) rick jones one of these days I may tweak netperf further so if the CPU utilization method for either end doesn't require calibration, CPU utilization will always be done on that end. people's thoughts on that tweak would be most welcome... -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 full-duplex TCP performance well below wire speed
Bill Fink wrote: a 2.6.15.4 kernel. The GigE NICs are Intel PRO/1000 82546EB_QUAD_COPPER, on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000 driver, and running with 9000-byte jumbo frames. The TCP congestion control is BIC. Bill, FYI, there was a known issue with e1000 (fixed in 7.0.38-k2) and socket charge due to truesize that kept one end or the other from opening its window. The result is not so great performance, and you must upgrade the driver at both ends to fix it. it was fixed in commit 9e2feace1acd38d7a3b1275f7f9f8a397d09040e That commit itself needed a couple of follow on bug fixes, but the point is that you could download 7.3.20 from sourceforge (which would compile on your kernel) and compare the performance with it if you were interested in a further experiment. Jesse -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
running aggregate netperf TCP_RR Re: e1000 full-duplex TCP performance well below wire speed
PS: Am I right that the TCP_RR tests should only be run on a single node at a time, not on both ends simultaneously? It depends on what you want to measure. In this specific case since the goal is to saturate the link in both directions it is unlikely you should need a second instance running, and if you do, going to a TCP_STREAM+TCP_MAERTS pair might be indicated. If one is measuring aggregate small transaction (perhaps packet) performance, then there can be times when running multiple, concurrent, aggregate TCP_RR tests is indicated. Also, from time to time you may want to experiment with the value you use with -b - the value necessary to get to saturation may not always be the same - particularly as you switch from link to link and from LAN to WAN and all those familiar bandwidthXdelay considerations. happy benchmarking, rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
A lot of people tend to forget that the pci-express bus has enough bandwidth on first glance - 2.5gbit/sec for 1gbit of traffix, but apart from data going over it there is significant overhead going on: each packet requires transmit, cleanup and buffer transactions, and there are many irq register clears per second (slow ioread/writes). The transactions double for TCP ack processing, and this all accumulates and starts to introduce latency, higher cpu utilization etc... Sounds like tools to show PCI* bus utilization would be helpful... rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Sounds like tools to show PCI* bus utilization would be helpful... that would be a hardware profiling thing and highly dependent on the part sticking out of the slot, vendor bus implementation etc... Perhaps Intel has some tools for this already but I personally do not know of any :/ Small matter of getting specs for the various LBA's (is that the correct term? - lower bus adaptors) and then abstracting them a la the CPU perf counters as done by say perfmon and then used by papi :) rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Auke, Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. That eliminates bus bandwidth issues, probably, but small packets take up a lot of extra descriptors, bus bandwidth, CPU, and cache resources. I see. Your concern is the extra ACK packets associated with TCP. Even those these represent a small volume of data (around 5% with MTU=1500, and less at larger MTU) they double the number of packets that must be handled by the system compared to UDP transmission at the same data rate. Is that correct? A lot of people tend to forget that the pci-express bus has enough bandwidth on first glance - 2.5gbit/sec for 1gbit of traffix, but apart from data going over it there is significant overhead going on: each packet requires transmit, cleanup and buffer transactions, and there are many irq register clears per second (slow ioread/writes). The transactions double for TCP ack processing, and this all accumulates and starts to introduce latency, higher cpu utilization etc... Based on the discussion in this thread, I am inclined to believe that lack of PCI-e bus bandwidth is NOT the issue. The theory is that the extra packet handling associated with TCP acknowledgements are pushing the PCI-e x1 bus past its limits. However the evidence seems to show otherwise: (1) Bill Fink has reported the same problem on a NIC with a 133 MHz 64-bit PCI connection. That connection can transfer data at 8Gb/s. (2) If the theory is right, then doubling the MTU from 1500 to 3000 should have significantly reduce the problem, since it drops the number of ACK's by two. Similarly, going from MTU 1500 to MTU 9000 should reduce the number of ACK's by a factor of six, practically eliminating the problem. But changing the MTU size does not help. (3) The interrupt counts are quite reasonable. Broadcom NICs without interrupt aggregation generate an order of magnitude more irq/s and this doesn't prevent wire speed performance there. (4) The CPUs on the system are largely idle. There are plenty of computing resources available. (5) I don't think that the overhead will increase the bandwidth needed by more than a factor of two. Of course you and the other e1000 developers are the experts, but the dominant bus cost should be copying data buffers across the bus. Everything else in minimal in comparison. Intel insiders: isn't there some simple instrumentation available (which read registers or statistics counters on the PCI-e interface chip) to tell us statistics such as how many bits have moved over the link in each direction? This plus some accurate timing would make it easy to see if the TCP case is saturating the PCI-e bus. Then the theory addressed with data rather than with opinions. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Auke, Based on the discussion in this thread, I am inclined to believe that lack of PCI-e bus bandwidth is NOT the issue. The theory is that the extra packet handling associated with TCP acknowledgements are pushing the PCI-e x1 bus past its limits. However the evidence seems to show otherwise: (1) Bill Fink has reported the same problem on a NIC with a 133 MHz 64-bit PCI connection. That connection can transfer data at 8Gb/s. That was even a PCI-X connection, which is known to have extremely good latency numbers, IIRC better than PCI-e? (?) which could account for a lot of the latency-induced lower performance... also, 82573's are _not_ a serverpart and were not designed for this usage. 82546's are and that really does make a difference. I'm confused. It DOESN'T make a difference! Using 'server grade' 82546's on a PCI-X bus, Bill Fink reports the SAME loss of throughput with TCP full duplex that we see on a 'consumer grade' 82573 attached to a PCI-e x1 bus. Just like us, when Bill goes from TCP to UDP, he gets wire speed back. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Bruce Allen wrote: Hi Auke, Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. That eliminates bus bandwidth issues, probably, but small packets take up a lot of extra descriptors, bus bandwidth, CPU, and cache resources. I see. Your concern is the extra ACK packets associated with TCP. Even those these represent a small volume of data (around 5% with MTU=1500, and less at larger MTU) they double the number of packets that must be handled by the system compared to UDP transmission at the same data rate. Is that correct? A lot of people tend to forget that the pci-express bus has enough bandwidth on first glance - 2.5gbit/sec for 1gbit of traffix, but apart from data going over it there is significant overhead going on: each packet requires transmit, cleanup and buffer transactions, and there are many irq register clears per second (slow ioread/writes). The transactions double for TCP ack processing, and this all accumulates and starts to introduce latency, higher cpu utilization etc... Based on the discussion in this thread, I am inclined to believe that lack of PCI-e bus bandwidth is NOT the issue. The theory is that the extra packet handling associated with TCP acknowledgements are pushing the PCI-e x1 bus past its limits. However the evidence seems to show otherwise: (1) Bill Fink has reported the same problem on a NIC with a 133 MHz 64-bit PCI connection. That connection can transfer data at 8Gb/s. That was even a PCI-X connection, which is known to have extremely good latency numbers, IIRC better than PCI-e? (?) which could account for a lot of the latency-induced lower performance... also, 82573's are _not_ a serverpart and were not designed for this usage. 82546's are and that really does make a difference. 82573's are full of power savings features and all that does make a difference even with some of them turned off. It's not for nothing that these 82573's are used in a ton of laptops like from toshiba, lenovo etc A lot of this has to do with the cards internal clock timings as usual. So, you'd really have to compare the 82546 to a 82571 card to be fair. You get what you pay for so to speak. (2) If the theory is right, then doubling the MTU from 1500 to 3000 should have significantly reduce the problem, since it drops the number of ACK's by two. Similarly, going from MTU 1500 to MTU 9000 should reduce the number of ACK's by a factor of six, practically eliminating the problem. But changing the MTU size does not help. (3) The interrupt counts are quite reasonable. Broadcom NICs without interrupt aggregation generate an order of magnitude more irq/s and this doesn't prevent wire speed performance there. (4) The CPUs on the system are largely idle. There are plenty of computing resources available. (5) I don't think that the overhead will increase the bandwidth needed by more than a factor of two. Of course you and the other e1000 developers are the experts, but the dominant bus cost should be copying data buffers across the bus. Everything else in minimal in comparison. Intel insiders: isn't there some simple instrumentation available (which read registers or statistics counters on the PCI-e interface chip) to tell us statistics such as how many bits have moved over the link in each direction? This plus some accurate timing would make it easy to see if the TCP case is saturating the PCI-e bus. Then the theory addressed with data rather than with opinions. the only tools we have are expensive bus analyzers. As said in the thread with Rick Jones, I think there might be some tools avaialable from Intel for this but I have never seen these. Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Bill, I see similar results on my test systems Thanks for this report and for confirming our observations. Could you please confirm that a single-port bidrectional UDP link runs at wire speed? This helps to localize the problem to the TCP stack or interaction of the TCP stack with the e1000 driver and hardware. Yes, a single-port bidirectional UDP test gets full GigE line rate in both directions with no packet loss. Thanks for confirming this. And thanks also for nuttcp! I just recognized you as the author. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
On Thu, 31 Jan 2008, Bruce Allen wrote: Based on the discussion in this thread, I am inclined to believe that lack of PCI-e bus bandwidth is NOT the issue. The theory is that the extra packet handling associated with TCP acknowledgements are pushing the PCI-e x1 bus past its limits. However the evidence seems to show otherwise: (1) Bill Fink has reported the same problem on a NIC with a 133 MHz 64-bit PCI connection. That connection can transfer data at 8Gb/s. That was even a PCI-X connection, which is known to have extremely good latency numbers, IIRC better than PCI-e? (?) which could account for a lot of the latency-induced lower performance... also, 82573's are _not_ a serverpart and were not designed for this usage. 82546's are and that really does make a difference. I'm confused. It DOESN'T make a difference! Using 'server grade' 82546's on a PCI-X bus, Bill Fink reports the SAME loss of throughput with TCP full duplex that we see on a 'consumer grade' 82573 attached to a PCI-e x1 bus. Just like us, when Bill goes from TCP to UDP, he gets wire speed back. Good. I thought it was just me who was confused by Auke's reply. :-) Yes, I get the same type of reduced TCP performance behavior on a bidirectional test that Bruce has seen, even though I'm using the better 82546 GigE NIC on a faster 64-bit/133-MHz PCI-X bus. I also don't think bus bandwidth is an issue, but I am curious if there are any known papers on typical PCI-X/PCI-E bus overhead on network transfers, either bulk data transfers with large packets or more transaction or video based applications using smaller packets. I started musing if once one side's transmitter got the upper hand, it might somehow defer the processing of received packets, causing the resultant ACKs to be delayed and thus further slowing down the other end's transmitter. I began to wonder if the txqueuelen could have an affect on the TCP performance behavior. I normally have the txqueuelen set to 1 for 10-GigE testing, so decided to run a test with txqueuelen set to 200 (actually settled on this value through some experimentation). Here is a typical result: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.6.79 tx: 1120.6345 MB / 10.07 sec = 933.4042 Mbps 12 %TX 9 %RX 0 retrans rx: 1104.3081 MB / 10.09 sec = 917.7365 Mbps 12 %TX 11 %RX 0 retrans This is significantly better, but there was more variability in the results. The above was with TSO enabled. I also then ran a test with TSO disabled, with the following typical result: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.6.79 tx: 1119.4749 MB / 10.05 sec = 934.2922 Mbps 13 %TX 9 %RX 0 retrans rx: 1131.7334 MB / 10.05 sec = 944.8437 Mbps 15 %TX 12 %RX 0 retrans This was a little better yet and getting closer to expected results. Jesse Brandeburg mentioned in another post that there were known performance issues with the version of the e1000 driver I'm using. I recognized that the kernel/driver versions I was using were rather old, but it was what I had available to do a quick test with. Those particular systems are in a remote location so I have to be careful with messing with their network drivers. I do have some other test systems at work that I might be able to try with newer kernels and/or drivers or maybe even with other vendor's GigE NICs, but I won't be back to work until early next week sometime. -Bill -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Bill, I started musing if once one side's transmitter got the upper hand, it might somehow defer the processing of received packets, causing the resultant ACKs to be delayed and thus further slowing down the other end's transmitter. I began to wonder if the txqueuelen could have an affect on the TCP performance behavior. I normally have the txqueuelen set to 1 for 10-GigE testing, so decided to run a test with txqueuelen set to 200 (actually settled on this value through some experimentation). Here is a typical result: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.6.79 tx: 1120.6345 MB / 10.07 sec = 933.4042 Mbps 12 %TX 9 %RX 0 retrans rx: 1104.3081 MB / 10.09 sec = 917.7365 Mbps 12 %TX 11 %RX 0 retrans This is significantly better, but there was more variability in the results. The above was with TSO enabled. I also then ran a test with TSO disabled, with the following typical result: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.6.79 tx: 1119.4749 MB / 10.05 sec = 934.2922 Mbps 13 %TX 9 %RX 0 retrans rx: 1131.7334 MB / 10.05 sec = 944.8437 Mbps 15 %TX 12 %RX 0 retrans This was a little better yet and getting closer to expected results. We'll also try changing txqueuelen. I have not looked, but I suppose that this is set to the default value of 1000. We'd be delighted to see full-duplex performance that was consistent and greater than 900 Mb/s x 2. I do have some other test systems at work that I might be able to try with newer kernels and/or drivers or maybe even with other vendor's GigE NICs, but I won't be back to work until early next week sometime. Bill, we'd be happy to give you root access to a couple of our systems here if you want to do additional testing. We can put the latest drivers on them (and reboot if/as needed). If you want to do this, please just send an ssh public key to Carsten. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
e1000 full-duplex TCP performance well below wire speed
(Pádraig Brady has suggested that I post this to Netdev. It was originally posted to LKML here: http://lkml.org/lkml/2008/1/30/141 ) Dear NetDev, We've connected a pair of modern high-performance boxes with integrated copper Gb/s Intel NICS, with an ethernet crossover cable, and have run some netperf full duplex TCP tests. The transfer rates are well below wire speed. We're reporting this as a kernel bug, because we expect a vanilla kernel with default settings to give wire speed (or close to wire speed) performance in this case. We DO see wire speed in simplex transfers. The behavior has been verified on multiple machines with identical hardware. Details: Kernel version: 2.6.23.12 ethernet NIC: Intel 82573L ethernet driver: e1000 version 7.3.20-k2 motherboard: Supermicro PDSML-LN2+ (one quad core Intel Xeon X3220, Intel 3000 chipset, 8GB memory) The test was done with various mtu sizes ranging from 1500 to 9000, with ethernet flow control switched on and off, and using reno and cubic as a TCP congestion control. The behavior depends on the setup. In one test we used cubic congestion control, flow control off. The transfer rate in one direction was above 0.9Gb/s while in the other direction it was 0.6 to 0.8 Gb/s. After 15-20s the rates flipped. Perhaps the two steams are fighting for resources. (The performance of a full duplex stream should be close to 1Gb/s in both directions.) A graph of the transfer speed as a function of time is here: https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg Red shows transmit and green shows receive (please ignore other plots): We're happy to do additional testing, if that would help, and very grateful for any advice! Bruce Allen Carsten Aulbert Henning Fehrmann
Re: e1000 full-duplex TCP performance well below wire speed
Hi David, Thanks for your note. (The performance of a full duplex stream should be close to 1Gb/s in both directions.) This is not a reasonable expectation. ACKs take up space on the link in the opposite direction of the transfer. So the link usage in the opposite direction of the transfer is very far from zero. Indeed, we are not asking to see 1000 Mb/s. We'd be happy to see 900 Mb/s. Netperf is trasmitting a large buffer in MTU-sized packets (min 1500 bytes). Since the acks are only about 60 bytes in size, they should be around 4% of the total traffic. Hence we would not expect to see more than 960 Mb/s. We have run these same tests on older kernels (with Broadcomm NICS) and gotten above 900 Mb/s full duplex. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
From: Bruce Allen [EMAIL PROTECTED] Date: Wed, 30 Jan 2008 03:51:51 -0600 (CST) [ netdev@vger.kernel.org added to CC: list, that is where kernel networking issues are discussed. ] (The performance of a full duplex stream should be close to 1Gb/s in both directions.) This is not a reasonable expectation. ACKs take up space on the link in the opposite direction of the transfer. So the link usage in the opposite direction of the transfer is very far from zero. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
On Wed, 30 Jan 2008 08:01:46 -0600 (CST) Bruce Allen [EMAIL PROTECTED] wrote: Hi David, Thanks for your note. (The performance of a full duplex stream should be close to 1Gb/s in both directions.) This is not a reasonable expectation. ACKs take up space on the link in the opposite direction of the transfer. So the link usage in the opposite direction of the transfer is very far from zero. Indeed, we are not asking to see 1000 Mb/s. We'd be happy to see 900 Mb/s. Netperf is trasmitting a large buffer in MTU-sized packets (min 1500 bytes). Since the acks are only about 60 bytes in size, they should be around 4% of the total traffic. Hence we would not expect to see more than 960 Mb/s. We have run these same tests on older kernels (with Broadcomm NICS) and gotten above 900 Mb/s full duplex. Cheers, Bru Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/ Max TCP Payload data rates over ethernet: (1500-40)/(38+1500) = 94.9285 % IPv4, minimal headers (1500-52)/(38+1500) = 94.1482 % IPv4, TCP timestamps I believe what you are seeing is an effect that occurs when using cubic on links with no other idle traffic. With two flows at high speed, the first flow consumes most of the router buffer and backs off gradually, and the second flow is not very aggressive. It has been discussed back and forth between TCP researchers with no agreement, one side says that it is unfairness and the other side says it is not a problem in the real world because of the presence of background traffic. See: http://www.hamilton.ie/net/pfldnet2007_cubic_final.pdf http://www.csc.ncsu.edu/faculty/rhee/Rebuttal-LSM-new.pdf -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 full-duplex TCP performance well below wire speed
Bruce Allen wrote: Details: Kernel version: 2.6.23.12 ethernet NIC: Intel 82573L ethernet driver: e1000 version 7.3.20-k2 motherboard: Supermicro PDSML-LN2+ (one quad core Intel Xeon X3220, Intel 3000 chipset, 8GB memory) Hi Bruce, The 82573L (a client NIC, regardless of the class of machine it is in) only has a x1 connection which does introduce some latency since the slot is only capable of about 2Gb/s data total, which includes overhead of descriptors and other transactions. As you approach the maximum of the slot it gets more and more difficult to get wire speed in a bidirectional test. The test was done with various mtu sizes ranging from 1500 to 9000, with ethernet flow control switched on and off, and using reno and cubic as a TCP congestion control. As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) I've recently discovered that particularly with the most recent kernels if you specify any socket options (-- -SX -sY) to netperf it does worse than if it just lets the kernel auto-tune. The behavior depends on the setup. In one test we used cubic congestion control, flow control off. The transfer rate in one direction was above 0.9Gb/s while in the other direction it was 0.6 to 0.8 Gb/s. After 15-20s the rates flipped. Perhaps the two steams are fighting for resources. (The performance of a full duplex stream should be close to 1Gb/s in both directions.) A graph of the transfer speed as a function of time is here: https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg Red shows transmit and green shows receive (please ignore other plots): One other thing you can try with e1000 is disabling the dynamic interrupt moderation by loading the driver with InterruptThrottleRate=8000,8000,... (the number of commas depends on your number of ports) which might help in your particular benchmark. just for completeness can you post the dump of ethtool -e eth0 and lspci -vvv? Thanks, Jesse -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) In particular, it would be good to know if you are doing two concurrent streams, or if you are using the burst mode TCP_RR with large request/response sizes method which then is only using one connection. I've recently discovered that particularly with the most recent kernels if you specify any socket options (-- -SX -sY) to netperf it does worse than if it just lets the kernel auto-tune. That is the bit where explicit setsockopts are capped by core [rw]mem sysctls but the autotuning is not correct? rick jones BTW, a bit of netperf news - the omni (two routines to measure it all) tests seem to be more or less working now in top of trunk netperf. It of course still needs work/polish, but if folks would like to play with them, I'd love the feedback. Output is a bit different from classic netperf, and includes an option to emit the results as csv (test-specific -o presently) rather than human readable (test-specific -O). You get the omni stuff via ./configure --enable-omni and use omni as the test name. No docs yet, for options and their effects, you need to look at scan_omni_args in src/nettest_omni.c One other addition in the omni tests is retreiving not just the initial SO_*BUF sizes, but also the final SO_*BUF sizes so one can see where autotuning took things just based on netperf output. If the general concensus is that the overhead of the omni stuff isn't too dear, (there are more conditionals in the mainline than with classic netperf) I will convert the classic netperf tests to use the omni code. BTW, don't have a heart attack when you see the quantity of current csv output - I do plan on being able to let the user specify what values should be included :) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Bruce Allen wrote: (Pádraig Brady has suggested that I post this to Netdev. It was originally posted to LKML here: http://lkml.org/lkml/2008/1/30/141 ) Dear NetDev, We've connected a pair of modern high-performance boxes with integrated copper Gb/s Intel NICS, with an ethernet crossover cable, and have run some netperf full duplex TCP tests. The transfer rates are well below wire speed. We're reporting this as a kernel bug, because we expect a vanilla kernel with default settings to give wire speed (or close to wire speed) performance in this case. We DO see wire speed in simplex transfers. The behavior has been verified on multiple machines with identical hardware. Try using NICs in the pci-e slots. We have better luck there, as you usually have more lanes and/or higher quality NIC chipsets available in this case. Try a UDP test to make sure the NIC can actually handle the throughput. Look at the actual link usage as reported by the ethernet driver so that you take all of the ACKS and other overhead into account. Try the same test using 10G hardware (CX4 NICs are quite affordable these days, and we drove a 2-port 10G NIC based on the Intel ixgbe chipset at around 4Gbps on two ports, full duplex, using pktgen). As in around 16Gbps throughput across the busses. That may also give you an idea if the bottleneck is hardware or software related. Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Stephen, Thanks for your helpful reply and especially for the literature pointers. Indeed, we are not asking to see 1000 Mb/s. We'd be happy to see 900 Mb/s. Netperf is trasmitting a large buffer in MTU-sized packets (min 1500 bytes). Since the acks are only about 60 bytes in size, they should be around 4% of the total traffic. Hence we would not expect to see more than 960 Mb/s. Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/ Max TCP Payload data rates over ethernet: (1500-40)/(38+1500) = 94.9285 % IPv4, minimal headers (1500-52)/(38+1500) = 94.1482 % IPv4, TCP timestamps Yes. If you look further down the page, you will see that with jumbo frames (which we have also tried) on Gb/s ethernet the maximum throughput is: (9000-20-20-12)/(9000+14+4+7+1+12)*10/100 = 990.042 Mbps We are very far from this number -- averaging perhaps 600 or 700 Mbps. I believe what you are seeing is an effect that occurs when using cubic on links with no other idle traffic. With two flows at high speed, the first flow consumes most of the router buffer and backs off gradually, and the second flow is not very aggressive. It has been discussed back and forth between TCP researchers with no agreement, one side says that it is unfairness and the other side says it is not a problem in the real world because of the presence of background traffic. At least in principle, we should have NO congestion here. We have ports on two different machines wired with a crossover cable. Box A can not transmit faster than 1 Gb/s. Box B should be able to receive that data without dropping packets. It's not doing anything else! See: http://www.hamilton.ie/net/pfldnet2007_cubic_final.pdf http://www.csc.ncsu.edu/faculty/rhee/Rebuttal-LSM-new.pdf This is extremely helpful. The typical oscillation (startup) period shown in the plots in these papers is of order 10 seconds, which is similar to the types of oscillation periods that we are seeing. *However* we have also seen similar behavior with the Reno congestion control algorithm. So this might not be due to cubic, or entirely due to cubic. In our application (cluster computing) we use a very tightly coupled high-speed low-latency network. There is no 'wide area traffic'. So it's hard for me to understand why any networking components or software layers should take more than milliseconds to ramp up or back off in speed. Perhaps we should be asking for a TCP congestion avoidance algorithm which is designed for a data center environment where there are very few hops and typical packet delivery times are tens or hundreds of microseconds. It's very different than delivering data thousands of km across a WAN. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Ben, Thank you for the suggestions and questions. We've connected a pair of modern high-performance boxes with integrated copper Gb/s Intel NICS, with an ethernet crossover cable, and have run some netperf full duplex TCP tests. The transfer rates are well below wire speed. We're reporting this as a kernel bug, because we expect a vanilla kernel with default settings to give wire speed (or close to wire speed) performance in this case. We DO see wire speed in simplex transfers. The behavior has been verified on multiple machines with identical hardware. Try using NICs in the pci-e slots. We have better luck there, as you usually have more lanes and/or higher quality NIC chipsets available in this case. It's a good idea. We can try this, though it will take a little time to organize. Try a UDP test to make sure the NIC can actually handle the throughput. I should have mentioned this in my original post -- we already did this. We can run UDP wire speed full duplex (over 900 Mb/s in each direction, at the same time). So the problem stems from TCP or is aggravated by TCP. It's not a hardware limitation. Look at the actual link usage as reported by the ethernet driver so that you take all of the ACKS and other overhead into account. OK. We'll report on this as soon as possible. Try the same test using 10G hardware (CX4 NICs are quite affordable these days, and we drove a 2-port 10G NIC based on the Intel ixgbe chipset at around 4Gbps on two ports, full duplex, using pktgen). As in around 16Gbps throughput across the busses. That may also give you an idea if the bottleneck is hardware or software related. OK. That will take more time to organize. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
On Wed, 30 Jan 2008 16:25:12 -0600 (CST) Bruce Allen [EMAIL PROTECTED] wrote: Hi Stephen, Thanks for your helpful reply and especially for the literature pointers. Indeed, we are not asking to see 1000 Mb/s. We'd be happy to see 900 Mb/s. Netperf is trasmitting a large buffer in MTU-sized packets (min 1500 bytes). Since the acks are only about 60 bytes in size, they should be around 4% of the total traffic. Hence we would not expect to see more than 960 Mb/s. Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/ Max TCP Payload data rates over ethernet: (1500-40)/(38+1500) = 94.9285 % IPv4, minimal headers (1500-52)/(38+1500) = 94.1482 % IPv4, TCP timestamps Yes. If you look further down the page, you will see that with jumbo frames (which we have also tried) on Gb/s ethernet the maximum throughput is: (9000-20-20-12)/(9000+14+4+7+1+12)*10/100 = 990.042 Mbps We are very far from this number -- averaging perhaps 600 or 700 Mbps. That is the upper bound of performance on a standard PCI bus (32 bit). To go higher you need PCI-X or PCI-Express. Also make sure you are really getting 64-bit PCI, because I have seen some e1000 PCI-X boards that are only 32bit. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 full-duplex TCP performance well below wire speed
Hi Jesse, It's good to be talking directly to one of the e1000 developers and maintainers. Although at this point I am starting to think that the issue may be TCP stack related and nothing to do with the NIC. Am I correct that these are quite distinct parts of the kernel? The 82573L (a client NIC, regardless of the class of machine it is in) only has a x1 connection which does introduce some latency since the slot is only capable of about 2Gb/s data total, which includes overhead of descriptors and other transactions. As you approach the maximum of the slot it gets more and more difficult to get wire speed in a bidirectional test. According to the Intel datasheet, the PCI-e x1 connection is 2Gb/s in each direction. So we only need to get up to 50% of peak to saturate a full-duplex wire-speed link. I hope that the overhead is not a factor of two. Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. The test was done with various mtu sizes ranging from 1500 to 9000, with ethernet flow control switched on and off, and using reno and cubic as a TCP congestion control. As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in Germany). So we'll provide this info in ~10 hours. I assume that the interrupt load is distributed among all four cores -- the default affinity is 0xff, and I also assume that there is some type of interrupt aggregation taking place in the driver. If the CPUs were not able to service the interrupts fast enough, I assume that we would also see loss of performance with UDP testing. I've recently discovered that particularly with the most recent kernels if you specify any socket options (-- -SX -sY) to netperf it does worse than if it just lets the kernel auto-tune. I am pretty sure that no socket options were specified, but again need to wait until Carsten or Henning come back on-line. The behavior depends on the setup. In one test we used cubic congestion control, flow control off. The transfer rate in one direction was above 0.9Gb/s while in the other direction it was 0.6 to 0.8 Gb/s. After 15-20s the rates flipped. Perhaps the two steams are fighting for resources. (The performance of a full duplex stream should be close to 1Gb/s in both directions.) A graph of the transfer speed as a function of time is here: https://n0.aei.uni-hannover.de/networktest/node19-new20-noflow.jpg Red shows transmit and green shows receive (please ignore other plots): One other thing you can try with e1000 is disabling the dynamic interrupt moderation by loading the driver with InterruptThrottleRate=8000,8000,... (the number of commas depends on your number of ports) which might help in your particular benchmark. OK. Is 'dynamic interrupt moderation' another name for 'interrupt aggregation'? Meaning that if more than one interrupt is generated in a given time interval, then they are replaced by a single interrupt? just for completeness can you post the dump of ethtool -e eth0 and lspci -vvv? Yup, we'll give that info also. Thanks again! Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Rick, First off, thanks for netperf. I've used it a lot and find it an extremely useful tool. As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) In particular, it would be good to know if you are doing two concurrent streams, or if you are using the burst mode TCP_RR with large request/response sizes method which then is only using one connection. I'm not sure -- must wait for Henning and Carsten to respond tomorrow. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Stephen, Indeed, we are not asking to see 1000 Mb/s. We'd be happy to see 900 Mb/s. Netperf is trasmitting a large buffer in MTU-sized packets (min 1500 bytes). Since the acks are only about 60 bytes in size, they should be around 4% of the total traffic. Hence we would not expect to see more than 960 Mb/s. Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/ Max TCP Payload data rates over ethernet: (1500-40)/(38+1500) = 94.9285 % IPv4, minimal headers (1500-52)/(38+1500) = 94.1482 % IPv4, TCP timestamps Yes. If you look further down the page, you will see that with jumbo frames (which we have also tried) on Gb/s ethernet the maximum throughput is: (9000-20-20-12)/(9000+14+4+7+1+12)*10/100 = 990.042 Mbps We are very far from this number -- averaging perhaps 600 or 700 Mbps. That is the upper bound of performance on a standard PCI bus (32 bit). To go higher you need PCI-X or PCI-Express. Also make sure you are really getting 64-bit PCI, because I have seen some e1000 PCI-X boards that are only 32bit. The motherboard NIC is in a PCI-e x1 slot. This has a maximum speed of 250 MB/s (2 Gb/s) in each direction. It should be a factor of 2 more interface speed than is needed. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Bruce, On Jan 30, 2008 5:25 PM, Bruce Allen [EMAIL PROTECTED] wrote: In our application (cluster computing) we use a very tightly coupled high-speed low-latency network. There is no 'wide area traffic'. So it's hard for me to understand why any networking components or software layers should take more than milliseconds to ramp up or back off in speed. Perhaps we should be asking for a TCP congestion avoidance algorithm which is designed for a data center environment where there are very few hops and typical packet delivery times are tens or hundreds of microseconds. It's very different than delivering data thousands of km across a WAN. If your network latency is low, regardless of type of protocols should give you more than 900Mbps. I can guess the RTT of two machines is less than 4ms in your case and I remember the throughputs of all high-speed protocols (including tcp-reno) were more than 900Mbps with 4ms RTT. So, my question which kernel version did you use with your broadcomm NIC and got more than 900Mbps? I have two machines connected by a gig switch and I can see what happens in my environment. Could you post what parameters did you use for netperf testing? and also if you set any parameters for your testing, please post them here so that I can see that happens to me as well. Regards, Sangtae -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 full-duplex TCP performance well below wire speed
Bruce Allen wrote: Hi Jesse, It's good to be talking directly to one of the e1000 developers and maintainers. Although at this point I am starting to think that the issue may be TCP stack related and nothing to do with the NIC. Am I correct that these are quite distinct parts of the kernel? Yes, quite. Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. That eliminates bus bandwidth issues, probably, but small packets take up a lot of extra descriptors, bus bandwidth, CPU, and cache resources. The test was done with various mtu sizes ranging from 1500 to 9000, with ethernet flow control switched on and off, and using reno and cubic as a TCP congestion control. As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in Germany). So we'll provide this info in ~10 hours. I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I think you'll have to compile netperf with burst mode support enabled. I assume that the interrupt load is distributed among all four cores -- the default affinity is 0xff, and I also assume that there is some type of interrupt aggregation taking place in the driver. If the CPUs were not able to service the interrupts fast enough, I assume that we would also see loss of performance with UDP testing. One other thing you can try with e1000 is disabling the dynamic interrupt moderation by loading the driver with InterruptThrottleRate=8000,8000,... (the number of commas depends on your number of ports) which might help in your particular benchmark. OK. Is 'dynamic interrupt moderation' another name for 'interrupt aggregation'? Meaning that if more than one interrupt is generated in a given time interval, then they are replaced by a single interrupt? Yes, InterruptThrottleRate=8000 means there will be no more than 8000 ints/second from that adapter, and if interrupts are generated faster than that they are aggregated. Interestingly since you are interested in ultra low latency, and may be willing to give up some cpu for it during bulk transfers you should try InterruptThrottleRate=1 (can generate up to 7 ints/s) just for completeness can you post the dump of ethtool -e eth0 and lspci -vvv? Yup, we'll give that info also. Thanks again! Welcome, its an interesting discussion. Hope we can come to a good conclusion. Jesse -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html