from:"Bill Fink"

Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-03-14 Thread Bill Fink

On Mon, 14 Mar 2016, Yuchung Cheng wrote:

> On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad
>  wrote:
> >
> > Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> > the latency for applications sending time-dependent data.
...
> > diff --git a/Documentation/networking/ip-sysctl.txt 
> > b/Documentation/networking/ip-sysctl.txt
> > index 6a92b15..8f3f3bf 100644
> > --- a/Documentation/networking/ip-sysctl.txt
> > +++ b/Documentation/networking/ip-sysctl.txt
> > @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
> > calculated, which is used to classify whether a stream is thin.
> > Default: 1
> >
> > +tcp_rdb - BOOLEAN
> > +   Enable RDB for all new TCP connections.
>   Please describe RDB briefly, perhaps with a pointer to your paper.
>I suggest have three level of controls:
>0: disable RDB completely
>1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
> options
>2: enable RDB on all thin-stream conn. by default
> 
>currently it only provides mode 1 and 2. but there may be cases where
>the administrator wants to disallow it (e.g., broken middle-boxes).
> 
> > +   Default: 0

A per route setting to enable or disable tcp_rdb, overriding
the global setting, could also be useful to the administrator.
Just a suggestion for potential followup work.

-Bill

Re: [IPV6]: Fix IPsec datagram fragmentation

2008-02-14 Thread Bill Fink

Hi Herbert,

On Fri, 15 Feb 2008, Herbert Xu wrote:

 On Tue, Feb 12, 2008 at 06:08:28PM -0800, David Miller wrote:
  
   [IPV6]: Fix IPsec datagram fragmentation
 
  Applied, and I'll queue this up to -stable as well.
 
 Sorry, David Stevens just told me that it doesn't work as intended.
 
 [IPV6]: Fix reversed local_df test in ip6_fragment
 
 I managed to reverse the local_df test when forward-porting this
 patch so it actually makes things worse by never fragmenting at
 all.
 
 Thanks to David Stevens for testing and reporting this bug.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]
 
 --
 diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
 index 4e9a2fe..35ba693 100644
 --- a/net/ipv6/ip6_output.c
 +++ b/net/ipv6/ip6_output.c
 @@ -621,7 +621,7 @@ static int ip6_fragment(struct sk_buff *skb, int 
 (*output)(struct sk_buff *))
* or if the skb it not generated by a local socket.  (This last
* check should be redundant, but it's free.)
*/
 - if (skb-local_df) {
 + if (!skb-local_df) {
   skb-dev = skb-dst-dev;
   icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu, skb-dev);
   IP6_INC_STATS(ip6_dst_idev(skb-dst), IPSTATS_MIB_FRAGFAILS);

I think the setting of skb-local_def is still backwards in your
original patch:

 diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
 index 9ac6ca2..4e9a2fe 100644
 --- a/net/ipv6/ip6_output.c
 +++ b/net/ipv6/ip6_output.c

...

 @@ -1420,6 +1420,10 @@ int ip6_push_pending_frames(struct sock *sk)
   tmp_skb-sk = NULL;
   }
 
 + /* Allow local fragmentation. */
 + if (np-pmtudisc = IPV6_PMTUDISC_DO)
 + skb-local_df = 1;
 +
   ipv6_addr_copy(final_dst, fl-fl6_dst);
   __skb_pull(skb, skb_network_header_len(skb));
   if (opt  opt-opt_flen)

I think the test should be:

if (np-pmtudisc  IPV6_PMTUDISC_DO)

as it is in net/ipv4/ip_output.c:

/* Unless user demanded real pmtu discovery (IP_PMTUDISC_DO), we allow
 * to fragment the frame generated here. No matter, what transforms
 * how transforms change size of the packet, it will come out.
 */
if (inet-pmtudisc  IP_PMTUDISC_DO)
skb-local_df = 1;

Or perhaps I'm just missing something obvious.

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bill Fink

On Wed, 30 Jan 2008, SANGTAE HA wrote:

 On Jan 30, 2008 5:25 PM, Bruce Allen [EMAIL PROTECTED] wrote:
 
  In our application (cluster computing) we use a very tightly coupled
  high-speed low-latency network.  There is no 'wide area traffic'.  So it's
  hard for me to understand why any networking components or software layers
  should take more than milliseconds to ramp up or back off in speed.
  Perhaps we should be asking for a TCP congestion avoidance algorithm which
  is designed for a data center environment where there are very few hops
  and typical packet delivery times are tens or hundreds of microseconds.
  It's very different than delivering data thousands of km across a WAN.
 
 
 If your network latency is low, regardless of type of protocols should
 give you more than 900Mbps. I can guess the RTT of two machines is
 less than 4ms in your case and I remember the throughputs of all
 high-speed protocols (including tcp-reno) were more than 900Mbps with
 4ms RTT. So, my question which kernel version did you use with your
 broadcomm NIC and got more than 900Mbps?
 
 I have two machines connected by a gig switch and I can see what
 happens in my environment. Could you post what parameters did you use
 for netperf testing?
 and also if you set any parameters for your testing, please post them
 here so that I can see that happens to me as well.

I see similar results on my test systems, using Tyan Thunder K8WE (S2895)
motherboard with dual Intel Xeon 3.06 GHZ CPUs and 1 GB memory, running
a 2.6.15.4 kernel.  The GigE NICs are Intel PRO/1000 82546EB_QUAD_COPPER,
on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000
driver, and running with 9000-byte jumbo frames.  The TCP congestion
control is BIC.

Unidirectional TCP test:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79
tx:  1186.5649 MB /  10.05 sec =  990.2741 Mbps 11 %TX 9 %RX 0 retrans

and:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Irx -r -w2m 192.168.6.79
rx:  1186.8281 MB /  10.05 sec =  990.5634 Mbps 14 %TX 9 %RX 0 retrans

Each direction gets full GigE line rate.

Bidirectional TCP test:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.6.79
tx:   898.9934 MB /  10.05 sec =  750.1634 Mbps 10 %TX 8 %RX 0 retrans
rx:  1167.3750 MB /  10.06 sec =  973.8617 Mbps 14 %TX 11 %RX 0 retrans

While one direction gets close to line rate, the other only got 750 Mbps.
Note there were no TCP retransmitted segments for either data stream, so
that doesn't appear to be the cause of the slower transfer rate in one
direction.

If the receive direction uses a different GigE NIC that's part of the
same quad-GigE, all is fine:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.5.79
tx:  1186.5051 MB /  10.05 sec =  990.2250 Mbps 12 %TX 13 %RX 0 retrans
rx:  1186.7656 MB /  10.05 sec =  990.5204 Mbps 15 %TX 14 %RX 0 retrans

Here's a test using the same GigE NIC for both directions with 1-second
interval reports:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -i1 -w2m 192.168.6.79  nuttcp 
-f-beta -Irx -r -i1 -w2m 192.168.6.79
tx:92.3750 MB /   1.01 sec =  767.2277 Mbps 0 retrans
rx:   104.5625 MB /   1.01 sec =  872.4757 Mbps 0 retrans
tx:83.3125 MB /   1.00 sec =  700.1845 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5541 Mbps 0 retrans
tx:83.8125 MB /   1.00 sec =  703.0322 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5502 Mbps 0 retrans
tx:83. MB /   1.00 sec =  696.1779 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5522 Mbps 0 retrans
tx:83.7500 MB /   1.00 sec =  702.4989 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5512 Mbps 0 retrans
tx:83.1250 MB /   1.00 sec =  697.2270 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5512 Mbps 0 retrans
tx:84.1875 MB /   1.00 sec =  706.1665 Mbps 0 retrans
rx:   117.5625 MB /   1.00 sec =  985.5510 Mbps 0 retrans
tx:83.0625 MB /   1.00 sec =  696.7167 Mbps 0 retrans
rx:   117.6875 MB /   1.00 sec =  987.5543 Mbps 0 retrans
tx:84.1875 MB /   1.00 sec =  706.1545 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5472 Mbps 0 retrans
rx:   117.6875 MB /   1.00 sec =  987.0724 Mbps 0 retrans
tx:83.3125 MB /   1.00 sec =  698.8137 Mbps 0 retrans

tx:   844.9375 MB /  10.07 sec =  703.7699 Mbps 11 %TX 6 %RX 0 retrans
rx:  1167.4414 MB /  10.05 sec =  973.9980 Mbps 14 %TX 11 %RX 0 retrans

In this test case, the receiver ramped up to nearly full GigE line rate,
while the transmitter was stuck at about 700 Mbps.  I ran one longer
60-second test and didn't see the oscillating behavior between receiver
and transmitter, but maybe that's because I have the GigE NIC interrupts
and nuttcp client/server applications both locked to CPU 0.

So in my tests, once one direction gets the upper hand, it seems to
stay that way.  Could this be because the slower side

Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bill Fink

Hi Bruce,

On Thu, 31 Jan 2008, Bruce Allen wrote:

  I see similar results on my test systems
 
 Thanks for this report and for confirming our observations.  Could you 
 please confirm that a single-port bidrectional UDP link runs at wire 
 speed?  This helps to localize the problem to the TCP stack or interaction 
 of the TCP stack with the e1000 driver and hardware.

Yes, a single-port bidirectional UDP test gets full GigE line rate
in both directions with no packet loss.

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -u -Ru -w2m 192.168.6.79  nuttcp 
-f-beta -Irx -r -u -Ru -w2m 192.168.6.79
tx:  1187.0078 MB /  10.04 sec =  992.0550 Mbps 19 %TX 7 %RX 0 / 151937 
drop/pkt 0.00 %loss
rx:  1187.1016 MB /  10.03 sec =  992.3408 Mbps 19 %TX 7 %RX 0 / 151949 
drop/pkt 0.00 %loss

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bill Fink

On Thu, 31 Jan 2008, Bruce Allen wrote:

  Based on the discussion in this thread, I am inclined to believe that
  lack of PCI-e bus bandwidth is NOT the issue.  The theory is that the
  extra packet handling associated with TCP acknowledgements are pushing
  the PCI-e x1 bus past its limits.  However the evidence seems to show
  otherwise:
 
  (1) Bill Fink has reported the same problem on a NIC with a 133 MHz
  64-bit PCI connection.  That connection can transfer data at 8Gb/s.
 
  That was even a PCI-X connection, which is known to have extremely good 
  latency
  numbers, IIRC better than PCI-e? (?) which could account for a lot of the
  latency-induced lower performance...
 
  also, 82573's are _not_ a serverpart and were not designed for this 
  usage. 82546's are and that really does make a difference.
 
 I'm confused.  It DOESN'T make a difference! Using 'server grade' 82546's 
 on a PCI-X bus, Bill Fink reports the SAME loss of throughput with TCP 
 full duplex that we see on a 'consumer grade' 82573 attached to a PCI-e x1 
 bus.
 
 Just like us, when Bill goes from TCP to UDP, he gets wire speed back.

Good.  I thought it was just me who was confused by Auke's reply.  :-)

Yes, I get the same type of reduced TCP performance behavior on a
bidirectional test that Bruce has seen, even though I'm using the
better 82546 GigE NIC on a faster 64-bit/133-MHz PCI-X bus.  I also
don't think bus bandwidth is an issue, but I am curious if there
are any known papers on typical PCI-X/PCI-E bus overhead on network
transfers, either bulk data transfers with large packets or more
transaction or video based applications using smaller packets.

I started musing if once one side's transmitter got the upper hand,
it might somehow defer the processing of received packets, causing
the resultant ACKs to be delayed and thus further slowing down the
other end's transmitter.  I began to wonder if the txqueuelen could
have an affect on the TCP performance behavior.  I normally have
the txqueuelen set to 1 for 10-GigE testing, so decided to run
a test with txqueuelen set to 200 (actually settled on this value
through some experimentation).  Here is a typical result:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.6.79
tx:  1120.6345 MB /  10.07 sec =  933.4042 Mbps 12 %TX 9 %RX 0 retrans
rx:  1104.3081 MB /  10.09 sec =  917.7365 Mbps 12 %TX 11 %RX 0 retrans

This is significantly better, but there was more variability in the
results.  The above was with TSO enabled.  I also then ran a test
with TSO disabled, with the following typical result:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.6.79
tx:  1119.4749 MB /  10.05 sec =  934.2922 Mbps 13 %TX 9 %RX 0 retrans
rx:  1131.7334 MB /  10.05 sec =  944.8437 Mbps 15 %TX 12 %RX 0 retrans

This was a little better yet and getting closer to expected results.

Jesse Brandeburg mentioned in another post that there were known
performance issues with the version of the e1000 driver I'm using.
I recognized that the kernel/driver versions I was using were rather
old, but it was what I had available to do a quick test with.  Those
particular systems are in a remote location so I have to be careful
with messing with their network drivers.  I do have some other test
systems at work that I might be able to try with newer kernels
and/or drivers or maybe even with other vendor's GigE NICs, but
I won't be back to work until early next week sometime.

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SO_RCVBUF doesn't change receiver advertised window

2008-01-16 Thread Bill Fink

On Tue, 15 Jan 2008, Ritesh Kumar wrote:

 Hi,
 I am using linux 2.6.20 and am trying to limit the receiver window
 size for a TCP connection. However, it seems that auto tuning is not
 turning itself off even after I use the syscall
 
 rwin=65536
 setsockopt(sock, SOL_SOCKET, SO_RCVBUF, rwin, sizeof(rwin));
 
 and verify using
 
 getsockopt(sock, SOL_SOCKET, SO_RCVBUF, rwin, rwin_size);
 
 that RCVBUF indeed is getting set (the value returned from getsockopt
 is double that, 131072).

Linux doubles what you requested, and then uses (by default) 1/4
of the socket space for overhead, so you effectively get 1.5 times
what you requested as an actual advertised receiver window, which
means since you specified 64 KB, you actually get 96 KB.

 The above calls are made before connect() on the client side and
 before bind(), accept() on the server side. Bulk data is being sent
 from the client to the server. The client and the server machines also
 have tcp_moderate_rcvbuf set to 0 (though I don't think that's really
 needed; setting a value to SO_RCVBUF should automatically turnoff auto
 tuning.).
 
 However the tcp trace shows the SYN, SYN/ACK and the first few packets as:
 14:34:18.831703 IP 192.168.1.153.45038  192.168.2.204.: S
 3947298186:3947298186(0) win 5840 mss 1460,sackOK,timestamp 2842625
 0,nop,wscale 5
 14:34:18.836000 IP 192.168.2.204.  192.168.1.153.45038: S
 3955381015:3955381015(0) ack 3947298187 win 5792 mss
 1460,sackOK,timestamp 2843649 2842625,nop,wscale 2
 14:34:18.837654 IP 192.168.1.153.45038  192.168.2.204.: . ack 1
 win 183 nop,nop,timestamp 2842634 2843649
 14:34:18.837849 IP 192.168.1.153.45038  192.168.2.204.: .
 1:1449(1448) ack 1 win 183 nop,nop,timestamp 2842634 2843649
 14:34:18.837851 IP 192.168.1.153.45038  192.168.2.204.: P
 1449:1461(12) ack 1 win 183 nop,nop,timestamp 2842634 2843649
 14:34:18.839001 IP 192.168.2.204.  192.168.1.153.45038: . ack
 1449 win 2172 nop,nop,timestamp 2843652 2842634
 14:34:18.839011 IP 192.168.2.204.  192.168.1.153.45038: . ack
 1461 win 2172 nop,nop,timestamp 2843652 2842634
 14:34:18.840875 IP 192.168.1.153.45038  192.168.2.204.: .
 1461:2909(1448) ack 1 win 183 nop,nop,timestamp 2842637 2843652
 14:34:18.840997 IP 192.168.1.153.45038  192.168.2.204.: .
 2909:4357(1448) ack 1 win 183 nop,nop,timestamp 2842637 2843652
 14:34:18.841120 IP 192.168.1.153.45038  192.168.2.204.: .
 4357:5805(1448) ack 1 win 183 nop,nop,timestamp 2842637 2843652
 14:34:18.841244 IP 192.168.1.153.45038  192.168.2.204.: .
 5805:7253(1448) ack 1 win 183 nop,nop,timestamp 2842637 2843652
 14:34:18.841388 IP 192.168.2.204.  192.168.1.153.45038: . ack
 2909 win 2896 nop,nop,timestamp 2843655 2842637
 14:34:18.841399 IP 192.168.2.204.  192.168.1.153.45038: . ack
 4357 win 3620 nop,nop,timestamp 2843655 2842637
 14:34:18.841413 IP 192.168.2.204.  192.168.1.153.45038: . ack
 5805 win 4344 nop,nop,timestamp 2843655 2842637
 
 As you can see, the syn and syn ack show rcv windows to be 5840 and
 5792 and it automatically increases for the receiver to values 2172
 till 4344 and more in the later part of the trace till 24214.

Since the window scale was 2, the final advertised receiver window
you indicate of 24214 gives 2^2*24214 or right around 96 KB, which
is what is expected given the way Linux works.

-Bill



 The values for the tcp sysctl variables are given below:
 /proc/sys/net/ipv4/tcp_moderate_rcvbuf  0
 /proc/sys/net/ipv4/tcp_mem 32768   43690   65536
 /proc/sys/net/ipv4/tcp_rmem409687380   1398080
 /proc/sys/net/ipv4/tcp_wmem   409616384   1398080
 /proc/sys/net/core/rmem_max  131071
 /proc/sys/net/core/wmem_max 131071
 /proc/sys/net/core/wmem_default  109568
 /proc/sys/net/core/rmem_default   109568
 
 I will really appreciate your help,
 
 Ritesh
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-21 Thread Bill Fink

On Thu, 20 Dec 2007, David Miller wrote:

 From: Ilpo_Järvinen [EMAIL PROTECTED]
 Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)

  [PATCH] [TCP]: Fix TSO deferring

  I'd say that most of what tcp_tso_should_defer had in between
  there was dead code because of this.

  Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]

 Yikes!

 John, we've been living a lie for more than a year. :-/

 On the bright side this explains a lot of small TSO frames I've been
 seeing in traces over the past year but never got a chance to
 investigate.

  diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
  index 8dafda9..693b9f6 100644
  --- a/net/ipv4/tcp_output.c
  +++ b/net/ipv4/tcp_output.c
  @@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, 
  struct sk_buff *skb)
  goto send_now;

  /* Defer for less than two clock ticks. */
  -   if (!tp-tso_deferred  ((jiffies1)1) - (tp-tso_deferred1)  1)
  +   if (tp-tso_deferred 
  +   ((jiffies  1)  1) - (tp-tso_deferred  1)  1)
  goto send_now;

  in_flight = tcp_packets_in_flight(tp);

I meant to ask about this a while back but then got distracted by
other things.  But now since the subject has come up, I had a couple
of more questions about this code.

What's with all the shifting back and forth?  Here with:

((jiffies1)1) - (tp-tso_deferred1)

and later with:

/* Ok, it looks like it is advisable to defer.  */
tp-tso_deferred = 1 | (jiffies1);

Is this just done to try and avoid the special case of jiffies==0 
when the jiffies wrap?  If so it seems like a lot of unnecessary
work just to avoid a 1 in 4 billion event, since it's my understanding
that the whole tcp_tso_should_defer function is just an optimization
and not a criticality to the proper functioning of TCP, especially
considering it hasn't even been executing at all up to now.

My second question is more basic and if I'm not mistaken actually
relates to a remaining bug in the (corrected) test:

/* Defer for less than two clock ticks. */
if (tp-tso_deferred 
((jiffies  1)  1) - (tp-tso_deferred  1)  1)

Since jiffies is an unsigned long, which is 64-bits on a 64-bit system,
whereas tp-tso_deferred is a u32, once jiffies exceeds 31-bits, which
will happen in about 25 days if HZ=1000, won't the second part of the
test always be true after that?  Or am I missing something obvious?

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-21 Thread Bill Fink

On Fri, 21 Dec 2007, David Miller wrote:

 From: Herbert Xu [EMAIL PROTECTED]
 Date: Fri, 21 Dec 2007 17:29:27 +0800

  On Fri, Dec 21, 2007 at 01:27:20AM -0800, David Miller wrote:

   It's two shifts, and this gets scheduled along with the other
   instructions on many cpus so it's effectively free.

   I don't see why this is even worth mentioning and discussing.

  I totally agree.  Two shifts are way better than a branch.

 We take probably a thousand+ 100+ cycle cache misses in the TCP stack
 on big window openning ACKs.

 Instead of discussing ways to solve that huge performance killer we're
 wanking about two friggin' integer shifts.

 It's hilarious isn't it? :-)

I don't think obfuscated code is hilarious.  Instead of the convoluted
and dense code:

/* Defer for less than two clock ticks. */
if (tp-tso_deferred 
((jiffies  1)  1) - (tp-tso_deferred  1)  1)

You can have the much simpler and more easily understandable:

/* Defer for less than two clock ticks. */
if (tp-tso_deferred  (jiffies - tp-tso_deferred)  1)

And instead of:

/* Ok, it looks like it is advisable to defer.  */
tp-tso_deferred = 1 | (jiffies1);

return 1;

You could do as Ilpo suggested:

/* Ok, it looks like it is advisable to defer.  */
tp-tso_deferred = max_t(u32, jiffies, 1);

return 1;

Or perhaps more efficiently:

/* Ok, it looks like it is advisable to defer.  */
tp-tso_deferred = jiffies;
if (unlikely(jiffies == 0))
tp-tso_deferred = 1;

return 1;

Or perhaps even:

/* Ok, it looks like it is advisable to defer.  */
tp-tso_deferred = jiffies;

/* need to return a non-zero value to defer, which means won't
 * defer if jiffies == 0 but it's only a 1 in 4 billion event
 * (and avoids a compare/branch by not checking jiffies)
 /
return jiffies;

Since it really only needs a non-zero return value to defer.

See, no branches needed and much clearer code.  That seems worthwhile
to me from a code maintenance standpoint, even if it isn't any speed
improvement.

And what about the 64-bit jiffies versus 32-bit tp-tso_deferred issue?
Should tso_deferred be made unsigned long to match jiffies?

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [IPROUTE]: A workaround to make larger rto_min printed correctly

2007-12-21 Thread Bill Fink

On Fri, 21 Dec 2007, YOSHIFUJI Hideaki wrote:

 In article [EMAIL PROTECTED] (at Fri, 21 Dec 2007 11:24:54 +0900), Satoru 
 SATOH [EMAIL PROTECTED] says:
 
  2007/12/21, Jarek Poplawski [EMAIL PROTECTED]:
   Jarek Poplawski wrote, On 12/20/2007 09:24 PM:
   ...
  
but since it's your patch, I hope you do some additional checking
if it's always like this...
  
  
   ...or maybe only changing this all a little bit will make it look safer!
  
   Jarek P.
  
  
  OK, how about this?
  
  Signed-off-by: Satoru SATOH [EMAIL PROTECTED]
  
   ip/iproute.c |   12 
   1 files changed, 8 insertions(+), 4 deletions(-)
  
  diff --git a/ip/iproute.c b/ip/iproute.c
  index f4200ae..c771b34 100644
  --- a/ip/iproute.c
  +++ b/ip/iproute.c
  @@ -510,16 +510,20 @@ int print_route(const struct sockaddr_nl *who,
  struct nlmsghdr *n, void *arg)
  fprintf(fp,  %u, 
  *(unsigned*)RTA_DATA(mxrta[i]));
  else {
  unsigned val = *(unsigned*)RTA_DATA(mxrta[i]);
  +   unsigned hz1 = hz;
  +   if (hz1  1000)
 
 Why don't you simply use unsigned long long (or maybe uint64_t) here?

I was wondering that too.  And maybe change the (float) cast
to (double) in the fprintf.

-Bill



 Signed-off-by: YOSHIFUJI Hideaki [EMAIL PROTECTED]
 
 --- 
 diff --git a/ip/iproute.c b/ip/iproute.c
 index f4200ae..db9a3b6 100644
 --- a/ip/iproute.c
 +++ b/ip/iproute.c
 @@ -509,16 +509,21 @@ int print_route(const struct sockaddr_nl *who, struct 
 nlmsghdr *n, void *arg)
   i != RTAX_RTO_MIN)
   fprintf(fp,  %u, 
 *(unsigned*)RTA_DATA(mxrta[i]));
   else {
 - unsigned val = *(unsigned*)RTA_DATA(mxrta[i]);
 + unsigned long long val = 
 *(unsigned*)RTA_DATA(mxrta[i]);
 + unsigned div = 1;
  
 - val *= 1000;
   if (i == RTAX_RTT)
 - val /= 8;
 + div = 8;
   else if (i == RTAX_RTTVAR)
 - val /= 4;
 - if (val = hz)
 - fprintf(fp,  %ums, val/hz);
 + div = 4;
   else
 + div = 1;
 +
 + val = val * 1000ULL / div;
 +
 + if (val = hz) {
 + fprintf(fp,  %llums, val/hz);
 + } else
   fprintf(fp,  %.2fms, (float)val/hz);
   }
   }
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-21 Thread Bill Fink

On Fri, 21 Dec 2007, Bill Fink wrote:

 Or perhaps even:
 
   /* Ok, it looks like it is advisable to defer.  */
   tp-tso_deferred = jiffies;
 
   /* need to return a non-zero value to defer, which means won't
* defer if jiffies == 0 but it's only a 1 in 4 billion event
* (and avoids a compare/branch by not checking jiffies)
/
 return jiffies;

Ack.  I introduced my own 64-bit to 32-bit issue (too late at night).
How about:

/* Ok, it looks like it is advisable to defer.  */
tp-tso_deferred = jiffies;

/* this won't defer if jiffies == 0 but it's only a 1 in
 * 4 billion event (and avoids a branch)
 */
return (jiffies != 0);

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-21 Thread Bill Fink

On Fri, 21 Dec 2007, Ilpo Järvinen wrote:

 On Fri, 21 Dec 2007, Bill Fink wrote:
 
  On Fri, 21 Dec 2007, Bill Fink wrote:
  
   Or perhaps even:
   
 /* Ok, it looks like it is advisable to defer.  */
 tp-tso_deferred = jiffies;
   
 /* need to return a non-zero value to defer, which means won't
  * defer if jiffies == 0 but it's only a 1 in 4 billion event
  * (and avoids a compare/branch by not checking jiffies)
  /
   return jiffies;
  
  Ack.  I introduced my own 64-bit to 32-bit issue (too late at night).
  How about:
  
  /* Ok, it looks like it is advisable to defer.  */
  tp-tso_deferred = jiffies;
  
  /* this won't defer if jiffies == 0 but it's only a 1 in
   * 4 billion event (and avoids a branch)
   */
  return (jiffies != 0);
 
 I'm not sure how the jiffies work but is this racy as well?
 
 Simple return tp-tso_deferred; should work, shouldn't it? :-)

As long as tp-tso_deferred remains u32, pending the other issue.

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 01/10] e1000e: make E1000E default to the same kconfig setting as E1000

2007-12-14 Thread Bill Fink

On Fri, 14 Dec 2007, Andrew Morton wrote:

 On Fri, 14 Dec 2007 15:39:26 -0500
 Jeff Garzik [EMAIL PROTECTED] wrote:
 
  [EMAIL PROTECTED] wrote:
   From: Randy Dunlap [EMAIL PROTECTED]
   
   Make E1000E default to the same kconfig setting as E1000.  So people's
   machiens don't stop working when they use oldconfig.
   
  I am not inclined to apply this one.  This practice, applied over time, 
  will tend to accumulate weird 'default' and 'select' statements.
  
  So I think the breakage that occurs is mitigated by two factors:
  1) kernel hackers that do their own configs are expected to be able to 
  figure this stuff.
  2) kernel builders (read: distros, mainly) are expected to have put 
  thought into the Kconfig selection and driver migration strategies.
  
  PCI IDs move across drivers from time, and we don't want to apply these 
  sorts changes:  Viewed in the long term, the suggested patch is merely a 
  temporary change to allow kernel experts to more easily deal with the 
  PCI ID migration across drivers.
  
  I would prefer simply to communicate to kernel experts and builders 
  about a Kconfig issue that could potentially their booting/networking... 
because this patch is only needed if the kernel experts do not already 
  know about a necessary config update.
 
 You can take it out again later on - most people's .configs will then have
 E1000E set.   People who still do `cp ancientconfig .config ; make oldconfig'
 remain screwed.

I was thinking the same thing.  Leave it in for 2 or 3 major versions
and then remove it (something analogous to the timeframe for a feature
removal).

And during the interim period, add something like the following
to the Kconfig help text:

Note some hardware that was previously supported by the
e1000 driver is now only handled by the e1000e driver.
If unsure and you previously used the e1000 driver,
say Y or M here.

 I dunno.  I guess I'm not into causing people pain in an attempt to train
 them to do what we want.  This is a popular driver and a *lot* of people
 are going to:
 
 - build new kernel
 
 - install new kernel
 
 - find it doesn't work, go through quite large amounts of hassle trying
   to work out why it stopped working.  Eventually work out that e1000
   stopped working.  Eventually work out that it stopped working because we
   forcibly switched them to a new driver which they didn't know about.
 
 - reconfigure kernel
 
 - rebuild, reinstall

Having been there, done that, it's definitely a pain.  It's especially
painful when you're doing it remotely, and since the network no longer
works, you can't get into the system anymore.

 Multiply that by 100s of people (at least).  All because Jeff wouldn't
 apply a one-liner?

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25] qdisc: new rate limiter

2007-12-08 Thread Bill Fink

On Sat, 08 Dec 2007, Patrick McHardy wrote:

 Patrick McHardy wrote:
  Stephen Hemminger wrote:
 
  +struct tc_rlim_qopt
  +{
  +__u32   limit;/* fifo limit (packets) */
  +__u32rate;/* bits per sec */

 
  This seems a bit small, 512mbit is the maximum rate.
 
 Its 4gbit of course, so I guess its enough :)

Actually, since we already have 10-Gbps nets, with 40-Gbps and
100-Gbps nets upcoming, it maybe should be __u64.

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Bill Fink

On Tue, 20 Nov 2007, Andrew Gallatin wrote:

 David Miller wrote:
   From: Andrew Gallatin [EMAIL PROTECTED]
   Date: Tue, 20 Nov 2007 06:47:57 -0500

   David Miller wrote:
 From: Herbert Xu [EMAIL PROTECTED]
 Date: Tue, 20 Nov 2007 14:09:18 +0800

 David Miller [EMAIL PROTECTED] wrote:
 Fundamentally, I really don't like this change, it batches to the
 point where it begins to erode the natural ACK clocking of TCP, 
 and I
 therefore am very likely to revert it before merging to Linus.

I have mixed feelings about this topic.  In general I agree with the
importance of maintaining the natural ACK clocking of TCP for normal
usage.  But there may also be some special cases that could benefit
significantly from such a new LRO pure ACK aggregation feature.  The
rest of my comments are in support of such a new feature, although
I haven't completely made up my own mind yet about the tradeoffs
involved in implementing such a new capability (good arguments are
being made on both sides).

 Perhaps make it a tunable that defaults to off?

 That's one idea.

   I'd certainly prefer the option to have a tunable to having our
   customers see performance regressions when they switch to
   the kernel's LRO.

   Please qualify this because by itself it's an inaccurate statement.

   It would cause a performance regression in situations where the is
   nearly no packet loss, no packet reordering, and the receiver has
   strong enough cpu power.

You are basically describing the HPC universe, which while not the
multitudes of the general Internet, is a very real and valid special
community of interest where maximum performance is critical.

For example, we're starting to see dynamic provisioning of dedicated
10-GigE lambda paths to meet various HPC requirements, just for the
purpose of insuring nearly no packet loss, no packet reordering.
See for example Internet2's Dynamic Circuit Network (DCN).

In the general Internet case, many smaller flows tend to be aggregated
together up to perhaps a 10-GigE interface, while in the HPC universe,
there tend to be fewer, but much higher individual bandwidth flows.
But both are totally valid usage scenarios.  So a tunable that defaults
to off for the general case makes sense to me.

 Yes, a regression of nearly 1Gb/s in some cases as I mentioned
 when I submitted the patch.

Which is a significant performance penalty.  But the CPU savings may
be an even more important benefit.

   Show me something over real backbones, talking to hundres or thousands
   of clients scattered all over the world.  That's what people will be
   using these high end NICs for front facing services, and that's where
   loss happens and stretch ACKs hurt performance.

The HPC universe uses real backbones, just not the general Internet
backbones.  Their backbones are engineered to have the characteristics
required for enabling very high performance applications.

And if performance would take a hit in the general Internet 10-GigE
server case, and that's clearly documented and understood, I don't
see what incentive the distros would have to enable the tunable for
their normal users, since why would they want to cause poorer
performance relative to other distros that stuck with the recommended
default.  The special HPC users could easily enable the option if it
was desired and proven beneficial in their environment.

 I can't.  I think most 10GbE on endstations is used either in the
 sever room, or on dedicated links.  My experience with 10GbE users is
 limited to my interactions with people using our NICs who contact our
 support.  Of those, I can recall only a tiny handful who were using
 10GbE on a normal internet facing connection (and the ones I dealt
 with were actually running a different OS).  The vast majority were in
 a well controlled, lossless environment.  It is quite ironic.  The
 very fact that I cannot provide you with examples of internet facing
 people using LRO (w/ack aggr) in more normal applications tends to
 support my point that most 10GbE users seem to be in lossless
 environments.

Most use of 10-GigE that I'm familiar with is related to the HPC
universe, but then that's the environment I work in.  I'm sure that
over time the use of 10-GigE in general Internet facing servers
will predominate, since that's where the great mass of users is.
But I would argue that that doesn't make it the sole usage arena
that matters.

   ACK stretching is bad bad bad for everything outside of some well
   controlled test network bubble.

It's not just for network bubbles.  That's where the technology tends
to first be shaken out, but the real goal is use in real-world,
production HPC environments.

 I just want those in the bubble to continue have the best performance
 possible in their situation.  If it is a tunable the defaults to off,
 that is great.

I totally agree, and think that the tunable (defaulting to off),
allows both the general

Re: [PATCH 1/2] [IPV4] UDP: Always checksum even if without socket filter

2007-11-19 Thread Bill Fink

On Mon, 19 Nov 2007, David Miller wrote:

 From: Andi Kleen [EMAIL PROTECTED]
 Date: Mon, 19 Nov 2007 16:29:33 +0100

 All of our options suck, we just have to choose the least sucking one
 and right now to me that's decrementing the counter as much as I
 empathize with the SNMP application overflow detection issue.

If the SNMP monitor detects an false overflow the error it reports 
will be much worse than a single missing packet. So you would replace 
one error with a worse error.

   This can be fixed, the above cannot.

  I don't see how, short of breaking the interface
  (e.g. reporting 64bit or separate overflow counts)

 As someone who just spent an entire weekend working on
 cpu performance counter code, I know it's possible.

 When you overflow, the new value is a lot less than
 the last sampled one.  When the value backtracks like
 we're discussing it could here, it only decrease
 a very little bit.

While I agree with your analysis that it could be worked around,
who knows how all the various SNMP monitoring applications out there
would interpret such an unusual event.  I liked Stephen's suggestion
of a deferred decrement that would insure the counter didn't ever
run backwards.  But the best approach seems to be just not to count
it in the first place until tha application has actually received
the packet, since as Herbert pointed out, that's what the RFC
actually specifies for the meaning of the udpInDatagrams counter.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/ipv4/arp.c: Fix arp reply when sender ip 0 (was: Strange behavior in arp probe reply, bug or feature?)

2007-11-19 Thread Bill Fink

On Mon, 19 Nov 2007, Alexey Kuznetsov wrote:

 Hello!
 
  Is there a reason that the target hardware address isn't the target
  hardware address?
 
 It is bound only to the fact that linux uses protocol address
 of the machine, which responds. It would be highly confusing
 (more than confusing :-)), if we used our protocol address and hardware
 address of requestor.
 
 But if you use zero protocol address as source, you really can use
 any hw address.
 
  The dhcp clients I examined, and the implementation of the arpcheck
  that I use will compare the target hardware field of the arp-reply and
  match it against its own mac, to verify the reply. And this fails with
  the current implementation in the kernel.
 
 1. Do not do this. Mainly, because you already know that this does not work
with linux. :-) Logically, target hw address in arp reply is just
a nonsensial redundancy, it should not be checked and even looked at.

Repeating what I posted earlier from the ARP RFC 826:

The target hardware address is included for completeness and
network monitoring.  It has no meaning in the request form,
since it is this number that the machine is requesting.  Its
meaning in the reply form is the address of the machine making
the request.  In some implementations (which do not get to look
at the 14.byte ethernet header, for example) this may save some
register shuffling or stack space by sending this field to the
hardware driver as the hardware destination address of the
packet.

Unless there is some other RFC that supercedes this, which doesn't appear
to be the case since it's also STD37, it appears to me that the current
Linux behavior is wrong.  It clearly states that for the ARP reply, the
target hardware address is the address of the machine making the request,
and not the address of the machine making the reply as Linux is apparently
doing.

 2. What's about your suggestion, I thought about this and I am going to agree.
 
Arguments, which convinced me are:
 
- arping still works.
- any piece of reasonable software should work.
- if Windows understands DaD (is it really true? I cannot believe)
  and it is unhappy about our responce and does not block use
  of duplicate address only due to this, we _must_ accomodate ASAP.
- if we do,we have to use 0 protocol address, no choice.

I agree the target protocol address should be 0 in this case.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/ipv4/arp.c: Fix arp reply when sender ip 0

2007-11-16 Thread Bill Fink

On Fri, 16 Nov 2007, David Miller wrote:

 From: Jonas Danielsson [EMAIL PROTECTED]
 Date: Fri, 16 Nov 2007 09:30:11 +0100

  2007/11/16, David Miller [EMAIL PROTECTED]:
   From: Jonas Danielsson [EMAIL PROTECTED]
   Date: Thu, 15 Nov 2007 22:40:13 +0100

Is there a reason that the target hardware address isn't the target
hardware address?

   Because of this, in cases where a choice can be made Linux will
   advertise what is most likely to result in successful communication.

   This is likely why we are changing that target address to the one of
   the interface actually sending back the reply rather than the zero
   value you used.

   In fact I think this information can be useful to the sender of
   the DAD request.

  There seem to be some confusion about what my patch really does. It
  does not set the hardware address to a zero value.

 I knew you were talking about the IP address not the hardware
 address.

  The reply from the Linux kernel in computer A, before the patch would look 
  like:

  Reply:
  Opcode: reply (0x0002)
  Sender HW: 00:AA.00:AA:00:AA
  Sender IP:   192.168.0.1
  Target HW:  00:AA:00:AA:00:AA
  Target IP:192.168.0.1

 And this is exactly a sensible response in my opinion.

I don't see how you can say that, since it appears to be in violation
of RFC 826:

The target hardware address is included for completeness and
network monitoring.  It has no meaning in the request form,
since it is this number that the machine is requesting.  Its
meaning in the reply form is the address of the machine making
the request.  In some implementations (which do not get to look
at the 14.byte ethernet header, for example) this may save some
register shuffling or stack space by sending this field to the
hardware driver as the hardware destination address of the
packet.

Since the MAC address of the machine making the request is
00:BB:00:BB:00:BB, and not 00:AA:00:AA:00:AA, Linux appears to
be in violation of the ARP RFC.

Regarding the Target IP, RFC 826 says:

The target protocol address is necessary in the request form
of the packet so that a machine can determine whether or not
to enter the sender information in a table or to send a reply.
It is not necessarily needed in the reply form if one assumes
a reply is only provoked by a request.  It is included for
completeness, network monitoring, and to simplify the suggested
processing algorithm described above (which does not look at
the opcode until AFTER putting the sender information in a
table).

So it's ambiguous about the target IP address in an ARP reply packet,
but a value of 0.0.0.0 makes more logical sense to me than using
192.168.0.1 in this example case, since it should reflect the requestor
IP address, which is unknown in this case.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] introduce udp_rmem and udp_wmem

2007-10-29 Thread Bill Fink

On Mon, 29 Oct 2007, Hideo AOKI wrote:

 This patch added /proc/sys/net/udp_rmem and /proc/sys/net/udp_rmem.
 Each UDP packet is drooped when the number of pages for socket buffer
 is beyond the limit and the socket already consumes minimum buffer.

I think you meant /proc/sys/net/ipv4/udp_{r,w}mem above.

Patch not in-lined making replying more difficult.

Cutting and pasting:

 diff -pruN 
 linux-2.6.24-rc1-mem003-ipv4-dev-p4/Documentation/networking/ip-sysctl.txt 
 linux-2.6.24-rc1-mem003-ipv4-dev-p5/Documentation/networking/ip-sysctl.txt
 --- 
 linux-2.6.24-rc1-mem003-ipv4-dev-p4/Documentation/networking/ip-sysctl.txt
 2007-10-26 20:35:52.0 -0400
 +++ 
 linux-2.6.24-rc1-mem003-ipv4-dev-p5/Documentation/networking/ip-sysctl.txt
 2007-10-29 09:44:05.0 -0400
 @@ -452,6 +452,18 @@ udp_mem - INTEGER
   Number of pages allowed for queueing by all UDP sockets.
   Default is calculated at boot time from amount of available memory.
  
 +udp_rmem - INTEGER
 + Minimal size of receive buffer used by UDP sockets. Each UDP socket
 + is able to use the size for receiving data, even if total pages of UDP
 + sockets exceed udp_mem. The unit is byte.
 + Default: 4096
 +
 +udp_wmem - INTEGER
 + Minimal size of send buffer used by UDP sockets. Each UDP socket is
 + able to use the size for sending data, even if total pages of UDP
 + sockets exceed udp_mem. The unit is byte.
 + Default: 4096
 +
  CIPSOv4 Variables:
  
  cipso_cache_enable - BOOLEAN

I think either the above should be renamed to udp_{r,w}mem_min, or
they should be changed to a 3-tuple like tcp_{r,w}mem, and the code
refactored accordingly (but then what to do about
/proc/sys/net/core/{r,w}mem_max).

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Throughput Bug?

2007-10-18 Thread Bill Fink

On Thu, 18 Oct 2007, Matthew Faulkner wrote:

 Hey all
 
 I'm using netperf to perform TCP throughput tests via the localhost
 interface. This is being done on a SMP machine. I'm forcing the
 netperf server and client to run on the same core. However, for any
 packet sizes below 523 the throughput is much lower compared to the
 throughput when the packet sizes are greater than 524.
 
 Recv   SendSend  Utilization   Service Demand
 Socket Socket  Message  Elapsed  Send Recv SendRecv
 Size   SizeSize Time Throughput  localremote   local   remote
 bytes  bytes   bytessecs.MBytes  /s  % S  % S  us/KB   us/KB
  65536  6553652330.0181.49   50.0050.0011.984  11.984
  65536  6553652430.01   460.61   49.9949.992.120   2.120
 
 The chances are i'm being stupid and there is an obvious reason for
 this, but when i put  the server and client on different cores i don't
 see this effect.
 
 Any help explaining this will be greatly appreciated.
 
 Machine details:
 
 Linux 2.6.22-2-amd64 #1 SMP Thu Aug 30 23:43:59 UTC 2007 x86_64 GNU/Linux
 
 sched_affinity is used by netperf internally to set the core affinity.

I don't know if it's relevant, but note that 524 bytes + 52 bytes
of IP(20)/TCP(20)/TimeStamp(12) overhead gives a 576 byte packet,
which is the specified size that all IP routers must handle (and
the smallest value possible during PMTU discovery I believe).  A
message size of 523 bytes would be 1 less than that.  Could this
possibly have to do with ABC (possibly try disabling it if set)?

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bonding support for eth1394?

2007-10-14 Thread Bill Fink

On Sat, 13 Oct 2007, Stefan Richter wrote:

 Roland Dreier wrote:
  There are a few changes to the bonding driver pending that will add
  support for bonding IP-over-InfiniBand interfaces.  IPoIB also cannot
  change its HW address, so the patches address that issue.
  
  Once those patches land, bonding eth1394 interfaces may just work.
 
 Sounds promising.  I will keep an eye on it.

While that might allow multiple eth1394 interfaces to be bonded,
I believe the user wanted to bond an eth1394 interface with a normal
Ethernet interface, and I don't think that will work even with the
IPoIB bonding changes, since bonding of different fundamental types
of network interfaces still won't be supported, and I'm pretty sure
eth1394 is not considered a standard Ethernet interface (different
MAC address format for one thing).

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Bill Fink

On Tue, 09 Oct 2007, David Miller wrote:

 From: jamal [EMAIL PROTECTED]
 Date: Tue, 09 Oct 2007 17:56:46 -0400

  if the h/ware queues are full because of link pressure etc, you drop. We
  drop today when the s/ware queues are full. The driver txmit lock takes
  place of the qdisc queue lock etc. I am assuming there is still need for
  that locking. The filter/classification scheme still works as is and
  select classes which map to rings. tc still works as is etc.

 I understand your suggestion.

 We have to keep in mind, however, that the sw queue right now is 1000
 packets.  I heavily discourage any driver author to try and use any
 single TX queue of that size.  Which means that just dropping on back
 pressure might not work so well.

 Or it might be perfect and signal TCP to backoff, who knows! :-)

I can't remember the details anymore, but for 10-GigE, I have encountered
cases where I was able to significantly increase TCP performance by
increasing the txqueuelen to 1, which is the setting I now use for
any 10-GigE testing.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tcp bw in 2.6

2007-10-03 Thread Bill Fink

Tangential aside:

On Tue, 02 Oct 2007, Rick Jones wrote:

 *) depending on the quantity of CPU around, and the type of test one is 
 running, 
 results can be better/worse depending on the CPU to which you bind the 
 application.  Latency tends to be best when running on the same core as takes 
 interrupts from the NIC, bulk transfer can be better when running on a 
 different 
 core, although generally better when a different core on the same chip.  These
 days the throughput stuff is more easily seen on 10G, but the netperf service 
 demand changes are still visible on 1G.

Interesting.  I was going to say that I've generally had the opposite
experience when it comes to bulk data transfers, which is what I would
expect due to CPU caching effects, but that perhaps it's motherboard/NIC/
driver dependent.  But in testing I just did I discovered it's even
MTU dependent (most of my normal testing is always with 9000-byte
jumbo frames).

With Myricom 10-GigE NICs, NIC interrupts on CPU 0 and nuttcp app
running on CPU 1 (both transmit and receive sides), and using 9000-byte
jumbo frames:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
10078.5000 MB /  10.02 sec = 8437.5396 Mbps 100 %TX 99 %RX

With Myricom 10-GigE NICs, and both NIC interrupts and nuttcp app
on CPU 0 (both transmit and receive sides), again using 9000-byte
jumbo frames:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11817.8750 MB /  10.00 sec = 9909.7537 Mbps 100 %TX 74 %RX

Same tests repeated with standard 1500-byte Ethernet MTU:

With Myricom 10-GigE NICs, NIC interrupts on CPU 0 and nuttcp app
running on CPU 1 (both transmit and receive sides), and using
standard 1500-byte Ethernet MTU:

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5685.9375 MB /  10.00 sec = 4768.0951 Mbps 99 %TX 98 %RX

With Myricom 10-GigE NICs, and both NIC interrupts and nuttcp app
on CPU 0 (both transmit and receive sides), again using standard
1500-byte Ethernet MTU:

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4974.0625 MB /  10.03 sec = 4161.6015 Mbps 100 %TX 100 %RX

Now back to your regularly scheduled programming.  :-)

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-02 Thread Bill Fink

On Tue, 02 Oct 2007, jamal wrote:

 On Tue, 2007-02-10 at 00:25 -0400, Bill Fink wrote:
 
  One reason I ask, is that on an earlier set of alternative batching
  xmit patches by Krishna Kumar, his performance testing showed a 30 %
  performance hit for TCP for a single process and a size of 4 KB, and
  a performance hit of 5 % for a single process and a size of 16 KB
  (a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
  time to inquire further about it, but it would be a major potential
  concern for me in my 10-GigE network testing with 9000-byte jumbo
  frames.  Of course the single process and 4 KB or larger size was
  the only case that showed a significant performance hit in Krishna
  Kumar's latest reported test results, so it might be acceptable to
  just have a switch to disable the batching feature for that specific
  usage scenario.  So it would be useful to know if your xmit batching
  changes would have similar issues.
 
 There were many times while testing that i noticed inconsistencies and
 in each case when i analysed[1], i found it to be due to some variable
 other than batching which needed some resolving, always via some
 parametrization or other. I suspect what KK posted is in the same class.
 To give you an example, with UDP, batching was giving worse results at
 around 256B compared to 64B or 512B; investigating i found that the
 receiver just wasnt able to keep up and the udp layer dropped a lot of
 packets so both iperf and netperf reported bad numbers. Fixing the
 receiver ended up with consistency coming back. On why 256B was the one
 that overwhelmed the receiver more than 64B(which sent more pps)? On
 some limited investigation, it seemed to me to be the effect of the
 choice of the tg3 driver's default tx mitigation parameters as well tx
 ring size; which is something i plan to revisit (but neutralizing it
 helps me focus on just batching). In the end i dropped both netperf and
 iperf for similar reasons and wrote my own app. What i am trying to
 achieve is demonstrate if batching is a GoodThing. In experimentation
 like this, it is extremely valuable to reduce the variables. Batching
 may expose other orthogonal issues - those need to be resolved or fixed
 as they are found. I hope that sounds sensible.

It does sound sensible.  My own decidedly non-expert speculation
was that the big 30 % performance hit right at 4 KB may be related
to memory allocation issues or having to split the skb across
multiple 4 KB pages.  And perhaps it only affected the single
process case because with multiple processes lock contention may
be a bigger issue and the xmit batching changes would presumably
help with that.  I am admittedly a novice when it comes to the
detailed internals of TCP/skb processing, although I have been
slowly slogging my way through parts of the TCP kernel code to
try and get a better understanding, so I don't know if these
thoughts have any merit.

BTW does anyone know of a good book they would recommend that has
substantial coverage of the Linux kernel TCP code, that's fairly
up-to-date and gives both an overall view of the code and packet
flow as well as details on individual functions and algorithms,
and hopefully covers basic issues like locking and synchronization,
concurrency of different parts of the stack, and memory allocation.
I have several books already on Linux kernel and networking internals,
but they seem to only cover the IP (and perhaps UDP) portions of the
network stack, and none have more than a cursory reference to TCP.  
The most useful documentation on the Linux TCP stack that I have
found thus far is some of Dave Miller's excellent web pages and
a few other web references, but overall it seems fairly skimpy
for such an important part of the Linux network code.

 Back to the =9K packet size you raise above:
 I dont have a 10Gige card so iam theorizing. Given that theres an
 observed benefit to batching for a saturated link with smaller packets
 (in my results small is anything below 256B which maps to about
 380Kpps anything above that seems to approach wire speed and the link is
 the bottleneck); then i theorize that 10Gige with 9K jumbo frames if
 already achieving wire rate, should continue to do so. And sizes below
 that will see improvements if they were not already hitting wire rate.
 So i would say that with 10G NICS, there will be more observed
 improvements with batching with apps that do bulk transfers (assuming
 those apps are not seeing wire speed already). Note that this hasnt been
 quiet the case even with TSO given the bottlenecks in the Linux
 receivers that J Heffner put nicely in a response to some results you
 posted - but that exposes an issue with Linux receivers rather than TSO.

It would be good to see some empirical evidence that there aren't
any unforeseen gotchas for larger packet sizes, that at least the
same level of performance can be obtained with no greater CPU
utilization.

  Also for your xmit

Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread Bill Fink

On Mon, 01 Oct 2007, jamal wrote:

 On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:
 
  Have you done performance comparisons for the case of using 9000-byte
  jumbo frames?
 
 I havent, but will try if any of the gige cards i have support it.
 
 As a side note: I have not seen any useful gains or losses as the packet
 size approaches even 1500B MTU. For example, post about 256B neither the
 batching nor the non-batching give much difference in either throughput
 or cpu use. Below 256B, theres a noticeable gain for batching.
 Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and
 so the occupancy of both the qdisc queue(s) and ethernet ring is
 constantly high. For example at 512B, the app is 80% idle on all 4 CPUs
 and we are hitting in the range of wire speed. We are at 90% idle at
 1024B. This is the case with or without batching.  So my suspicion is
 that with that trend a 9000B packet will just follow the same pattern.

One reason I ask, is that on an earlier set of alternative batching
xmit patches by Krishna Kumar, his performance testing showed a 30 %
performance hit for TCP for a single process and a size of 4 KB, and
a performance hit of 5 % for a single process and a size of 16 KB
(a size of 8 KB wasn't tested).  Unfortunately I was too busy at the
time to inquire further about it, but it would be a major potential
concern for me in my 10-GigE network testing with 9000-byte jumbo
frames.  Of course the single process and 4 KB or larger size was
the only case that showed a significant performance hit in Krishna
Kumar's latest reported test results, so it might be acceptable to
just have a switch to disable the batching feature for that specific
usage scenario.  So it would be useful to know if your xmit batching
changes would have similar issues.

Also for your xmit batching changes, I think it would be good to see
performance comparisons for TCP and IP forwarding in addition to your
UDP pktgen tests, including various packet sizes up to and including
9000-byte jumbo frames.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-09-30 Thread Bill Fink

On Sun, 30 Sep 2007, jamal wrote:

 This patch adds the usage of batching within the core.
 
 cheers,
 jamal



 [sep30-p2of3  text/plain (6.8KB)]
 [NET_BATCH] net core use batching
 
 This patch adds the usage of batching within the core.
 The same test methodology used in introducing txlock is used, with
 the following results on different kernels:
 
 ++--+-+++
 |   64B  |  128B| 256B| 512B   |1024B   |
 ++--+-+++
 Original| 467482 | 463061   | 388267  | 216308 | 114704 |
 ||  | |||
 txlock  | 468922 | 464060   | 388298  | 216316 | 114709 |
 ||  | |||
 tg3nobtx| 468012 | 464079   | 388293  | 216314 | 114704 |
 ||  | |||
 tg3btxdr| 480794 | 475102   | 388298  | 216316 | 114705 |
 ||  | |||
 tg3btxco| 481059 | 475423   | 388285  | 216308 | 114706 |
 ++--+-+++
 
 The first two colums Original and txlock were introduced in an earlier
 patch and demonstrate a slight increase in performance with txlock.
 tg3nobtx shows the tg3 driver with no changes to support batching.
 The purpose of this test is to demonstrate the effect of introducing
 the core changes to a driver that doesnt support them.
 Although this patch brings down perfomance slightly compared to txlock
 for such netdevices, it is still better compared to just the original
 kernel.
 tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3
 driver. tg3btxco demonstrates the effect of letting the core do all the
 work. As can be seen the last two are not very different in performance.
 The difference is -hard_batch_xmit() introduces a new method which
 is intrusive.

Have you done performance comparisons for the case of using 9000-byte
jumbo frames?

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 driver and samba

2007-09-19 Thread Bill Fink

On Wed, 19 Sep 2007, L F wrote:

 I have one further question: what should I be doing with the TSO and
 flow control? As of now, TSO is on but flow control is off.
 I'd like to thank everyone who helped and I'll be trying to see if the
 realtek integrated NIC works next.

Just my personal opinion, but unless you want to do more testing,
since you now seem to have a working setup, I would tend to leave
it the way it is.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 driver and samba

2007-09-19 Thread Bill Fink

On Wed, 19 Sep 2007, L F wrote:

 Well,
 the issue seems to have gone away as of this morning, but I am
 somewhat unsure as to why.
 Placement of some things were modified so as to allow shorter cables.
 Now there are 3' CAT6 cables everywhere except for the 15' cable
 between the two switches. All the cables are new, high quality
 'tested' cables from a company nearby.
 The server is now running 2.6.22.6 with the 7.6.5 e1000 driver from
 intel.com and samba 3.0.26-1 ... and it seems to work. Samba will not
 disconnect, even with all 8 clients running unreasonable read/write
 loads and CRC and MD5 checksums of the transferred files all match.
 The issue therefore seems to have gone away, but the reason why still
 escapes me. I cannot believe that CAT5 cables under 10' in length were
 causing it, because if that were the case
 1) it would've shown itself, I presume, from the beginning
 2) I could name dozens of different locations which would be having
 the same problems
 Samba 3.0.25 was definitely part of the problem and I sent a nice
 nastygram to the debian maintainers, because -testing is not
 -unstable, last I checked.
 As to samba having any sort of data integrity capability, to the best
 of my knowledge that has never been the case.
 To answer further questions: I checked for file integrity with
 CRC/CRC32/MD5 checksum utilities. They used to fail fairly
 consistently, they have been fine all this morning.

By any chance did you happen to power cycle some equipment in this
process that you didn't previously power cycle during earlier testing
and debugging?  If so, perhaps that hardware had somehow gotten into
a funky state, and the power cycling might have cleared it up.

Just a thought.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 driver and samba

2007-09-18 Thread Bill Fink

On Mon, 17 Sep 2007, Brandeburg, Jesse wrote:

 L F wrote:
  On 9/17/07, Kok, Auke [EMAIL PROTECTED] wrote:
  The statistic we were looking at _will_ increase when running in
  half duplex, but if it increases when running in full duplex might
  indicate a hardware failure. Probably you have fixed the issue with
  the CAT6 cable. 
  Uhm, 'fixed' may be premature: I restarted the machine and with 22
  hours uptime I am getting:
  tx_deferred_ok: 36254
  
  Can you run this new configuration with the old cable? that would
  eliminate the cable (or not)
  I most certainly can. This seems to have gotten worse by a factor or
  100 or more.. so am I to suspect the new cable?
  
  A single port failure on a switch can also happen, and samba is
  definately 
  a good test for defective hardware. I cannot rule out anything from
  the information we have gotten yet.
  True, but I tried changing the switch ports with little change.
  Putting a client on the same switch port yielded no errors on the
  client, although unfortunately I don't have ethtool statistics on XP.
  The switch, btw, is a fairly generic GS108 from Netgear (there
  actually are two).
 
 it may be not well documented, but the hardware has several states that
 it can get into that can cause tx_deferred counter to increment.  None
 of them are fatal to traffic, it is mainly an informational statistic.
 
 in this case it is in the due to receiving flow control; tx is paused
 state...
 
 he has 488 rx flow control xoff/xon, which means the switch is being
 overloaded and sending flow control, or the switch is passing through
 flow control packets (which it should not since they are multicast) and
 (some) client is overloaded.
 
 can you turn off flow control at the server?  ethtool -A ethX rx off tx
 off or load the driver with parameter FlowControl=0  With the 7.6.5
 driver at least you'll get confirmation of the flow control change on
 the Link Up: line.

It may also be a useful test to disable hardware TSO support
via ethtool -K ethX tso off.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/7] CAN: Add documentation

2007-09-18 Thread Bill Fink

On 17 Sep 2007, Urs Thuermann wrote:

 Thomas Gleixner [EMAIL PROTECTED] writes:
 
  Please do, having the patch in mail makes it easier to review and to
  comment.
 
 OK, here it is:

One more typo.

 This patch adds documentation for the PF_CAN protocol family.
 
 Signed-off-by: Oliver Hartkopp [EMAIL PROTECTED]
 Signed-off-by: Urs Thuermann [EMAIL PROTECTED]
 
 ---
  Documentation/networking/00-INDEX |2 
  Documentation/networking/can.txt  |  635 
 ++
  2 files changed, 637 insertions(+)
 
 Index: net-2.6.24/Documentation/networking/can.txt
 ===
 --- /dev/null 1970-01-01 00:00:00.0 +
 +++ net-2.6.24/Documentation/networking/can.txt   2007-09-17 
 21:57:29.0 +0200
 @@ -0,0 +1,635 @@
 +
 +
 +can.txt
 +
 +Readme file for the Controller Area Network Protocol Family (aka Socket CAN)
 +
 +This file contains
 +
 +  1 Overview / What is Socket CAN
 +
 +  2 Motivation / Why using the socket API
 +

...

 +
 +
 +1. Overview / What is Socket CAN
 +
 +
 +The socketcan package is an implementation of CAN protocols
 +(Controller Area Network) for Linux.  CAN is a networking technology
 +which has widespread use in automation, embedded devices, and
 +automotive fields.  While there have been other CAN implementations
 +for Linux based on character devices, Socket CAN uses the Berkeley
 +socket API, the Linux network stack and implements the CAN device
 +drivers as network interfaces.  The CAN socket API has been designed
 +as similar as possible to the TCP/IP protocols to allow programmers,
 +familiar with network programming, to easily learn how to use CAN
 +sockets.
 +
 +2. Motivation / Why using the socket API
 +
 +
 +There have been CAN implementations for Linux before Socket CAN so the
 +question arises, why we have started another project.  Most existing
 +implementations come as a device driver for some CAN hardware, they
 +are based on character devices and provide comparatively little
 +functionality.  Usually, there is only a hardware-specific device
 +driver which provides a character device interface to send and
 +receive raw CAN frames, directly to/from the controller hardware.
 +Queueing of frames and higher-level transport protocols like ISO-TP
 +have to be implemented in user space applications.  Also, most
 +character-device implementations support only one single process to
 +open the device at a time, similar to a serial interface.  Exchanging
 +the CAN controller requires employment of another device driver and
 +often the need for adaption of large parts of the application to the
 +new driver's API.
 +
 +Socket CAN was designed to overcome all of these limitations.  A new
 +protocol family has been implemented which provides a socket interface
 +to user space applications and which builds upon the Linux network
 +layer, so to use all of the provided queueing functionality.  A device
 +driver for CAN controller hardware registers itself with the Linux
 +network layer as a network device, so that CAN frames from the
 +controller can be passed up to the network layer and on to the CAN
 +protocol family module and also vice-versa.  Also, the protocol family
 +module provides an API for transport protocol modules to register, so
 +that any number of transport protocols can be loaded or unloaded
 +dynamically.  In fact, the can core module alone does not provide any
 +protocol and cannot be used without loading at least one additional
 +protocol module.  Multiple sockets can be opened at the same time,
 +on different or the same protocol module and they can listen/send
 +frames on different or the same CAN IDs.  Several sockets listening on
 +the same interface for frames with the same CAN ID are all passed the
 +same received matching CAN frames.  An application wishing to
 +communicate using a specific transport protocol, e.g. ISO-TP, just
 +selects that protocol when opening the socket, and then can read and
 +write application data byte streams, without having to deal with
 +CAN-IDs, frames, etc.
 +
 +Similar functionality visible from user-space could be provided by a
 +character decive, too, but this would lead to a technically inelegant
 +solution for a couple of reasons:

decive - device above.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 driver and samba

2007-09-18 Thread Bill Fink

On 18 Sep 2007, Urs Thuermann wrote:

 Bill Fink [EMAIL PROTECTED] writes:
 
  It may also be a useful test to disable hardware TSO support
  via ethtool -K ethX tso off.
 
 All suggestions here on the list, i.e. checking for flow control,
 duplex, cable problems, etc. don't explain (at least to me) why LF
 sees file corruption.  How can a corrupted frame pass the TCP checksum
 check?  Does TCP use the hardware checksum of the NIC if available?
 AFAICS, this would be the only way for a corrupt frame to make it into
 the file.  But Bill already suggested this and LF reported that it
 didn't make a difference.
 
 A few months ago I had hadware problems with an embedded device, where
 transmission from the NIC via the PCI bus to the CPU had some bits
 flipped.  But tcpdump clearly showed the TCP checksum errors and also
 TCP recognized the errors and the connection was stalled.  And, BTW,
 we also observed an increasing percentage of corrupted frames with
 increasing traffic on that interface, i.e. increasing load on the PCI
 bus.
 
 So I would run tcpdump -s0 and watch for incorrect checksum messages.

I agree TSO is an unlikely candidate since it should only affect
transmits and the problem as I understand it is with receives.
But still one of the first things I try doing when dealing with
weird problems is disabling all hardware assists.

But I also agree with you that network errors should normally be
detected by the TCP checksum (unless hardware checksumming was
messed up), and from what I recall there were no receive checksum
errors being seen.  That and the fact that the problem was seen
with two different NICs would lead me to believe that the problem
is elsewhere in the system.

That leaves many possibilities.  It could be a memory problem,
although it was indicated that memory testing was successfully
performed (but we don't know how extensive the memory checking
is enabled via the BIOS).  It could be the PCI bus writes back
to the disk, or a problem with the disk/controller/fs writes
themselves (some kind of disk stress test might be useful).

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: e1000 driver and samba

2007-09-18 Thread Bill Fink

On Tue, 18 Sep 2007, Florian Weimer wrote:

 * Urs Thuermann:
 
  How can a corrupted frame pass the TCP checksum check?
 
 The TCP/IP checksums are extremely weak.  If the corruption is due to
 defective SRAM or something like that, it's likely that it causes an
 error pattern which is 16-bit-aligned.  And an even number of
 16-bit-aligned bit flips is not detected by the TCP checksum. 8-(
 
 Actually, nobody should use TCP without application-level checksums
 for that reason.  But of course, there is HTTP.

But in this specific case, IIRC there were _no_ receive checksum
errors seen, and it would seem odd that any bit corruption was
_always_ an even number of 16-bit-aligned bit flips.

Also, I don't know anything at all about the SAMBA fs/protocol, but
I would expect it would have some kind of stronger data integrity
capability that should catch such errors.  Which would be another
reason implying the data corruption problem is above the network
layer, and perhaps a hardware error of some kind on the write path
to the disk (also could possibly be a software bug of some kind
in that path).

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-09-14 Thread Bill Fink

On Mon, 27 Aug 2007, jamal wrote:

 On Sun, 2007-26-08 at 19:04 -0700, David Miller wrote:
 
  The transfer is much better behaved if we ACK every two full sized
  frames we copy into the receiver, and therefore don't stretch ACK, but
  at the cost of cpu utilization.
 
 The rx coalescing in theory should help by accumulating more ACKs on the
 rx side of the sender. But it doesnt seem to do that i.e For the 9K MTU,
 you are better off to turn off the coalescing if you want higher
 numbers. Also some of the TOE vendors (chelsio?) claim to have fixed
 this by reducing bursts on outgoing packets.
  
 Bill:
 who suggested (as per your email) the 75usec value and what was it based
 on measurement-wise? 

Belatedly getting back to this thread.  There was a recent myri10ge
patch that changed the default value for tx/rx interrupt coalescing
to 75 usec claiming it was an optimum value for maximum throughput
(and is also mentioned in their external README documentation).

I also did some empirical testing to determine the effect of different
values of TX/RX interrupt coalescing on 10-GigE network performance,
both with TSO enabled and with TSO disabled.  The actual test runs
are attached at the end of this message, but the results are summarized
in the following table (network performance in Mbps).

TX/RX interrupt coalescing in usec (both sides)
   0  15  30  45  60  75  90 105

TSO enabled 89099682971697259739974596889648
TSO disabled91139910991099109910991099109910

TSO disabled performance is always better than equivalent TSO enabled
performance.  With TSO enabled, the optimum performance is indeed at
a TX/RX interrupt coalescing value of 75 usec.  With TSO disabled,
performance is the full 10-GigE line rate of 9910 Mbps for any value
of TX/RX interrupt coalescing from 15 usec to 105 usec.

 BTW, thanks for the finding the energy to run those tests and a very
 refreshing perspective. I dont mean to add more work, but i had some
 queries;
 On your earlier tests, i think that Reno showed some significant
 differences on the lower MTU case over BIC. I wonder if this is
 consistent? 

Here's a retest (5 tests each):

TSO enabled:

TCP Cubic (initial_ssthresh set to 0):

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5007.6295 MB /  10.06 sec = 4176.1807 Mbps 36 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4950.9279 MB /  10.06 sec = 4130.2528 Mbps 36 %TX 99 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4917.1742 MB /  10.05 sec = 4102.5772 Mbps 35 %TX 99 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4948.7920 MB /  10.05 sec = 4128.7990 Mbps 36 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4937.5765 MB /  10.05 sec = 4120.6460 Mbps 35 %TX 99 %RX

TCP Bic (initial_ssthresh set to 0):

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5005.5335 MB /  10.06 sec = 4172.9571 Mbps 36 %TX 99 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5001.0625 MB /  10.06 sec = 4169.2960 Mbps 36 %TX 99 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4957.7500 MB /  10.06 sec = 4135.7355 Mbps 36 %TX 99 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4957.3777 MB /  10.06 sec = 4135.6252 Mbps 36 %TX 99 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5059.1815 MB /  10.05 sec = 4221.3546 Mbps 37 %TX 99 %RX

TCP Reno:

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4973.3532 MB /  10.06 sec = 4147.3589 Mbps 36 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4984.4375 MB /  10.06 sec = 4155.2131 Mbps 36 %TX 99 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4995.6841 MB /  10.06 sec = 4166.2734 Mbps 36 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4982.2500 MB /  10.05 sec = 4156.7586 Mbps 36 %TX 99 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4989.9796 MB /  10.05 sec = 4163.0949 Mbps 36 %TX 99 %RX

TSO disabled:

TCP Cubic (initial_ssthresh set to 0):

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5075.8125 MB /  10.02 sec = 4247.3408 Mbps 99 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5056. MB /  10.03 sec = 4229.9621 Mbps 100 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5047.4375 MB /  10.03 sec = 4223.1203 Mbps 99 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5066.1875 MB /  10.03 sec = 4239.1659 Mbps 100 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4986.3750 MB /  10.03 sec = 4171.9906 Mbps 99 %TX 100 %RX

TCP Bic (initial_ssthresh set to 0):

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5040.5625 MB /  10.03 sec = 4217.3521 Mbps 100 %TX 100 %RX
[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5049.7500 MB /  10.03 sec = 4225.4585 Mbps 99 %TX 100 %RX

Re: e1000 driver and samba

2007-09-14 Thread Bill Fink

On Fri, 14 Sep 2007, L F wrote:

  can you describe your setup a bit more in detail? you're writing from a 
  linux
  client to a windows smb server? or even to a linux server? which end sees 
  the
  connection drop? the samba server? the samba linux client?
 Certainly.
 I have a LAN, with two switches in a stack. There currently are 7
 WinXP clients and one linux machine. The linux machine acts as a samba
 server and as a firewall/gateway.
 The two ports of the PRO/1000 in the linux box are connected to the
 LAN (eth4) and to a Comcast modem (eth3) respectively. Shorewall 3.4.5
 is running on the linux machine, with a strong firewall + NAT setup.
 Further, the linux machine currently has a tap device bridged into the
 LAN side, for virtualbox.
 Therefore, eth3 is a plain ethernet interface. br0, on the lan side,
 is tap0 + eth4.
 If I get any client on the LAN side, I can read from the linux box
 without a problem. However, if I attempt to write to the linux box
 from a LANside client, it will fail. If traffic is low, the failures
 are sporadic. If traffic is high (large file and/or multiple incoming
 files) the failure is guaranteed, either in 'delayed write fail' mode
 on the client or in silent corruption of the file (much worse). If
 read/write activity is combined, for instance when I unzip a zip
 archive to its own directory, failure is guaranteed and rapid, with a
 'delayed write fail' on the client after 50MB or so.
 I can post .config and anything else you may want if you require it. I
 tried changing cable as you suggested with little success. I'll try
 changing switch port, just to cover all bases.

Would it be worth a shot to try disabling the receiver hardware
checksumming (ethtool -K ethX rx off)?

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

2007-09-12 Thread Bill Fink

On Fri, 07 Sep 2007, jamal wrote:

 On Fri, 2007-07-09 at 10:31 +0100, James Chapman wrote:
  Not really. I used 3-year-old, single CPU x86 boxes with e100 
  interfaces. 
  The idle poll change keeps them in polled mode. Without idle 
  poll, I get twice as many interrupts as packets, one for txdone and one 
  for rx. NAPI is continuously scheduled in/out.
 
 Certainly faster than the machine in the paper (which was about 2 years
 old in 2005).
 I could never get ping -f to do that for me - so things must be getting
 worse with newer machines then.
 
  No. Since I did a flood ping from the machine under test, the improved 
  latency meant that the ping response was handled more quickly, causing 
  the next packet to be sent sooner. So more packets were transmitted in 
  the allotted time (10 seconds).
 
 ok.
 
  With current NAPI:
  rtt min/avg/max/mdev = 0.902/1.843/101.727/4.659 ms, pipe 9, ipg/ewma 
  1.611/1.421 ms
  
  With idle poll changes:
  rtt min/avg/max/mdev = 0.898/1.117/28.371/0.689 ms, pipe 3, ipg/ewma 
  1.175/1.236 ms
 
 Not bad in terms of latency. The deviation certainly looks better.
 
  But the CPU has done more work. 
 
 I am going to be the devil's advocate[1]:

So let me be the angel's advocate.  :-)

 If the problem i am trying to solve is reduce cpu use at lower rate,
 then this is not the right answer because your cpu use has gone up.
 Your latency numbers have not improved that much (looking at the avg)
 and your throughput is not that much higher. Will i be willing to pay
 more cpu (of an already piggish cpu use by NAPI at that rate with 2
 interupts per packet)?

I view his results much more favorably.  With current NAPI, the average
RTT is 104% higher than the minimum, the deviation is 4.659 ms, and the
maximum RTT is 101.727 ms.  With his patch, the average RTT is only 24%
higher than the minimum, the deviation is only 0.689 ms, and the maximum
RTT is 28.371 ms.  The average RTT improved by 39%, the deviation was
6.8 times smaller, and the maximum RTT was 3.6 times smaller.  So in
every respect the latency was significantly better.

The throughput increased from 6200 packets to 8510 packets or an increase
of 37%.  The only negative is that the CPU utilization increased from
62% to 100% or an increase of 61%, so the CPU increase was greater than
the increase in the amount of work performed (17.6% greater than what
one would expect purely from the increased amount of work).

You can't always improve on all metrics of a workload.  Sometimes there
are tradeoffs to be made to be decided by the user based on what's most
important to that user and his specific workload.  And the suggested
ethtool option (defaulting to current behavior) would enable the user
to make that decision.

-Bill

P.S.  I agree that some tests run in parallel with some CPU hogs also
  running might be beneficial and enlightening.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2]: [NET_SCHED]: Make all rate based scheduler work with TSO.

2007-09-04 Thread Bill Fink

On Tue, 04 Sep 2007, Patrick McHardy wrote:

 Bill Fink wrote:
  On Sat, 1 Sep 2007, Jesper Dangaard Brouer wrote:
 
 On Sat, 1 Sep 2007, Patrick McHardy wrote:
 
 It still won't work properly with TSO (TBF for example already drops
 oversized packets during -enqueue), but its a good cleanup anyway.
 
 Then lets call it a cleanup of the L2T macros.  In the next step we will 
 fix the different schedulers, to use the ability to lookup larger sized 
 packets. (I did notice the TBF scheduler would drop oversized packets).
  
  Hmmm.  I guess this is also why TBF doesn't seem to work with 9000 byte
  jumbo frames.
  
  [EMAIL PROTECTED] ~]# tc qdisc add dev eth2 root tbf rate 2gbit buffer 
  500 limit 18000
 
 Yes, you need to specify the MTU on the command line for
 jumbo frames.

Thanks!  Works much better now, although it does slightly exceed
the specified rate.

[EMAIL PROTECTED] ~]# tc qdisc add dev eth2 root tbf rate 2gbit buffer 500 
limit 18000 mtu 9000

[EMAIL PROTECTED] ~]# ./nuttcp-5.5.5 -w10m 192.168.88.14
 2465.6729 MB /  10.08 sec = 2051.8241 Mbps 19 %TX 13 %RX

[EMAIL PROTECTED] ~]# ./nuttcp-5.5.5 -M1460 -w10m 192.168.88.14
 2785.5000 MB /  10.00 sec = 2335.6569 Mbps 100 %TX 26 %RX

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2]: [NET_SCHED]: Make all rate based scheduler work with TSO.

2007-09-03 Thread Bill Fink

On Sat, 1 Sep 2007, Jesper Dangaard Brouer wrote:

 On Sat, 1 Sep 2007, Patrick McHardy wrote:
 
  Jesper Dangaard Brouer wrote:
  commit 6fdc0f061be94f5e297650961360fb7a9d1cc85d
  Author: Jesper Dangaard Brouer [EMAIL PROTECTED]
  Date:   Thu Aug 30 17:53:42 2007 +0200
  
  [NET_SCHED]: Make all rate based scheduler work with TSO.
  
  Change L2T (length to time) macros, in all rate based schedulers, to
  call a common function qdisc_l2t() that does the rate table lookup.
  This function handles if the packet size lookup is larger than the
  rate table, which often occurs with TSO enabled.
 
 
  It still won't work properly with TSO (TBF for example already drops
  oversized packets during -enqueue), but its a good cleanup anyway.
 
 Then lets call it a cleanup of the L2T macros.  In the next step we will 
 fix the different schedulers, to use the ability to lookup larger sized 
 packets. (I did notice the TBF scheduler would drop oversized packets).

Hmmm.  I guess this is also why TBF doesn't seem to work with 9000 byte
jumbo frames.

[EMAIL PROTECTED] ~]# tc qdisc add dev eth2 root tbf rate 2gbit buffer 500 
limit 18000
[EMAIL PROTECTED] ~]# tc qdisc show 
  qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 
1 1 1 1 1
qdisc tbf 8002: dev eth2 rate 2000Mbit burst 500b lat 4.3s

With 9000 byte jumbo frames:

[EMAIL PROTECTED] ~]# ./nuttcp-5.5.5 -w10m 192.168.88.14
0. MB /   5.00 sec =0. Mbps 0 %TX 0 %RX

But reducing the MSS to 1460 to emulate a standard 1500 byte Ethernet MTU:

[EMAIL PROTECTED] ~]# ./nuttcp-5.5.5 -M1460 -w10m 192.168.88.14
 2335.7048 MB /  10.05 sec = 1950.3419 Mbps 62 %TX 22 %RX

This is on a 2.6.20.7 kernel.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-26 Thread Bill Fink

On Fri, 24 Aug 2007, John Heffner wrote:

 Bill Fink wrote:
  Here you can see there is a major difference in the TX CPU utilization
  (99 % with TSO disabled versus only 39 % with TSO enabled), although
  the TSO disabled case was able to squeeze out a little extra performance
  from its extra CPU utilization.  Interestingly, with TSO enabled, the
  receiver actually consumed more CPU than with TSO disabled, so I guess
  the receiver CPU saturation in that case (99 %) was what restricted
  its performance somewhat (this was consistent across a few test runs).
 
 One possibility is that I think the receive-side processing tends to do 
 better when receiving into an empty queue.  When the (non-TSO) sender is 
 the flow's bottleneck, this is going to be the case.  But when you 
 switch to TSO, the receiver becomes the bottleneck and you're always 
 going to have to put the packets at the back of the receive queue.  This 
 might help account for the reason why you have both lower throughput and 
 higher CPU utilization -- there's a point of instability right where the 
 receiver becomes the bottleneck and you end up pushing it over to the 
 bad side. :)
 
 Just a theory.  I'm honestly surprised this effect would be so 
 significant.  What do the numbers from netstat -s look like in the two 
 cases?

Well, I was going to check this out, but I happened to reboot the
system and now I get somewhat different results.

Here are the new results, which should hopefully be more accurate
since they are on a freshly booted system.

TSO enabled and GSO disabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11610.6875 MB /  10.00 sec = 9735.9526 Mbps 100 %TX 75 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5029.6875 MB /  10.06 sec = 4194.6931 Mbps 36 %TX 100 %RX

TSO disabled and GSO disabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11817.9375 MB /  10.00 sec = 9909.7773 Mbps 99 %TX 77 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5823.3125 MB /  10.00 sec = 4883.2429 Mbps 100 %TX 82 %RX

The TSO disabled case got a little better performance even for
9000 byte jumbo frames.  For the -M1460 case eumalating a
standard 1500 byte Ethernet MTU, the performance was significantly
better and used less CPU on the receiver (82 % versus 100 %)
although it did use significantly more CPU on the transmitter
(100 % versus 36 %).

TSO disabled and GSO enabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11609.5625 MB /  10.00 sec = 9734.9859 Mbps 99 %TX 75 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5001.4375 MB /  10.06 sec = 4170.6739 Mbps 52 %TX 100 %RX

The GSO enabled case is very similar to the TSO enabled case,
except that for the -M1460 test the transmitter used more
CPU (52 % versus 36 %), which is to be expected since TSO has
hardware assist.

Here's the beforeafter delta of the receiver's netstat -s
statistics for the TSO enabled case:

Ip:
3659898 total packets received
3659898 incoming packets delivered
80050 requests sent out
Tcp:
2 passive connection openings
3659897 segments received
80050 segments send out
TcpExt:
33 packets directly queued to recvmsg prequeue.
104956 packets directly received from backlog
705528 packets directly received from prequeue
3654842 packets header predicted
193 packets header predicted and directly queued to user
4 acknowledgments not containing data received
6 predicted acknowledgments

And here it is for the TSO disabled case (GSO also disabled):

Ip:
4107083 total packets received
4107083 incoming packets delivered
1401376 requests sent out
Tcp:
2 passive connection openings
4107083 segments received
1401376 segments send out
TcpExt:
2 TCP sockets finished time wait in fast timer
48486 packets directly queued to recvmsg prequeue.
1056111048 packets directly received from backlog
2273357712 packets directly received from prequeue
1819317 packets header predicted
2287497 packets header predicted and directly queued to user
4 acknowledgments not containing data received
10 predicted acknowledgments

For the TSO disabled case, there are a huge amount more TCP segments
sent out (1401376 versus 80050), which I assume are ACKs, and which
could possibly contribute to the higher throughput for the TSO disabled
case due to faster feedback, but not explain the lower CPU utilization.
There are many more packets directly queued to recvmsg prequeue
(48486 versus 33).  The numbers for packets directly received from
backlog and prequeue in the TCP disabled case seem bogus to me so
I don't know how to interpret that.  There are only about half as
many packets header predicted (1819317 versus 3654842), but there
are many more packets header predicted and directly queued to user
(2287497 versus 193).  I'll leave the analysis of all this to those
who might actually know what it all means.

I also ran another set of tests that may

Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-25 Thread Bill Fink

On Sat, 25 Aug 2007, Herbert Xu wrote:

 On Fri, Aug 24, 2007 at 02:25:03PM -0700, David Miller wrote:
 
  My hunch is that even if in the non-TSO case the TX packets were all
  back to back in the cards TX ring, TSO still spits them out faster on
  the wire.
 
 If this is the case then we should see an improvement by
 disabling TSO and enabling GSO.

TSO disabled and GSO enabled:

[EMAIL PROTECTED] redhat]# nuttcp -w10m 192.168.88.16
11806.7500 MB /  10.00 sec = 9900.6278 Mbps 100 %TX 84 %RX

[EMAIL PROTECTED] redhat]# nuttcp -M1460 -w10m 192.168.88.16
 4872.0625 MB /  10.00 sec = 4085.5690 Mbps 100 %TX 64 %RX

In the -M1460 case, there was generally less receiver CPU utilization,
but the transmitter utilization was generally pegged at 100 %, even
though there wasn't any improvement in throughput compared to the
TSO enabled case (in fact the throughput generally seemed to be somewhat
less than the TSO enabled case).  Note there was a fair degree of
variability across runs for the receiver CPU utilization (the one
shown I considered to be representative of the average behavior).

Repeat of previous test results:

TSO enabled and GSO disabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11813.4375 MB /  10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5102.8503 MB /  10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX

TSO disabled and GSO disabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5399.5625 MB /  10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-24 Thread Bill Fink

On Fri, 24 Aug 2007, jamal wrote:

 On Thu, 2007-23-08 at 23:18 -0400, Bill Fink wrote:
 
 [..]
  Here you can see there is a major difference in the TX CPU utilization
  (99 % with TSO disabled versus only 39 % with TSO enabled), although
  the TSO disabled case was able to squeeze out a little extra performance
  from its extra CPU utilization.  
 
 Good stuff. What kind of machine? SMP?

Tyan Thunder K8WE S2895ANRF motherboard with Nvidia nForce
Professional 2200+2050 chipset, 2 AMD Opteron 254 2.8 GHz CPUs,
4 GB PC3200 ECC REG-DDR 400 memory, and 2 PCI-Express x16 slots
(2 buses).

It is SMP but both the NIC interrupts and nuttcp are bound to
CPU 0.  And all other non-kernel system processes are bound to
CPU 1.

 Seems the receive side of the sender is also consuming a lot more cpu
 i suspect because receiver is generating a lot more ACKs with TSO.

Odd.  I just reran the TCP CUBIC -M1460 tests, and with TSO enabled
on the transmitter, there were about 153709 eth2 interrupts on the
receiver, while with TSO disabled there was actually a somewhat higher
number (164988) of receiver side eth2 interrupts, although the receive
side CPU utilization was actually lower in that case.

On the transmit side (different test run), the TSO enabled case had
about 161773 eth2 interrupts whereas the TSO disabled case had about
165179 eth2 interrupts.

 Does the choice of the tcp congestion control algorithm affect results?
 it would be interesting to see both MTUs with either TCP BIC vs good old
 reno on sender (probably without changing what the receiver does). BIC
 seems to be the default lately.

These tests were with the default TCP CUBIC (with initial_ssthresh
set to 0).

With TCP BIC (and initial_ssthresh set to 0):

TSO enabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11751.3750 MB /  10.00 sec = 9853.9839 Mbps 100 %TX 83 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 4999.3321 MB /  10.06 sec = 4167.7872 Mbps 38 %TX 100 %RX

TSO disabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11818.1875 MB /  10.00 sec = 9910.0682 Mbps 99 %TX 81 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5502.6250 MB /  10.00 sec = 4614.3297 Mbps 100 %TX 84 %RX

And with TCP Reno:

TSO enabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11782.6250 MB /  10.00 sec = 9880.2613 Mbps 100 %TX 77 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5024.6649 MB /  10.06 sec = 4191.6574 Mbps 38 %TX 99 %RX

TSO disabled:

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0860 Mbps 99 %TX 77 %RX

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5284. MB /  10.00 sec = 4430.9604 Mbps 99 %TX 79 %RX

Very similar results to the original TCP CUBIC tests.

  Interestingly, with TSO enabled, the
  receiver actually consumed more CPU than with TSO disabled, 
 
 I would suspect the fact that a lot more packets making it into the
 receiver for TSO contributes.
 
  so I guess
  the receiver CPU saturation in that case (99 %) was what restricted
  its performance somewhat (this was consistent across a few test runs).
 
 Unfortunately the receiver plays a big role in such tests - if it is
 bottlenecked then you are not really testing the limits of the
 transmitter. 

It might be interesting to see what affect the LRO changes would have
on this.  Once they are in a stable released kernel, I might try that
out, or maybe even before if I get some spare time (but that's in very
short supply right now).

-Thanks

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

2007-08-23 Thread Bill Fink

On Thu, 23 Aug 2007, Rick Jones wrote:

 jamal wrote:
  [TSO already passed - iirc, it has been
  demostranted to really not add much to throughput (cant improve much
  over closeness to wire speed) but improve CPU utilization].
 
 In the one gig space sure, but in the 10 Gig space, TSO on/off does make a 
 difference for throughput.

Not too much.

TSO enabled:

[EMAIL PROTECTED] ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11813.4375 MB /  10.00 sec = 9906.1644 Mbps 99 %TX 80 %RX

TSO disabled:

[EMAIL PROTECTED] ~]# ethtool -K eth2 tso off
[EMAIL PROTECTED] ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

[EMAIL PROTECTED] ~]# nuttcp -w10m 192.168.88.16
11818.2500 MB /  10.00 sec = 9910.0176 Mbps 100 %TX 78 %RX

Pretty negligible difference it seems.

This is with a 2.6.20.7 kernel, Myricom 10-GigE NICs, and 9000 byte
jumbo frames, in a LAN environment.

For grins, I also did a couple of tests with an MSS of 1460 to
emulate a standard 1500 byte Ethernet MTU.

TSO enabled:

[EMAIL PROTECTED] ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5102.8503 MB /  10.06 sec = 4253.9124 Mbps 39 %TX 99 %RX

TSO disabled:

[EMAIL PROTECTED] ~]# ethtool -K eth2 tso off
[EMAIL PROTECTED] ~]# ethtool -k eth2
Offload parameters for eth2:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

[EMAIL PROTECTED] ~]# nuttcp -M1460 -w10m 192.168.88.16
 5399.5625 MB /  10.00 sec = 4527.9070 Mbps 99 %TX 76 %RX

Here you can see there is a major difference in the TX CPU utilization
(99 % with TSO disabled versus only 39 % with TSO enabled), although
the TSO disabled case was able to squeeze out a little extra performance
from its extra CPU utilization.  Interestingly, with TSO enabled, the
receiver actually consumed more CPU than with TSO disabled, so I guess
the receiver CPU saturation in that case (99 %) was what restricted
its performance somewhat (this was consistent across a few test runs).

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/24] make atomic_read() behave consistently across all architectures

2007-08-15 Thread Bill Fink

On Wed, 15 Aug 2007, Satyam Sharma wrote:

 (C)
 $ cat tp3.c
 int a;
 
 void func(void)
 {
   *(volatile int *)a = 10;
   *(volatile int *)a = 20;
 }
 $ gcc -Os -S tp3.c
 $ cat tp3.s
 ...
 movl$10, a
 movl$20, a
 ...

I'm curious about one minor tangential point.  Why, instead of:

b = *(volatile int *)a;

why can't this just be expressed as:

b = (volatile int)a;

Isn't it the contents of a that's volatile, i.e. it's value can change
invisibly to the compiler, and that's why you want to force a read from
memory?  Why do you need the *(volatile int *) construct?

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ixgbe: New driver for Pci-Express 10GbE 82598 support

2007-08-04 Thread Bill Fink

On Fri, 3 Aug 2007, Auke Kok wrote:

 This patch adds support for the Intel 82598 PCI-Express 10GbE
 chipset. Devices will be available on the market soon.
 
 This version of the driver is largely the same as the last release:
 
 * Driver uses a single RX and single TX queue, each using 1 MSI-X
   irq vector.
 * Driver runs in NAPI mode only
 * Driver is largely multiqueue-ready (TM)

...

 diff --git a/Documentation/networking/ixgbe.txt 
 b/Documentation/networking/ixgbe.txt
 new file mode 100644
 index 000..823d69c
 --- /dev/null
 +++ b/Documentation/networking/ixgbe.txt
 @@ -0,0 +1,72 @@
 +Linux* Base Driver for the 10 Gigabit Family of Adapters
 +
 +
 +July 09, 2007
 +
 +
 +Contents
 +
 +
 +- In This Release
 +- Identifying Your Adapter
 +- Command Line Parameters

There is no section Command Line Parameters in the document.

-Bill



 +- Support
 +
 +In This Release
 +===
 +
 +This file describes the Linux* Base Driver for the 10 Gigabit PCI Express
 +Family of Adapters.  This driver supports the 2.6.x kernel. This driver
 +includes support for Itanium(R)2-based systems.
 +
 +The following features are now available in supported kernels:
 + - Native VLANs
 + - Channel Bonding (teaming)
 + - SNMP
 +
 +Channel Bonding documentation can be found in the Linux kernel source:
 +/Documentation/networking/bonding.txt
 +
 +Instructions on updating ethtool can be found in the section Additional
 +Configurations later in this document.
 +
 +
 +Identifying Your Adapter
 +
 +
 +The following Intel network adapters are compatible with the drivers in this
 +release:
 +
 +Controller  Adapter Name Physical Layer
 +--   --
 +82598   Intel(R) 10GbE-LR/LRM/SR
 +Server Adapters  10G Base -SR (850 nm optical fiber)
 + 10G Base -LRM (850 nm optical fiber)
 + 10G Base -LR (1310 nm optical fiber)
 +
 +For more information on how to identify your adapter, go to the Adapter 
 +Driver ID Guide at:
 +
 +http://support.intel.com/support/network/sb/CS-012904.htm
 +
 +For the latest Intel network drivers for Linux, refer to the following
 +website.  In the search field, enter your adapter name or type, or use the
 +networking link on the left to search for your adapter:
 +
 +http://downloadfinder.intel.com/scripts-df/support_intel.asp
 +
 +
 +Support
 +===
 +
 +For general information, go to the Intel support website at:
 +
 +http://support.intel.com
 +
 +or the Intel Wired Networking project hosted by Sourceforge at:
 +
 +http://sourceforge.net/projects/e1000
 +
 +If an issue is identified with the released source code on the supported
 +kernel with a supported adapter, email the specific information related
 +to the issue to [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: specifying scopid's for link-local IPv6 addrs

2007-07-25 Thread Bill Fink

On Tue, 24 Jul 2007, Sridhar Samudrala wrote:

 On Tue, 2007-07-24 at 10:13 -0700, Rick Jones wrote:
   Rick,
   
   I don't see any way around this.  For example, on one of my test
   systems, I have the following link local routes:
   
   chance% netstat -A inet6 -rn | grep fe80::/64
   fe80::/64   ::
 U 25600 eth0
   fe80::/64   ::
 U 25600 eth2
   fe80::/64   ::
 U 25600 eth3
   fe80::/64   ::
 U 25600 eth4
   fe80::/64   ::
 U 25600 eth5
   fe80::/64   ::
 U 25600 eth6
   
   So if I want to run a link local test to fe80::202:b3ff:fed4:cd1,
   the system has no way to choose which is the correct interface to
   use for the test, and will give an error if the interface isn't
   specified. 
  
  Yeah, I was wondering about that.  I'm not sure if the attempts on those 
  other 
  OSes happened to involve multiple interfaces or not.  Even so, it feels 
  unpleasant for an application to deal with and I wonder if there is a way 
  for a 
  stack to deal with it on the application's behalf.  I guess that might 
  involve 
  some sort of layer violation between neightbor discovery and routing 
  (typing 
  while I think about things I know little about...)
  
  Is there RFC chapter and verse I might read about routing with multiple 
  link-local's on a system?
  
   You must explicitly specify the desired interface.  For example,
   on my test system, the correct interface is eth6 which is interface 8
   (lo eth0 eth1 eth2 ... eth5 eth6).  Here is an example nuttcp test
   specifying interface 8:
   
   chance% nuttcp -P5100 fe80::202:b3ff:fed4:cd1%8
1178.5809 MB /  10.02 sec =  986.2728 Mbps 12 %TX 15 %RX
   
   nuttcp uses getaddrinfo() which parses the %ifindex field,
   and then copies the sin6_scope_id from the res structure to the
   server's sockaddr_in6 structure before initiating the connect().
  
  OK, I'll give that a quick try with netperf:
  
  [EMAIL PROTECTED] ~]# netperf -H 192.168.2.107 -c -C -i 30,3 -- -s 1M -S 1M 
  -m 64K 
  -H fe80::207:43ff:fe05:9d%2
 
 We can even specify the interface name instead of the interface index
link-local%ethX
 
 getaddrinfo() uses if_nametoindex() internally to get the index.
 
 Thanks
 Sridhar

Cool!  That's much easier and works great.  :-)

chance% nuttcp -P5100 fe80::202:b3ff:fed4:cd1%eth6
 1178.5468 MB /  10.02 sec =  986.3239 Mbps 13 %TX 15 %RX

Still learn something new every day.  Now if I just could remember
it all when I needed it later.  :-)

-Thanks

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: specifying scopid's for link-local IPv6 addrs

2007-07-24 Thread Bill Fink

On Mon, 23 Jul 2007, Rick Jones wrote:

 Folks -
 
 People running netperf have reported that they have trouble with IPv6 under 
 Linux.  Specifically, wereas the use of link-local IPv6 addresses just 
 works 
 in netperf under a number of other OSes they do not under Linux.  I'm 
 ass-u-me-ing 2.6 here, but not sure exactly which ones - I've seen it on a 
 2.6.18-based RHEL5.
 
 Some poking about and conversation has suggested that one has to set a 
 sin6_scope_id in the sockaddr_in6.  This needs to be an index of one of the 
 interfaces in the system, which I presume means walking some additional 
 structures.
 
 Is this a requirement which might be expected to remain in the future, or is 
 it 
 something which might just go away?  That will have an effect on netperf 
 future 
 development.
 
 thanks,
 
 rick jones

Rick,

I don't see any way around this.  For example, on one of my test
systems, I have the following link local routes:

chance% netstat -A inet6 -rn | grep fe80::/64
fe80::/64   ::  
U 25600 eth0
fe80::/64   ::  
U 25600 eth2
fe80::/64   ::  
U 25600 eth3
fe80::/64   ::  
U 25600 eth4
fe80::/64   ::  
U 25600 eth5
fe80::/64   ::  
U 25600 eth6

So if I want to run a link local test to fe80::202:b3ff:fed4:cd1,
the system has no way to choose which is the correct interface to
use for the test, and will give an error if the interface isn't
specified.  Here's an example of this with nuttcp:

chance% nuttcp -P5100 fe80::202:b3ff:fed4:cd1
nuttcp-t: Info: attempting to switch to deprecated classic mode
nuttcp-t: Info: will use less reliable transmitter side statistics
nuttcp-t: v5.5.5: Error: connect: Invalid argument
errno=22

You must explicitly specify the desired interface.  For example,
on my test system, the correct interface is eth6 which is interface 8
(lo eth0 eth1 eth2 ... eth5 eth6).  Here is an example nuttcp test
specifying interface 8:

chance% nuttcp -P5100 fe80::202:b3ff:fed4:cd1%8
 1178.5809 MB /  10.02 sec =  986.2728 Mbps 12 %TX 15 %RX

nuttcp uses getaddrinfo() which parses the %ifindex field,
and then copies the sin6_scope_id from the res structure to the
server's sockaddr_in6 structure before initiating the connect().

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Realtek RTL8111B serious performance issues

2007-07-18 Thread Bill Fink

Hi John,

On Wed, 18 Jul 2007, [EMAIL PROTECTED] wrote:

 On Wed, 18 Jul 2007, Francois Romieu wrote:
 
  [EMAIL PROTECTED] [EMAIL PROTECTED] :
  [...]
  Anyone have any suggestions for solving this problem?
 
  Try 2.6.23-rc1 when it is published or apply against 2.6.22 one of:
  http://www.fr.zoreil.com/people/francois/misc/20070628-2.6.22-rc6-r8169-test.patch
 
 Unfortunately, the 20070628 patch did not make any difference.
 
 
  http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.22-rc6/r8169-20070628/
 
 
 I tried various patches from that directory (aren't most or all of them
 included in the 20070628 patch?), but none of them helped either.
 
 
 This problem could be very difficult to track down.  Like I said, it
 definately effects emacs and firefox being drawn on a remote computer.
 Ping times, however, are not that bad:
 
 PING 192.168.26.150: 56 data bytes
 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=0. 
 time=0.287 ms
 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=1. 
 time=0.279 ms
 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=2. 
 time=0.196 ms
 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=3. 
 time=0.201 ms
 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=4. 
 time=0.159 ms
 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=5. 
 time=0.148 ms
 64 bytes from dyn26-1.blueskytours.com (192.168.26.150): icmp_seq=6. 
 time=0.150 ms
 
 Also, wget gets good throughput when retrieving files.
 
 It just seems to be X traffic which is extremely slow.  Using the old
 Linksys 10/100 PCI NIC, emacs comes up virtually instantaneously.  Using the
 integrated Realtek 8111B, emacs takes 10 seconds to draw.
 
 Thank you very much for trying to help.

Any chance that the Realtek 8111B is sharing interrupts with another
device (cat /proc/interrupts)?  Perhaps it is, and the Linksys isn't,
which could explain the difference in behavior.  Just something simple
to check and either rule in or out.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] TCP: remove initial_ssthresh from Cubic

2007-06-13 Thread Bill Fink

On Wed, 13 Jun 2007, David Miller wrote:

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 13 Jun 2007 11:31:49 -0700

  Maybe it is time to remove BIC?

 I don't see any compelling reason, the same could be said
 of the other experimental protocols we include in the tree.

I agree bic should be kept.  As I pointed out, if someone did want
to set the bic/cubic initial_ssthresh to 100 globally, my tests
showed bic's performance during the initial slow start phase was
far superior to cubic's.  I don't know if this is a bug or a
feature with cubic.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?

2007-06-12 Thread Bill Fink

On Tue, 12 Jun 2007, Stephen Hemminger wrote:

 On Tue, 12 Jun 2007 15:12:58 -0700 (PDT)
 David Miller [EMAIL PROTECTED] wrote:

  From: Bill Fink [EMAIL PROTECTED]
  Date: Wed, 16 May 2007 02:44:09 -0400

   [EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
   25446 segments retransmited
   20936 fast retransmits
   4503 retransmits in slow start
   4 sack retransmits failed

   It then only took 2.14 seconds to transfer 1 GB of data.

   That's all for now.

  Thanks for all of your testing and numbers Bill.

  Inhong et al., we have to do something about this, the issue
  has been known and sitting around for weeks if not months.

  How safely can we set the default initial_ssthresh to zero in
  Cubic and BIC?

 Yes. set it to zero. The module parameter could even go, and just
 leave the route metric as a way to set/remember it.

Actually, after thinking about this some more I had some second
thoughts about the matter.  For my scenario of an uncongested 10-GigE
path an initial_ssthresh=0 is definitely what is desired.

But perhaps on a congested link with lots of connections, the
initial_ssthresh=100 setting might have some benefit.  I don't
have an easy way of testing that so I was hoping Injong or someone
else might do that and report back.  If there was a benefit, perhaps
it would be useful to have a per-route option for setting the
initial_ssthresh.  That would leave the question of what to make
the default.  There was also the mystery of why cubic's slow start
performance was so much worse than bic's.  If a real benefit could
be demonstrated for the congested case, and if bic's slow start
behavior could be grafted onto cubic, then bic's current slow start
performance (with initial_ssthresh=100) might serve as an adequate
compromise between performance and not being overly aggressive for
the default behavior.

OTOH just setting it to zero as a default should also be fine as
that's the standard Reno behavior.  I'm leaning in that direction
personally, but I'm possibly biased because of my environment,
where I'm trying to get maximum performance out of 10-GigE WAN
networks that aren't particularly congested normally.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?

2007-05-16 Thread Bill Fink

 /sys/module/tcp_cubic/parameters/initial_ssthresh
0

[EMAIL PROTECTED] ~]# nuttcp -n1g -i1 -w60m 192.168.89.15
   34.5728 MB /   1.00 sec =  288.7865 Mbps
  108.0847 MB /   1.00 sec =  906.6994 Mbps
  160.3540 MB /   1.00 sec = 1345.0124 Mbps
  180.6226 MB /   1.00 sec = 1515.3385 Mbps
  195.5276 MB /   1.00 sec = 1640.2125 Mbps
  199.6750 MB /   1.00 sec = 1675.0192 Mbps

 1024. MB /   6.70 sec = 1282.1900 Mbps 17 %TX 31 %RX

[EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
25446 segments retransmited
20936 fast retransmits
4503 retransmits in slow start
4 sack retransmits failed

It only took 6.70 seconds to transfer 1 GB of data.  Note all the
retransmits were fast retransmits.

And finally with the standard aggressive Reno slow start behavior,
with no congestion experienced (increased the amount of buffering to
the netem delay emulator):

[EMAIL PROTECTED] ~]# nuttcp -n1g -i1 -w60m 192.168.89.15
   69.9829 MB /   1.01 sec =  583.0183 Mbps
  837.8787 MB /   1.00 sec = 7028.6427 Mbps

 1024. MB /   2.14 sec = 4005.2066 Mbps 52 %TX 32 %RX

[EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
25446 segments retransmited
20936 fast retransmits
4503 retransmits in slow start
4 sack retransmits failed

It then only took 2.14 seconds to transfer 1 GB of data.

That's all for now.

-Bill



 Thanks,
 Sangtae
 
 
 On 5/12/07, Bill Fink [EMAIL PROTECTED] wrote:
  On Thu, 10 May 2007, Injong Rhee wrote:
 
   Oops. I thought Bill was using 2.6.20 instead of 2.6.22 which should 
   contain
   our latest update.
 
  I am using 2.6.20.7.
 
   Regarding slow start behavior, the latest version should not change 
   though.
   I think it would be ok to change the slow start of bic and cubic to the
   default slow start. But what we observed is that when BDP is large,
   increasing cwnd by two times is really an overkill. consider increasing 
   from
   1024 into 2048 packets..maybe the target is somewhere between them. We 
   have
   potentially a large number of packets flushed into the network. That was 
   the
   original motivation to change slow start from the default into a more 
   gentle
   version. But I see the point that Bill is raising. We are working on
   improving this behavior in our lab. We will get back to this topic in a
   couple of weeks after we finish our testing and produce a patch.
 
  Is it feasible to replace the version of cubic in 2.6.20.7 with the
  new 2.1 version of cubic without changing the rest of the kernel, or
  are there kernel changes/dependencies that would prevent that?
 
  I've tried building and running a 2.6.21-git13 kernel, but am having
  some difficulties.  I will be away the rest of the weekend so won't be
  able to get back to this until Monday.
 
  -Bill
 
  P.S.  When getting into the the 10 Gbps range, I'm not sure there's
any way to avoid the types of large increases during slow start
that you mention, if you want to achieve those kinds of data
rates.
 
 
 
   - Original Message -
   From: Stephen Hemminger [EMAIL PROTECTED]
   To: David Miller [EMAIL PROTECTED]
   Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED];
   netdev@vger.kernel.org
   Sent: Thursday, May 10, 2007 4:45 PM
   Subject: Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
  
  
On Thu, 10 May 2007 13:35:22 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:
   
From: [EMAIL PROTECTED]
Date: Thu, 10 May 2007 14:39:25 -0400 (EDT)
   

 Bill,
 Could you test with the lastest version of CUBIC? this is not the
 latest
 version of it you tested.
   
Rhee-sangsang-nim, it might be a lot easier for people if you provide
a patch against the current tree for users to test instead of
constantly pointing them to your web site.
-
   
The 2.6.22 version should have the latest version, that I know of.
There was small patch from 2.6.21 that went in.
 
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?

2007-05-12 Thread Bill Fink

On Thu, 10 May 2007, Injong Rhee wrote:

 Oops. I thought Bill was using 2.6.20 instead of 2.6.22 which should contain 
 our latest update.

I am using 2.6.20.7.

 Regarding slow start behavior, the latest version should not change though. 
 I think it would be ok to change the slow start of bic and cubic to the 
 default slow start. But what we observed is that when BDP is large, 
 increasing cwnd by two times is really an overkill. consider increasing from 
 1024 into 2048 packets..maybe the target is somewhere between them. We have 
 potentially a large number of packets flushed into the network. That was the 
 original motivation to change slow start from the default into a more gentle 
 version. But I see the point that Bill is raising. We are working on 
 improving this behavior in our lab. We will get back to this topic in a 
 couple of weeks after we finish our testing and produce a patch.

Is it feasible to replace the version of cubic in 2.6.20.7 with the
new 2.1 version of cubic without changing the rest of the kernel, or
are there kernel changes/dependencies that would prevent that?

I've tried building and running a 2.6.21-git13 kernel, but am having
some difficulties.  I will be away the rest of the weekend so won't be
able to get back to this until Monday.

-Bill

P.S.  When getting into the the 10 Gbps range, I'm not sure there's
  any way to avoid the types of large increases during slow start
  that you mention, if you want to achieve those kinds of data
  rates.



 - Original Message - 
 From: Stephen Hemminger [EMAIL PROTECTED]
 To: David Miller [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 netdev@vger.kernel.org
 Sent: Thursday, May 10, 2007 4:45 PM
 Subject: Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?
 
 
  On Thu, 10 May 2007 13:35:22 -0700 (PDT)
  David Miller [EMAIL PROTECTED] wrote:
 
  From: [EMAIL PROTECTED]
  Date: Thu, 10 May 2007 14:39:25 -0400 (EDT)
 
  
   Bill,
   Could you test with the lastest version of CUBIC? this is not the 
   latest
   version of it you tested.
 
  Rhee-sangsang-nim, it might be a lot easier for people if you provide
  a patch against the current tree for users to test instead of
  constantly pointing them to your web site.
  -
 
  The 2.6.22 version should have the latest version, that I know of.
  There was small patch from 2.6.21 that went in.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?

2007-05-10 Thread Bill Fink

.

This reinforces my belief that it's best to marry the standard Reno
aggressive initial slow start behavior with the better performance
of bic or cubic during the subsequent steady state portion of the
TCP session.

I can of course achieve that objective by setting initial_ssthresh
to 0, but perhaps that should be made the default behavior.

-Bill



On Wed, 9 May 2007, I wrote:

 Hi Sangtae,
 
 On Tue, 8 May 2007, SANGTAE HA wrote:
 
  Hi Bill,
  
  At this time, BIC and CUBIC use a less aggressive slow start than
  other protocols. Because we observed slow start is somewhat
  aggressive and introduced a lot of packet losses. This may be changed
  to standard slow start in later version of BIC and CUBIC, but, at
  this time, we still using a modified slow start.
 
 slow start is somewhat of a misnomer.  However, I'd argue in favor
 of using the standard slow start for BIC and CUBIC as the default.
 Is the rationale for using a less agressive slow start to be gentler
 to certain receivers, which possibly can't handle a rapidly increasing
 initial burst of packets (and the resultant necessary allocation of
 system resources)?  Or is it related to encountering actual network
 congestion during the initial slow start period, and how well that
 is responded to?
 
  So, as you observed, this modified slow start behavior may slow for
  10G testing. You can alleviate this for your 10G testing by changing
  BIC and CUBIC to use a standard slow start by loading these modules
  with initial_ssthresh=0.
 
 I saw the initial_ssthresh parameter, but didn't know what it did or
 even what its units were.  I saw the default value was 100 and tried
 increasing it, but I didn't think to try setting it to 0.
 
 [EMAIL PROTECTED] ~]# grep -r initial_ssthresh 
 /usr/src/kernels/linux-2.6.20.7/Documentation/
 [EMAIL PROTECTED] ~]#
 
 It would be good to have some documentation for these bic and cubic
 parameters similar to the documentation in ip-sysctl.txt for the
 /proc/sys/net/ipv[46]/* variables (I know, I know, I should just
 use the source).
 
 Is it expected that the cubic slow start is that much less agressive
 than the bic slow start (from 10 secs to max rate for bic in my test
 to 25 secs to max rate for cubic).  This could be considered a performance
 regression since the default TCP was changed from bic to cubic.
 
 In any event, I'm now happy as setting initial_ssthresh to 0 works
 well for me.
 
 [EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
 0 segments retransmited
 
 [EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
 cubic
 
 [EMAIL PROTECTED] ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
 0
 
 [EMAIL PROTECTED] ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
69.9829 MB /   1.00 sec =  584.2065 Mbps
   843.1467 MB /   1.00 sec = 7072.9052 Mbps
   844.3655 MB /   1.00 sec = 7082.6544 Mbps
   842.2671 MB /   1.00 sec = 7065.7169 Mbps
   839.9204 MB /   1.00 sec = 7045.8335 Mbps
   840.1780 MB /   1.00 sec = 7048.3114 Mbps
   834.1475 MB /   1.00 sec = 6997.4270 Mbps
   835.5972 MB /   1.00 sec = 7009.3148 Mbps
   835.8152 MB /   1.00 sec = 7011.7537 Mbps
   830.9333 MB /   1.00 sec = 6969.9281 Mbps
 
  7617.1875 MB /  10.01 sec = 6386.2622 Mbps 90 %TX 46 %RX
 
 [EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
 0 segments retransmited
 
   -Thanks a lot!
 
   -Bill
 
 
 
  Regards,
  Sangtae
  
  
  On 5/6/07, Bill Fink [EMAIL PROTECTED] wrote:
   The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
   extent bic) seems to be way too slow.  With an ~80 ms RTT, this
   is what cubic delivers (thirty second test with one second interval
   reporting and specifying a socket buffer size of 60 MB):
  
   [EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
   0 segments retransmited
  
   [EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
   cubic
  
   [EMAIL PROTECTED] ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
   6.8188 MB /   1.00 sec =   57.0365 Mbps
  16.2097 MB /   1.00 sec =  135.9824 Mbps
  25.4553 MB /   1.00 sec =  213.5420 Mbps
  35.5127 MB /   1.00 sec =  297.9119 Mbps
  43.0066 MB /   1.00 sec =  360.7770 Mbps
  50.3210 MB /   1.00 sec =  422.1370 Mbps
  59.0796 MB /   1.00 sec =  495.6124 Mbps
  69.1284 MB /   1.00 sec =  579.9098 Mbps
  76.6479 MB /   1.00 sec =  642.9130 Mbps
  90.6189 MB /   1.00 sec =  760.2835 Mbps
 109.4348 MB /   1.00 sec =  918.0361 Mbps
 128.3105 MB /   1.00 sec = 1076.3813 Mbps
 150.4932 MB /   1.00 sec = 1262.4686 Mbps
 175.9229 MB /   1.00 sec = 1475.7965 Mbps
 205.9412 MB /   1.00 sec = 1727.6150 Mbps
 240.8130 MB /   1.00 sec = 2020.1504 Mbps
 282.1790 MB /   1.00 sec = 2367.1644 Mbps
 318.1841 MB /   1.00 sec = 2669.1349 Mbps
 372.6814 MB /   1.00 sec = 3126.1687 Mbps
 440.8411 MB /   1.00 sec = 3698.5200 Mbps

Re: 2.6.20.7 TCP cubic (and bic) initial slow start way too slow?

2007-05-09 Thread Bill Fink

Hi Sangtae,

On Tue, 8 May 2007, SANGTAE HA wrote:

 Hi Bill,
 
 At this time, BIC and CUBIC use a less aggressive slow start than
 other protocols. Because we observed slow start is somewhat
 aggressive and introduced a lot of packet losses. This may be changed
 to standard slow start in later version of BIC and CUBIC, but, at
 this time, we still using a modified slow start.

slow start is somewhat of a misnomer.  However, I'd argue in favor
of using the standard slow start for BIC and CUBIC as the default.
Is the rationale for using a less agressive slow start to be gentler
to certain receivers, which possibly can't handle a rapidly increasing
initial burst of packets (and the resultant necessary allocation of
system resources)?  Or is it related to encountering actual network
congestion during the initial slow start period, and how well that
is responded to?

 So, as you observed, this modified slow start behavior may slow for
 10G testing. You can alleviate this for your 10G testing by changing
 BIC and CUBIC to use a standard slow start by loading these modules
 with initial_ssthresh=0.

I saw the initial_ssthresh parameter, but didn't know what it did or
even what its units were.  I saw the default value was 100 and tried
increasing it, but I didn't think to try setting it to 0.

[EMAIL PROTECTED] ~]# grep -r initial_ssthresh 
/usr/src/kernels/linux-2.6.20.7/Documentation/
[EMAIL PROTECTED] ~]#

It would be good to have some documentation for these bic and cubic
parameters similar to the documentation in ip-sysctl.txt for the
/proc/sys/net/ipv[46]/* variables (I know, I know, I should just
use the source).

Is it expected that the cubic slow start is that much less agressive
than the bic slow start (from 10 secs to max rate for bic in my test
to 25 secs to max rate for cubic).  This could be considered a performance
regression since the default TCP was changed from bic to cubic.

In any event, I'm now happy as setting initial_ssthresh to 0 works
well for me.

[EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
0 segments retransmited

[EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
cubic

[EMAIL PROTECTED] ~]# cat /sys/module/tcp_cubic/parameters/initial_ssthresh
0

[EMAIL PROTECTED] ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
   69.9829 MB /   1.00 sec =  584.2065 Mbps
  843.1467 MB /   1.00 sec = 7072.9052 Mbps
  844.3655 MB /   1.00 sec = 7082.6544 Mbps
  842.2671 MB /   1.00 sec = 7065.7169 Mbps
  839.9204 MB /   1.00 sec = 7045.8335 Mbps
  840.1780 MB /   1.00 sec = 7048.3114 Mbps
  834.1475 MB /   1.00 sec = 6997.4270 Mbps
  835.5972 MB /   1.00 sec = 7009.3148 Mbps
  835.8152 MB /   1.00 sec = 7011.7537 Mbps
  830.9333 MB /   1.00 sec = 6969.9281 Mbps

 7617.1875 MB /  10.01 sec = 6386.2622 Mbps 90 %TX 46 %RX

[EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
0 segments retransmited

-Thanks a lot!

-Bill



 Regards,
 Sangtae
 
 
 On 5/6/07, Bill Fink [EMAIL PROTECTED] wrote:
  The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
  extent bic) seems to be way too slow.  With an ~80 ms RTT, this
  is what cubic delivers (thirty second test with one second interval
  reporting and specifying a socket buffer size of 60 MB):
 
  [EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
  0 segments retransmited
 
  [EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
  cubic
 
  [EMAIL PROTECTED] ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
  6.8188 MB /   1.00 sec =   57.0365 Mbps
 16.2097 MB /   1.00 sec =  135.9824 Mbps
 25.4553 MB /   1.00 sec =  213.5420 Mbps
 35.5127 MB /   1.00 sec =  297.9119 Mbps
 43.0066 MB /   1.00 sec =  360.7770 Mbps
 50.3210 MB /   1.00 sec =  422.1370 Mbps
 59.0796 MB /   1.00 sec =  495.6124 Mbps
 69.1284 MB /   1.00 sec =  579.9098 Mbps
 76.6479 MB /   1.00 sec =  642.9130 Mbps
 90.6189 MB /   1.00 sec =  760.2835 Mbps
109.4348 MB /   1.00 sec =  918.0361 Mbps
128.3105 MB /   1.00 sec = 1076.3813 Mbps
150.4932 MB /   1.00 sec = 1262.4686 Mbps
175.9229 MB /   1.00 sec = 1475.7965 Mbps
205.9412 MB /   1.00 sec = 1727.6150 Mbps
240.8130 MB /   1.00 sec = 2020.1504 Mbps
282.1790 MB /   1.00 sec = 2367.1644 Mbps
318.1841 MB /   1.00 sec = 2669.1349 Mbps
372.6814 MB /   1.00 sec = 3126.1687 Mbps
440.8411 MB /   1.00 sec = 3698.5200 Mbps
524.8633 MB /   1.00 sec = 4403.0220 Mbps
614.3542 MB /   1.00 sec = 5153.7367 Mbps
718.9917 MB /   1.00 sec = 6031.5386 Mbps
829.0474 MB /   1.00 sec = 6954.6438 Mbps
867.3289 MB /   1.00 sec = 7275.9510 Mbps
865.7759 MB /   1.00 sec = 7262.9813 Mbps
864.4795 MB /   1.00 sec = 7251.7071 Mbps
864.5425 MB /   1.00 sec = 7252.8519 Mbps
867.3372 MB /   1.00 sec = 7246.9232 Mbps
 
  10773.6875 MB /  30.00 sec = 3012.3936 Mbps 38 %TX 25 %RX
 
  [EMAIL PROTECTED] ~]# netstat -s | grep -i

2.6.20.7 TCP cubic (and bic) initial slow start way too slow?

2007-05-06 Thread Bill Fink

The initial TCP slow start on 2.6.20.7 cubic (and to a lesser
extent bic) seems to be way too slow.  With an ~80 ms RTT, this
is what cubic delivers (thirty second test with one second interval
reporting and specifying a socket buffer size of 60 MB):

[EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
0 segments retransmited

[EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
cubic

[EMAIL PROTECTED] ~]# nuttcp -T30 -i1 -w60m 192.168.89.15
6.8188 MB /   1.00 sec =   57.0365 Mbps
   16.2097 MB /   1.00 sec =  135.9824 Mbps
   25.4553 MB /   1.00 sec =  213.5420 Mbps
   35.5127 MB /   1.00 sec =  297.9119 Mbps
   43.0066 MB /   1.00 sec =  360.7770 Mbps
   50.3210 MB /   1.00 sec =  422.1370 Mbps
   59.0796 MB /   1.00 sec =  495.6124 Mbps
   69.1284 MB /   1.00 sec =  579.9098 Mbps
   76.6479 MB /   1.00 sec =  642.9130 Mbps
   90.6189 MB /   1.00 sec =  760.2835 Mbps
  109.4348 MB /   1.00 sec =  918.0361 Mbps
  128.3105 MB /   1.00 sec = 1076.3813 Mbps
  150.4932 MB /   1.00 sec = 1262.4686 Mbps
  175.9229 MB /   1.00 sec = 1475.7965 Mbps
  205.9412 MB /   1.00 sec = 1727.6150 Mbps
  240.8130 MB /   1.00 sec = 2020.1504 Mbps
  282.1790 MB /   1.00 sec = 2367.1644 Mbps
  318.1841 MB /   1.00 sec = 2669.1349 Mbps
  372.6814 MB /   1.00 sec = 3126.1687 Mbps
  440.8411 MB /   1.00 sec = 3698.5200 Mbps
  524.8633 MB /   1.00 sec = 4403.0220 Mbps
  614.3542 MB /   1.00 sec = 5153.7367 Mbps
  718.9917 MB /   1.00 sec = 6031.5386 Mbps
  829.0474 MB /   1.00 sec = 6954.6438 Mbps
  867.3289 MB /   1.00 sec = 7275.9510 Mbps
  865.7759 MB /   1.00 sec = 7262.9813 Mbps
  864.4795 MB /   1.00 sec = 7251.7071 Mbps
  864.5425 MB /   1.00 sec = 7252.8519 Mbps
  867.3372 MB /   1.00 sec = 7246.9232 Mbps

10773.6875 MB /  30.00 sec = 3012.3936 Mbps 38 %TX 25 %RX

[EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
0 segments retransmited

It takes 25 seconds for cubic TCP to reach its maximal rate.
Note that there were no TCP retransmissions (no congestion
experienced).

Now with bic (only 20 second test this time):

[EMAIL PROTECTED] ~]# echo bic  /proc/sys/net/ipv4/tcp_congestion_control
[EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
bic

[EMAIL PROTECTED] ~]# nuttcp -T20 -i1 -w60m 192.168.89.15
9.9548 MB /   1.00 sec =   83.1497 Mbps
   47.2021 MB /   1.00 sec =  395.9762 Mbps
   92.4304 MB /   1.00 sec =  775.3889 Mbps
  134.3774 MB /   1.00 sec = 1127.2758 Mbps
  194.3286 MB /   1.00 sec = 1630.1987 Mbps
  280.0598 MB /   1.00 sec = 2349.3613 Mbps
  404.3201 MB /   1.00 sec = 3391.8250 Mbps
  559.1594 MB /   1.00 sec = 4690.6677 Mbps
  792.7100 MB /   1.00 sec = 6650.0257 Mbps
  857.2241 MB /   1.00 sec = 7190.6942 Mbps
  852.6912 MB /   1.00 sec = 7153.3283 Mbps
  852.6968 MB /   1.00 sec = 7153.2538 Mbps
  851.3162 MB /   1.00 sec = 7141.7575 Mbps
  851.4927 MB /   1.00 sec = 7143.0240 Mbps
  850.8782 MB /   1.00 sec = 7137.8762 Mbps
  852.7119 MB /   1.00 sec = 7153.2949 Mbps
  852.3879 MB /   1.00 sec = 7150.2982 Mbps
  850.2163 MB /   1.00 sec = 7132.5165 Mbps
  849.8340 MB /   1.00 sec = 7129.0026 Mbps

11882.7500 MB /  20.00 sec = 4984.0068 Mbps 67 %TX 41 %RX

[EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
0 segments retransmited

bic does better but still takes 10 seconds to achieve its maximal
rate.

Surprisingly venerable reno does the best (only a 10 second test):

[EMAIL PROTECTED] ~]# echo reno  /proc/sys/net/ipv4/tcp_congestion_control
[EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/tcp_congestion_control
reno

[EMAIL PROTECTED] ~]# nuttcp -T10 -i1 -w60m 192.168.89.15
   69.9829 MB /   1.01 sec =  583.5822 Mbps
  844.3870 MB /   1.00 sec = 7083.2808 Mbps
  862.7568 MB /   1.00 sec = 7237.7342 Mbps
  859.5725 MB /   1.00 sec = 7210.8981 Mbps
  860.1365 MB /   1.00 sec = 7215.4487 Mbps
  865.3940 MB /   1.00 sec = 7259.8434 Mbps
  863.9678 MB /   1.00 sec = 7247.4942 Mbps
  864.7493 MB /   1.00 sec = 7254.4634 Mbps
  864.6660 MB /   1.00 sec = 7253.5183 Mbps

 7816.9375 MB /  10.00 sec = 6554.4883 Mbps 90 %TX 53 %RX

[EMAIL PROTECTED] ~]# netstat -s | grep -i retrans
0 segments retransmited

reno achieves its maximal rate in about 2 seconds.  This is what I
would expect from the exponential increase during TCP's initial
slow start.  To achieve 10 Gbps on an 80 ms RTT with 9000 byte
jumbo frame packets would require:

[EMAIL PROTECTED] ~]# bc -l
scale=10
10^10*0.080/9000/8
1.11

So 1 packets would have to be in flight during one RTT.
It should take log2(1)+1 round trips to achieve 10 Gbps
(note bc's l() function is logE);

[EMAIL PROTECTED] ~]# bc -l
scale=10
l(1)/l(2)+1
14.4397010470

And 15 round trips at 80 ms each gives a total time of:

[EMAIL PROTECTED] ~]# bc -l
scale=10
15*0.080
1.200

So if there is no packet loss (which there wasn't), it should only
take about 1.2 seconds to achieve 10 Gbps.  Only TCP reno is in
this

Re: [PATCH 5/5 2.6.21] L2TP: Add PPPoL2TP in-kernel documentation

2007-05-01 Thread Bill Fink

On Mon, 30 Apr 2007, James Chapman wrote:

 Signed-off-by: James Chapman [EMAIL PROTECTED]
 
 Index: linux-2.6.21/Documentation/networking/l2tp.txt
 ===
 --- /dev/null
 +++ linux-2.6.21/Documentation/networking/l2tp.txt
 @@ -0,0 +1,167 @@
 +This brief document describes how to use the kernel's PPPoL2TP driver
 +to provide L2TP functionality. L2TP is a protocol that tunnels one or
 +more PPP sessions over a UDP tunnel. It is commonly used for VPNs
 +(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP
 +network infrastructure.
 +
 +Design
 +==
 +
 +The PPPoL2TP driver, drivers/net/pppol2tp.c, provides a mechanism by
 +which PPP frames carried through an L2TP session are passed through
 +the kernel's PPP subsystem. The standard PPP daemon, pppd, handles all
 +PPP interaction with the peer. PPP network interfaces are created for
 +each local PPP endpoint.

...

 +There are a number of requirements on the userspace L2TP daemon in
 +order to use the pppol2tp driver.
 +
 +1. Use a UDP socket per tunnel.
 +
 +2. Create a single PPPoL2TP socket per tunnel. This is used only for
 +   for communicating with the driver but must remain open while the

for for above.

 +   tunnel is active. The driver marks the tunnel socket as an L2TP UDP
 +   encapsulation socket, which hooks up the UDP receive path via
 +   usp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed

Typo.  usp_encap_rcv() - udp_encap_rcv().

 +   in this special PPPoX socket.
 +
 +3. Create a PPPoL2TP socket per L2TP session. This is typically done
 +   by starting pppd with the pppol2tp plugin and appropriate
 +   arguments. A PPPoL2TP tunnel management socket (Step 2) must be
 +   created before the first PPPoL2TP session socket is created.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: IPoIB forwarding

2007-04-30 Thread Bill Fink

On Mon, 30 Apr 2007, Rick Jones wrote:

  What version of the myri10ge driver is this?  With the 1.2.0 version
  that comes with the 2.6.20.7 kernel, there is no myri10ge_lro module
  parameter.
  
  [EMAIL PROTECTED] ~]# modinfo myri10ge | grep -i lro
  [EMAIL PROTECTED] ~]# 
  
  And I've been testing IP forwarding using two Myricom 10-GigE NICs
  without setting any special modprobe parameters.
 
 
 Ethtool -i on the interface reports 1.2.0 as the driver version.

Perhaps it would be useful to have different version strings for
the in-kernel Linux version and the Myricom externally provided
version.  Just a thought.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general] Re: IPoIB forwarding

2007-04-28 Thread Bill Fink

On Fri, 27 Apr 2007, Rick Jones wrote:

 Bryan Lawver wrote:
  I had so much debugging turned on that it was not the slowing of the 
  traffic but the non-coelescencing that was the remedy.  The NIC is a 
  MyriCom NIC and these are easy options to set.
 
 As chance would have it, I've played with some Myricom myri10ge NICs 
 recently, 
 and even disabled large receive offload during some netperf tests :)  It is a 
 modprobe option.  Going back now to the driver source and the README I see :-)
 
 
 excerpt
 Troubleshooting
 ===
 
 Large Receive Offload (LRO) is enabled by default.  This will
 interfere with forwarding TCP traffic.  If you plan to forward TCP
 traffic (using the host with the Myri10GE NIC as a router or bridge),
 you must disable LRO.  To disable LRO, load the myri10ge driver
 with myri10ge_lro set to 0:
 
   # modprobe myri10ge myri10ge_lro=0
 
 Alternatively, you can disable LRO at runtime by disabling
 receive checksum offloading via ethtool:
 
 # ethtool -K eth2 rx off
 
 /excerpt
 
 rick jones

What version of the myri10ge driver is this?  With the 1.2.0 version
that comes with the 2.6.20.7 kernel, there is no myri10ge_lro module
parameter.

[EMAIL PROTECTED] ~]# modinfo myri10ge | grep -i lro
[EMAIL PROTECTED] ~]# 

And I've been testing IP forwarding using two Myricom 10-GigE NICs
without setting any special modprobe parameters.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

netdev file size restrictions??? Was: Re: [PATCH 04/14] AF_RXRPC: Provide secure RxRPC sockets for ...

2007-04-27 Thread Bill Fink

On Thu, 26 Apr 2007 20:54:36 +0100, David Howells wrote:

 Provide AF_RXRPC sockets that can be used to talk to AFS servers, or serve
 answers to AFS clients.  KerberosIV security is fully supported.  The patches
 and some example test programs can be found in:
 
   http://people.redhat.com/~dhowells/rxrpc/
 
 This will eventually replace the old implementation of kernel-only RxRPC
 currently resident in net/rxrpc/.
 
 The following documentation is from Documentation/networking/rxrpc.txt:
 
   ==
   RxRPC NETWORK PROTOCOL
   ==
...

Did the file size restrictions for netdev somehow get lifted?
I just received this e-mail that my mail client says is 339.3KB
(and a few others that are over 100KB (some well over)).

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ppp and routing table rules.

2007-03-02 Thread Bill Fink

On Thu, 01 Mar 2007, Ben Greear wrote:

 Ben Greear wrote:
 
 I am sending udp packets through ppp400, and I see them appear on ppp401 
 as expected.
 
 The thing that is bothering me is that all I see on rddVR4 (172.1.2.1) 
 is arps for 172.1.2.2, but the 'tell' IP is that of the
 originating ppp400 link, not the IP of rddVR4, as I expected:
 
 21:47:16.119640 arp who-has 172.1.2.2 tell 11.1.1.3
 21:47:17.119371 arp who-has 172.1.2.2 tell 11.1.1.3
 21:47:18.119254 arp who-has 172.1.2.2 tell 11.1.1.3
 21:47:19.273118 arp who-has 172.1.2.2 tell 11.1.1.3
 
 Unless I'm missing something dumb, a similar setup with all ethernet-ish 
 network devices
 works fine.
 
 I have also enabled arp filtering:
 # Only answer ARPs if it is for the IP on our own interface.
 echo 2  /proc/sys/net/ipv4/conf/all/arp_ignore
 and for every device used in these routing tables:
 echo 1  /proc/sys/net/ipv4/conf/[dev]/arp_filter
 
 Any idea what I need to do in order to make  the source IP for the ARP 
 packet correct?

Wouldn't that be controlled by arp_announce?

arp_announce - INTEGER
Define different restriction levels for announcing the local
source IP address from IP packets in ARP requests sent on
interface:
0 - (default) Use any local address, configured on any interface
1 - Try to avoid local addresses that are not in the target's
subnet for this interface. This mode is useful when target
hosts reachable via this interface require the source IP
address in ARP requests to be part of their logical network
configured on the receiving interface. When we generate the
request we will check all our subnets that include the
target IP and will preserve the source address if it is from
such subnet. If there is no such subnet we select source
address according to the rules for level 2.
2 - Always use the best local address for this target.
In this mode we ignore the source address in the IP packet
and try to select local address that we prefer for talks with
the target host. Such local address is selected by looking
for primary IP addresses on all our subnets on the outgoing
interface that include the target IP address. If no suitable
local address is found we select the first local address
we have on the outgoing interface or on all other interfaces,
with the hope we will receive reply for our request and
even sometimes no matter the source IP address we announce.

The max value from conf/{all,interface}/arp_announce is used.

Increasing the restriction level gives more chance for
receiving answer from the resolved target while decreasing
the level announces more valid sender's information.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why do we mangle checksums for v6 ICMP?

2006-11-11 Thread Bill Fink

On Thu, 09 Nov 2006, David Miller wrote:

 From: Brian Haley [EMAIL PROTECTED]
 Date: Thu, 09 Nov 2006 12:32:18 -0500

  Al Viro wrote:
 AFAICS, the rules are:

   (1) checksum is 16-bit one's complement of the one's complement sum of
   relevant 16bit words.

   (2) for v4 UDP all-zeroes has special meaning - no checksum; if you get
   it from (1), send all-ones instead.

   (3) for v6 UDP we have the same remapping as in (2), but all-zeroes has
   different meaning - not ignore checksum as in v4, but reject the
   packet.

   (4) there is no (4).

 IOW, nobody except UDP has any business doing that 0-0x
   replacement.  However, we have
  if (icmp6h-icmp6_cksum == 0)
  icmp6h-icmp6_cksum = -1;

  This doesn't look necessary, RFCs 4443/2463 don't mention it being 
  necessary, and BSD doesn't do it either.  I'll cook-up a patch to remove 
  that since I was doing some other mods in that codepath.

 This is how things look to me too.

   and similar in net/ipv6/raw.c

  Maybe here it only needs to be done if (fl-proto == IPPROTO_UDP)?

 Yes, I believe that is what is needed.

On a raw IPv6 socket, shouldn't the IP checksum just be left
unchanged, so you can test transmission of IPv6 packets with
an invalid zero IP checksum.  Or is raw not fully raw?

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/14] [TIPC] Name publication events now delivered in chronological order

2006-10-13 Thread Bill Fink

FYI,

At least here, I received two copies of patch 9/14 and no copy
of patch 10/14.

-Bill



On Fri, 13 Oct 2006 13:37:50 +0200, Per Liden wrote:

 From: Allan Stephens [EMAIL PROTECTED]
 
 This patch tivially re-orders the entries in TIPC's list of local
 publications so that applications will receive publication events
 in the order they were published.
 
 Signed-off-by: Allan Stephens [EMAIL PROTECTED]
 Signed-off-by: Per Liden [EMAIL PROTECTED]
 ---
  net/tipc/name_distr.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
 index f0b063b..03bd659 100644
 --- a/net/tipc/name_distr.c
 +++ b/net/tipc/name_distr.c
 @@ -122,7 +122,7 @@ void tipc_named_publish(struct publicati
   struct sk_buff *buf;
   struct distr_item *item;
  
 - list_add(publ-local_list, publ_root);
 + list_add_tail(publ-local_list, publ_root);
   publ_cnt++;
  
   buf = named_prepare_buf(PUBLICATION, ITEM_SIZE, 0);
 -- 
 1.4.1
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/02] net/ipv6: seperate sit driver to extra module

2006-10-06 Thread Bill Fink

On Fri, 6 Oct 2006 17:15:56 +0200, Joerg Roedel wrote:

 +config IPV6_SIT
 + tristate IPv6: IPv6-in-IPv4 tunnel (SIT driver)
 + depends on IPV6
 + default y
 + ---help---
 +   Tunneling means encapsulating data of one protocol type within
 +   another protocol and sending it over a channel that understands the
 +   encapsulating protocol. This driver implements encapsulation of IPv6
 +   into IPv4 packets. This is useful if you want to connect two IPv6
 +   networks over an IPv4-only path.
 +
 +   Saying M here will produce a module called sit.ko. If unsure, say N.

From a user perspective, I believe it should say If unsure, say Y.
The unsure case for the unsure user should be the case that works for
the broadest possible usage spectrum, which would be the 'Y' case.
To put it another way, if you pick 'Y' and don't really need it, the
only downside is wasting some memory.  But if you pick 'N' and actually
did need it, previously working IPv6 networking would no longer work.
I believe the default setting should match the unsure recommendation.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ethtool v4: add autoneg advertise feature

2006-08-25 Thread Bill Fink

On Thu, 24 Aug 2006, Michael Chan wrote:

 Jeff Kirsher wrote:
 
  The old way of setting autonegotiation was using the 
  following command:
  ethtool -s ethx speed 100 duplex full auto on
  now the command would be
  ethtool -s ethx auto on advertise 0x08
  both commands would result in only advertising 100 FULL.
  
  There still needs to be a change made to the man file to reflect the
  change in the behavior of ethtool, which I have not done.  But this
  patch will allow for greater flexibility in setting autonegotiation
  speeds.
 
 It is more flexible, but less intuitive.  The user now has to
 remember hex values instead of the more intuitive speed and
 duplex.  Perhaps we can keep the old method of using speed and
 duplex, while adding the new method of specifying hex values? 

I agree.  Something like:

ethtool -s ethx auto on advertise mode1+mode2+...+moden

For example:

ethtool -s ethx auto on advertise 100-half+100-full

to set speed 100 either half or full duplex.

Maybe have some abbreviations such as 100-all (same as above) or
all-half (for all supported half duplex) or just all (for all supported
modes), which I suppose is the default.

Just an idea.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ethtool v4: add autoneg advertise feature

2006-08-25 Thread Bill Fink

On Fri, 25 Aug 2006, Jeff Kirsher wrote:

 On 8/25/06, Bill Fink [EMAIL PROTECTED] wrote:
 
  I agree.  Something like:
 
  ethtool -s ethx auto on advertise mode1+mode2+...+moden
 
  For example:
 
  ethtool -s ethx auto on advertise 100-half+100-full
 
  to set speed 100 either half or full duplex.
 
  Maybe have some abbreviations such as 100-all (same as above) or
  all-half (for all supported half duplex) or just all (for all supported
  modes), which I suppose is the default.
 
  Just an idea.
 
  -Bill
 
 
 I agree that using a hex value is less intuitive, but with proper
 documentation in the man file it would be easily understood.  It is
 also easier to state
 ethtool -s ethx autoneg on advertise 0x0F
 than it would be to do:
 ethtool -s ethx autoneg on advertise 100-half+100-full+10-half+10-full

This could be abbreviated to:

ethtool -s ethx autoneg on advertise 100-all+10-all

 Not that it is impossible to do, but the code to do the parsing would
 not be as clean as it is to use a hex value.  Currently ethtool
 already uses numeric values for messagelevel, phyad and sopass.  So I
 am not suggesting something completely new.  I have already submitted
 a patch to keep the old functionality while adding the new.  Only
 thing left for this is to create the manual documentation so that
 users can easily understand how to use the functionality.
 
 10-half = 0x01
 10-full  = 0x02
 100-half   = 0x04
 100-full= 0x08
 1000-half = 0x10 (actually not supported by IEEE standards)

I thought the above wasn't a supported option.

 1000-full  = 0x20
 auto= 0x00 or 0x3F
 
 In addition the code already tests the value that the user enters with
 what is supported and only displays the supported values.

I agree that with decent documentation, use of the hex values shouldn't
be that difficult for most users, although using hex arithmetic might
be Greek to some.  I was just suggesting a possible alternative, but
I admit it's a fairly minor issue one way or the other.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/18] d80211: pointers as extended booleans

2006-08-22 Thread Bill Fink

On Mon, 21 Aug 2006, Johannes Berg wrote:

 Please review carefully, the task was so boring that I might have made
 stupid mistakes.
 ---
 This huge patch changes d80211 to treat pointers as extended booleans,
 using if (!ptr) and if (ptr) instead of comparisons with NULL.
 
 Signed-off-by: Johannes Berg [EMAIL PROTECTED]
 
 --- wireless-dev.orig/net/d80211/ieee80211_scan.c 2006-08-20 
 14:56:09.738192788 +0200
 +++ wireless-dev/net/d80211/ieee80211_scan.c  2006-08-20 14:56:17.398192788 
 +0200
[...]
 @@ -1105,8 +1105,8 @@ __ieee80211_tx_prepare(struct ieee80211_
   tx-fragmented = local-fragmentation_threshold 
   IEEE80211_MAX_FRAG_THRESHOLD  tx-u.tx.unicast 
   skb-len + 4 /* FCS */  local-fragmentation_threshold 
 - (local-hw-set_frag_threshold == NULL);
 - if (tx-sta == NULL)
 + (!local-hw-set_frag_threshold);
 + if (!tx-sta)
   control-clear_dst_mask = 1;
   else if (tx-sta-clear_dst_mask) {
   control-clear_dst_mask = 1;
[...]

Just a minor nit.  I don't believe the () on the first new line
are needed.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] network memory allocator.

2006-08-15 Thread Bill Fink

On Tue, 15 Aug 2006, Evgeniy Polyakov wrote:

 On Tue, Aug 15, 2006 at 03:49:28PM +0200, Peter Zijlstra ([EMAIL PROTECTED]) 
 wrote:
 
  It could if you can provide adequate detection of memory pressure and
  fallback to a degraded mode within the same allocator/stack and can
  guarantee limited service to critical parts.
 
 It is not needed, since network allocations are separated from main
 system ones.
 I think I need to show an example here.
 
 Let's main system works only with TCP for simplicity.
 Let's maximum allowed memory is limited by 1mb (it is 768k on machine
 with 1gb of ram).

The maximum amount of memory available for TCP on a system with 1 GB
of memory is 768 MB (not 768 KB).

[EMAIL PROTECTED] ~]$ cat /proc/meminfo
MemTotal:  1034924 kB
...

[EMAIL PROTECTED] ~]$ cat /proc/sys/net/ipv4/tcp_mem
98304   131072  196608

Since tcp_mem is in pages (4K in this case), maximum TCP memory
is 196608*4K or 768 MB.

Or am I missing something obvious.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-06-25 Thread Bill Fink

On Sun, 25 Jun 2006, Harry Edmon wrote:

 I understand the saying beggars can't be choosers, but I have heard nothing 
 on 
 this issue since June 19th.  Does anyone have any ideas on what is going on?  
 Is 
 there more information I can collect that would help diagnose this problem?  
 And 
 again, thanks for any and all help!

Harry,

I'd suggest checking all the ethtool configuration settings
(ethtool -a, -c, -g, -k) and statistics (ethtool -S) for both
the working and problematic kernels, and then comparing them
to see if anything jumps out at you.  Also compare ifconfig
settings and dmesg output.  Check /proc/interrupts to see if
there is any difference with the interrupt routing.  Check
sysctl.conf and rc.local for any special system configuration
or device settings that might differ between the systems.

The one thing that has caused me a lot of network performance
issues on e1000 is having TSO enabled, so if that is enabled
(check with ethtool -k), then I'd try disabling it to see if
that helps.

-Hope this helps

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: reminder, 2.6.18 window...

2006-05-25 Thread Bill Fink

On Wed, 24 May 2006, Jeff Garzik wrote:

 Brent Cook wrote:
  Note that this is just clearing the hardware statistics on the interface, 
  and 
  would not require any kind of atomic_increment addition for interfaces that 
  support that. It would be kind-of awkward to implement this on drivers that 
   
  increment stats in hardware though (lo, vlan, br, etc.) This also brings up 
  the question of resetting the stats for 'netstat -s'
 
 If you don't atomically clear the statistics, then you are leaving open 
 a window where the stats could easily be corrupted, if the network 
 interface is under load.
 
 This 'clearing' operation has implications on the rest of the statistics 
 usage.
 
 More complexity, and breaking of apps, when we could just use the 
 existing, working system?  I'll take the do nothing, break nothing, 
 everything still works route any day.

I'll admit to not knowing all the intricacies of the kernel coding involved,
but I don't offhand see how zeroing the stats would be significantly more
complex than updating the stats during normal usage.  But I'll have to
leave that argument to the experts.

To me the main argument is that such a stat zeroing feature would be
extremely useful.  When trying to track down nasty networking problems
that traverse a multitude of devices, it is often highly desirable to
zero the interface statistics on all the interfaces in the path (which
is available on all networking switches and routers I have worked with),
run some kind of stress test across the path, and then examine the packet
and error counters on all the involved interfaces.  This makes it easy to
pinpoint where packets are getting lost or errors are being introduced,
especially when there are scores of stats per device and you may not even
know a priori exactly what you are looking for.  Using such a scheme, the
human mind can quickly discern patterns in the data and focus in on any
likely problem areas.  The human mind (at least speaking for myself) is
not nearly as adept when having to deal with deltas.  Yes, you can record
the initial state of all the devices, run the stress test, record the new
state of all the devices, and then spend a large amount of time devising
a script to calculate all the deltas for all the scores of variables on
all the involved devices, and then finally try and figure out what is
wrong.  But it would be so much better, easier, and more efficient, if
the kernel simply provided such a feature that almost all other networking
devices provide.

I also think the SNMP/mgt apps argument is specious.  A) SNMP isn't even
an issue with all networks.  B) As has been pointed out by others, there
is no requirement to have to use such a new stats zeroing feature.  It
would simply be a tool in the network engineer's toolbelt, just like
possibly taking an interface down and back up to see if it corrects a
problem.  The network engineer has to balance the potential benefit/harm
of any action he chooses to take, but let him have that choice.  And C)
I don't think any decent SNMP/mgt app will be particularly bothered by
zeroing interface stats.  I believe they are fairly decent about dealing
with such events (I don't recall our MRTG graphs getting any giant spikes
when I've zeroed interface stats on our GigE/10-GigE switches).  I think
the main harm in such a case would be the loss of a sampling interval.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: reminder, 2.6.18 window...

2006-05-25 Thread Bill Fink

On Wed, 24 May 2006, Phil Dibowitz wrote:

 Right. I think the point here is that it does _NOT_ inherently break
 things. If you don't like the behavior, don't run ethtool -z eth0,
 it's that simple.
 
 A co-worker suggested today, that maybe it'd appease people if the final
 ethtool patch made it a capitol option that you can only run by itself.
 I.e. if you can't call it with anything else, it's more difficult to
 call my accident. I'd be willing to this.

I think that's a good idea.  Since it is changing (zeroing) the stats,
it probably should be a capitol option.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: reminder, 2.6.18 window...

2006-05-25 Thread Bill Fink

On Thu, 25 May 2006, Brent Cook wrote:

 On Thursday 25 May 2006 02:23, Bill Fink wrote:
  On Wed, 24 May 2006, Jeff Garzik wrote:
   Brent Cook wrote:
Note that this is just clearing the hardware statistics on the
interface, and would not require any kind of atomic_increment addition
for interfaces that support that. It would be kind-of awkward to
implement this on drivers that increment stats in hardware though (lo,
vlan, br, etc.) This also brings up the question of resetting the stats
for 'netstat -s'
  
   If you don't atomically clear the statistics, then you are leaving open
   a window where the stats could easily be corrupted, if the network
   interface is under load.
  
   This 'clearing' operation has implications on the rest of the statistics
   usage.
  
   More complexity, and breaking of apps, when we could just use the
   existing, working system?  I'll take the do nothing, break nothing,
   everything still works route any day.
 
  I'll admit to not knowing all the intricacies of the kernel coding
  involved, but I don't offhand see how zeroing the stats would be
  significantly more complex than updating the stats during normal usage. 
  But I'll have to leave that argument to the experts.
 
 What it boils down to is that currently, a single CPU or thread ever touches 
 the stats concurrently, so it doesn't have to lock them or do anything 
 special to ensure that the continue incrementing. If you want to make sure 
 that the statistics actually reset when you want them to, you have to account 
 for this case:
 
   CPU0 reads current value from memory (increment)
   CPU1 writes 0 to current value in memory (reset)
   CPU0 writes incremented value to memory (increment complete)
 
 Check out do_add_counters() in net/ipv4/netfilter/ip_tables.c
 to see what's required to do this reliably in the kernel.

Thanks for the info.  I have a possibly naive question.  Would it
increase the reliability of clearing the stats using lazy zeroing
(no locking), if the zeroing app (ethtool) bound itself to the same
CPU that was handling interrupts for the device (assuming no sharing
of interrupts across CPUs)?

 The current patch is fine if your hardware implements the required atomicity 
 itself. Otherwise, you need a locking infrastructure to extend it to all 
 network devices if you want zeroing to always work. What I'm seeing here in 
 response to this is that it doesn't matter if zeroing just _mostly_ works, 
 which is what you would get if you didn't lock. Eh, I'm OK with that too, but 
 I think people are worried about the bugs that would get filed by admins when 
 just zeroing the stats on cheap NIC x only works 90% of the time, less under 
 load. Or not at all (not implemented in driver.) Then you're back to the 
 userspace solution or actually implement stat locking / atomic ops.

I would be fine with the lazy clearing of the stats (with a note
describing the limitations in the ethtool man page).  Being somewhat
anal, I would always check that the stats had in fact been zeroed
successfully before proceeding.  BTW I am in 100% agreement not to
do anything that would affect performance of the fast path, as I
understand proper locking would necessitate.

I will also look into the beforeafter utility that has been suggested,
to see how easy it is to use and how much extra work would be required
over just a direct visual examination of the interface statistics.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 6309] New: SO_RCVBUF doubled on set not halved on get

2006-03-31 Thread Bill Fink

On Thu, 30 Mar 2006, David S. Miller wrote:

 From: Bill Fink [EMAIL PROTECTED]
 Date: Fri, 31 Mar 2006 01:58:35 -0500

  I don't think it makes perfect sense.  If there's overhead, fine go
  ahead and add the overhead, but do it under the covers and invisible
  to the user.

 How in the world would you ever be able to figure out what
 value the kernel is using for the receive buffer?

I guess it depends on what exactly SO_SNDBUF/SO_RCVBUF actually
defines, whether it should include the kernel overhead or not.
I would argue that it should only specify the actual amount of
data space for buffering user application data in the kernel,
and the kernel just allocates whatever additional memory is required
for overhead to insure the user gets the amount of data buffering
requested for the application.  There could be a separate mechanism
(perhaps /proc) for monitoring the actual total amount of kernel
memory being used for the user socket.

 It also isn't an exact science, doubling the value is best
 effort.

You've convinced me from your following example that the overhead
can be substantial.  But to me the variability of the kernel
overhead, from fairly small proportionally for high performance
bulk data TCP transfers to quite large proportionally for a high
volume stream of small UDP packets, is another argument that
the SO_SNDBUF/SO_RCVBUF arguments to the {set,get}sockopt() calls
shouldn't include the kernel overhead.

 For example, if you, for example, receive a lot of tiny UDP packets,
 wherein the struct sk_buff overhead far exceeds the amount of data
 in the packet, it still might not work out.  You could specify 100K
 and only be able to receive say 60K of receive data in the socket
 at once.

Another reason the SO_SNDBUF/SO_RCVBUF values shouldn't include
the kernel overhead.  If the user requests 100K of kernel data
buffering, then they should get 100K.  It shouldn't matter to the
user that the kernel would actually be using a total of 167K of
memory to satisfy the request for 100K of data buffering.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 6309] New: SO_RCVBUF doubled on set not halved on get

2006-03-30 Thread Bill Fink

On Thu, 30 Mar 2006, Mark Butler wrote:

 David S. Miller wrote:
 
 This has been this way for centuries and it's the correct behavior.
 
 We double it on the way in to account for struct sk_buff etc.
 overhead, applications assume that the SO_RCVBUF setting they make
 will allow that much actual data to be received on that socket.
 Applications are unaware that struct sk_buff and other overheads
 allocate from the receive buffer during socket buffer allocation.
 
 And after considering the possible alternatives, returning the value
 we actually used on get is the most desirable behavior.
 
 Doubling the value passed via setsockopt(..., SO_RCVBUF,...) makes 
 perfect sense.

I don't think it makes perfect sense.  If there's overhead, fine go
ahead and add the overhead, but do it under the covers and invisible
to the user.  And doubling definitely doesn't make sense.  For example,
on a 10-Gbps transcontinental link with a 90 ms RTT, the sender
SO_SNDBUF and receiver SO_RCVBUF should be the BW*RTT product, which
in MB is 0.090*100/1024/1024/8 = 107 MB.  Doubling that gives
107 MB for overhead, which seems a mite excessive (and there are paths
in active use with double or more that RTT).

 But what is the rationale for returning the doubled 
 value back in getsockopt(..., SO_RCVBUF, )?
 
 All it appears to do is make applications believe / report they have 
 more buffer space than is actually available.

I definitely agree with this part.  The user only cares that their
application actually obtained the amount of buffer space they requested
for real user data, and not how much kernel overhead was required for
managing that buffer space.

Further complicating matters is that you don't actually even get what
you requested when it comes to the receive window that's actually
advertised on the network wire.  Earlier kernels would only give you
a receive window that was 3/4 the requested SO_RCVBUF, so to get the
desired optimum network performance you would have to multiply your
desired SO_RCVBUF by 4/3.

The 2.6.15.4 kernel I am currently running is even funkier.  It advertises
a fixed value for the receive window (scaled by the window scale factor)
regardless of the requested SO_RCVBUF.

Here's a test with a 80 MB requested receiver SO_RCVBUF
(and also an 80 MB sender SO_SNDBUF):

chance4 (192.168.88.8) - chance5 (192.168.88.9):

[EMAIL PROTECTED] nuttcp -w80m 192.168.88.9
 6069.0625 MB /  10.01 sec = 5086.1838 Mbps 100 %TX 74 %RX

tcpdump of beginning of transfer showing wscale is 12:

tcpdump: listening on eth0
01:01:20.490078 192.168.88.8.44379  192.168.88.9.5001: S [tcp sum ok] 
2540322474:2540322474(0) win 17920 mss 8960,sackOK,timestamp 410221719 
0,nop,wscale 12(DF) (ttl 64, id 16957, len 60)
01:01:20.492120 192.168.88.9.5001  192.168.88.8.44379: S [tcp sum ok] 
2569611102:2569611102(0) ack 2540322475 win 17896 mss 8960,sackOK,timestamp 
410302705 410221719,nop,wscale 12 (DF) (ttl 64, id 0, len 60)
...

tcpdump near the end of transfer showing advertised receive window:

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
...
01:01:24.563081 192.168.88.9.5001  192.168.88.8.44379: . [tcp sum ok] 1:1(0) 
ack 4294005300 win 19203 nop,nop,timestamp 410303112 410222126 (DF) (ttl 64, 
id48880, len 52)
...

So the advertised receive window is 19203*2^12/1024^2 = 75 MB.

Now here's a test with a 100 MB requested SO_RCVBUF:

[EMAIL PROTECTED] nuttcp -w100m 192.168.88.9
 5996.7500 MB /  10.02 sec = 5020.6207 Mbps 100 %TX 75 %RX

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
01:10:49.202097 192.168.88.8.53177  192.168.88.9.5001: S [tcp sum ok] 
3122099198:3122099198(0) win 17920 mss 8960,sackOK,timestamp 410278583 
0,nop,wscale 12(DF) (ttl 64, id 10569, len 60)
01:10:49.204184 192.168.88.9.5001  192.168.88.8.53177: S [tcp sum ok] 
3164733525:3164733525(0) ack 3122099199 win 17896 mss 8960,sackOK,timestamp 
410359569 410278583,nop,wscale 12 (DF) (ttl 64, id 0, len 60)
...

Still a wscale of 12.

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
...
01:10:54.835437 192.168.88.9.5001  192.168.88.8.53177: . [tcp sum ok] 1:1(0) 
ack 4294041092 win 19203 nop,nop,timestamp 410360132 410279146 (DF) (ttl 64, 
id34999, len 52)
...

Hmmm, that same win 19203, giving a 75 MB advertised window,
compared with the requested 100 MB.

And here's a test with a 60 MB requested SO_RCVBUF:

[EMAIL PROTECTED] nuttcp -w60m 192.168.88.9
 6229.3750 MB /  10.02 sec = 5215.3106 Mbps 100 %TX 77 %RX

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -c 5 port 5001
tcpdump: listening on eth0
01:13:58.522721 192.168.88.8.40883  192.168.88.9.5001: S [tcp sum ok] 
3319987801:3319987801(0) win 17920 mss 8960,sackOK,timestamp 410297513 
0,nop,wscale 12(DF) (ttl 64, id 23280, len 60)
01:13:58.524777 192.168.88.9.5001  192.168.88.8.40883: S [tcp sum ok] 
3367196353:3367196353(0) ack 3319987802 win 17896 mss 8960,sackOK,timestamp

Re: tg3 breakage this morning

2006-03-24 Thread Bill Fink

On Fri, 24 Mar 2006, walt wrote:

 Michael Chan wrote:
  Walt wrote:
  
  Nope, it was the second one:  Skip phy power down...
 
  It doesn't make sense. This code should have no effect on your
  5702. With or without this patch, the 5702 will be powered down
  the same with tg3_writephy(tp, MII_BMCR, BMCR_PDOWN) if WOL
  is not enabled when you ifdown.
  
  Also, for this code to have any effect, you must do ifdown or
  suspend. So presumably the driver loaded fine at least once and
  you get the failure during subsequent modprode...
 
 I confess I'm a bit confused by your question.  I have no idea
 why an ifdown would be executed during boot, but the startup
 scripts are so complicated that I can't understand what they do.
 
 The network script does print a message that the eth0 interface
 doesn't exist, so I assume that the script tried to use ifconfig
 to do something and failed (wouldn't that most likely be an ifup
 rather than ifdown, during bootup?).

The eth0 interface not existing does sound somewhat like the tg3
problem I had.  Did you check if you had PCI_MMCONFIG enabled in
your config?

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tg3 breakage this morning

2006-03-23 Thread Bill Fink

On Thu, 23 Mar 2006, walt wrote:

 Hi,
 I emailed Dave Miller, who re-directed me to this mail list.
 
 I update from Linus's git repository every morning, and today
 I got this error:
 
 tg3.c:v3.53 (Mar 22, 2006)
 PCI: Enabling device :00:09.0 (0014 - 0016)
 ACPI: PCI Interrupt :00:09.0[A] - GSI 18 (level, low) - IRQ 17
 tg3: Could not obtain valid ethernet address, aborting.
 ACPI: PCI interrupt for device :00:09.0 disabled
 tg3: probe of :00:09.0 failed with error -22
 
 Here is the dmesg from yesterday:
 tg3.c:v3.52 (Mar 06, 2006)
 PCI: Enabling device :00:09.0 (0014 - 0016)
 ACPI: PCI Interrupt :00:09.0[A] - GSI 18 (level, low) - IRQ 17
 eth0: Tigon3 [partno(BCM95702A20) rev 1002 PHY(5703)] (PCI:33MHz:32-bit)
 10/100/1000BaseT Ethernet 00
 :e0:18:d2:a6:c1
 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1]
 TSOcap[1]
 eth0: dma_rwctrl[763f] dma_mask[64-bit]
 
 I'd be happy to try patches or whatever if I can help with debugging.

Hi Walt,

I don't know if it's at all related, but I just tried building a 2.6.15.6
kernel, and all my built-in tg3 interfaces disappeared (on an AMD x86-64
system).  In my case I only got the messages:

Mar 20 14:45:22 vela ifup: tg3 device eth0 does not seem to be present, delaying
 initialization.
Mar 20 14:45:22 vela network: Bringing up interface eth0:  failed
Mar 20 14:45:22 vela ifup: tg3 device eth1 does not seem to be present, delaying
 initialization.
Mar 20 14:45:22 vela network: Bringing up interface eth1:  failed

In my case I made 2 changes to my config to get working tg3 interfaces
once more.  I enabled ACPI and then I enabled PCI_MMCONFIG (which was
dependent on ACPI being enabled).  Perhaps enabling PCI_MMCONFIG is
the critical step, but because of the dependency on ACPI, I had to
enable it first.  I've been meaning to test what would happen with
ACPI enabled but PCI_MMCONFIG disabled, but I haven't had time yet.

I believe the default for PCI_MMCONFIG is disabled, so in case it is
important, you might want to try enabling it if it isn't already enabled
in your config.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.12.6 to 2.6.14.3 Major 10-GigE TCP Network Performance Degradation

2006-01-04 Thread Bill Fink

On Tue, 3 Jan 2006, Stephen Hemminger wrote:

 On Wed, 28 Dec 2005 22:35:50 -0500
 Bill Fink [EMAIL PROTECTED] wrote:
 
  Would the following patch be at all useful for the 2.6.14.x stable
  series, since enabling TSO there causes a 40% or greater TCP performance
  penalty, or is 2.6.15 final so imminenent that it wouldn't be
  considered useful?
  
  Signed-off-by: Bill Fink [EMAIL PROTECTED]
  
  --- linux-2.6.14.3.orig/drivers/net/ixgb/ixgb_main.c2005-11-24 
  17:10:21.0 -0500
  +++ linux-2.6.14.3/drivers/net/ixgb/ixgb_main.c 2005-12-28 
  01:06:05.0 -0500
  @@ -445,7 +445,8 @@
 NETIF_F_HW_VLAN_RX |
 NETIF_F_HW_VLAN_FILTER;
   #ifdef NETIF_F_TSO
  -   netdev-features |= NETIF_F_TSO;
  +   /* TSO not performant at present - disable by default */
  +   netdev-features = ~NETIF_F_TSO;
   #endif
   
  if(pci_using_dac)
 
 doesn't make sense to patch just one driver.  It would make more sense
 to backport the TSO cwnd patch from 2.6.15

I agree but I don't currently have the knowledge or time to do that,
and it's probably moot since I believe 2.6.15 final is now out.

Also, I suspect it's really only an issue for 10-GigE NICs.  Testing
with an Intel quad-GigE NIC didn't show any problem:

[EMAIL PROTECTED] ifconfig eth5
eth5  Link encap:Ethernet  HWaddr 00:04:23:08:8A:46
  inet addr:192.168.5.75  Bcast:192.168.5.255  Mask:255.255.255.0
  inet6 addr: fe80::204:23ff:fe08:8a46/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
...

[EMAIL PROTECTED] ethtool -k eth5
Offload parameters for eth5:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

(and likewise on chance4 (192.168.5.78))

TSO enabled:

chance% nuttcp -w1m 192.168.5.78
 1183.7128 MB /  10.03 sec =  989.7925 Mbps 11 %TX 8 %RX
chance% nuttcp -r -w1m 192.168.5.78
 1183.7982 MB /  10.03 sec =  989.7653 Mbps 11 %TX 8 %RX

[EMAIL PROTECTED] ethtool -K eth5 tso off
[EMAIL PROTECTED] ethtool -k eth5
Offload parameters for eth5:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

(and likewise on chance4 (192.168.5.78))

TSO disabled:

chance% nuttcp -w1m 192.168.5.78
 1182.5833 MB /  10.02 sec =  989.7864 Mbps 13 %TX 8 %RX
chance% nuttcp -r -w1m 192.168.5.78
 1182.1543 MB /  10.02 sec =  989.5802 Mbps 12 %TX 8 %RX

-Bill

P.S.  I can't currently get to www.kernel.org, getting the following error:

Forbidden

You don't have permission to access / on this server.

Additionally, a 404 Not Found error was encountered while trying to use an 
ErrorDocument to handle the request.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.12.6 to 2.6.14.3 Major 10-GigE TCP Network Performance Degradation

2005-12-28 Thread Bill Fink

On Thu, 15 Dec 2005, Stephen Hemminger wrote:

 On Fri, 16 Dec 2005 03:26:31 +0100
 Andi Kleen [EMAIL PROTECTED] wrote:
 
  On Thu, Dec 15, 2005 at 08:35:32PM -0500, Bill Fink wrote:
   On Fri, 16 Dec 2005, Andi Kleen wrote:
   
 It appears that it is getting CPU starved for some reason (note the
 43%/40% transmitter CPU usage versus the 99%/99% CPU usage for the
 2.6.12.6 case).

What happens when you turn off tso in ethtool?
   
   Thanks!!!  That did the trick.
  
  TSO is still a bit of work in progress. The old 2.6.12 TSO code
  actually ignored the congestion window and was illegal in benchmarks etc
  (and might have even been dangerous to the internet). That was fixed,
  but performance still didn't fully recover. It's a tricky problem.
 
 And it wasn't till 2.6.15 that we got the fix in to correctly
 increase cwnd with TSO.

Update:

I just now tested with 2.6.15-rc7 and it seems to work fine with
TSO enabled.

chance% cat /proc/version
Linux version 2.6.15-rc7-bf-smp ([EMAIL PROTECTED]) (gcc version 3.2 20020903 
(Red Hat Linux 8.0 3.2-7)) #1 SMP Wed Dec 28 19:35:55 EST 2005

[EMAIL PROTECTED] ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

chance% nuttcp -w2m 192.168.88.8
 6054.1250 MB /  10.01 sec = 5073.1546 Mbps 100 %TX 72 %RX
chance% nuttcp -r -w2m 192.168.88.8
 6090.4375 MB /  10.01 sec = 5103.6174 Mbps 100 %TX 70 %RX

That's expected TCP performance levels of slightly over 5 Gbps,
although I thought I might get some CPU back with TSO enabled
(note the transmitter is still pegged at 100%).

Would the following patch be at all useful for the 2.6.14.x stable
series, since enabling TSO there causes a 40% or greater TCP performance
penalty, or is 2.6.15 final so imminenent that it wouldn't be
considered useful?

Signed-off-by: Bill Fink [EMAIL PROTECTED]

--- linux-2.6.14.3.orig/drivers/net/ixgb/ixgb_main.c2005-11-24 
17:10:21.0 -0500
+++ linux-2.6.14.3/drivers/net/ixgb/ixgb_main.c 2005-12-28 01:06:05.0 
-0500
@@ -445,7 +445,8 @@
   NETIF_F_HW_VLAN_RX |
   NETIF_F_HW_VLAN_FILTER;
 #ifdef NETIF_F_TSO
-   netdev-features |= NETIF_F_TSO;
+   /* TSO not performant at present - disable by default */
+   netdev-features = ~NETIF_F_TSO;
 #endif
 
if(pci_using_dac)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bad ARP Behavior???

2005-12-16 Thread Bill Fink

Sometimes when doing my 10-GigE testing, I would get results like
the following:

chance% nuttcp -w2m 192.168.88.8
 1184.3614 MB /  10.04 sec =  989.8235 Mbps 12 %TX 9 %RX

This seemed to indicate it was using one of the GigE interfaces
rather than the 10-GigE interface.  Both chance and chance4 have
multiple GigEs attached to the same VLAN as the 10-GigE interface,
but using different network addresses.

Immediate subsequent tests would still get GigE performance, but if
I would wait about 5 minutes and retest, it would be back to 10-GigE
performance levels.

After thinking about it a bit more, this seemed like it was probably
an ARP issue, which was verified by:

chance% arp -n 192.168.88.8
Address  HWtype  HWaddress   Flags MaskIface
192.168.88.8 ether   00:02:B3:D4:0C:D8   C eth0

eth0 is the 10-GigE interface on chance4 with an IP address of 192.168.88.8.

[EMAIL PROTECTED] ifconfig eth0
eth0  Link encap:Ethernet  HWaddr 00:07:E9:11:6A:61
  inet addr:192.168.88.8  Bcast:192.168.88.15  Mask:255.255.255.240
  inet6 addr: fe80::207:e9ff:fe11:6a61/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
  RX packets:593 errors:0 dropped:0 overruns:0 frame:0
  TX packets:801861 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1
  RX bytes:37952 (37.0 Kb)  TX bytes:2254470945 (2150.0 Mb)
  Base address:0xb800 Memory:fe8f8000-fe90

eth3 is a GigE interface on chance4 with an IP address of 192.168.3.78.

[EMAIL PROTECTED] ifconfig eth3
eth3  Link encap:Ethernet  HWaddr 00:02:B3:D4:0C:D8
  inet addr:192.168.3.78  Bcast:192.168.3.255  Mask:255.255.255.0
  inet6 addr: fe80::202:b3ff:fed4:cd8/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
  RX packets:3023004 errors:0 dropped:0 overruns:0 frame:0
  TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1
  RX bytes:2944400295 (2807.9 Mb)  TX bytes:1884 (1.8 Kb)
  Base address:0x6800 Memory:fe4c-fe4e

So the ARP request for 192.168.88.8 got resolved to the eth3 GigE
interface on chance4 instead of the eth0 10-GigE interface, even
though eth3 is in a completeley different network than 192.168.88.8.

A tcpdump on chance verified that all the GigE interfaces on chance4
were doing ARP replies for the ARP request for 192.168.88.8, in addition
to the desired ARP reply from the 10-GigE interface.

[EMAIL PROTECTED] tcpdump -n -vv -s 1500 -e arp
tcpdump: listening on eth0
22:33:15.737712 0:7:e9:11:6a:26 Broadcast arp 42: arp who-has 192.168.88.8 
tell192.168.88.10
22:33:15.738647 0:7:e9:11:6a:61 0:7:e9:11:6a:26 arp 60: arp reply 192.168.88.8 
is-at 0:7:e9:11:6a:61
22:33:15.738648 0:2:b3:d4:c:d8 0:7:e9:11:6a:26 arp 60: arp reply 192.168.88.8 
is-at 0:2:b3:d4:c:d8
22:33:15.738649 0:4:23:8:52:5d 0:7:e9:11:6a:26 arp 60: arp reply 192.168.88.8 
is-at 0:4:23:8:52:5d
22:33:15.738658 0:4:23:8:52:5e 0:7:e9:11:6a:26 arp 60: arp reply 192.168.88.8 
is-at 0:4:23:8:52:5e
22:33:15.738658 0:4:23:8:52:5c 0:7:e9:11:6a:26 arp 60: arp reply 192.168.88.8 
is-at 0:4:23:8:52:5c
22:33:15.738751 0:4:23:8:52:5f 0:7:e9:11:6a:26 arp 60: arp reply 192.168.88.8 
is-at 0:4:23:8:52:5f

Is it expected behavior that ARP replies would be generated for interfaces
on a different network than the IP address in the ARP request (note I
don't have Proxy ARP enabled), or is this a bug?  To me it would seem
to be a bug.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

2.6.12.6 to 2.6.14.3 Major 10-GigE TCP Network Performance Degradation

2005-12-15 Thread Bill Fink

Hi,

We use dual 3.06 GHz Xeon PC servers, with 1 GB memory, 133-MHz/64-bit
PCI-X bus, and Intel PRO/10GbE 10-GigE NIC, as 10-GigE network performance
measurement and troubleshooting systems.  With the 2.6.12.6 kernel we
get consistently excellent network performance, both TCP and UDP.

Here's a sample of the UDP performance, first transmitting from our
system chance (192.168.88.10) to our system chance4 (192.168.88.8),
followed by a transfer in the opposite direction.

chance% nuttcp -u -w5m 192.168.88.8
 6348.3594 MB /  10.00 sec = 5322.9641 Mbps 99 %TX 66 %RX 0 / 812590 drop/pkt 0.
00 %loss
chance% nuttcp -u -r -w5m 192.168.88.8
 6509.0312 MB /  10.00 sec = 5457.7234 Mbps 99 %TX 62 %RX 0 / 833156 drop/pkt 0.
00 %loss

As you can see, we get over 5 Gbps with zero packet drops which demonstrates
the network path is clean.  The TCP performance is also excellent:

chance% nuttcp -w2m 192.168.88.8
 6489.5625 MB /  10.00 sec = 5442.5464 Mbps 99 %TX 76 %RX
chance% nuttcp -r -w2m 192.168.88.8
 6114.1250 MB /  10.00 sec = 5127.3559 Mbps 99 %TX 70 %RX

If we do the same tests on a 2.6.14.3 kernel, the UDP performance is still
excellent:

chance% nuttcp -u -w5m 192.168.88.8
 6743.2656 MB /  10.02 sec = 5644.6505 Mbps 100 %TX 69 %RX 0 / 863138 drop/pkt 0
.00 %loss
chance% nuttcp -u -r -w5m 192.168.88.8
 6692.6094 MB /  10.02 sec = 5602.4222 Mbps 100 %TX 69 %RX 0 / 856654 drop/pkt 0
.00 %loss

But the TCP performance is consistently 40% or more less than the
performance with the 2.6.12.6 kernel:

chance% nuttcp -w2m 192.168.88.8
 3680.4890 MB /  10.02 sec = 3082.0133 Mbps 43 %TX 43 %RX
chance% nuttcp -r -w2m 192.168.88.8
 3495.4405 MB /  10.02 sec = 2925.3573 Mbps 40 %TX 40 %RX

It appears that it is getting CPU starved for some reason (note the
43%/40% transmitter CPU usage versus the 99%/99% CPU usage for the
2.6.12.6 case).

If we use multiple streams, we can then get up to the maximum performance
level, first a sample with 2 streams:

chance% nuttcp -Is1 -w2m 192.168.88.8  nuttcp -Is2 -w2m -p5002 192.168.88.8
s1:  2996.8977 MB /  10.02 sec = 2508.4744 Mbps 32 %TX 35 %RX
s2:  1795. MB /  10.02 sec = 1502.0434 Mbps 26 %TX 24 %RX

That's an aggregate of 4010.5178 Mbps.  And with 3 streams:

chance% nuttcp -Is1 -w2m 192.168.88.8  nuttcp -Is2 -w2m -p5002 192.168.88.8  n
uttcp -Is3 -w2m -p5003 192.168.88.8
s1:  3183.1493 MB /  10.02 sec = 2665.1879 Mbps 67 %TX 44 %RX
s2:  1583.1875 MB /  10.04 sec = 1322.9457 Mbps 27 %TX 26 %RX
s3:  1581.6250 MB /  10.04 sec = 1321.7448 Mbps 29 %TX 26 %RX

That's an aggregate of 5309.8784 Mbps which is comparable to the TCP
performance of the single stream 2.6.12.6 case.

I also tried testing with a 2.6.13.4 kernel.  It gives inconsistent
results, sometimes slightly less than the 2.6.12.6 kernel such as:

chance% nuttcp -w2m 192.168.88.8
 5848. MB /  10.01 sec = 4900.5697 Mbps 96 %TX 67 %RX
chance% nuttcp -r -w2m 192.168.88.8
 5817.9375 MB /  10.01 sec = 4875.8281 Mbps 91 %TX 71 %RX

And sometimes as bad as the 2.6.14.3 kernel:

chance% nuttcp -w2m 192.168.88.8
 3627.4375 MB /  10.02 sec = 3037.6242 Mbps 44 %TX 47 %RX
chance% nuttcp -r -w2m 192.168.88.8
 4149.6250 MB /  10.01 sec = 3477.6491 Mbps 54 %TX 52 %RX

The full network performance tests are attached below.  They were
run from a script shortly after chance and chance4 were rebooted.
There was a 5 second sleep between each pair of tests, and there
were 10 pairs of TCP tests plus a UDP pair in each run.

The 2.6.13.4 kernel config was generated from the 2.6.12.6 config
by doing a make oldconfig.  Likewise the 2.6.14.3 config was
generated from the 2.6.13.4 config by doing a make oldconfig  Here's
the diff between the different kernel versions (chance3 is another
system where the kernels were built):

[EMAIL PROTECTED] grep ^CONFIG /usr/src/linux-2.6.12.6/.config | sort  
/tmp/config-2.6.12
[EMAIL PROTECTED] grep ^CONFIG /usr/src/linux-2.6.13.4/.config | sort  
/tmp/config-2.6.13
[EMAIL PROTECTED] grep ^CONFIG /usr/src/linux-2.6.14.3/.config | sort  
/tmp/config-2.6.14

[EMAIL PROTECTED] diff /tmp/config-2.6.12 /tmp/config-2.6.13
37a38
 CONFIG_ASK_IP_FIB_HASH=y
274a276,278
 CONFIG_FLATMEM_MANUAL=y
 CONFIG_FLATMEM=y
 CONFIG_FLAT_NODE_MEM_MAP=y
279,282d282
 CONFIG_FUSION_CTL=m
 CONFIG_FUSION_LAN=m
 CONFIG_FUSION=m
 CONFIG_FUSION_MAX_SGE=40
285d284
 CONFIG_GAMEPORT_CS461X=m
291d289
 CONFIG_GAMEPORT_VORTEX=m
315a314,316
 CONFIG_HWMON=y
 CONFIG_HZ=100
 CONFIG_HZ_100=y
381a383
 CONFIG_INOTIFY=y
422a425
 CONFIG_IP_FIB_HASH=y
632a636
 CONFIG_NET_EMATCH_TEXT=m
668a673
 CONFIG_NFS_COMMON=y
763a769
 CONFIG_PHYSICAL_START=0x10
775a782
 CONFIG_PREEMPT_NONE=y
863a871
 CONFIG_SELECT_MEMORY_MODEL=y
866d873
 CONFIG_SERIAL_8250_MULTIPORT=y
958a966
 CONFIG_TCP_CONG_BIC=y
959a968,970
 CONFIG_TEXTSEARCH_FSM=m
 CONFIG_TEXTSEARCH_KMP=m
 CONFIG_TEXTSEARCH=y
1011c1022
 CONFIG_USB_MON=m
---
 CONFIG_USB_MON=y

[EMAIL PROTECTED] diff /tmp/config-2.6.13 /tmp/config-2.6.14
36a37
 CONFIG_ARCH_MAY_HAVE_PC_FDC=y
155a157
 CONFIG_CHELSIO_T1=m
190a193

Re: 2.6.12.6 to 2.6.14.3 Major 10-GigE TCP Network Performance Degradation

2005-12-15 Thread Bill Fink

Oops.  I forgot to attach my 2.6.12.6 kernel config.

-Bill


config-2.6.12.bz2
Description: BZip2 compressed data

Re: 2.6.12.6 to 2.6.14.3 Major 10-GigE TCP Network Performance Degradation

2005-12-15 Thread Bill Fink

On Fri, 16 Dec 2005, Andi Kleen wrote:

  It appears that it is getting CPU starved for some reason (note the
  43%/40% transmitter CPU usage versus the 99%/99% CPU usage for the
  2.6.12.6 case).
 
 What happens when you turn off tso in ethtool?

Thanks!!!  That did the trick.

[EMAIL PROTECTED] ethtool -K eth0 tso off
[EMAIL PROTECTED] ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

[EMAIL PROTECTED] ethtool -K eth0 tso off
[EMAIL PROTECTED] ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off

chance% nuttcp -w2m 192.168.88.8
 6299.0625 MB /  10.01 sec = 5278.6065 Mbps 100 %TX 74 %RX
chance% nuttcp -r -w2m 192.168.88.8
 6221.3125 MB /  10.01 sec = 5213.2026 Mbps 100 %TX 71 %RX

And a full test I just did consistently got over 5 Gbps.

-Thanks again

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

78 matches

Mail list logo