RE: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Jesse,


It's good to be talking directly to one of the e1000 developers and
maintainers.  Although at this point I am starting to think that the
issue may be TCP stack related and nothing to do with the NIC.  Am I
correct that these are quite distinct parts of the kernel?


Yes, quite.


OK.  I hope that there is also someone knowledgable about the TCP stack 
who is following this thread. (Perhaps you also know this part of the 
kernel, but I am assuming that your expertise is on the e1000/NIC bits.)



Important note: we ARE able to get full duplex wire speed (over 900
Mb/s simulaneously in both directions) using UDP.  The problems occur
only with TCP connections.


That eliminates bus bandwidth issues, probably, but small packets take
up a lot of extra descriptors, bus bandwidth, CPU, and cache resources.


I see.  Your concern is the extra ACK packets associated with TCP.  Even 
those these represent a small volume of data (around 5% with MTU=1500, and 
less at larger MTU) they double the number of packets that must be handled 
by the system compared to UDP transmission at the same data rate. Is that 
correct?



I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in
Germany).  So we'll provide this info in ~10 hours.


I would suggest you try TCP_RR with a command line something like this:
netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K

I think you'll have to compile netperf with burst mode support enabled.


I just saw Carsten a few minutes ago.  He has to take part in a 
'Baubesprechung' meeting this morning, after which he will start answering 
the technical questions and doing additional testing as suggested by you 
and others.  If you are on the US west coast, he should have some answers 
and results posted by Thursday morning Pacific time.



I assume that the interrupt load is distributed among all four cores
-- the default affinity is 0xff, and I also assume that there is some
type of interrupt aggregation taking place in the driver.  If the
CPUs were not able to service the interrupts fast enough, I assume
that we would also see loss of performance with UDP testing.


One other thing you can try with e1000 is disabling the dynamic
interrupt moderation by loading the driver with
InterruptThrottleRate=8000,8000,... (the number of commas depends on
your number of ports) which might help in your particular benchmark.


OK.  Is 'dynamic interrupt moderation' another name for 'interrupt
aggregation'?  Meaning that if more than one interrupt is generated
in a given time interval, then they are replaced by a single
interrupt?


Yes, InterruptThrottleRate=8000 means there will be no more than 8000
ints/second from that adapter, and if interrupts are generated faster
than that they are aggregated.

Interestingly since you are interested in ultra low latency, and may be
willing to give up some cpu for it during bulk transfers you should try
InterruptThrottleRate=1 (can generate up to 7 ints/s)


I'm not sure it's quite right to say that we are interested in ultra low 
latency. Most of our network transfers involve bulk data movement (a few 
MB or more).  We don't care so much about low latency (meaning how long it 
takes the FIRST byte of data to travel from sender to receiver).  We care 
about aggregate bandwidth: once the pipe is full, how fast can data be 
moved through it. Sow we don't care so much if getting the pipe full takes 
20 us or 50 us.  We just want the data to flow fast once the pipe IS full.



Welcome, its an interesting discussion.  Hope we can come to a good
conclusion.


Thank you. Carsten will post more info and answers later today.

Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipcomp regression in 2.6.24

2008-01-31 Thread Marco Berizzi
Herbert Xu wrote:

 On Wed, Jan 30, 2008 at 10:14:46AM +0100, Marco Berizzi wrote:
 
  Sorry for bother you again.
  I have applied to 2.6.24, but ipcomp doesn't work anyway.
  I have patched a clean 2.6.24 tree and I did a complete
  rebuild.
  With tcpdump I see both the esp packets going in/out but
  I don't see the clear packets on the interface.

 After testing it here it looks like there is this little typo
 which means that you can't actually use IPComp for anything
 that's not compressible :)

applied and tested to 2.6.24: ipcomp is working now.
As always, thanks a lot Herbert for fixing this.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


xfrm_lookup() and XFRM_POLICY_ICMP

2008-01-31 Thread Andy Johnson
Hello,

A question about XFRM_POLICY_ICMP:

 I had tried to understand this check in __xfrm_lookup() method in
  net/xfrm/xfrm_policy.c (the recent 2.6 git dave miller tree):
...
...
if ((flags  XFRM_LOOKUP_ICMP)  !(policy-flags  
XFRM_POLICY_ICMP))
goto error;
...
...

Why is the check for XFRM_POLICY_ICMP? I had grepped under the kernel tree,
and the only place where XFRM_POLICY_ICMP appears is here (except its definition
in xfrm.h).

I also grepped under openswan tree, and could not find XFRM_POLICY_ICMP.
(the struct xfrm_userpolicy_info  in openswan includes XFRM_POLICY_ALLOW and
XFRM_POLICY_BLOCK and XFRM_POLICY_LOCALOK, but not XFRM_POLICY_ICMP).

I also grepped under iproute2 tree (from git) and there is no XFRM_POLICY_ICMP.

So is this there a way at all to set XFRM_POLICY_ICMP? and if not - maybe this
check is not needed at all ?

Regards,
Andy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Sangtae,

Thanks for joining this discussion -- it's good to a CUBIC author and 
expert here!


In our application (cluster computing) we use a very tightly coupled 
high-speed low-latency network.  There is no 'wide area traffic'.  So 
it's hard for me to understand why any networking components or 
software layers should take more than milliseconds to ramp up or back 
off in speed. Perhaps we should be asking for a TCP congestion 
avoidance algorithm which is designed for a data center environment 
where there are very few hops and typical packet delivery times are 
tens or hundreds of microseconds. It's very different than delivering 
data thousands of km across a WAN.


If your network latency is low, regardless of type of protocols should 
give you more than 900Mbps.


Yes, this is also what I had thought.

In the graph that we posted, the two machines are connected by an ethernet 
crossover cable.  The total RTT of the two machines is probably AT MOST a 
couple of hundred microseconds.  Typically it takes 20 or 30 microseconds 
to get the first packet out the NIC.  Travel across the wire is a few 
nanoseconds.  Then getting the packet into the receiving NIC might be 
another 20 or 30 microseconds.  The ACK should fly back in about the same 
time.


I can guess the RTT of two machines is less than 4ms in your case and I 
remember the throughputs of all high-speed protocols (including 
tcp-reno) were more than 900Mbps with 4ms RTT. So, my question which 
kernel version did you use with your broadcomm NIC and got more than 
900Mbps?


We are going to double-check this (we did the broadcom testing about two 
months ago). Carsten is going to re-run the broadcomm experiments later 
today and will then post the results.


You can see results from some testing on crossover-cable wired systems 
with broadcomm NICs, that I did about 2 years ago, here:

http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/SMC_8508T_Performance.html
You'll notice that total TCP throughput on the crossover cable was about 
220 MB/sec.  With TCP overhead this is very close to 2Gb/s.


I have two machines connected by a gig switch and I can see what happens 
in my environment. Could you post what parameters did you use for 
netperf testing?


Carsten will post these in the next few hours.  If you want to simplify 
further, you can even take away the gig switch and just use a crossover 
cable.



and also if you set any parameters for your testing, please post them
here so that I can see that happens to me as well.


Carsten will post all the sysctl and ethtool parameters shortly.

Thanks again for chiming in. I am sure that with help from you, Jesse, and 
Rick, we can figure out what is going on here, and get it fixed.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Andi Kleen
Bruce Allen [EMAIL PROTECTED] writes:

 Important note: we ARE able to get full duplex wire speed (over 900
 Mb/s simulaneously in both directions) using UDP.  The problems occur
 only with TCP connections.

Another issue with full duplex TCP not mentioned yet is that if TSO is used 
the output  will be somewhat bursty and might cause problems with the 
TCP ACK clock of the other direction because the ACKs would need 
to squeeze in between full TSO bursts.

You could try disabling TSO with ethtool.

-Andi
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Andi!


Important note: we ARE able to get full duplex wire speed (over 900
Mb/s simulaneously in both directions) using UDP.  The problems occur
only with TCP connections.


Another issue with full duplex TCP not mentioned yet is that if TSO is used
the output  will be somewhat bursty and might cause problems with the
TCP ACK clock of the other direction because the ACKs would need
to squeeze in between full TSO bursts.

You could try disabling TSO with ethtool.


Noted.  We'll try this also.

Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] make net/802/tr.c:sysctl_tr_rif_timeout static

2008-01-31 Thread Pavel Emelyanov
Adrian Bunk wrote:
 sysctl_tr_rif_timeout can now become static.
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

Acked-by: Pavel Emelyanov [EMAIL PROTECTED]

 ---
 e5accd81b924224d40a3f38204c08cf6896ed79b 
 diff --git a/net/802/tr.c b/net/802/tr.c
 index 3f16b17..18c6647 100644
 --- a/net/802/tr.c
 +++ b/net/802/tr.c
 @@ -76,7 +76,7 @@ static DEFINE_SPINLOCK(rif_lock);
  
  static struct timer_list rif_timer;
  
 -int sysctl_tr_rif_timeout = 60*10*HZ;
 +static int sysctl_tr_rif_timeout = 60*10*HZ;
  
  static inline unsigned long rif_hash(const unsigned char *addr)
  {
 
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] make struct ipv4_devconf static

2008-01-31 Thread Pavel Emelyanov
Adrian Bunk wrote:
 struct ipv4_devconf can now become static.
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

Acked-by: Pavel Emelyanov [EMAIL PROTECTED]

 ---
 
  include/linux/inetdevice.h |2 --
  net/ipv4/devinet.c |2 +-
  2 files changed, 1 insertion(+), 3 deletions(-)
 
 20262a3317069b1bdbf2b37f4002fa5322445914 
 diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
 index 8d9eaae..fc4e3db 100644
 --- a/include/linux/inetdevice.h
 +++ b/include/linux/inetdevice.h
 @@ -17,8 +17,6 @@ struct ipv4_devconf
   DECLARE_BITMAP(state, __NET_IPV4_CONF_MAX - 1);
  };
  
 -extern struct ipv4_devconf ipv4_devconf;
 -
  struct in_device
  {
   struct net_device   *dev;
 diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
 index 21f71bf..5ab5acc 100644
 --- a/net/ipv4/devinet.c
 +++ b/net/ipv4/devinet.c
 @@ -64,7 +64,7 @@
  #include net/rtnetlink.h
  #include net/net_namespace.h
  
 -struct ipv4_devconf ipv4_devconf = {
 +static struct ipv4_devconf ipv4_devconf = {
   .data = {
   [NET_IPV4_CONF_ACCEPT_REDIRECTS - 1] = 1,
   [NET_IPV4_CONF_SEND_REDIRECTS - 1] = 1,
 
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] make nf_ct_path[] static

2008-01-31 Thread Pavel Emelyanov
Adrian Bunk wrote:
 This patch makes the needlessly global nf_ct_path[] static.
 
 Signed-off-by: Adrian Bunk [EMAIL PROTECTED]

Acked-by: Pavel Emelyanov [EMAIL PROTECTED]

Thanks, Adrian!

 ---
 6396fbcebe3eb61f7e6eb1a671920a515912b005 
 diff --git a/net/netfilter/nf_conntrack_standalone.c 
 b/net/netfilter/nf_conntrack_standalone.c
 index 696074a..5bd38a6 100644
 --- a/net/netfilter/nf_conntrack_standalone.c
 +++ b/net/netfilter/nf_conntrack_standalone.c
 @@ -380,7 +380,7 @@ static ctl_table nf_ct_netfilter_table[] = {
   { .ctl_name = 0 }
  };
  
 -struct ctl_path nf_ct_path[] = {
 +static struct ctl_path nf_ct_path[] = {
   { .procname = net, .ctl_name = CTL_NET, },
   { }
  };
 
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5

2008-01-31 Thread Pavel Emelyanov
Adrian Bunk wrote:
 Commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
 ([RAW]: Consolidate proc interface.) did not only change raw6_seq_ops
 (including adding 3 EXPORT_SYMBOL_GPL's to net/ipv4/raw.c for accessing 
 functions from there), it also removed the only user of raw6_seq_ops...

The commit is not strange it is wrong :( Sorry David, when I checked
the according proc files, I saw that both files show sockets, but
overlooked that the raw6 one shows the ipv4 part of the ipv6 socket.

Denis noticed that this morning and has already prepared a fix.
So please, do not revert the commit, the fix will be at your mailbox today.

Thanks, Adrian.

 cu
 Adrian
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.25][NETNS]: Fix race between put_net() and netlink_kernel_create().

2008-01-31 Thread Pavel Emelyanov
David Miller wrote:
 From: Pavel Emelyanov [EMAIL PROTECTED]
 Date: Thu, 24 Jan 2008 16:15:13 +0300
 
 The comment about race free view of the set of network 
 namespaces was a bit hasty. Look (there even can be only 
 one CPU, as discovered by Alexey Dobriyan and Denis Lunev):
  ...
 Instead, I propose to crate the socket inside an init_net
 namespace and then re-attach it to the desired one right
 after the socket is created.

 After doing this, we also have to be careful on error paths
 not to drop the reference on the namespace, we didn't get
 the one on.

 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
 Acked-by: Denis Lunev [EMAIL PROTECTED]
 
 Applied, thanks.
 

Thanks, David.

I have one more patch pending in netdev@ and some more to be sent
(cleanups, small fixes and net namespaces). Do I have to wait till
net-2.6.26, or can I start (re-)sending them while 2.6.25 merge
window is open?

Thanks,
Pavel
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] [RAW]: proc output cleanups.

2008-01-31 Thread Denis V. Lunev
yesterday Adrian Bunk noticed, that the commit

commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
Author: Pavel Emelyanov [EMAIL PROTECTED]
Date:   Mon Nov 19 22:38:33 2007 -0800

is somewhat strange. Basically, the commit is obviously wrong as the
content of the /proc/net/raw6 is incorrect due to it.

This series of patches fixes original problem and slightly cleanups the
code around.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] [RAW]: Cleanup IPv4 raw_seq_show.

2008-01-31 Thread Denis V. Lunev
There is no need to use 128 bytes on the stack at all. Clean the code in
the IPv6 style.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/raw.c |   24 +++-
 1 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 507cbfe..830f19e 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -927,7 +927,7 @@ void raw_seq_stop(struct seq_file *seq, void *v)
 }
 EXPORT_SYMBOL_GPL(raw_seq_stop);
 
-static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i)
+static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i)
 {
struct inet_sock *inet = inet_sk(sp);
__be32 dest = inet-daddr,
@@ -935,33 +935,23 @@ static __inline__ char *get_raw_sock(struct sock *sp, 
char *tmpbuf, int i)
__u16 destp = 0,
  srcp  = inet-num;
 
-   sprintf(tmpbuf, %4d: %08X:%04X %08X:%04X
+   seq_printf(seq, %4d: %08X:%04X %08X:%04X
 %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d,
i, src, srcp, dest, destp, sp-sk_state,
atomic_read(sp-sk_wmem_alloc),
atomic_read(sp-sk_rmem_alloc),
0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp),
atomic_read(sp-sk_refcnt), sp, atomic_read(sp-sk_drops));
-   return tmpbuf;
 }
 
-#define TMPSZ 128
-
 static int raw_seq_show(struct seq_file *seq, void *v)
 {
-   char tmpbuf[TMPSZ+1];
-
if (v == SEQ_START_TOKEN)
-   seq_printf(seq, %-*s\n, TMPSZ-1,
-sl  local_address rem_address   st tx_queue 
-  rx_queue tr tm-when retrnsmt   uid  timeout 
-  inode  drops);
-   else {
-   struct raw_iter_state *state = raw_seq_private(seq);
-
-   seq_printf(seq, %-*s\n, TMPSZ-1,
-  get_raw_sock(v, tmpbuf, state-bucket));
-   }
+   seq_printf(seq,   sl  local_address rem_address   st tx_queue 
+   rx_queue tr tm-when retrnsmt   uid  timeout 
+   inode  drops\n);
+   else
+   raw_sock_seq_show(seq, v, raw_seq_private(seq)-bucket);
return 0;
 }
 
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] [RAW]: Wrong content of the /proc/net/raw6.

2008-01-31 Thread Denis V. Lunev
The address of IPv6 raw sockets was shown in the wrong format, from IPv4 ones.
The problem has been introduced by the
commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
Author: Pavel Emelyanov [EMAIL PROTECTED]
Date:   Mon Nov 19 22:38:33 2007 -0800

Thanks to Adrian Bunk who originally noticed the problem.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/raw.h |3 ++-
 net/ipv4/raw.c|8 
 net/ipv6/raw.c|2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index c7ea7a2..1828f81 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -48,7 +48,8 @@ struct raw_iter_state {
 void *raw_seq_start(struct seq_file *seq, loff_t *pos);
 void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos);
 void raw_seq_stop(struct seq_file *seq, void *v);
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h);
+int raw_seq_open(struct inode *ino, struct file *file,
+struct raw_hashinfo *h, const struct seq_operations *ops);
 
 #endif
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 830f19e..a3002fe 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -962,13 +962,13 @@ static const struct seq_operations raw_seq_ops = {
.show  = raw_seq_show,
 };
 
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h)
+int raw_seq_open(struct inode *ino, struct file *file,
+struct raw_hashinfo *h, const struct seq_operations *ops)
 {
int err;
struct raw_iter_state *i;
 
-   err = seq_open_net(ino, file, raw_seq_ops,
-   sizeof(struct raw_iter_state));
+   err = seq_open_net(ino, file, ops, sizeof(struct raw_iter_state));
if (err  0)
return err;
 
@@ -980,7 +980,7 @@ EXPORT_SYMBOL_GPL(raw_seq_open);
 
 static int raw_v4_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, raw_v4_hashinfo);
+   return raw_seq_open(inode, file, raw_v4_hashinfo, raw_seq_ops);
 }
 
 static const struct file_operations raw_seq_fops = {
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index a2cf499..8897ccf 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = {
 
 static int raw6_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, raw_v6_hashinfo);
+   return raw_seq_open(inode, file, raw_v6_hashinfo, raw6_seq_ops);
 }
 
 static const struct file_operations raw6_seq_fops = {
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] [RAW]: Family check in the /proc/net/raw[6] is extra.

2008-01-31 Thread Denis V. Lunev
Different hashtables are used for IPv6 and IPv4 raw sockets, so no need to
check the socket family in the iterator over hashtables. Clean this out.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/raw.h |4 +---
 net/ipv4/raw.c|   12 
 net/ipv6/raw.c|2 +-
 3 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/include/net/raw.h b/include/net/raw.h
index cca81d8..c7ea7a2 100644
--- a/include/net/raw.h
+++ b/include/net/raw.h
@@ -41,7 +41,6 @@ extern void raw_proc_exit(void);
 struct raw_iter_state {
struct seq_net_private p;
int bucket;
-   unsigned short family;
struct raw_hashinfo *h;
 };
 
@@ -49,8 +48,7 @@ struct raw_iter_state {
 void *raw_seq_start(struct seq_file *seq, loff_t *pos);
 void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos);
 void raw_seq_stop(struct seq_file *seq, void *v);
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h,
-   unsigned short family);
+int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h);
 
 #endif
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index f863c3d..507cbfe 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -862,8 +862,7 @@ static struct sock *raw_get_first(struct seq_file *seq)
struct hlist_node *node;
 
sk_for_each(sk, node, state-h-ht[state-bucket])
-   if (sk-sk_net == state-p.net 
-   sk-sk_family == state-family)
+   if (sk-sk_net == state-p.net)
goto found;
}
sk = NULL;
@@ -879,8 +878,7 @@ static struct sock *raw_get_next(struct seq_file *seq, 
struct sock *sk)
sk = sk_next(sk);
 try_again:
;
-   } while (sk  sk-sk_net != state-p.net 
-   sk-sk_family != state-family);
+   } while (sk  sk-sk_net != state-p.net);
 
if (!sk  ++state-bucket  RAW_HTABLE_SIZE) {
sk = sk_head(state-h-ht[state-bucket]);
@@ -974,8 +972,7 @@ static const struct seq_operations raw_seq_ops = {
.show  = raw_seq_show,
 };
 
-int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h,
-   unsigned short family)
+int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h)
 {
int err;
struct raw_iter_state *i;
@@ -987,14 +984,13 @@ int raw_seq_open(struct inode *ino, struct file *file, 
struct raw_hashinfo *h,
 
i = raw_seq_private((struct seq_file *)file-private_data);
i-h = h;
-   i-family = family;
return 0;
 }
 EXPORT_SYMBOL_GPL(raw_seq_open);
 
 static int raw_v4_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, raw_v4_hashinfo, PF_INET);
+   return raw_seq_open(inode, file, raw_v4_hashinfo);
 }
 
 static const struct file_operations raw_seq_fops = {
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index d61c63d..a2cf499 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = {
 
 static int raw6_seq_open(struct inode *inode, struct file *file)
 {
-   return raw_seq_open(inode, file, raw_v6_hashinfo, PF_INET6);
+   return raw_seq_open(inode, file, raw_v6_hashinfo);
 }
 
 static const struct file_operations raw6_seq_fops = {
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.25][NETNS]: Fix race between put_net() and netlink_kernel_create().

2008-01-31 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 14:05:57 +0300

 I have one more patch pending in netdev@ and some more to be sent
 (cleanups, small fixes and net namespaces). Do I have to wait till
 net-2.6.26, or can I start (re-)sending them while 2.6.25 merge
 window is open?

Send it, I'll take a look at it.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bill Fink
On Wed, 30 Jan 2008, SANGTAE HA wrote:

 On Jan 30, 2008 5:25 PM, Bruce Allen [EMAIL PROTECTED] wrote:
 
  In our application (cluster computing) we use a very tightly coupled
  high-speed low-latency network.  There is no 'wide area traffic'.  So it's
  hard for me to understand why any networking components or software layers
  should take more than milliseconds to ramp up or back off in speed.
  Perhaps we should be asking for a TCP congestion avoidance algorithm which
  is designed for a data center environment where there are very few hops
  and typical packet delivery times are tens or hundreds of microseconds.
  It's very different than delivering data thousands of km across a WAN.
 
 
 If your network latency is low, regardless of type of protocols should
 give you more than 900Mbps. I can guess the RTT of two machines is
 less than 4ms in your case and I remember the throughputs of all
 high-speed protocols (including tcp-reno) were more than 900Mbps with
 4ms RTT. So, my question which kernel version did you use with your
 broadcomm NIC and got more than 900Mbps?
 
 I have two machines connected by a gig switch and I can see what
 happens in my environment. Could you post what parameters did you use
 for netperf testing?
 and also if you set any parameters for your testing, please post them
 here so that I can see that happens to me as well.

I see similar results on my test systems, using Tyan Thunder K8WE (S2895)
motherboard with dual Intel Xeon 3.06 GHZ CPUs and 1 GB memory, running
a 2.6.15.4 kernel.  The GigE NICs are Intel PRO/1000 82546EB_QUAD_COPPER,
on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000
driver, and running with 9000-byte jumbo frames.  The TCP congestion
control is BIC.

Unidirectional TCP test:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79
tx:  1186.5649 MB /  10.05 sec =  990.2741 Mbps 11 %TX 9 %RX 0 retrans

and:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Irx -r -w2m 192.168.6.79
rx:  1186.8281 MB /  10.05 sec =  990.5634 Mbps 14 %TX 9 %RX 0 retrans

Each direction gets full GigE line rate.

Bidirectional TCP test:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.6.79
tx:   898.9934 MB /  10.05 sec =  750.1634 Mbps 10 %TX 8 %RX 0 retrans
rx:  1167.3750 MB /  10.06 sec =  973.8617 Mbps 14 %TX 11 %RX 0 retrans

While one direction gets close to line rate, the other only got 750 Mbps.
Note there were no TCP retransmitted segments for either data stream, so
that doesn't appear to be the cause of the slower transfer rate in one
direction.

If the receive direction uses a different GigE NIC that's part of the
same quad-GigE, all is fine:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.5.79
tx:  1186.5051 MB /  10.05 sec =  990.2250 Mbps 12 %TX 13 %RX 0 retrans
rx:  1186.7656 MB /  10.05 sec =  990.5204 Mbps 15 %TX 14 %RX 0 retrans

Here's a test using the same GigE NIC for both directions with 1-second
interval reports:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -i1 -w2m 192.168.6.79  nuttcp 
-f-beta -Irx -r -i1 -w2m 192.168.6.79
tx:92.3750 MB /   1.01 sec =  767.2277 Mbps 0 retrans
rx:   104.5625 MB /   1.01 sec =  872.4757 Mbps 0 retrans
tx:83.3125 MB /   1.00 sec =  700.1845 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5541 Mbps 0 retrans
tx:83.8125 MB /   1.00 sec =  703.0322 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5502 Mbps 0 retrans
tx:83. MB /   1.00 sec =  696.1779 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5522 Mbps 0 retrans
tx:83.7500 MB /   1.00 sec =  702.4989 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5512 Mbps 0 retrans
tx:83.1250 MB /   1.00 sec =  697.2270 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5512 Mbps 0 retrans
tx:84.1875 MB /   1.00 sec =  706.1665 Mbps 0 retrans
rx:   117.5625 MB /   1.00 sec =  985.5510 Mbps 0 retrans
tx:83.0625 MB /   1.00 sec =  696.7167 Mbps 0 retrans
rx:   117.6875 MB /   1.00 sec =  987.5543 Mbps 0 retrans
tx:84.1875 MB /   1.00 sec =  706.1545 Mbps 0 retrans
rx:   117.6250 MB /   1.00 sec =  986.5472 Mbps 0 retrans
rx:   117.6875 MB /   1.00 sec =  987.0724 Mbps 0 retrans
tx:83.3125 MB /   1.00 sec =  698.8137 Mbps 0 retrans

tx:   844.9375 MB /  10.07 sec =  703.7699 Mbps 11 %TX 6 %RX 0 retrans
rx:  1167.4414 MB /  10.05 sec =  973.9980 Mbps 14 %TX 11 %RX 0 retrans

In this test case, the receiver ramped up to nearly full GigE line rate,
while the transmitter was stuck at about 700 Mbps.  I ran one longer
60-second test and didn't see the oscillating behavior between receiver
and transmitter, but maybe that's because I have the GigE NIC interrupts
and nuttcp client/server applications both locked to CPU 0.

So in my tests, once one direction gets the upper hand, it seems to
stay that way.  Could this be because the slower side 

Re: [PATCH 0/3] [RAW]: proc output cleanups.

2008-01-31 Thread David Miller
From: Denis V. Lunev [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 14:32:52 +0300

 yesterday Adrian Bunk noticed, that the commit
 
 commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
 Author: Pavel Emelyanov [EMAIL PROTECTED]
 Date:   Mon Nov 19 22:38:33 2007 -0800
 
 is somewhat strange. Basically, the commit is obviously wrong as the
 content of the /proc/net/raw6 is incorrect due to it.
 
 This series of patches fixes original problem and slightly cleanups the
 code around.
 
 Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

All applied, thanks a lot!
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hard hang through qdisc

2008-01-31 Thread Andi Kleen


I just managed to hang a 2.6.24 (+ some non network patches) kernel 
with the following (non sensical) command

tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100

No oops or anything just hangs. While I understand root can
do bad things just hanging like this seems a little extreme.

-Andi
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6][IPV6]: Introduce the INET6_TW_MATCH macro.

2008-01-31 Thread Pavel Emelyanov
We have INET_MATCH, INET_TW_MATCH and INET6_MATCH to test
sockets and twbuckets for matching, but ipv6 twbuckets are
tested manually.

Here's the INET6_TW_MATCH to help with it.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---
 include/linux/ipv6.h|8 
 net/ipv6/inet6_hashtables.c |   21 +++--
 2 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 5d35a4c..c347860 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -465,6 +465,14 @@ static inline struct raw6_sock *raw6_sk(const struct sock 
*sk)
 ipv6_addr_equal(inet6_sk(__sk)-rcv_saddr, (__daddr))  \
 (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif
 
+#define INET6_TW_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif) \
+   (((__sk)-sk_hash == (__hash))   \
+(*((__portpair *)(inet_twsk(__sk)-tw_dport)) == (__ports))\
+((__sk)-sk_family== PF_INET6)  \
+(ipv6_addr_equal(inet6_twsk(__sk)-tw_v6_daddr, (__saddr)))\
+(ipv6_addr_equal(inet6_twsk(__sk)-tw_v6_rcv_saddr, (__daddr)))  \
+(!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif
+
 #endif /* __KERNEL__ */
 
 #endif /* _IPV6_H */
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index a66a7d8..06b01be 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -80,17 +80,8 @@ struct sock *__inet6_lookup_established(struct inet_hashinfo 
*hashinfo,
}
/* Must check for a TIME_WAIT'er before going to listener hash. */
sk_for_each(sk, node, head-twchain) {
-   const struct inet_timewait_sock *tw = inet_twsk(sk);
-
-   if(*((__portpair *)(tw-tw_dport)) == ports
-  sk-sk_family== PF_INET6) {
-   const struct inet6_timewait_sock *tw6 = inet6_twsk(sk);
-
-   if (ipv6_addr_equal(tw6-tw_v6_daddr, saddr)   
-   ipv6_addr_equal(tw6-tw_v6_rcv_saddr, daddr)   

-   (!sk-sk_bound_dev_if || sk-sk_bound_dev_if == 
dif))
-   goto hit;
-   }
+   if (INET6_TW_MATCH(sk, hash, saddr, daddr, ports, dif))
+   goto hit;
}
read_unlock(lock);
return NULL;
@@ -185,15 +176,9 @@ static int __inet6_check_established(struct 
inet_timewait_death_row *death_row,
 
/* Check TIME-WAIT sockets first. */
sk_for_each(sk2, node, head-twchain) {
-   const struct inet6_timewait_sock *tw6 = inet6_twsk(sk2);
-
tw = inet_twsk(sk2);
 
-   if(*((__portpair *)(tw-tw_dport)) == ports 
-  sk2-sk_family  == PF_INET6   
-  ipv6_addr_equal(tw6-tw_v6_daddr, saddr) 
-  ipv6_addr_equal(tw6-tw_v6_rcv_saddr, daddr) 
-  (!sk2-sk_bound_dev_if || sk2-sk_bound_dev_if == dif)) {
+   if (INET6_TW_MATCH(sk2, hash, saddr, daddr, ports, dif)) {
if (twsk_unique(sk, sk2, twp))
goto unique;
else
-- 
1.5.3.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.

2008-01-31 Thread Pavel Emelyanov
These two functions are the same except for what they call
to check_established and hash for a socket.

This saves half-a-kilo for ipv4 and ipv6.

 add/remove: 1/0 grow/shrink: 1/4 up/down: 582/-1128 (-546)
 function old new   delta
 __inet_hash_connect- 577+577
 arp_ignore   108 113  +5
 static.hint8   4  -4
 rt_worker_func   376 372  -4
 inet6_hash_connect   584  25-559
 inet_hash_connect586  25-561

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---
 include/net/inet_hashtables.h |5 ++
 net/ipv4/inet_hashtables.c|   32 +-
 net/ipv6/inet6_hashtables.c   |   93 +
 3 files changed, 28 insertions(+), 102 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 761bdc0..a34a8f2 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -413,6 +413,11 @@ static inline struct sock *inet_lookup(struct 
inet_hashinfo *hashinfo,
return sk;
 }
 
+extern int __inet_hash_connect(struct inet_timewait_death_row *death_row,
+   struct sock *sk,
+   int (*check_established)(struct inet_timewait_death_row *,
+   struct sock *, __u16, struct inet_timewait_sock **),
+   void (*hash)(struct inet_hashinfo *, struct sock *));
 extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
 struct sock *sk);
 #endif /* _INET_HASHTABLES_H */
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 619c63c..b93d40f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -348,11 +348,11 @@ void __inet_hash(struct inet_hashinfo *hashinfo, struct 
sock *sk)
 }
 EXPORT_SYMBOL_GPL(__inet_hash);
 
-/*
- * Bind a port for a connect operation and hash it.
- */
-int inet_hash_connect(struct inet_timewait_death_row *death_row,
- struct sock *sk)
+int __inet_hash_connect(struct inet_timewait_death_row *death_row,
+   struct sock *sk,
+   int (*check_established)(struct inet_timewait_death_row *,
+   struct sock *, __u16, struct inet_timewait_sock **),
+   void (*hash)(struct inet_hashinfo *, struct sock *))
 {
struct inet_hashinfo *hinfo = death_row-hashinfo;
const unsigned short snum = inet_sk(sk)-num;
@@ -385,9 +385,8 @@ int inet_hash_connect(struct inet_timewait_death_row 
*death_row,
BUG_TRAP(!hlist_empty(tb-owners));
if (tb-fastreuse = 0)
goto next_port;
-   if (!__inet_check_established(death_row,
- sk, port,
- tw))
+   if (!check_established(death_row, sk,
+   port, tw))
goto ok;
goto next_port;
}
@@ -415,7 +414,7 @@ ok:
inet_bind_hash(sk, tb, port);
if (sk_unhashed(sk)) {
inet_sk(sk)-sport = htons(port);
-   __inet_hash_nolisten(hinfo, sk);
+   hash(hinfo, sk);
}
spin_unlock(head-lock);
 
@@ -432,17 +431,28 @@ ok:
tb  = inet_csk(sk)-icsk_bind_hash;
spin_lock_bh(head-lock);
if (sk_head(tb-owners) == sk  !sk-sk_bind_node.next) {
-   __inet_hash_nolisten(hinfo, sk);
+   hash(hinfo, sk);
spin_unlock_bh(head-lock);
return 0;
} else {
spin_unlock(head-lock);
/* No definite answer... Walk to established hash table */
-   ret = __inet_check_established(death_row, sk, snum, NULL);
+   ret = check_established(death_row, sk, snum, NULL);
 out:
local_bh_enable();
return ret;
}
 }
+EXPORT_SYMBOL_GPL(__inet_hash_connect);
+
+/*
+ * Bind a port for a connect operation and hash it.
+ */
+int inet_hash_connect(struct inet_timewait_death_row *death_row,
+ struct sock *sk)
+{
+   return __inet_hash_connect(death_row, sk,
+   __inet_check_established, __inet_hash_nolisten);
+}
 
 EXPORT_SYMBOL_GPL(inet_hash_connect);
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 06b01be..ece6d0e 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ 

NET: AX88796 use dev_dbg() instead of printk()

2008-01-31 Thread Ben Dooks
Change to using dev_dbg() and the other dev_xxx()
macros instead of printk, and update to use the
print_mac() helper.

Signed-off-by: Ben Dooks [EMAIL PROTECTED]

Index: linux-2.6.24-quilt1/drivers/net/ax88796.c
===
--- linux-2.6.24-quilt1.orig/drivers/net/ax88796.c
+++ linux-2.6.24-quilt1/drivers/net/ax88796.c
@@ -137,11 +137,12 @@ static int ax_initial_check(struct net_d
 static void ax_reset_8390(struct net_device *dev)
 {
struct ei_device *ei_local = netdev_priv(dev);
+   struct ax_device  *ax = to_ax_dev(dev);
unsigned long reset_start_time = jiffies;
void __iomem *addr = (void __iomem *)dev-base_addr;
 
if (ei_debug  1)
-   printk(KERN_DEBUG resetting the 8390 t=%ld..., jiffies);
+   dev_dbg(ax-dev-dev, resetting the 8390 t=%ld\n, jiffies);
 
ei_outb(ei_inb(addr + NE_RESET), addr + NE_RESET);
 
@@ -151,7 +152,7 @@ static void ax_reset_8390(struct net_dev
/* This check _should_not_ be necessary, omit eventually. */
while ((ei_inb(addr + EN0_ISR)  ENISR_RESET) == 0) {
if (jiffies - reset_start_time  2*HZ/100) {
-   printk(KERN_WARNING %s: %s did not complete.\n,
+   dev_warn(ax-dev-dev, %s: %s did not complete.\n,
   __FUNCTION__, dev-name);
break;
}
@@ -165,13 +166,15 @@ static void ax_get_8390_hdr(struct net_d
int ring_page)
 {
struct ei_device *ei_local = netdev_priv(dev);
+   struct ax_device  *ax = to_ax_dev(dev);
void __iomem *nic_base = ei_local-mem;
 
/* This *shouldn't* happen. If it does, it's the last thing you'll see 
*/
if (ei_status.dmaing) {
-   printk(KERN_EMERG %s: DMAing conflict in %s 
[DMAstat:%d][irqlock:%d].\n,
+   dev_err(ax-dev-dev, %s: DMAing conflict in %s 
+   [DMAstat:%d][irqlock:%d].\n,
dev-name, __FUNCTION__,
-  ei_status.dmaing, ei_status.irqlock);
+   ei_status.dmaing, ei_status.irqlock);
return;
}
 
@@ -204,13 +207,16 @@ static void ax_block_input(struct net_de
   struct sk_buff *skb, int ring_offset)
 {
struct ei_device *ei_local = netdev_priv(dev);
+   struct ax_device  *ax = to_ax_dev(dev);
void __iomem *nic_base = ei_local-mem;
char *buf = skb-data;
 
if (ei_status.dmaing) {
-   printk(KERN_EMERG %s: DMAing conflict in ax_block_input 
+   dev_err(ax-dev-dev,
+   %s: DMAing conflict in %s 
[DMAstat:%d][irqlock:%d].\n,
-   dev-name, ei_status.dmaing, ei_status.irqlock);
+   dev-name, __FUNCTION__,
+   ei_status.dmaing, ei_status.irqlock);
return;
}
 
@@ -239,6 +245,7 @@ static void ax_block_output(struct net_d
const unsigned char *buf, const int start_page)
 {
struct ei_device *ei_local = netdev_priv(dev);
+   struct ax_device  *ax = to_ax_dev(dev);
void __iomem *nic_base = ei_local-mem;
unsigned long dma_start;
 
@@ -251,7 +258,7 @@ static void ax_block_output(struct net_d
 
/* This *shouldn't* happen. If it does, it's the last thing you'll see 
*/
if (ei_status.dmaing) {
-   printk(KERN_EMERG %s: DMAing conflict in %s.
+   dev_err(ax-dev-dev, %s: DMAing conflict in %s.
[DMAstat:%d][irqlock:%d]\n,
dev-name, __FUNCTION__,
   ei_status.dmaing, ei_status.irqlock);
@@ -281,7 +288,8 @@ static void ax_block_output(struct net_d
 
while ((ei_inb(nic_base + EN0_ISR)  ENISR_RDC) == 0) {
if (jiffies - dma_start  2*HZ/100) {   /* 20ms */
-   printk(KERN_WARNING %s: timeout waiting for Tx 
RDC.\n, dev-name);
+   dev_warn(ax-dev-dev,
+%s: timeout waiting for Tx RDC.\n, 
dev-name);
ax_reset_8390(dev);
ax_NS8390_init(dev,1);
break;
@@ -424,10 +432,11 @@ static void
 ax_phy_write(struct net_device *dev, int phy_addr, int reg, int value)
 {
struct ei_device *ei = (struct ei_device *) netdev_priv(dev);
+   struct ax_device  *ax = to_ax_dev(dev);
unsigned long flags;
 
-   printk(KERN_DEBUG %s: %p, %04x, %04x %04x\n,
-  __FUNCTION__, dev, phy_addr, reg, value);
+   dev_dbg(ax-dev-dev, %s: %p, %04x, %04x %04x\n,
+   __FUNCTION__, dev, phy_addr, reg, value);
 
spin_lock_irqsave(ei-page_lock, flags);
 
@@ -750,14 +759,11 @@ static int ax_init_dev(struct net_device
ax_NS8390_init(dev, 0);
 
if (first_init) {
-  

[PATCH 0/6] preparations to enable netdevice notifiers inside a namespace (resend)

2008-01-31 Thread Denis V. Lunev
Here are some preparations and cleanups to enable network device/inet
address notifiers inside a namespace.

This set of patches has been originally sent last Friday. One cleanup
patch from the original series is dropped as wrong, thanks to Daniel
Lezcano.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipcomp regression in 2.6.24

2008-01-31 Thread Beschorner Daniel
 applied and tested to 2.6.24: ipcomp is working now.
 As always, thanks a lot Herbert for fixing this.

Thank you too, I applied the 2 patches and it works.

Daniel


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] [IPV4]: Fix memory leak on error path during FIB initialization.

2008-01-31 Thread Denis V. Lunev
net-ipv4.fib_table_hash is not freed when fib4_rules_init failed. The problem
has been introduced by the following commit.
commit c8050bf6d84785a7edd2e81591e8f833231477e8
Author: Denis V. Lunev [EMAIL PROTECTED]
Date:   Thu Jan 10 03:28:24 2008 -0800

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/fib_frontend.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d282618..d0507f4 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -975,6 +975,7 @@ static struct notifier_block fib_netdev_notifier = {
 
 static int __net_init ip_fib_net_init(struct net *net)
 {
+   int err;
unsigned int i;
 
net-ipv4.fib_table_hash = kzalloc(
@@ -985,7 +986,14 @@ static int __net_init ip_fib_net_init(struct net *net)
for (i = 0; i  FIB_TABLE_HASHSZ; i++)
INIT_HLIST_HEAD(net-ipv4.fib_table_hash[i]);
 
-   return fib4_rules_init(net);
+   err = fib4_rules_init(net);
+   if (err  0)
+   goto fail;
+   return 0;
+
+fail:
+   kfree(net-ipv4.fib_table_hash);
+   return err;
 }
 
 static void __net_exit ip_fib_net_exit(struct net *net)
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] [IPV4]: Small style cleanup of the error path in rtm_to_ifaddr.

2008-01-31 Thread Denis V. Lunev
Remove error code assignment inside brackets on failure. The code looks better
if the error is assigned before condition check. Also, the compiler treats this
better.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/devinet.c |   21 -
 1 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 21f71bf..9da4c68 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -492,39 +492,34 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr 
*nlh)
struct ifaddrmsg *ifm;
struct net_device *dev;
struct in_device *in_dev;
-   int err = -EINVAL;
+   int err;
 
err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv4_policy);
if (err  0)
goto errout;
 
ifm = nlmsg_data(nlh);
-   if (ifm-ifa_prefixlen  32 || tb[IFA_LOCAL] == NULL) {
-   err = -EINVAL;
+   err = -EINVAL;
+   if (ifm-ifa_prefixlen  32 || tb[IFA_LOCAL] == NULL)
goto errout;
-   }
 
dev = __dev_get_by_index(init_net, ifm-ifa_index);
-   if (dev == NULL) {
-   err = -ENODEV;
+   err = -ENODEV;
+   if (dev == NULL)
goto errout;
-   }
 
in_dev = __in_dev_get_rtnl(dev);
-   if (in_dev == NULL) {
-   err = -ENOBUFS;
+   err = -ENOBUFS;
+   if (in_dev == NULL)
goto errout;
-   }
 
ifa = inet_alloc_ifa();
-   if (ifa == NULL) {
+   if (ifa == NULL)
/*
 * A potential indev allocation can be left alive, it stays
 * assigned to its device and is destroy with it.
 */
-   err = -ENOBUFS;
goto errout;
-   }
 
ipv4_devconf_setall(in_dev);
in_dev_hold(in_dev);
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] [NETNS]: Add a namespace mark to fib_info.

2008-01-31 Thread Denis V. Lunev
This is required to make fib_info lookups namespace aware. In the other case
initial namespace devices are marked as dead in the local routing table
during other namespace stop.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/ip_fib.h |1 +
 net/ipv4/fib_semantics.c |8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 1b2f008..cb0df37 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -69,6 +69,7 @@ struct fib_nh {
 struct fib_info {
struct hlist_node   fib_hash;
struct hlist_node   fib_lhash;
+   struct net  *fib_net;
int fib_treeref;
atomic_tfib_clntref;
int fib_dead;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 5beff2e..97cc494 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -687,6 +687,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
struct fib_info *fi = NULL;
struct fib_info *ofi;
int nhs = 1;
+   struct net *net = cfg-fc_nlinfo.nl_net;
 
/* Fast check to catch the most weird cases */
if (fib_props[cfg-fc_type].scope  cfg-fc_scope)
@@ -727,6 +728,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
goto failure;
fib_info_cnt++;
 
+   fi-fib_net = net;
fi-fib_protocol = cfg-fc_protocol;
fi-fib_flags = cfg-fc_flags;
fi-fib_priority = cfg-fc_priority;
@@ -798,8 +800,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
if (nhs != 1 || nh-nh_gw)
goto err_inval;
nh-nh_scope = RT_SCOPE_NOWHERE;
-   nh-nh_dev = dev_get_by_index(cfg-fc_nlinfo.nl_net,
- fi-fib_nh-nh_oif);
+   nh-nh_dev = dev_get_by_index(net, fi-fib_nh-nh_oif);
err = -ENODEV;
if (nh-nh_dev == NULL)
goto failure;
@@ -813,8 +814,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
if (fi-fib_prefsrc) {
if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst ||
fi-fib_prefsrc != cfg-fc_dst)
-   if (inet_addr_type(cfg-fc_nlinfo.nl_net,
-  fi-fib_prefsrc) != RTN_LOCAL)
+   if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL)
goto err_inval;
}
 
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] [NETNS]: Process interface address manipulation routines in the namespace.

2008-01-31 Thread Denis V. Lunev
The namespace is available when required except rtm_to_ifaddr. Add
namespace argument to it.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/devinet.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index e55c85e..6a6e92e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -485,7 +485,7 @@ errout:
return err;
 }
 
-static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh)
+static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh)
 {
struct nlattr *tb[IFA_MAX+1];
struct in_ifaddr *ifa;
@@ -503,7 +503,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh)
if (ifm-ifa_prefixlen  32 || tb[IFA_LOCAL] == NULL)
goto errout;
 
-   dev = __dev_get_by_index(init_net, ifm-ifa_index);
+   dev = __dev_get_by_index(net, ifm-ifa_index);
err = -ENODEV;
if (dev == NULL)
goto errout;
@@ -571,7 +571,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct 
nlmsghdr *nlh, void *arg
if (net != init_net)
return -EINVAL;
 
-   ifa = rtm_to_ifaddr(nlh);
+   ifa = rtm_to_ifaddr(net, nlh);
if (IS_ERR(ifa))
return PTR_ERR(ifa);
 
@@ -1189,7 +1189,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct 
netlink_callback *cb)
 
s_ip_idx = ip_idx = cb-args[1];
idx = 0;
-   for_each_netdev(init_net, dev) {
+   for_each_netdev(net, dev) {
if (idx  s_idx)
goto cont;
if (idx  s_idx)
@@ -1223,7 +1223,9 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, 
struct nlmsghdr *nlh,
struct sk_buff *skb;
u32 seq = nlh ? nlh-nlmsg_seq : 0;
int err = -ENOBUFS;
+   struct net *net;
 
+   net = ifa-ifa_dev-dev-nd_net;
skb = nlmsg_new(inet_nlmsg_size(), GFP_KERNEL);
if (skb == NULL)
goto errout;
@@ -1235,10 +1237,10 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, 
struct nlmsghdr *nlh,
kfree_skb(skb);
goto errout;
}
-   err = rtnl_notify(skb, init_net, pid, RTNLGRP_IPV4_IFADDR, nlh, 
GFP_KERNEL);
+   err = rtnl_notify(skb, net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL);
 errout:
if (err  0)
-   rtnl_set_sk_err(init_net, RTNLGRP_IPV4_IFADDR, err);
+   rtnl_set_sk_err(net, RTNLGRP_IPV4_IFADDR, err);
 }
 
 #ifdef CONFIG_SYSCTL
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] [IPV4]: fib_sync_down rework.

2008-01-31 Thread Denis V. Lunev
fib_sync_down can be called with an address and with a device. In reality
it is called either with address OR with a device. The codepath inside is
completely different, so lets separate it into two calls for these two
cases.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/ip_fib.h |3 +-
 net/ipv4/fib_frontend.c  |4 +-
 net/ipv4/fib_semantics.c |  104 +++--
 3 files changed, 57 insertions(+), 54 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 9daa60b..1b2f008 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -218,7 +218,8 @@ extern void fib_select_default(struct net *net, const 
struct flowi *flp,
 
 /* Exported by fib_semantics.c */
 extern int ip_fib_check_default(__be32 gw, struct net_device *dev);
-extern int fib_sync_down(__be32 local, struct net_device *dev, int force);
+extern int fib_sync_down_dev(struct net_device *dev, int force);
+extern int fib_sync_down_addr(__be32 local);
 extern int fib_sync_up(struct net_device *dev);
 extern __be32  __fib_res_prefsrc(struct fib_result *res);
 extern void fib_select_multipath(const struct flowi *flp, struct fib_result 
*res);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d0507f4..d69ffa2 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa)
   First of all, we scan fib_info list searching
   for stray nexthop entries, then ignite fib_flush.
*/
-   if (fib_sync_down(ifa-ifa_local, NULL, 0))
+   if (fib_sync_down_addr(ifa-ifa_local))
fib_flush(dev-nd_net);
}
}
@@ -898,7 +898,7 @@ static void nl_fib_lookup_exit(struct net *net)
 
 static void fib_disable_ip(struct net_device *dev, int force)
 {
-   if (fib_sync_down(0, dev, force))
+   if (fib_sync_down_dev(dev, force))
fib_flush(dev-nd_net);
rt_cache_flush(0);
arp_ifdown(dev);
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c791286..5beff2e 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1031,70 +1031,72 @@ nla_put_failure:
  referring to it.
- device went down - we must shutdown all nexthops going via it.
  */
-
-int fib_sync_down(__be32 local, struct net_device *dev, int force)
+int fib_sync_down_addr(__be32 local)
 {
int ret = 0;
-   int scope = RT_SCOPE_NOWHERE;
-
-   if (force)
-   scope = -1;
+   unsigned int hash = fib_laddr_hashfn(local);
+   struct hlist_head *head = fib_info_laddrhash[hash];
+   struct hlist_node *node;
+   struct fib_info *fi;
 
-   if (local  fib_info_laddrhash) {
-   unsigned int hash = fib_laddr_hashfn(local);
-   struct hlist_head *head = fib_info_laddrhash[hash];
-   struct hlist_node *node;
-   struct fib_info *fi;
+   if (fib_info_laddrhash == NULL || local == 0)
+   return 0;
 
-   hlist_for_each_entry(fi, node, head, fib_lhash) {
-   if (fi-fib_prefsrc == local) {
-   fi-fib_flags |= RTNH_F_DEAD;
-   ret++;
-   }
+   hlist_for_each_entry(fi, node, head, fib_lhash) {
+   if (fi-fib_prefsrc == local) {
+   fi-fib_flags |= RTNH_F_DEAD;
+   ret++;
}
}
+   return ret;
+}
 
-   if (dev) {
-   struct fib_info *prev_fi = NULL;
-   unsigned int hash = fib_devindex_hashfn(dev-ifindex);
-   struct hlist_head *head = fib_info_devhash[hash];
-   struct hlist_node *node;
-   struct fib_nh *nh;
+int fib_sync_down_dev(struct net_device *dev, int force)
+{
+   int ret = 0;
+   int scope = RT_SCOPE_NOWHERE;
+   struct fib_info *prev_fi = NULL;
+   unsigned int hash = fib_devindex_hashfn(dev-ifindex);
+   struct hlist_head *head = fib_info_devhash[hash];
+   struct hlist_node *node;
+   struct fib_nh *nh;
 
-   hlist_for_each_entry(nh, node, head, nh_hash) {
-   struct fib_info *fi = nh-nh_parent;
-   int dead;
+   if (force)
+   scope = -1;
 
-   BUG_ON(!fi-fib_nhs);
-   if (nh-nh_dev != dev || fi == prev_fi)
-   continue;
-   prev_fi = fi;
-   dead = 0;
-   change_nexthops(fi) {
-   if (nh-nh_flagsRTNH_F_DEAD)
-   dead++;
-   else if (nh-nh_dev == dev 
-nh-nh_scope != scope) {
-

[PATCH 6/6] [NETNS]: Lookup in FIB semantic hashes taking into account the namespace.

2008-01-31 Thread Denis V. Lunev
The namespace is not available in the fib_sync_down_addr, add it
as a parameter.

Looking up a device by the pointer to it is OK. Looking up using a result
from fib_trie/fib_hash table lookup is also safe. No need to fix that at all.
So, just fix lookup by address and insertion to the hash table path.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/ip_fib.h |2 +-
 net/ipv4/fib_frontend.c  |2 +-
 net/ipv4/fib_semantics.c |6 +-
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index cb0df37..90d1175 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -220,7 +220,7 @@ extern void fib_select_default(struct net *net, const 
struct flowi *flp,
 /* Exported by fib_semantics.c */
 extern int ip_fib_check_default(__be32 gw, struct net_device *dev);
 extern int fib_sync_down_dev(struct net_device *dev, int force);
-extern int fib_sync_down_addr(__be32 local);
+extern int fib_sync_down_addr(struct net *net, __be32 local);
 extern int fib_sync_up(struct net_device *dev);
 extern __be32  __fib_res_prefsrc(struct fib_result *res);
 extern void fib_select_multipath(const struct flowi *flp, struct fib_result 
*res);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index d69ffa2..86ff271 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa)
   First of all, we scan fib_info list searching
   for stray nexthop entries, then ignite fib_flush.
*/
-   if (fib_sync_down_addr(ifa-ifa_local))
+   if (fib_sync_down_addr(dev-nd_net, ifa-ifa_local))
fib_flush(dev-nd_net);
}
}
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 97cc494..a13c847 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -229,6 +229,8 @@ static struct fib_info *fib_find_info(const struct fib_info 
*nfi)
head = fib_info_hash[hash];
 
hlist_for_each_entry(fi, node, head, fib_hash) {
+   if (fi-fib_net != nfi-fib_net)
+   continue;
if (fi-fib_nhs != nfi-fib_nhs)
continue;
if (nfi-fib_protocol == fi-fib_protocol 
@@ -1031,7 +1033,7 @@ nla_put_failure:
  referring to it.
- device went down - we must shutdown all nexthops going via it.
  */
-int fib_sync_down_addr(__be32 local)
+int fib_sync_down_addr(struct net *net, __be32 local)
 {
int ret = 0;
unsigned int hash = fib_laddr_hashfn(local);
@@ -1043,6 +1045,8 @@ int fib_sync_down_addr(__be32 local)
return 0;
 
hlist_for_each_entry(fi, node, head, fib_lhash) {
+   if (fi-fib_net != net)
+   continue;
if (fi-fib_prefsrc == local) {
fi-fib_flags |= RTNH_F_DEAD;
ret++;
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] macb: Fix section mismatch and shrink runtime footprint

2008-01-31 Thread Haavard Skinnemoen
macb devices are only found integrated on SoCs, so they can't be
hotplugged. Thus, the probe() and exit() functions can be __init and
__exit, respectively. By using platform_driver_probe() instead of
platform_driver_register(), there won't be any references to the
discarded probe() function after the driver has loaded.

This also fixes a section mismatch due to macb_probe(), defined as
__devinit, calling macb_get_hwaddr, defined as __init.

Signed-off-by: Haavard Skinnemoen [EMAIL PROTECTED]
---
 drivers/net/macb.c |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/macb.c b/drivers/net/macb.c
index e10528e..81bf005 100644
--- a/drivers/net/macb.c
+++ b/drivers/net/macb.c
@@ -1084,7 +1084,7 @@ static int macb_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
return phy_mii_ioctl(phydev, if_mii(rq), cmd);
 }
 
-static int __devinit macb_probe(struct platform_device *pdev)
+static int __init macb_probe(struct platform_device *pdev)
 {
struct eth_platform_data *pdata;
struct resource *regs;
@@ -1248,7 +1248,7 @@ err_out:
return err;
 }
 
-static int __devexit macb_remove(struct platform_device *pdev)
+static int __exit macb_remove(struct platform_device *pdev)
 {
struct net_device *dev;
struct macb *bp;
@@ -1276,8 +1276,7 @@ static int __devexit macb_remove(struct platform_device 
*pdev)
 }
 
 static struct platform_driver macb_driver = {
-   .probe  = macb_probe,
-   .remove = __devexit_p(macb_remove),
+   .remove = __exit_p(macb_remove),
.driver = {
.name   = macb,
},
@@ -1285,7 +1284,7 @@ static struct platform_driver macb_driver = {
 
 static int __init macb_init(void)
 {
-   return platform_driver_register(macb_driver);
+   return platform_driver_probe(macb_driver, macb_probe);
 }
 
 static void __exit macb_exit(void)
-- 
1.5.3.8

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Good morning (my TZ),

I'll try to answer all questions, hoewver if I miss something big, 
please point my nose to it again.


Rick Jones wrote:

As asked in LKML thread, please post the exact netperf command used to
start the client/server, whether or not you're using irqbalanced (aka
irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI,
right?)


netperf was used without any special tuning parameters. Usually we start 
two processes on two hosts which start (almost) simultaneously, last for 
20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e.


on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20
on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20

192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we 
use the .2. part for eth1 for a range of mtus.


The server is started on both nodes with the start-stop-daemon and no 
special parameters I'm aware of.


/proc/interrupts shows me PCI_MSI-edge thus, I think YES.

In particular, it would be good to know if you are doing two concurrent 
streams, or if you are using the burst mode TCP_RR with large 
request/response sizes method which then is only using one connection.




As outlined above: Two concurrent streams right now. If you think TCP_RR 
should be better I'm happy to rerun some tests.


More in other emails.

I'll wade through them slowly.

Carsten
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen

TSO interacts badly with many queueing disciplines because they rely on 
reordering packets from different streams and the large TSO packets can 
make this difficult. This patch disables TSO for sockets that send over 
devices with non standard queueing disciplines. That's anything but noop 
or pfifo_fast and pfifo right now.

Longer term other queueing disciplines could be checked if they
are also ok with TSO. If yes they can set the TCQ_F_GSO_OK flag too.

It is still enabled for the standard pfifo_fast because that will never
reorder packets with the same type-of-service. This means 99+% of all users
will still be able to use TSO just fine.

The status is only set up at socket creation so a shifted route
will not reenable TSO on a existing socket. I don't think that's a 
problem though.

Signed-off-by: Andi Kleen [EMAIL PROTECTED]

---
 include/net/sch_generic.h |1 +
 net/core/sock.c   |3 +++
 net/sched/sch_generic.c   |5 +++--
 3 files changed, 7 insertions(+), 2 deletions(-)

Index: linux/include/net/sch_generic.h
===
--- linux.orig/include/net/sch_generic.h
+++ linux/include/net/sch_generic.h
@@ -31,6 +31,7 @@ struct Qdisc
 #define TCQ_F_BUILTIN  1
 #define TCQ_F_THROTTLED2
 #define TCQ_F_INGRESS  4
+#define TCQ_F_GSO_OK   8
int padded;
struct Qdisc_ops*ops;
u32 handle;
Index: linux/net/sched/sch_generic.c
===
--- linux.orig/net/sched/sch_generic.c
+++ linux/net/sched/sch_generic.c
@@ -307,7 +307,7 @@ struct Qdisc_ops noop_qdisc_ops __read_m
 struct Qdisc noop_qdisc = {
.enqueue=   noop_enqueue,
.dequeue=   noop_dequeue,
-   .flags  =   TCQ_F_BUILTIN,
+   .flags  =   TCQ_F_BUILTIN | TCQ_F_GSO_OK,
.ops=   noop_qdisc_ops,
.list   =   LIST_HEAD_INIT(noop_qdisc.list),
 };
@@ -325,7 +325,7 @@ static struct Qdisc_ops noqueue_qdisc_op
 static struct Qdisc noqueue_qdisc = {
.enqueue=   NULL,
.dequeue=   noop_dequeue,
-   .flags  =   TCQ_F_BUILTIN,
+   .flags  =   TCQ_F_BUILTIN | TCQ_F_GSO_OK,
.ops=   noqueue_qdisc_ops,
.list   =   LIST_HEAD_INIT(noqueue_qdisc.list),
 };
@@ -538,6 +538,7 @@ void dev_activate(struct net_device *dev
printk(KERN_INFO %s: activation failed\n, 
dev-name);
return;
}
+   qdisc-flags |= TCQ_F_GSO_OK;
list_add_tail(qdisc-list, dev-qdisc_list);
} else {
qdisc =  noqueue_qdisc;
Index: linux/net/core/sock.c
===
--- linux.orig/net/core/sock.c
+++ linux/net/core/sock.c
@@ -112,6 +112,7 @@
 #include linux/tcp.h
 #include linux/init.h
 #include linux/highmem.h
+#include net/sch_generic.h
 
 #include asm/uaccess.h
 #include asm/system.h
@@ -1062,6 +1063,8 @@ void sk_setup_caps(struct sock *sk, stru
 {
__sk_dst_set(sk, dst);
sk-sk_route_caps = dst-dev-features;
+   if (!(dst-dev-qdisc-flags  TCQ_F_GSO_OK))
+   sk-sk_route_caps = ~NETIF_F_GSO_MASK;
if (sk-sk_route_caps  NETIF_F_GSO)
sk-sk_route_caps |= NETIF_F_GSO_SOFTWARE;
if (sk_can_gso(sk)) {
Index: linux/net/sched/sch_fifo.c
===
--- linux.orig/net/sched/sch_fifo.c
+++ linux/net/sched/sch_fifo.c
@@ -62,6 +62,7 @@ static int fifo_init(struct Qdisc *sch, 
 
q-limit = ctl-limit;
}
+   sch-flags |= TCQ_F_GSO_OK;
 
return 0;
 }
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc II

2008-01-31 Thread Andi Kleen
On Thursday 31 January 2008 13:21:00 Andi Kleen wrote:
 
 I just managed to hang a 2.6.24 (+ some non network patches) kernel 
 with the following (non sensical) command

Correction: the kernel was actually a git linus kernel with David's 
recent merge included.

I found it's pretty easy to hang the kernel with various tbf parameters.

-Andi

 tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100
 
 No oops or anything just hangs. While I understand root can
 do bad things just hanging like this seems a little extreme.
 
 -Andi
 


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6][NETNS]: Udp sockets per-net lookup.

2008-01-31 Thread Pavel Emelyanov
Add the net parameter to udp_get_port family of calls and 
udp_lookup one and use it to filter sockets.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---
 net/ipv4/udp.c |   25 ++---
 net/ipv6/udp.c |   10 ++
 2 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 2fb8d73..7ea1b67 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -130,14 +130,14 @@ EXPORT_SYMBOL(sysctl_udp_wmem_min);
 atomic_t udp_memory_allocated;
 EXPORT_SYMBOL(udp_memory_allocated);
 
-static inline int __udp_lib_lport_inuse(__u16 num,
+static inline int __udp_lib_lport_inuse(struct net *net, __u16 num,
const struct hlist_head udptable[])
 {
struct sock *sk;
struct hlist_node *node;
 
sk_for_each(sk, node, udptable[num  (UDP_HTABLE_SIZE - 1)])
-   if (sk-sk_hash == num)
+   if (sk-sk_net == net  sk-sk_hash == num)
return 1;
return 0;
 }
@@ -159,6 +159,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
struct hlist_head *head;
struct sock *sk2;
interror = 1;
+   struct net *net = sk-sk_net;
 
write_lock_bh(udp_hash_lock);
 
@@ -198,7 +199,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum,
/* 2nd pass: find hole in shortest hash chain */
rover = best;
for (i = 0; i  (1  16) / UDP_HTABLE_SIZE; i++) {
-   if (! __udp_lib_lport_inuse(rover, udptable))
+   if (! __udp_lib_lport_inuse(net, rover, udptable))
goto gotit;
rover += UDP_HTABLE_SIZE;
if (rover  high)
@@ -218,6 +219,7 @@ gotit:
sk_for_each(sk2, node, head)
if (sk2-sk_hash == snum 
sk2 != sk
+   sk2-sk_net == net   
(!sk2-sk_reuse|| !sk-sk_reuse) 
(!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if
 || sk2-sk_bound_dev_if == sk-sk_bound_dev_if) 
@@ -261,9 +263,9 @@ static inline int udp_v4_get_port(struct sock *sk, unsigned 
short snum)
 /* UDP is nearly always wildcards out the wazoo, it makes no sense to try
  * harder than this. -DaveM
  */
-static struct sock *__udp4_lib_lookup(__be32 saddr, __be16 sport,
- __be32 daddr, __be16 dport,
- int dif, struct hlist_head udptable[])
+static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
+   __be16 sport, __be32 daddr, __be16 dport,
+   int dif, struct hlist_head udptable[])
 {
struct sock *sk, *result = NULL;
struct hlist_node *node;
@@ -274,7 +276,8 @@ static struct sock *__udp4_lib_lookup(__be32 saddr, __be16 
sport,
sk_for_each(sk, node, udptable[hnum  (UDP_HTABLE_SIZE - 1)]) {
struct inet_sock *inet = inet_sk(sk);
 
-   if (sk-sk_hash == hnum  !ipv6_only_sock(sk)) {
+   if (sk-sk_net == net  sk-sk_hash == hnum 
+   !ipv6_only_sock(sk)) {
int score = (sk-sk_family == PF_INET ? 1 : 0);
if (inet-rcv_saddr) {
if (inet-rcv_saddr != daddr)
@@ -361,8 +364,8 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct 
hlist_head udptable[])
int harderr;
int err;
 
-   sk = __udp4_lib_lookup(iph-daddr, uh-dest, iph-saddr, uh-source,
-  skb-dev-ifindex, udptable  );
+   sk = __udp4_lib_lookup(skb-dev-nd_net, iph-daddr, uh-dest,
+   iph-saddr, uh-source, skb-dev-ifindex, udptable);
if (sk == NULL) {
ICMP_INC_STATS_BH(ICMP_MIB_INERRORS);
return; /* No socket for error */
@@ -1185,8 +1188,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct hlist_head 
udptable[],
if (rt-rt_flags  (RTCF_BROADCAST|RTCF_MULTICAST))
return __udp4_lib_mcast_deliver(skb, uh, saddr, daddr, 
udptable);
 
-   sk = __udp4_lib_lookup(saddr, uh-source, daddr, uh-dest,
-  inet_iif(skb), udptable);
+   sk = __udp4_lib_lookup(skb-dev-nd_net, saddr, uh-source, daddr,
+   uh-dest, inet_iif(skb), udptable);
 
if (sk != NULL) {
int ret = 0;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index bd4b9df..53739de 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -56,7 +56,8 @@ static inline int udp_v6_get_port(struct sock *sk, unsigned 
short snum)
return udp_get_port(sk, snum, ipv6_rcv_saddr_equal);
 }
 
-static struct sock *__udp6_lib_lookup(struct in6_addr 

[PATCH 4/6][NETNS]: Tcp-v4 sockets per-net lookup.

2008-01-31 Thread Pavel Emelyanov
Add a net argument to inet_lookup and propagate it further
into lookup calls. Plus tune the __inet_check_established.

The dccp and inet_diag, which use that lookup functions
pass the init_net into them.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---
 include/net/inet_hashtables.h |   48 +++--
 net/dccp/ipv4.c   |6 ++--
 net/ipv4/inet_diag.c  |2 +-
 net/ipv4/inet_hashtables.c|   29 
 net/ipv4/tcp_ipv4.c   |   15 ++--
 5 files changed, 58 insertions(+), 42 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 55532b9..c23c4ed 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -302,15 +302,17 @@ out:
wake_up(hashinfo-lhash_wait);
 }
 
-extern struct sock *__inet_lookup_listener(struct inet_hashinfo *hashinfo,
+extern struct sock *__inet_lookup_listener(struct net *net,
+  struct inet_hashinfo *hashinfo,
   const __be32 daddr,
   const unsigned short hnum,
   const int dif);
 
-static inline struct sock *inet_lookup_listener(struct inet_hashinfo *hashinfo,
-   __be32 daddr, __be16 dport, int 
dif)
+static inline struct sock *inet_lookup_listener(struct net *net,
+   struct inet_hashinfo *hashinfo,
+   __be32 daddr, __be16 dport, int dif)
 {
-   return __inet_lookup_listener(hashinfo, daddr, ntohs(dport), dif);
+   return __inet_lookup_listener(net, hashinfo, daddr, ntohs(dport), dif);
 }
 
 /* Socket demux engine toys. */
@@ -344,26 +346,26 @@ typedef __u64 __bitwise __addrpair;
   (((__force __u64)(__be32)(__daddr))  32) | 
\
   ((__force __u64)(__be32)(__saddr)));
 #endif /* __BIG_ENDIAN */
-#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\
-   (((__sk)-sk_hash == (__hash))\
+#define INET_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, __ports, 
__dif)\
+   (((__sk)-sk_hash == (__hash))  ((__sk)-sk_net == (__net)) 
\
 ((*((__addrpair *)(inet_sk(__sk)-daddr))) == (__cookie))   
\
 ((*((__portpair *)(inet_sk(__sk)-dport))) == (__ports))
\
 (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif
-#define INET_TW_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, 
__dif)\
-   (((__sk)-sk_hash == (__hash))\
+#define INET_TW_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, 
__ports, __dif)\
+   (((__sk)-sk_hash == (__hash))  ((__sk)-sk_net == (__net)) 
\
 ((*((__addrpair *)(inet_twsk(__sk)-tw_daddr))) == (__cookie))  
\
 ((*((__portpair *)(inet_twsk(__sk)-tw_dport))) == (__ports))   
\
 (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr)
-#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)   
\
-   (((__sk)-sk_hash == (__hash))\
+#define INET_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, __ports, 
__dif)\
+   (((__sk)-sk_hash == (__hash))  ((__sk)-sk_net == (__net)) 
\
 (inet_sk(__sk)-daddr  == (__saddr)) \
 (inet_sk(__sk)-rcv_saddr  == (__daddr)) \
 ((*((__portpair *)(inet_sk(__sk)-dport))) == (__ports))
\
 (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif
-#define INET_TW_MATCH(__sk, __hash,__cookie, __saddr, __daddr, __ports, __dif) 
\
-   (((__sk)-sk_hash == (__hash))\
+#define INET_TW_MATCH(__sk, __net, __hash,__cookie, __saddr, __daddr, __ports, 
__dif)  \
+   (((__sk)-sk_hash == (__hash))  ((__sk)-sk_net == (__net)) 
\
 (inet_twsk(__sk)-tw_daddr == (__saddr)) \
 (inet_twsk(__sk)-tw_rcv_saddr == (__daddr)) \
 ((*((__portpair *)(inet_twsk(__sk)-tw_dport))) == (__ports))   
\
@@ -376,32 +378,36 @@ typedef __u64 __bitwise __addrpair;
  *
  * Local BH must be disabled here.
  */
-extern struct sock * __inet_lookup_established(struct inet_hashinfo *hashinfo,
+extern struct sock * __inet_lookup_established(struct net *net,
+   struct inet_hashinfo *hashinfo,
const __be32 saddr, const __be16 sport,
const __be32 daddr, const u16 hnum, const int dif);
 
 static inline struct sock *
-   inet_lookup_established(struct inet_hashinfo *hashinfo,
+   inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo,
const 

[PATCH 3/6][NETNS]: Make bind buckets live in net namespaces.

2008-01-31 Thread Pavel Emelyanov
This tags the inet_bind_bucket struct with net pointer,
initializes it during creation and makes a filtering
during lookup.

A better hashfn, that takes the net into account is to
be done in the future, but currently all bind buckets
with similar port will be in one hash chain.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---
 include/net/inet_hashtables.h   |2 ++
 net/ipv4/inet_connection_sock.c |8 +---
 net/ipv4/inet_hashtables.c  |8 ++--
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index a34a8f2..55532b9 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -74,6 +74,7 @@ struct inet_ehash_bucket {
  * ports are created in O(1) time?  I thought so. ;-)  -DaveM
  */
 struct inet_bind_bucket {
+   struct net  *ib_net;
unsigned short  port;
signed shortfastreuse;
struct hlist_node   node;
@@ -194,6 +195,7 @@ static inline void inet_ehash_locks_free(struct 
inet_hashinfo *hashinfo)
 
 extern struct inet_bind_bucket *
inet_bind_bucket_create(struct kmem_cache *cachep,
+   struct net *net,
struct inet_bind_hashbucket *head,
const unsigned short snum);
 extern void inet_bind_bucket_destroy(struct kmem_cache *cachep,
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 7801cce..de5a41d 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -87,6 +87,7 @@ int inet_csk_get_port(struct inet_hashinfo *hashinfo,
struct hlist_node *node;
struct inet_bind_bucket *tb;
int ret;
+   struct net *net = sk-sk_net;
 
local_bh_disable();
if (!snum) {
@@ -100,7 +101,7 @@ int inet_csk_get_port(struct inet_hashinfo *hashinfo,
head = hashinfo-bhash[inet_bhashfn(rover, 
hashinfo-bhash_size)];
spin_lock(head-lock);
inet_bind_bucket_for_each(tb, node, head-chain)
-   if (tb-port == rover)
+   if (tb-ib_net == net  tb-port == rover)
goto next;
break;
next:
@@ -127,7 +128,7 @@ int inet_csk_get_port(struct inet_hashinfo *hashinfo,
head = hashinfo-bhash[inet_bhashfn(snum, 
hashinfo-bhash_size)];
spin_lock(head-lock);
inet_bind_bucket_for_each(tb, node, head-chain)
-   if (tb-port == snum)
+   if (tb-ib_net == net  tb-port == snum)
goto tb_found;
}
tb = NULL;
@@ -147,7 +148,8 @@ tb_found:
}
 tb_not_found:
ret = 1;
-   if (!tb  (tb = inet_bind_bucket_create(hashinfo-bind_bucket_cachep, 
head, snum)) == NULL)
+   if (!tb  (tb = inet_bind_bucket_create(hashinfo-bind_bucket_cachep,
+   net, head, snum)) == NULL)
goto fail_unlock;
if (hlist_empty(tb-owners)) {
if (sk-sk_reuse  sk-sk_state != TCP_LISTEN)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index b93d40f..db1e53a 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -28,12 +28,14 @@
  * The bindhash mutex for snum's hash chain must be held here.
  */
 struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
+struct net *net,
 struct inet_bind_hashbucket 
*head,
 const unsigned short snum)
 {
struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
 
if (tb != NULL) {
+   tb-ib_net   = net;
tb-port  = snum;
tb-fastreuse = 0;
INIT_HLIST_HEAD(tb-owners);
@@ -359,6 +361,7 @@ int __inet_hash_connect(struct inet_timewait_death_row 
*death_row,
struct inet_bind_hashbucket *head;
struct inet_bind_bucket *tb;
int ret;
+   struct net *net = sk-sk_net;
 
if (!snum) {
int i, remaining, low, high, port;
@@ -381,7 +384,7 @@ int __inet_hash_connect(struct inet_timewait_death_row 
*death_row,
 * unique enough.
 */
inet_bind_bucket_for_each(tb, node, head-chain) {
-   if (tb-port == port) {
+   if (tb-ib_net == net  tb-port == port) {
BUG_TRAP(!hlist_empty(tb-owners));
if (tb-fastreuse = 0)
goto next_port;
@@ -392,7 

[PATCH 5/6][NETNS]: Tcp-v6 sockets per-net lookup.

2008-01-31 Thread Pavel Emelyanov
Add a net argument to inet6_lookup and propagate it further. 
Actually, this is tcp-v6 implementation of what was done for 
tcp-v4 sockets in a previous patch.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---
 include/linux/ipv6.h   |8 
 include/net/inet6_hashtables.h |   17 ++---
 net/dccp/ipv6.c|8 
 net/ipv4/inet_diag.c   |2 +-
 net/ipv6/inet6_hashtables.c|   25 ++---
 net/ipv6/tcp_ipv6.c|   19 ++-
 6 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index c347860..4aaefc3 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -457,16 +457,16 @@ static inline struct raw6_sock *raw6_sk(const struct sock 
*sk)
 #define inet_v6_ipv6only(__sk) 0
 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */
 
-#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif)\
-   (((__sk)-sk_hash == (__hash))   \
+#define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\
+   (((__sk)-sk_hash == (__hash))  ((__sk)-sk_net == (__net))\
 ((*((__portpair *)(inet_sk(__sk)-dport))) == (__ports))   \
 ((__sk)-sk_family == AF_INET6) \
 ipv6_addr_equal(inet6_sk(__sk)-daddr, (__saddr))  \
 ipv6_addr_equal(inet6_sk(__sk)-rcv_saddr, (__daddr))  \
 (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif
 
-#define INET6_TW_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif) \
-   (((__sk)-sk_hash == (__hash))   \
+#define INET6_TW_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif) \
+   (((__sk)-sk_hash == (__hash))  ((__sk)-sk_net == (__net))\
 (*((__portpair *)(inet_twsk(__sk)-tw_dport)) == (__ports))\
 ((__sk)-sk_family== PF_INET6)  \
 (ipv6_addr_equal(inet6_twsk(__sk)-tw_v6_daddr, (__saddr)))\
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 668056b..fdff630 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -57,34 +57,37 @@ extern void __inet6_hash(struct inet_hashinfo *hashinfo, 
struct sock *sk);
  *
  * The sockhash lock must be held as a reader here.
  */
-extern struct sock *__inet6_lookup_established(struct inet_hashinfo *hashinfo,
+extern struct sock *__inet6_lookup_established(struct net *net,
+  struct inet_hashinfo *hashinfo,
   const struct in6_addr *saddr,
   const __be16 sport,
   const struct in6_addr *daddr,
   const u16 hnum,
   const int dif);
 
-extern struct sock *inet6_lookup_listener(struct inet_hashinfo *hashinfo,
+extern struct sock *inet6_lookup_listener(struct net *net,
+ struct inet_hashinfo *hashinfo,
  const struct in6_addr *daddr,
  const unsigned short hnum,
  const int dif);
 
-static inline struct sock *__inet6_lookup(struct inet_hashinfo *hashinfo,
+static inline struct sock *__inet6_lookup(struct net *net,
+ struct inet_hashinfo *hashinfo,
  const struct in6_addr *saddr,
  const __be16 sport,
  const struct in6_addr *daddr,
  const u16 hnum,
  const int dif)
 {
-   struct sock *sk = __inet6_lookup_established(hashinfo, saddr, sport,
-daddr, hnum, dif);
+   struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr,
+   sport, daddr, hnum, dif);
if (sk)
return sk;
 
-   return inet6_lookup_listener(hashinfo, daddr, hnum, dif);
+   return inet6_lookup_listener(net, hashinfo, daddr, hnum, dif);
 }
 
-extern struct sock *inet6_lookup(struct inet_hashinfo *hashinfo,
+extern struct sock *inet6_lookup(struct net *net, struct inet_hashinfo 
*hashinfo,
 const struct in6_addr *saddr, const __be16 
sport,
 const struct in6_addr *daddr, const __be16 
dport,
 const int dif);
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index f42b75c..ed0a005 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -101,8 +101,8 @@ static void dccp_v6_err(struct sk_buff *skb, struct 
inet6_skb_parm *opt,
int err;

Re: [PATCH 5/6][NETNS]: Tcp-v6 sockets per-net lookup.

2008-01-31 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 15:40:16 +0300

 Add a net argument to inet6_lookup and propagate it further. 
 Actually, this is tcp-v6 implementation of what was done for 
 tcp-v4 sockets in a previous patch.
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6][NETNS]: Udp sockets per-net lookup.

2008-01-31 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 15:41:58 +0300

 Add the net parameter to udp_get_port family of calls and 
 udp_lookup one and use it to filter sockets.
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH retry] bluetooth : add conn add/del workqueues to avoid connection fail

2008-01-31 Thread Jens Axboe
On Wed, Jan 30 2008, Dave Young wrote:
 
 The bluetooth hci_conn sysfs add/del executed in the default workqueue.
 If the del_conn is executed after the new add_conn with same target,
 add_conn will failed with warning of same kobject name.
 
 Here add btaddconn  btdelconn workqueues,
 flush the btdelconn workqueue in the add_conn function to avoid the issue.
 
 Signed-off-by: Dave Young [EMAIL PROTECTED] 
 
 ---
 diff -upr a/net/bluetooth/hci_sysfs.c b/net/bluetooth/hci_sysfs.c
 --- a/net/bluetooth/hci_sysfs.c   2008-01-30 10:14:27.0 +0800
 +++ b/net/bluetooth/hci_sysfs.c   2008-01-30 10:14:14.0 +0800
 @@ -12,6 +12,8 @@
  #undef  BT_DBG
  #define BT_DBG(D...)
  #endif
 +static struct workqueue_struct *btaddconn;
 +static struct workqueue_struct *btdelconn;
  
  static inline char *typetostr(int type)
  {
 @@ -279,6 +281,7 @@ static void add_conn(struct work_struct 
   struct hci_conn *conn = container_of(work, struct hci_conn, work);
   int i;
  
 + flush_workqueue(btdelconn);
   if (device_add(conn-dev)  0) {
   BT_ERR(Failed to register connection device);
   return;
 @@ -313,6 +316,7 @@ void hci_conn_add_sysfs(struct hci_conn 
  
   INIT_WORK(conn-work, add_conn);
  
 + queue_work(btaddconn, conn-work);
   schedule_work(conn-work);
  }

So you queue conn-work on both btaddconn and keventd_wq?

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6][NETNS]: Udp sockets per-net lookup.

2008-01-31 Thread YOSHIFUJI Hideaki / 吉藤英明
In article [EMAIL PROTECTED] (at Thu, 31 Jan 2008 15:41:58 +0300), Pavel 
Emelyanov [EMAIL PROTECTED] says:

 Add the net parameter to udp_get_port family of calls and 
 udp_lookup one and use it to filter sockets.

I may miss something, but I'm afraid that I have to disagree.
Port is identified only by family, address, protocol and port,
and should not be split by name space.

--yoshfuji
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6][NETNS]: Udp sockets per-net lookup.

2008-01-31 Thread David Miller
From: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED]
Date: Fri, 01 Feb 2008 00:11:38 +1100 (EST)

 In article [EMAIL PROTECTED] (at Thu, 31 Jan 2008 15:41:58 +0300), Pavel 
 Emelyanov [EMAIL PROTECTED] says:
 
  Add the net parameter to udp_get_port family of calls and 
  udp_lookup one and use it to filter sockets.
 
 I may miss something, but I'm afraid that I have to disagree.
 Port is identified only by family, address, protocol and port,
 and should not be split by name space.

It is like being on a totally different system.

Without sockets in namespaces, there is no point.

The networking devices are even per-namespace already,
so you can even say that each namespace is even
physically different.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6][NETNS]: Tcp-v4 sockets per-net lookup.

2008-01-31 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 15:38:15 +0300

 Add a net argument to inet_lookup and propagate it further
 into lookup calls. Plus tune the __inet_check_established.
 
 The dccp and inet_diag, which use that lookup functions
 pass the init_net into them.
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][IPV6]: Introduce the INET6_TW_MATCH macro.

2008-01-31 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 15:29:20 +0300

0/6? :-)

 We have INET_MATCH, INET_TW_MATCH and INET6_MATCH to test
 sockets and twbuckets for matching, but ipv6 twbuckets are
 tested manually.
 
 Here's the INET6_TW_MATCH to help with it.
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.

2008-01-31 Thread Arnaldo Carvalho de Melo
Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu:
 These two functions are the same except for what they call
 to check_established and hash for a socket.
 
 This saves half-a-kilo for ipv4 and ipv6.

Good stuff!

Yesterday I was perusing tcp_hash and I think we could have the hashinfo
pointer stored perhaps in sk-sk_prot.

That way we would be able to kill tcp_hash(), inet_put_port() could
receive just sk, etc.

What do you think?

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.

2008-01-31 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 15:32:09 +0300

 These two functions are the same except for what they call
 to check_established and hash for a socket.
 
 This saves half-a-kilo for ipv4 and ipv6.
 
  add/remove: 1/0 grow/shrink: 1/4 up/down: 582/-1128 (-546)
  function old new   delta
  __inet_hash_connect- 577+577
  arp_ignore   108 113  +5
  static.hint8   4  -4
  rt_worker_func   376 372  -4
  inet6_hash_connect   584  25-559
  inet_hash_connect586  25-561
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/6][NETNS]: Make bind buckets live in net namespaces.

2008-01-31 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 15:35:39 +0300

 This tags the inet_bind_bucket struct with net pointer,
 initializes it during creation and makes a filtering
 during lookup.
 
 A better hashfn, that takes the net into account is to
 be done in the future, but currently all bind buckets
 with similar port will be in one hash chain.
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Applied.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] [IPV4]: Fix memory leak on error path during FIB initialization.

2008-01-31 Thread David Miller
From: Denis V. Lunev [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 15:00:45 +0300

 commit c8050bf6d84785a7edd2e81591e8f833231477e8
 Author: Denis V. Lunev [EMAIL PROTECTED]
 Date:   Thu Jan 10 03:28:24 2008 -0800

I am fixing it up for you this time, but please do not
reference the commit this way.

Say something like:

  blah blah blah in commit $(SHA1_HASH) (commit head line).

The author and date give no real useful information in
this context, the important part is giving the reader
enough information to find the commit should they wish
to gain more information.

If they have the commit hash they can usually find the
commit, but if that fails they can search the commit
messages for the head line text string.

I feel like I've had to explain this 10 times in the past week...
:-/

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.

2008-01-31 Thread David Miller
From: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
Date: Thu, 31 Jan 2008 11:01:53 -0200

 Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu:
  These two functions are the same except for what they call
  to check_established and hash for a socket.
  
  This saves half-a-kilo for ipv4 and ipv6.
 
 Good stuff!
 
 Yesterday I was perusing tcp_hash and I think we could have the hashinfo
 pointer stored perhaps in sk-sk_prot.
 
 That way we would be able to kill tcp_hash(), inet_put_port() could
 receive just sk, etc.
 
 What do you think?

Sounds good to me.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Null pointer dereference when bringing up bonding device on kernel-2.6.24-2.fc9.i686

2008-01-31 Thread Siim Põder
Yo!

Jay Vosburgh wrote:
 Benny Amorsen [EMAIL PROTECTED] wrote:
 
 https://bugzilla.redhat.com/show_bug.cgi?id=430391
 
   I know what this is, I'll fix it.

do you know when this happend, so we would know which kernel is ok to
use (not to start trying blindly)?

Siim
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.

2008-01-31 Thread Pavel Emelyanov
Arnaldo Carvalho de Melo wrote:
 Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu:
 These two functions are the same except for what they call
 to check_established and hash for a socket.

 This saves half-a-kilo for ipv4 and ipv6.
 
 Good stuff!
 
 Yesterday I was perusing tcp_hash and I think we could have the hashinfo
 pointer stored perhaps in sk-sk_prot.
 
 That way we would be able to kill tcp_hash(), inet_put_port() could
 receive just sk, etc.

But each proto will still have its own hashfn, so proto's 
callbacks will be called to hash/unhash sockets, so this will 
give us just one extra dereference. No?

 What do you think?

Hmmm... Even raw_hash, etc may become simpler. On the other hand
maybe this is a good idea, but I'm not very common with this code
yet to foresee such things in advance... I think that we should
try to prepare a patch and look, but if you have smth ready, then
it's better to review your stuff first.

 - Arnaldo
 

Thanks,
Pavel
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread jamal
On Thu, 2008-31-01 at 13:21 +0100, Andi Kleen wrote:
 
 I just managed to hang a 2.6.24 (+ some non network patches) kernel 
 with the following (non sensical) command
 
 tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100
 
 No oops or anything just hangs. While I understand root can
 do bad things just hanging like this seems a little extreme.
 

-
lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100
lilsol:~# uname -a
Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686
GNU/Linux
lilsol:~# tc qdisc ls dev eth0
qdisc tbf 8001: root rate 1000bit burst 10b lat 737.3ms
lilsol:~#
---

What do your patches do?

cheers,
jamal


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Andi Kleen

 -
 lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100
 lilsol:~# uname -a
 Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686

Can you try it again with current git mainline?

 GNU/Linux
 lilsol:~# tc qdisc ls dev eth0
 qdisc tbf 8001: root rate 1000bit burst 10b lat 737.3ms
 lilsol:~#
 ---
 
 What do your patches do?

Nothing really related to qdiscs.  I suspect it came from the git mainline patch
I had (but forgot to mention in the first email) 

-Andi

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Patrick McHardy

Andi Kleen wrote:

-
lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100
lilsol:~# uname -a
Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686


Can you try it again with current git mainline?



I'll look into it.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cls_u32 u32_classify() +

2008-01-31 Thread jamal
On Wed, 2008-30-01 at 11:31 -0200, Dzianis Kahanovich wrote:
 Currently fine u32 hashkey ... at ... not work with relative offsets.
 There are simpliest fix to use eat.
 (sorry, v2)
 

Hi, 
Please send me the commands you are trying to run that motivated this
patch.

cheers,
jamal

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6][NETNS]: Udp sockets per-net lookup.

2008-01-31 Thread YOSHIFUJI Hideaki / 吉藤英明
In article [EMAIL PROTECTED] (at Thu, 31 Jan 2008 05:20:07 -0800 (PST)), 
David Miller [EMAIL PROTECTED] says:

 The networking devices are even per-namespace already,
 so you can even say that each namespace is even
 physically different.

Ah, okay, we are splitting weak domains...

--yoshfuji
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.

2008-01-31 Thread Arnaldo Carvalho de Melo
Em Thu, Jan 31, 2008 at 04:18:51PM +0300, Pavel Emelyanov escreveu:
 Arnaldo Carvalho de Melo wrote:
  Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu:
  These two functions are the same except for what they call
  to check_established and hash for a socket.
 
  This saves half-a-kilo for ipv4 and ipv6.
  
  Good stuff!
  
  Yesterday I was perusing tcp_hash and I think we could have the hashinfo
  pointer stored perhaps in sk-sk_prot.
  
  That way we would be able to kill tcp_hash(), inet_put_port() could
  receive just sk, etc.
 
 But each proto will still have its own hashfn, so proto's 
 callbacks will be called to hash/unhash sockets, so this will 
 give us just one extra dereference. No?
 
  What do you think?
 
 Hmmm... Even raw_hash, etc may become simpler. On the other hand
 maybe this is a good idea, but I'm not very common with this code
 yet to foresee such things in advance... I think that we should
 try to prepare a patch and look, but if you have smth ready, then
 it's better to review your stuff first.

gimme some minutes

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Patrick McHardy

Patrick McHardy wrote:

Andi Kleen wrote:

-
lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100
lilsol:~# uname -a
Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686


Can you try it again with current git mainline?



I'll look into it.



Works for me:

qdisc tbf 8001: root rate 1000bit burst 10b/8 mpu 0b lat 720.0ms
 Sent 0 bytes 0 pkt (dropped 9, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0

Packets are dropped as expected.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [VLAN] vlan_dev: Initialize dev pointer only when it is being used

2008-01-31 Thread Patrick McHardy

Benjamin Li wrote:

Signed-off-by: Benjamin Li [EMAIL PROTECTED]
---
 net/8021q/vlan_dev.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 8059fa4..2fa5d68 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -49,7 +49,7 @@
  */
 static int vlan_dev_rebuild_header(struct sk_buff *skb)
 {
-   struct net_device *dev = skb-dev;
+   struct net_device *dev;
struct vlan_ethhdr *veth = (struct vlan_ethhdr *)(skb-data);
 
 	switch (veth-h_vlan_encapsulated_proto) {

@@ -60,6 +60,7 @@ static int vlan_dev_rebuild_header(struct sk_buff *skb)
return arp_find(veth-h_dest, skb);
 #endif
default:
+   dev = skb-dev;
pr_debug(%s: unable to resolve type %X addresses.\n,
 dev-name, ntohs(veth-h_vlan_encapsulated_proto));



This seems pretty pointless to me.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NET: AX88796 use dev_dbg() instead of printk()

2008-01-31 Thread Jeff Garzik
On Thu, Jan 31, 2008 at 11:25:31AM +, Ben Dooks wrote:
 Change to using dev_dbg() and the other dev_xxx()
 macros instead of printk, and update to use the
 print_mac() helper.
 
 Signed-off-by: Ben Dooks [EMAIL PROTECTED]

Please send to [EMAIL PROTECTED] or [EMAIL PROTECTED], the email addresses
I've always used for communication.

The redhat.com address is only for legal sign-offs, not actual
communication.

Thanks,

Jeff



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread David Acker

Bill Fink wrote:

If the receive direction uses a different GigE NIC that's part of the
same quad-GigE, all is fine:

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79  nuttcp -f-beta 
-Irx -r -w2m 192.168.5.79
tx:  1186.5051 MB /  10.05 sec =  990.2250 Mbps 12 %TX 13 %RX 0 retrans
rx:  1186.7656 MB /  10.05 sec =  990.5204 Mbps 15 %TX 14 %RX 0 retrans
Could this be an issue with pause frames?  At a previous job I remember 
having issues with a similar configuration using two broadcom sb1250 3 
gigE port devices. If I ran bidirectional tests on a single pair of 
ports connected via cross over, it was slower than when I gave each 
direction its own pair of ports.  The problem turned out to be that 
pause frame generation and handling was not configured correctly.

-Ack
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Andi Kleen

 Works for me:

 qdisc tbf 8001: root rate 1000bit burst 10b/8 mpu 0b lat 720.0ms
   Sent 0 bytes 0 pkt (dropped 9, overlimits 0 requeues 0)
   rate 0bit 0pps backlog 0b 0p requeues 0

 Packets are dropped as expected.

I can still reproduce it on 64bit with http://halobates.de/config-qdisc
(all qdiscs etc. compiled in for testing) 
with latest git tip (8af03e782cae1e0a0f530ddd22301cdd12cf9dc0)

The command line above causes an instant hang. Also tried it with
newer iproute2 (the original one was quite old), but it didn't make
a difference.

Perhaps it's related to what qdiscs are enabled? Can you please
try with the above config?

If everything fails I can do a bisect later.

-Andi
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][NETFILTER]: Ipv6-related xt_hashlimit compilation fix.

2008-01-31 Thread Pavel Emelyanov
The hashlimit_ipv6_mask() is called from under IP6_NF_IPTABLES
config option, but is not under it by itself.

gcc warns us about it :) :
net/netfilter/xt_hashlimit.c:473: warning: ‘hashlimit_ipv6_mask’ defined but 
not used

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c
index 54aaf5b..744c7f2 100644
--- a/net/netfilter/xt_hashlimit.c
+++ b/net/netfilter/xt_hashlimit.c
@@ -469,6 +469,7 @@ static inline __be32 maskl(__be32 a, unsigned int l)
return htonl(ntohl(a)  ~(~(u_int32_t)0  l));
 }
 
+#if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE)
 static void hashlimit_ipv6_mask(__be32 *i, unsigned int p)
 {
switch (p) {
@@ -503,6 +504,7 @@ static void hashlimit_ipv6_mask(__be32 *i, unsigned int p)
break;
}
 }
+#endif
 
 static int
 hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo,
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Patrick McHardy

Andi Kleen wrote:

Works for me:

qdisc tbf 8001: root rate 1000bit burst 10b/8 mpu 0b lat 720.0ms
  Sent 0 bytes 0 pkt (dropped 9, overlimits 0 requeues 0)
  rate 0bit 0pps backlog 0b 0p requeues 0

Packets are dropped as expected.


I can still reproduce it on 64bit with http://halobates.de/config-qdisc
(all qdiscs etc. compiled in for testing) 
with latest git tip (8af03e782cae1e0a0f530ddd22301cdd12cf9dc0)


The command line above causes an instant hang. Also tried it with
newer iproute2 (the original one was quite old), but it didn't make
a difference.

Perhaps it's related to what qdiscs are enabled?



I'm also testing on 64 bit, with all qdiscs enabled as modules.


Can you please try with the above config?



I'll give it a try later.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Hi all, slowly crawling through the mails.

Brandeburg, Jesse wrote:


The test was done with various mtu sizes ranging from 1500 to 9000,
with ethernet flow control switched on and off, and using reno and
cubic as a TCP congestion control.

As asked in LKML thread, please post the exact netperf command used
to start the client/server, whether or not you're using irqbalanced
(aka irqbalance) and what cat /proc/interrupts looks like (you ARE
using MSI, right?)


We are using MSI, /proc/interrupts look like:
n0003:~# cat /proc/interrupts
   CPU0   CPU1   CPU2   CPU3
  0:6536963  0  0  0   IO-APIC-edge  timer
  1:  2  0  0  0   IO-APIC-edge  i8042
  3:  1  0  0  0   IO-APIC-edge  serial
  8:  0  0  0  0   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 14:  32321  0  0  0   IO-APIC-edge  libata
 15:  0  0  0  0   IO-APIC-edge  libata
 16:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb5
 18:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb4
 19:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb3
 23:  0  0  0  0   IO-APIC-fasteoi 
ehci_hcd:usb1, uhci_hcd:usb2

378:   17234866  0  0  0   PCI-MSI-edge  eth1
379: 129826  0  0  0   PCI-MSI-edge  eth0
NMI:  0  0  0  0
LOC:6537181653732665371496537052
ERR:  0

(sorry for the line break).

What we don't understand is why only core0 gets the interrupts, since 
the affinity is set to f:

# cat /proc/irq/378/smp_affinity
f

Right now, irqbalance is not running, though I can give it shot if 
people think this will make a difference.



I would suggest you try TCP_RR with a command line something like this:
netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K


I did that and the results can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest

The results with netperf running like
netperf -t TCP_STREAM -H host -l 20
can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf1

I reran the tests with
netperf -t test -H host -l 20 -c -C
or in the case of TCP_RR with the suggested burst settings -b 4 -r 64k



Yes, InterruptThrottleRate=8000 means there will be no more than 8000
ints/second from that adapter, and if interrupts are generated faster
than that they are aggregated.

Interestingly since you are interested in ultra low latency, and may be
willing to give up some cpu for it during bulk transfers you should try
InterruptThrottleRate=1 (can generate up to 7 ints/s)



On the web page you'll see that there are about 4000 interrupts/s for 
most tests and up to 20,000/s for the TCP_RR test. Shall I change the 
throttle rate?



just for completeness can you post the dump of ethtool -e eth0 and
lspci -vvv?

Yup, we'll give that info also.


n0002:~# ethtool -e eth1
Offset  Values
--  --
0x  00 30 48 93 94 2d 20 0d 46 f7 57 00 ff ff ff ff
0x0010  ff ff ff ff 6b 02 9a 10 d9 15 9a 10 86 80 df 80
0x0020  00 00 00 20 54 7e 00 00 00 10 da 00 04 00 00 27
0x0030  c9 6c 50 31 32 07 0b 04 84 29 00 00 00 c0 06 07
0x0040  08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff
0x0050  14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00
0x0060  00 01 00 40 1e 12 ff ff ff ff ff ff ff ff ff ff
0x0070  ff ff ff ff ff ff ff ff ff ff ff ff ff ff cf 2f

lspci -vvv for this card:
0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet 
Controller

Subsystem: Super Micro Computer Inc Unknown device 109a
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- 
TAbort- MAbort- SERR- PERR-

Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 378
Region 0: Memory at ee20 (32-bit, non-prefetchable) [size=128K]
Region 2: I/O ports at 5000 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)

Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ 
Queue=0/0 Enable+

Address: fee0f00c  Data: 41b9
Capabilities: [e0] Express Endpoint IRQ 0
Device: Supported: MaxPayload 256 bytes, PhantFunc 0, 
ExtTag-

Device: Latency L0s 512ns, L1 64us
Device: AtnBtn- AtnInd- PwrInd-
Device: Errors: Correctable- Non-Fatal- Fatal- 

Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Brief question I forgot to ask:

Right now we are using the old version 7.3.20-k2. To save some effort 
on your end, shall we upgrade this to 7.6.15 or should our version be 
good enough?


Thanks

Carsten
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi Bill,


I see similar results on my test systems


Thanks for this report and for confirming our observations.  Could you 
please confirm that a single-port bidrectional UDP link runs at wire 
speed?  This helps to localize the problem to the TCP stack or interaction 
of the TCP stack with the e1000 driver and hardware.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bruce Allen

Hi David,

Could this be an issue with pause frames?  At a previous job I remember 
having issues with a similar configuration using two broadcom sb1250 3 
gigE port devices. If I ran bidirectional tests on a single pair of 
ports connected via cross over, it was slower than when I gave each 
direction its own pair of ports.  The problem turned out to be that 
pause frame generation and handling was not configured correctly.


We had PAUSE frames turned off for our testing.  The idea is to let TCP 
do the flow and congestion control.


The problem with PAUSE+TCP is that it can cause head-of-line blocking, 
where a single oversubscribed output port on a switch can PAUSE a large 
number of flows on other paths.


Cheers,
Bruce
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rtl8150: use default MTU of 1500

2008-01-31 Thread Petko Manolov

On Wed, 30 Jan 2008, Lennert Buytenhek wrote:


The RTL8150 driver uses an MTU of 1540 by default, which causes a
bunch of problems -- it prevents booting from NFS root, for one.


Agreed, although it is a bit strange how this particular bug has sneaked 
up for so long...



cheers,
Petko




Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED]
Cc: Petko Manolov [EMAIL PROTECTED]

--- linux-2.6.24-git7.orig/drivers/net/usb/rtl8150.c2008-01-24 
23:58:37.0 +0100
+++ linux-2.6.24-git7/drivers/net/usb/rtl8150.c 2008-01-30 20:29:00.0 
+0100
@@ -925,9 +925,8 @@
netdev-hard_start_xmit = rtl8150_start_xmit;
netdev-set_multicast_list = rtl8150_set_multicast;
netdev-set_mac_address = rtl8150_set_mac_address;
netdev-get_stats = rtl8150_netdev_stats;
-   netdev-mtu = RTL8150_MTU;
SET_ETHTOOL_OPS(netdev, ops);
dev-intr_interval = 100;/* 100ms */

if (!alloc_all_urbs(dev)) {


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.

2008-01-31 Thread Arnaldo Carvalho de Melo
Em Thu, Jan 31, 2008 at 11:39:55AM -0200, Arnaldo Carvalho de Melo escreveu:
 Em Thu, Jan 31, 2008 at 04:18:51PM +0300, Pavel Emelyanov escreveu:
  Arnaldo Carvalho de Melo wrote:
   Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu:
   These two functions are the same except for what they call
   to check_established and hash for a socket.
  
   This saves half-a-kilo for ipv4 and ipv6.
   
   Good stuff!
   
   Yesterday I was perusing tcp_hash and I think we could have the hashinfo
   pointer stored perhaps in sk-sk_prot.
   
   That way we would be able to kill tcp_hash(), inet_put_port() could
   receive just sk, etc.
  
  But each proto will still have its own hashfn, so proto's 
  callbacks will be called to hash/unhash sockets, so this will 
  give us just one extra dereference. No?
  
   What do you think?
  
  Hmmm... Even raw_hash, etc may become simpler. On the other hand
  maybe this is a good idea, but I'm not very common with this code
  yet to foresee such things in advance... I think that we should
  try to prepare a patch and look, but if you have smth ready, then
  it's better to review your stuff first.
 
 gimme some minutes

A bit more than minutes tho, but here it is, I'm testing it now.

Take a look and if testing is ok I'll submit it with a proper
description.

- Arnaldo

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index fdff630..62a5b69 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -49,7 +49,7 @@ static inline int inet6_sk_ehashfn(const struct sock *sk)
return inet6_ehashfn(laddr, lport, faddr, fport);
 }
 
-extern void __inet6_hash(struct inet_hashinfo *hashinfo, struct sock *sk);
+extern void __inet6_hash(struct sock *sk);
 
 /*
  * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so
diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 133cf30..f00f057 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -29,7 +29,6 @@
 #undef INET_CSK_CLEAR_TIMERS
 
 struct inet_bind_bucket;
-struct inet_hashinfo;
 struct tcp_congestion_ops;
 
 /*
@@ -59,6 +58,8 @@ struct inet_connection_sock_af_ops {
int level, int optname,
char __user *optval, int __user *optlen);
void(*addr2sockaddr)(struct sock *sk, struct sockaddr *);
+   int (*bind_conflict)(const struct sock *sk,
+const struct inet_bind_bucket *tb);
 };
 
 /** inet_connection_sock - INET connection oriented sock
@@ -244,10 +245,7 @@ extern struct request_sock *inet_csk_search_req(const 
struct sock *sk,
const __be32 laddr);
 extern int inet_csk_bind_conflict(const struct sock *sk,
  const struct inet_bind_bucket *tb);
-extern int inet_csk_get_port(struct inet_hashinfo *hashinfo,
-struct sock *sk, unsigned short snum,
-int (*bind_conflict)(const struct sock *sk,
- const struct inet_bind_bucket 
*tb));
+extern int inet_csk_get_port(struct sock *sk, unsigned short snum);
 
 extern struct dst_entry* inet_csk_route_req(struct sock *sk,
const struct request_sock *req);
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index c23c4ed..48ac620 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -221,9 +221,9 @@ static inline int inet_sk_listen_hashfn(const struct sock 
*sk)
 }
 
 /* Caller must disable local BH processing. */
-static inline void __inet_inherit_port(struct inet_hashinfo *table,
-  struct sock *sk, struct sock *child)
+static inline void __inet_inherit_port(struct sock *sk, struct sock *child)
 {
+   struct inet_hashinfo *table = sk-sk_prot-hashinfo;
const int bhash = inet_bhashfn(inet_sk(child)-num, table-bhash_size);
struct inet_bind_hashbucket *head = table-bhash[bhash];
struct inet_bind_bucket *tb;
@@ -235,15 +235,14 @@ static inline void __inet_inherit_port(struct 
inet_hashinfo *table,
spin_unlock(head-lock);
 }
 
-static inline void inet_inherit_port(struct inet_hashinfo *table,
-struct sock *sk, struct sock *child)
+static inline void inet_inherit_port(struct sock *sk, struct sock *child)
 {
local_bh_disable();
-   __inet_inherit_port(table, sk, child);
+   __inet_inherit_port(sk, child);
local_bh_enable();
 }
 
-extern void inet_put_port(struct inet_hashinfo *table, struct sock *sk);
+extern void inet_put_port(struct sock *sk);
 
 extern void inet_listen_wlock(struct inet_hashinfo *hashinfo);
 
@@ -266,41 +265,9 @@ static inline void inet_listen_unlock(struct inet_hashinfo 
*hashinfo)

Re: rtl8150: use default MTU of 1500

2008-01-31 Thread Lennert Buytenhek
On Thu, Jan 31, 2008 at 05:42:34PM +0200, Petko Manolov wrote:

  The RTL8150 driver uses an MTU of 1540 by default, which causes a
  bunch of problems -- it prevents booting from NFS root, for one.
 
 Agreed, although it is a bit strange how this particular bug has
 sneaked up for so long...

I posted this patch sometime in 2006, and you asked me a question
about it then (why we don't just set RTL8150_MTU to 1500 -- the
answer would be that RTL8150_MTU is used in a couple more places
in the driver, including for allocing skbuffs), but I failed to
follow up to that question at the time, which is why I assume it got
dropped.

I have been carrying the patch in my own tree since then, and only
noticed recently that the patch never made it upstream.


cheers,
Lennert


 Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED]
 Cc: Petko Manolov [EMAIL PROTECTED]
 
 --- linux-2.6.24-git7.orig/drivers/net/usb/rtl8150.c 2008-01-24 
 23:58:37.0 +0100
 +++ linux-2.6.24-git7/drivers/net/usb/rtl8150.c  2008-01-30 
 20:29:00.0 +0100
 @@ -925,9 +925,8 @@
  netdev-hard_start_xmit = rtl8150_start_xmit;
  netdev-set_multicast_list = rtl8150_set_multicast;
  netdev-set_mac_address = rtl8150_set_mac_address;
  netdev-get_stats = rtl8150_netdev_stats;
 -netdev-mtu = RTL8150_MTU;
  SET_ETHTOOL_OPS(netdev, ops);
  dev-intr_interval = 100;   /* 100ms */
 
  if (!alloc_all_urbs(dev)) {
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Hi Andi,

Andi Kleen wrote:
Another issue with full duplex TCP not mentioned yet is that if TSO is used 
the output  will be somewhat bursty and might cause problems with the 
TCP ACK clock of the other direction because the ACKs would need 
to squeeze in between full TSO bursts.


You could try disabling TSO with ethtool.


I just tried that:

https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3

It seems that the numbers do get better (sweet-spot seems to be MTU6000 
with 914 MBit/s and 927 MBit/s), however for other settings the results 
vary a lot so I'm not sure how large the statistical fluctuations are.


Next test I'll try if it makes sense to enlarge the ring buffers.

Thanks

Carsten
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}

2008-01-31 Thread Ben Greear

Andi Kleen wrote:

I believe the problem was that all of my ports were used up with
TIME_WAIT sockets and so it couldn't create more.  My test
case was similar to this:



Ah that's simple to solve then :- use more IP addresses and bind 
to them in RR in your user program.


Arguably the Linux TCP code should be able to do this by itself
when enough IP addresses are available, but it's not very hard
to do in user space using bind(2)

BTW it's also an very unusual case -- in most cases there are more
remote IP addresses
  
This could be done, but it does decrease our options for testing certain 
scenarios.
So, is there a better way to max out the connections per second without 
having to use tcp_tw_recycle?



Well did you profile where the bottle necks were?

Perhaps also just increase the memory allowed for TCP sockets.
  
I may be missing something, but I believe the issue is that the sockets 
wait around a while (maybe 30 seconds
or so) in TIME_WAIT state.  So, even if we use all 64k of the local port 
range, that will limit us to about 2000 new sockets

per second, as we have to wait for old ones to transition out of TIME_WAIT.

I guess I could probably decrease TIME_WAIT, but then all of my 
connections would be affected, not just the
ones on the ports creating very large numbers of connections per 
second.  From 'man tcp', it does not seem

I can set the TIME_WAIT on a per-socket basis.

I don't know exactly how the tcp_tw_recycle works, but it seems like it 
could be made to only
take affect when all local ports are used up in TIME_WAIT.  It could 
then recycle the oldest one
as a new socket is requested.  For any normal program, it would be very 
unlikely to ever need to
recycle in this case because there would be enough free IP/port pairs 
available.  But, for weird things
like my own, at least it could be made to work w/out hacking the global 
TIME_WAIT.


Thanks,
Ben

--
Ben Greear [EMAIL PROTECTED] 
Candela Technologies Inc  http://www.candelatech.com



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 08:41:38AM -0800, Ben Greear wrote:
 I don't know exactly how the tcp_tw_recycle works, but it seems like it 
 could be made to only
 take affect when all local ports are used up in TIME_WAIT.  

TIME-WAIT does not actually use up local ports; it uses up remote ports
because it is done on the LISTEN socket which has always a fixed
local port. And it has no idea how many ports the other end has left.

-Andi
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Carsten Aulbert

Hi all,

Brandeburg, Jesse wrote:

I would suggest you try TCP_RR with a command line something like
this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K

I did that and the results can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest


seems something went wrong and all you ran was the 1 byte tests, where
it should have been 64K both directions (request/response).
 


Yes, shell-quoting got me there. I'll re-run the tests, so please don't 
look at the TCP_RR results too closely. I think I'll be able to run 
maybe one or two more tests today, rest will follow tomorrow.


Thanks for bearing with me

Carsten

PS: Am I right that the TCP_RR tests should only be run on a single node 
at a time, not on both ends simultaneously?

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Stephen Hemminger
On Thu, 31 Jan 2008 13:46:32 +0100
Andi Kleen [EMAIL PROTECTED] wrote:

 
 TSO interacts badly with many queueing disciplines because they rely on 
 reordering packets from different streams and the large TSO packets can 
 make this difficult. This patch disables TSO for sockets that send over 
 devices with non standard queueing disciplines. That's anything but noop 
 or pfifo_fast and pfifo right now.
 
 Longer term other queueing disciplines could be checked if they
 are also ok with TSO. If yes they can set the TCQ_F_GSO_OK flag too.
 
 It is still enabled for the standard pfifo_fast because that will never
 reorder packets with the same type-of-service. This means 99+% of all users
 will still be able to use TSO just fine.
 
 The status is only set up at socket creation so a shifted route
 will not reenable TSO on a existing socket. I don't think that's a 
 problem though.
 
 Signed-off-by: Andi Kleen [EMAIL PROTECTED]
 


Fix the broken qdisc instead.

-- 
Stephen Hemminger [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1]: Add support for aes-ctr to ipsec

2008-01-31 Thread Joy Latten
Very sorry, re-posting as first patch was incomplete.

The below patch allows IPsec to use CTR mode with
AES encryption algorithm. Tested this using setkey
in ipsec-tools.

regards,
Joy


Signed-off-by: Joy Latten [EMAIL PROTECTED]

--

diff -urpN net-2.6.25/include/linux/pfkeyv2.h 
net-2.6.25.patch/include/linux/pfkeyv2.h
--- net-2.6.25/include/linux/pfkeyv2.h  2008-01-29 11:48:00.0 -0600
+++ net-2.6.25.patch/include/linux/pfkeyv2.h2008-01-29 13:43:59.0 
-0600
@@ -298,6 +298,7 @@ struct sadb_x_sec_ctx {
 #define SADB_X_EALG_BLOWFISHCBC7
 #define SADB_EALG_NULL 11
 #define SADB_X_EALG_AESCBC 12
+#define SADB_X_EALG_AESCTR 13
 #define SADB_X_EALG_CAMELLIACBC22
 #define SADB_EALG_MAX   253 /* last EALG */
 /* private allocations should use 249-255 (RFC2407) */
diff -urpN net-2.6.25/net/xfrm/xfrm_algo.c net-2.6.25.patch/net/xfrm/xfrm_algo.c
--- net-2.6.25/net/xfrm/xfrm_algo.c 2008-01-29 11:48:03.0 -0600
+++ net-2.6.25.patch/net/xfrm/xfrm_algo.c   2008-01-29 13:42:43.0 
-0600
@@ -300,6 +300,23 @@ static struct xfrm_algo_desc ealg_list[]
.sadb_alg_maxbits = 256
}
 },
+{
+   .name = rfc3686(ctr(aes)),
+
+   .uinfo = {
+   .encr = {
+   .blockbits = 128,
+   .defkeybits = 160, /* 128-bit key + 32-bit nonce */
+   }
+   },
+
+   .desc = {
+   .sadb_alg_id = SADB_X_EALG_AESCTR,
+   .sadb_alg_ivlen = 8,
+   .sadb_alg_minbits = 128,
+   .sadb_alg_maxbits = 256
+   }
+},
 };
 
 static struct xfrm_algo_desc calg_list[] = {
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Bill Fink
Hi Bruce,

On Thu, 31 Jan 2008, Bruce Allen wrote:

  I see similar results on my test systems
 
 Thanks for this report and for confirming our observations.  Could you 
 please confirm that a single-port bidrectional UDP link runs at wire 
 speed?  This helps to localize the problem to the TCP stack or interaction 
 of the TCP stack with the e1000 driver and hardware.

Yes, a single-port bidirectional UDP test gets full GigE line rate
in both directions with no packet loss.

[EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -u -Ru -w2m 192.168.6.79  nuttcp 
-f-beta -Irx -r -u -Ru -w2m 192.168.6.79
tx:  1187.0078 MB /  10.04 sec =  992.0550 Mbps 19 %TX 7 %RX 0 / 151937 
drop/pkt 0.00 %loss
rx:  1187.1016 MB /  10.03 sec =  992.3408 Mbps 19 %TX 7 %RX 0 / 151949 
drop/pkt 0.00 %loss

-Bill
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Brandeburg, Jesse
Carsten Aulbert wrote:
 PS: Am I right that the TCP_RR tests should only be run on a single
 node at a time, not on both ends simultaneously?

yes, they are a request/response test, and so perform the bidirectional
test with a single node starting the test.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Rick Jones
netperf was used without any special tuning parameters. Usually we start 
two processes on two hosts which start (almost) simultaneously, last for 
20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e.


on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20
on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20

192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we 
use the .2. part for eth1 for a range of mtus.


The server is started on both nodes with the start-stop-daemon and no 
special parameters I'm aware of.



So long as you are relying on external (netperf relative) means to 
report the throughput, those command lines would be fine.  I wouldn't be 
comfortably relying on the sum of the netperf-reported throughtputs with 
those comand lines though.  Netperf2 has no test synchronization, so two 
separate commands, particularly those initiated on different systems, 
are subject to skew errors.  99 times out of ten they might be epsilon, 
but I get a _little_ paranoid there.


There are three alternatives:

1) use netperf4.  not as convenient for quick testing at present, but 
it has explicit test synchronization, so  you know that the numbers 
presented are from when all connections were actively transferring data


2) use the aforementioned burst TCP_RR test.  This is then a single 
netperf with data flowing both ways on a single connection so no issue 
of skew, but perhaps an issue of being one connection and so one process 
on each end.


3) start both tests from the same system and follow the suggestions 
contained in :


http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/doc/netperf.html

particluarly:

http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

and use a combination of TCP_STREAM and TCP_MAERTS (STREAM backwards) tests.

happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] preparations to enable netdevice notifiers inside a namespace (resend)

2008-01-31 Thread Daniel Lezcano

Benjamin Thery wrote:

On Jan 31, 2008 3:58 PM, Daniel Lezcano [EMAIL PROTECTED] wrote:


Denis V. Lunev wrote:

Here are some preparations and cleanups to enable network device/inet
address notifiers inside a namespace.

This set of patches has been originally sent last Friday. One cleanup
patch from the original series is dropped as wrong, thanks to Daniel
Lezcano.

Can you explain please.



I think Denis refers to the patch called 3/7 Prohibit assignment of
0.0.0.0as interface address. ,
he dropped because it was inappropriate, no?


Yes, you are right, Denis explained me in a private email. I think I 
really need to sleep a little more :)

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET_SCHED 00/04]: External SFQ classifiers/flow classifier

2008-01-31 Thread Patrick McHardy
These patches add support for external classifiers to SFQ and add a
new flow classifier, which can do hashing based on user-specified
keys or deterministic mapping of keys to classes. Additionally there
is a patch to make the SFQ queues visisble as classes to verify that
the hash is indeed doing something useful and a patch to consifiy
struct tcf_ext_map, which I had queued in the same tree.

Please apply, thanks.


 include/linux/pkt_cls.h   |   50 
 include/linux/pkt_sched.h |5 +
 include/net/pkt_cls.h |6 +-
 net/sched/Kconfig |   11 +
 net/sched/Makefile|1 +
 net/sched/cls_api.c   |6 +-
 net/sched/cls_basic.c |2 +-
 net/sched/cls_flow.c  |  660 +
 net/sched/cls_fw.c|2 +-
 net/sched/cls_route.c |2 +-
 net/sched/cls_tcindex.c   |2 +-
 net/sched/cls_u32.c   |2 +-
 net/sched/sch_sfq.c   |  134 +-
 13 files changed, 868 insertions(+), 15 deletions(-)
 create mode 100644 net/sched/cls_flow.c

Patrick McHardy (4):
  [NET_SCHED]: Constify struct tcf_ext_map
  [NET_SCHED]: sch_sfq: add support for external classifiers
  [NET_SCHED]: sch_sfq: make internal queues visible as classes
  [NET_SCHED]: Add flow classifier
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET_SCHED 01/04]: Constify struct tcf_ext_map

2008-01-31 Thread Patrick McHardy
[NET_SCHED]: Constify struct tcf_ext_map

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 12e33ddf57910b685501df10bd92223ea9b98fd6
tree 1ce47c7b6b6b968940f3dc28f9d7839e78c85089
parent 8af03e782cae1e0a0f530ddd22301cdd12cf9dc0
author Patrick McHardy [EMAIL PROTECTED] Wed, 30 Jan 2008 21:59:26 +0100
committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:55 +0100

 include/net/pkt_cls.h   |6 +++---
 net/sched/cls_api.c |6 +++---
 net/sched/cls_basic.c   |2 +-
 net/sched/cls_fw.c  |2 +-
 net/sched/cls_route.c   |2 +-
 net/sched/cls_tcindex.c |2 +-
 net/sched/cls_u32.c |2 +-
 7 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 8716eb7..d349c66 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -131,14 +131,14 @@ tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts,
 
 extern int tcf_exts_validate(struct tcf_proto *tp, struct nlattr **tb,
 struct nlattr *rate_tlv, struct tcf_exts *exts,
-struct tcf_ext_map *map);
+const struct tcf_ext_map *map);
 extern void tcf_exts_destroy(struct tcf_proto *tp, struct tcf_exts *exts);
 extern void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
 struct tcf_exts *src);
 extern int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts,
-struct tcf_ext_map *map);
+const struct tcf_ext_map *map);
 extern int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts,
-  struct tcf_ext_map *map);
+  const struct tcf_ext_map *map);
 
 /**
  * struct tcf_pkt_info - packet information
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3377ca0..0fbedca 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -482,7 +482,7 @@ EXPORT_SYMBOL(tcf_exts_destroy);
 
 int tcf_exts_validate(struct tcf_proto *tp, struct nlattr **tb,
  struct nlattr *rate_tlv, struct tcf_exts *exts,
- struct tcf_ext_map *map)
+ const struct tcf_ext_map *map)
 {
memset(exts, 0, sizeof(*exts));
 
@@ -535,7 +535,7 @@ void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts 
*dst,
 EXPORT_SYMBOL(tcf_exts_change);
 
 int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts,
- struct tcf_ext_map *map)
+ const struct tcf_ext_map *map)
 {
 #ifdef CONFIG_NET_CLS_ACT
if (map-action  exts-action) {
@@ -571,7 +571,7 @@ EXPORT_SYMBOL(tcf_exts_dump);
 
 
 int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts,
-   struct tcf_ext_map *map)
+   const struct tcf_ext_map *map)
 {
 #ifdef CONFIG_NET_CLS_ACT
if (exts-action)
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index bfb4342..956915c 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -35,7 +35,7 @@ struct basic_filter
struct list_headlink;
 };
 
-static struct tcf_ext_map basic_ext_map = {
+static const struct tcf_ext_map basic_ext_map = {
.action = TCA_BASIC_ACT,
.police = TCA_BASIC_POLICE
 };
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 436a6e7..b0f90e5 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -47,7 +47,7 @@ struct fw_filter
struct tcf_exts exts;
 };
 
-static struct tcf_ext_map fw_ext_map = {
+static const struct tcf_ext_map fw_ext_map = {
.action = TCA_FW_ACT,
.police = TCA_FW_POLICE
 };
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index f7e7d39..784dcb8 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -62,7 +62,7 @@ struct route4_filter
 
 #define ROUTE4_FAILURE ((struct route4_filter*)(-1L))
 
-static struct tcf_ext_map route_ext_map = {
+static const struct tcf_ext_map route_ext_map = {
.police = TCA_ROUTE4_POLICE,
.action = TCA_ROUTE4_ACT
 };
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index ee60b2d..7a7bff5 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -55,7 +55,7 @@ struct tcindex_data {
int fall_through;   /* 0: only classify if explicit match */
 };
 
-static struct tcf_ext_map tcindex_ext_map = {
+static const struct tcf_ext_map tcindex_ext_map = {
.police = TCA_TCINDEX_POLICE,
.action = TCA_TCINDEX_ACT
 };
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index e8a7756..b18fa95 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -82,7 +82,7 @@ struct tc_u_common
u32 hgenerator;
 };
 
-static struct tcf_ext_map u32_ext_map = {
+static const struct tcf_ext_map u32_ext_map = {
.action = TCA_U32_ACT,
.police = TCA_U32_POLICE
 };
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to 

[NET_SCHED 02/04]: sch_sfq: add support for external classifiers

2008-01-31 Thread Patrick McHardy
[NET_SCHED]: sch_sfq: add support for external classifiers

Add support for external classifiers to allow using different flow hash
functions similar to ESFQ. When no classifier is attached the built-in
hash is used as before.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 6049892cc4acca9af393e134e4cdaf6b3e1ccad9
tree 9a8347d45808de2aef14486e5792fcab58baf3fe
parent 12e33ddf57910b685501df10bd92223ea9b98fd6
author Patrick McHardy [EMAIL PROTECTED] Wed, 30 Jan 2008 21:59:27 +0100
committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:55 +0100

 net/sched/sch_sfq.c |   95 +--
 1 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 91af539..d818d19 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -95,6 +95,7 @@ struct sfq_sched_data
int limit;
 
 /* Variables */
+   struct tcf_proto *filter_list;
struct timer_list perturb_timer;
u32 perturbation;
sfq_index   tail;   /* Index of current slot in round */
@@ -155,6 +156,39 @@ static unsigned sfq_hash(struct sfq_sched_data *q, struct 
sk_buff *skb)
return sfq_fold_hash(q, h, h2);
 }
 
+static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
+int *qerr)
+{
+   struct sfq_sched_data *q = qdisc_priv(sch);
+   struct tcf_result res;
+   int result;
+
+   if (TC_H_MAJ(skb-priority) == sch-handle 
+   TC_H_MIN(skb-priority)  0 
+   TC_H_MIN(skb-priority) = SFQ_HASH_DIVISOR)
+   return TC_H_MIN(skb-priority);
+
+   if (!q-filter_list)
+   return sfq_hash(q, skb) + 1;
+
+   *qerr = NET_XMIT_BYPASS;
+   result = tc_classify(skb, q-filter_list, res);
+   if (result = 0) {
+#ifdef CONFIG_NET_CLS_ACT
+   switch (result) {
+   case TC_ACT_STOLEN:
+   case TC_ACT_QUEUED:
+   *qerr = NET_XMIT_SUCCESS;
+   case TC_ACT_SHOT:
+   return 0;
+   }
+#endif
+   if (TC_H_MIN(res.classid) = SFQ_HASH_DIVISOR)
+   return TC_H_MIN(res.classid);
+   }
+   return 0;
+}
+
 static inline void sfq_link(struct sfq_sched_data *q, sfq_index x)
 {
sfq_index p, n;
@@ -245,8 +279,18 @@ static int
 sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
struct sfq_sched_data *q = qdisc_priv(sch);
-   unsigned hash = sfq_hash(q, skb);
+   unsigned int hash;
sfq_index x;
+   int ret;
+
+   hash = sfq_classify(skb, sch, ret);
+   if (hash == 0) {
+   if (ret == NET_XMIT_BYPASS)
+   sch-qstats.drops++;
+   kfree_skb(skb);
+   return ret;
+   }
+   hash--;
 
x = q-ht[hash];
if (x == SFQ_DEPTH) {
@@ -289,8 +333,18 @@ static int
 sfq_requeue(struct sk_buff *skb, struct Qdisc *sch)
 {
struct sfq_sched_data *q = qdisc_priv(sch);
-   unsigned hash = sfq_hash(q, skb);
+   unsigned int hash;
sfq_index x;
+   int ret;
+
+   hash = sfq_classify(skb, sch, ret);
+   if (hash == 0) {
+   if (ret == NET_XMIT_BYPASS)
+   sch-qstats.drops++;
+   kfree_skb(skb);
+   return ret;
+   }
+   hash--;
 
x = q-ht[hash];
if (x == SFQ_DEPTH) {
@@ -465,6 +519,8 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 static void sfq_destroy(struct Qdisc *sch)
 {
struct sfq_sched_data *q = qdisc_priv(sch);
+
+   tcf_destroy_chain(q-filter_list);
del_timer(q-perturb_timer);
 }
 
@@ -490,9 +546,40 @@ nla_put_failure:
return -1;
 }
 
+static int sfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
+   struct nlattr **tca, unsigned long *arg)
+{
+   return -EOPNOTSUPP;
+}
+
+static unsigned long sfq_get(struct Qdisc *sch, u32 classid)
+{
+   return 0;
+}
+
+static struct tcf_proto **sfq_find_tcf(struct Qdisc *sch, unsigned long cl)
+{
+   struct sfq_sched_data *q = qdisc_priv(sch);
+
+   if (cl)
+   return NULL;
+   return q-filter_list;
+}
+
+static void sfq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+   return;
+}
+
+static const struct Qdisc_class_ops sfq_class_ops = {
+   .get=   sfq_get,
+   .change =   sfq_change_class,
+   .tcf_chain  =   sfq_find_tcf,
+   .walk   =   sfq_walk,
+};
+
 static struct Qdisc_ops sfq_qdisc_ops __read_mostly = {
-   .next   =   NULL,
-   .cl_ops =   NULL,
+   .cl_ops =   sfq_class_ops,
.id =   sfq,
.priv_size  =   sizeof(struct sfq_sched_data),
.enqueue=   sfq_enqueue,
--
To unsubscribe from this list: send the 

[NET_SCHED 04/04]: Add flow classifier

2008-01-31 Thread Patrick McHardy
[NET_SCHED]: Add flow classifier

Add new flow classifier, which is meant to extend the SFQ hashing
capabilities without hard-coding new hash functions and also allows
deterministic mappings of keys to classes, replacing some out of tree
iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps
IPs to marks, with fw filters to classes), ...

Some examples:

- Classic SFQ hash:

  tc filter add ... flow hash \
keys src,dst,proto,proto-src,proto-dst divisor 1024

- Classic SFQ hash, but using information from conntrack to work properly in
  combination with NAT:

  tc filter add ... flow hash \
keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024

- Map destination IPs of 192.168.0.0/24 to classids 1-257:

  tc filter add ... flow map \
key dst addend -192.168.0.0 divisor 256

- alternatively:

  tc filter add ... flow map \
key dst and 0xff

- similar, but reverse ordered:

  tc filter add ... flow map \
key dst and 0xff xor 0xff

Perturbation is currently not supported because we can't reliable kill the
timer on destruction.

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 91a3a09ce63cba8df30ac42133a40dd64c0a7259
tree 2572feb8ffd88e6abf9270d2137af2a4cf7f542a
parent 7a281f8ef334a35d699682315e9f80a3e006376c
author Patrick McHardy [EMAIL PROTECTED] Wed, 30 Jan 2008 21:59:31 +0100
committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:56 +0100

 include/linux/pkt_cls.h |   50 
 net/sched/Kconfig   |   11 +
 net/sched/Makefile  |1 
 net/sched/cls_flow.c|  660 +++
 4 files changed, 722 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index 30b8571..1c1dba9 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -328,6 +328,56 @@ enum
 
 #define TCA_TCINDEX_MAX (__TCA_TCINDEX_MAX - 1)
 
+/* Flow filter */
+
+enum
+{
+   FLOW_KEY_SRC,
+   FLOW_KEY_DST,
+   FLOW_KEY_PROTO,
+   FLOW_KEY_PROTO_SRC,
+   FLOW_KEY_PROTO_DST,
+   FLOW_KEY_IIF,
+   FLOW_KEY_PRIORITY,
+   FLOW_KEY_MARK,
+   FLOW_KEY_NFCT,
+   FLOW_KEY_NFCT_SRC,
+   FLOW_KEY_NFCT_DST,
+   FLOW_KEY_NFCT_PROTO_SRC,
+   FLOW_KEY_NFCT_PROTO_DST,
+   FLOW_KEY_RTCLASSID,
+   FLOW_KEY_SKUID,
+   FLOW_KEY_SKGID,
+   __FLOW_KEY_MAX,
+};
+
+#define FLOW_KEY_MAX   (__FLOW_KEY_MAX - 1)
+
+enum
+{
+   FLOW_MODE_MAP,
+   FLOW_MODE_HASH,
+};
+
+enum
+{
+   TCA_FLOW_UNSPEC,
+   TCA_FLOW_KEYS,
+   TCA_FLOW_MODE,
+   TCA_FLOW_BASECLASS,
+   TCA_FLOW_RSHIFT,
+   TCA_FLOW_ADDEND,
+   TCA_FLOW_MASK,
+   TCA_FLOW_XOR,
+   TCA_FLOW_DIVISOR,
+   TCA_FLOW_ACT,
+   TCA_FLOW_POLICE,
+   TCA_FLOW_EMATCHES,
+   __TCA_FLOW_MAX
+};
+
+#define TCA_FLOW_MAX   (__TCA_FLOW_MAX - 1)
+
 /* Basic filter */
 
 enum
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 87af7c9..bccf42b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -307,6 +307,17 @@ config NET_CLS_RSVP6
  To compile this code as a module, choose M here: the
  module will be called cls_rsvp6.
 
+config NET_CLS_FLOW
+   tristate Flow classifier
+   select NET_CLS
+   ---help---
+ If you say Y here, you will be able to classify packets based on
+ a configurable combination of packet keys. This is mostly useful
+ in combination with SFQ.
+
+ To compile this code as a module, choose M here: the
+ module will be called cls_flow.
+
 config NET_EMATCH
bool Extended Matches
select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 81ecbe8..1d2b0f7 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -35,6 +35,7 @@ obj-$(CONFIG_NET_CLS_RSVP)+= cls_rsvp.o
 obj-$(CONFIG_NET_CLS_TCINDEX)  += cls_tcindex.o
 obj-$(CONFIG_NET_CLS_RSVP6)+= cls_rsvp6.o
 obj-$(CONFIG_NET_CLS_BASIC)+= cls_basic.o
+obj-$(CONFIG_NET_CLS_FLOW) += cls_flow.o
 obj-$(CONFIG_NET_EMATCH)   += ematch.o
 obj-$(CONFIG_NET_EMATCH_CMP)   += em_cmp.o
 obj-$(CONFIG_NET_EMATCH_NBYTE) += em_nbyte.o
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
new file mode 100644
index 000..5a7f6a3
--- /dev/null
+++ b/net/sched/cls_flow.c
@@ -0,0 +1,660 @@
+/*
+ * net/sched/cls_flow.cGeneric flow classifier
+ *
+ * Copyright (c) 2007, 2008 Patrick McHardy [EMAIL PROTECTED]
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ */
+
+#include linux/kernel.h
+#include linux/init.h
+#include linux/list.h
+#include linux/jhash.h
+#include linux/random.h
+#include linux/pkt_cls.h
+#include linux/skbuff.h
+#include linux/in.h
+#include linux/ip.h
+#include linux/ipv6.h
+
+#include 

Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
 Fix the broken qdisc instead.

What do you mean? I don't think the qdiscs are broken.
I cannot think of any way how e.g. TBF can do anything useful
with large TSO packets.

-Andi
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPROUTE 01/02]: Add support for SFQ xstats

2008-01-31 Thread Patrick McHardy
 [IPROUTE]: Add support for SFQ xstats

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 196870f762ee393438c42115425f4af69e5b5186
tree 5650c1f93cc58886f8f97a0e55e374c157b96e2e
parent 54bb35c69cec6c730a4ac95530a1d2ca6670f73b
author Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 15:10:07 +0100
committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 15:10:07 +0100

 include/linux/pkt_sched.h |5 +
 tc/q_sfq.c|   17 +
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 3276135..4ccd684 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -150,6 +150,11 @@ struct tc_sfq_qopt
 	unsigned	flows;		/* Maximal number of flows  */
 };
 
+struct tc_sfq_xstats
+{
+	__u32		allot;
+};
+
 /*
  *  NOTE: limit, divisor and flows are hardwired to code at the moment.
  *
diff --git a/tc/q_sfq.c b/tc/q_sfq.c
index 05385cf..ce4dade 100644
--- a/tc/q_sfq.c
+++ b/tc/q_sfq.c
@@ -100,8 +100,25 @@ static int sfq_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	return 0;
 }
 
+static int sfq_print_xstats(struct qdisc_util *qu, FILE *f,
+			struct rtattr *xstats)
+{
+	struct tc_sfq_xstats *st;
+
+	if (xstats == NULL)
+		return 0;
+	if (RTA_PAYLOAD(xstats)  sizeof(*st))
+		return -1;
+	st = RTA_DATA(xstats);
+
+	fprintf(f,  allot %d , st-allot);
+	fprintf(f, \n);
+	return 0;
+}
+
 struct qdisc_util sfq_qdisc_util = {
 	.id		= sfq,
 	.parse_qopt	= sfq_parse_opt,
 	.print_qopt	= sfq_print_opt,
+	.print_xstats	= sfq_print_xstats,
 };


[IPROUTE 02/02]: Add flow classifier support

2008-01-31 Thread Patrick McHardy
 [IPROUTE]: Add flow classifier support

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit ac3df2d7e37826b06cc9093f50d829a9da1873a4
tree b33a2b29abdcea0267fe7a357d282a4c2f67124b
parent 196870f762ee393438c42115425f4af69e5b5186
author Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:47 +0100
committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:47 +0100

 include/linux/pkt_cls.h |   50 +++
 tc/Makefile |1 
 tc/f_flow.c |  347 +++
 3 files changed, 398 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index afb79d0..16869c2 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -328,6 +328,56 @@ enum
 
 #define TCA_TCINDEX_MAX (__TCA_TCINDEX_MAX - 1)
 
+/* Flow filter */
+
+enum
+{
+	FLOW_KEY_SRC,
+	FLOW_KEY_DST,
+	FLOW_KEY_PROTO,
+	FLOW_KEY_PROTO_SRC,
+	FLOW_KEY_PROTO_DST,
+	FLOW_KEY_IIF,
+	FLOW_KEY_PRIORITY,
+	FLOW_KEY_MARK,
+	FLOW_KEY_NFCT,
+	FLOW_KEY_NFCT_SRC,
+	FLOW_KEY_NFCT_DST,
+	FLOW_KEY_NFCT_PROTO_SRC,
+	FLOW_KEY_NFCT_PROTO_DST,
+	FLOW_KEY_RTCLASSID,
+	FLOW_KEY_SKUID,
+	FLOW_KEY_SKGID,
+	__FLOW_KEY_MAX,
+};
+
+#define FLOW_KEY_MAX	(__FLOW_KEY_MAX - 1)
+
+enum
+{
+	FLOW_MODE_MAP,
+	FLOW_MODE_HASH,
+};
+
+enum
+{
+	TCA_FLOW_UNSPEC,
+	TCA_FLOW_KEYS,
+	TCA_FLOW_MODE,
+	TCA_FLOW_BASECLASS,
+	TCA_FLOW_RSHIFT,
+	TCA_FLOW_ADDEND,
+	TCA_FLOW_MASK,
+	TCA_FLOW_XOR,
+	TCA_FLOW_DIVISOR,
+	TCA_FLOW_ACT,
+	TCA_FLOW_POLICE,
+	TCA_FLOW_EMATCHES,
+	__TCA_FLOW_MAX
+};
+
+#define TCA_FLOW_MAX	(__TCA_FLOW_MAX - 1)
+
 /* Basic filter */
 
 enum
diff --git a/tc/Makefile b/tc/Makefile
index 0facc88..7ece958 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -18,6 +18,7 @@ TCMODULES += f_u32.o
 TCMODULES += f_route.o
 TCMODULES += f_fw.o
 TCMODULES += f_basic.o
+TCMODULES += f_flow.o
 TCMODULES += q_dsmark.o
 TCMODULES += q_gred.o
 TCMODULES += f_tcindex.o
diff --git a/tc/f_flow.c b/tc/f_flow.c
new file mode 100644
index 000..eca05cd
--- /dev/null
+++ b/tc/f_flow.c
@@ -0,0 +1,347 @@
+/*
+ * f_flow.c		Flow filter
+ *
+ * 		This program is free software; you can redistribute it and/or
+ * 		modify it under the terms of the GNU General Public License
+ * 		as published by the Free Software Foundation; either version
+ * 		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Patrick McHardy [EMAIL PROTECTED]
+ */
+#include stdio.h
+#include stdlib.h
+#include unistd.h
+#include string.h
+#include errno.h
+
+#include utils.h
+#include tc_util.h
+#include m_ematch.h
+
+static void explain(void)
+{
+	fprintf(stderr,
+Usage: ... flow ...\n
+\n
+ [mapping mode]: map key KEY [ OPS ] ...\n
+ [hashing mode]: hash keys KEY-LIST ...\n
+\n
+ [ divisor NUM ] [ baseclass ID ] [ match EMATCH_TREE ]\n
+ [ police POLICE_SPEC ] [ action ACTION_SPEC ]\n
+\n
+KEY-LIST := [ KEY-LIST , ] KEY\n
+KEY  := [ src | dst | proto | proto-src | proto-dst | iif | priority | \n
+  mark | nfct | nfct-src | nfct-dst | nfct-proto-src | \n
+  nfct-proto-dst | rt-classid | sk-uid | sk-gid ]\n
+OPS  := [ or NUM | and NUM | xor NUM | rshift NUM | addend NUM ]\n
+ID   := X:Y\n
+	);
+}
+
+static const char *flow_keys[FLOW_KEY_MAX+1] = {
+	[FLOW_KEY_SRC]			= src,
+	[FLOW_KEY_DST]			= dst,
+	[FLOW_KEY_PROTO]		= proto,
+	[FLOW_KEY_PROTO_SRC]		= proto-src,
+	[FLOW_KEY_PROTO_DST]		= proto-dst,
+	[FLOW_KEY_IIF]			= iif,
+	[FLOW_KEY_PRIORITY]		= priority,
+	[FLOW_KEY_MARK]			= mark,
+	[FLOW_KEY_NFCT]			= nfct,
+	[FLOW_KEY_NFCT_SRC]		= nfct-src,
+	[FLOW_KEY_NFCT_DST]		= nfct-dst,
+	[FLOW_KEY_NFCT_PROTO_SRC]	= nfct-proto-src,
+	[FLOW_KEY_NFCT_PROTO_DST]	= nfct-proto-dst,
+	[FLOW_KEY_RTCLASSID]		= rt-classid,
+	[FLOW_KEY_SKUID]		= sk-uid,
+	[FLOW_KEY_SKGID]		= sk-gid,
+};
+
+static int flow_parse_keys(__u32 *keys, __u32 *nkeys, char *argv)
+{
+	char *s, *sep;
+	unsigned int i;
+
+	*keys = 0;
+	*nkeys = 0;
+	s = argv;
+	while (s != NULL) {
+		sep = strchr(s, ',');
+		if (sep)
+			*sep = '\0';
+
+		for (i = 0; i = FLOW_KEY_MAX; i++) {
+			if (matches(s, flow_keys[i]) == 0) {
+*keys |= 1  i;
+(*nkeys)++;
+break;
+			}
+		}
+		if (i  FLOW_KEY_MAX) {
+			fprintf(stderr, Unknown flow key \%s\\n, s);
+			return -1;
+		}
+		s = sep ? sep + 1 : NULL;
+	}
+	return 0;
+}
+
+static void transfer_bitop(__u32 *mask, __u32 *xor, __u32 m, __u32 x)
+{
+	*xor = x ^ (*xor  m);
+	*mask = m;
+}
+
+static int get_addend(__u32 *addend, char *argv, __u32 keys)
+{
+	inet_prefix addr;
+	int sign = 0;
+	__u32 tmp;
+
+	if (*argv == '-') {
+		sign = 1;
+		argv++;
+	}
+
+	if (get_u32(tmp, argv, 0) == 0)
+		goto out;
+
+	if (keys  (FLOW_KEY_SRC | FLOW_KEY_DST |
+		FLOW_KEY_NFCT_SRC | FLOW_KEY_NFCT_DST) 
+	get_addr(addr, argv, AF_UNSPEC) == 0) {
+		switch (addr.family) {
+		case AF_INET:
+			tmp = ntohl(addr.data[0]);
+			goto out;
+		case AF_INET6:
+			tmp = ntohl(addr.data[3]);
+			goto out;
+		}
+	}
+
+	return -1;
+out:
+	if (sign)

Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Andi Kleen wrote:

Fix the broken qdisc instead.


What do you mean? I don't think the qdiscs are broken.
I cannot think of any way how e.g. TBF can do anything useful
with large TSO packets.



Someone posted a patch some time ago to calculate the amount
of tokens needed in max_size portions and use that, but IMO
people should just configure TBF with the proper MTU for TSO.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still oopsing in nf_nat_move_storage()

2008-01-31 Thread Chuck Ebbert
On 01/29/2008 12:18 PM, Patrick McHardy wrote:
 Chuck Ebbert wrote:
 nf_nat_move_storage():
 /usr/src/debug/kernel-2.6.23/linux-2.6.23.i686/net/ipv4/netfilter/nf_nat_core.c:612

   87:   f7 47 64 80 01 00 00testl  $0x180,0x64(%edi)
   8e:   74 39   je c9
 nf_nat_move_storage+0x65

 line 612:
 if (!(ct-status  IPS_NAT_DONE_MASK))
 return;

 ct is NULL
 
 The current kernel (and 2.6.23-stable) have:
 
 if (!ct || !(ct-status  IPS_NAT_DONE_MASK))
 return;
 
 so it seems you're using an old version.

Sorry, I re-used the analysis from before that change went in. I now
have an oops report from 2.6.23.14 on x86_64.

It is oopsing there, and only on x86_64 now, because x86_64 refuses to
use a non-canonical address. ct contains what appears to be ASCII data.
i386 might be dereferencing some random address instead of oopsing...


   0:   48 f7 45 78 80 01 00testq  $0x180,0x78(%rbp)
   7:   00
   8:   74 4c   je 0x56
   a:   48 c7 c7 e0 18 28 88mov$0x882818e0,%rdi

%rbp has a bogus (non-canonical) address. On i386 there is no such test possible
so it will just dereference the address if it is mapped.

%rbp contains 8 valid ASCII chars: salcf x\

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Kok, Auke
Carsten Aulbert wrote:
 Hi Andi,
 
 Andi Kleen wrote:
 Another issue with full duplex TCP not mentioned yet is that if TSO is
 used the output  will be somewhat bursty and might cause problems with
 the TCP ACK clock of the other direction because the ACKs would need
 to squeeze in between full TSO bursts.

 You could try disabling TSO with ethtool.
 
 I just tried that:
 
 https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3
 
 It seems that the numbers do get better (sweet-spot seems to be MTU6000
 with 914 MBit/s and 927 MBit/s), however for other settings the results
 vary a lot so I'm not sure how large the statistical fluctuations are.
 
 Next test I'll try if it makes sense to enlarge the ring buffers.

sometimes it may help if the system (cpu) is laggy or busy a lot so that the 
card
has more buffers available (and thus can go longer without servicing)

Usually (if your system responds quickly) it's better to use *smaller* ring 
sizes
as this reduces cache. Hence the small default value.

so, unless the ethtool -S ethX output indicates that your system is too busy
(rx_no_buffer_count increases) I would not recommend increasing the ring size.

Auke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
 Then change TBF to use skb_gso_segment?  Be careful, the fact that

That doesn't help because it wants to interleave packets
from different streams to get everything fair and smooth. The only 
good way to handle that is to split it up and the simplest way to do 
this is to just tell TCP to not do GSO in the first place.

-Andi

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Andi Kleen wrote:

Then change TBF to use skb_gso_segment?  Be careful, the fact that


That doesn't help because it wants to interleave packets
from different streams to get everything fair and smooth. The only 
good way to handle that is to split it up and the simplest way to do 
this is to just tell TCP to not do GSO in the first place.



Thats not correct, TBF keeps packets strictly ordered unless
an inner qdisc does reordering. But even then (let say you use
SFQ) packets of a single flow will stay ordered. Segmenting
TSO packets is no different than having them arrive independantly
for other reasons.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Rick Jones

Andi Kleen wrote:
TSO interacts badly with many queueing disciplines because they rely on 
reordering packets from different streams and the large TSO packets can 
make this difficult. This patch disables TSO for sockets that send over 
devices with non standard queueing disciplines. That's anything but noop 
or pfifo_fast and pfifo right now.


Does this also imply that JumboFrames interacts badly with these qdiscs? 
 Or IPoIB with its 65000ish byte MTU?


rick jones
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Stephen Hemminger wrote:

On Thu, 31 Jan 2008 19:37:35 +0100
Andi Kleen [EMAIL PROTECTED] wrote:


On Thu, Jan 31, 2008 at 07:01:00PM +0100, Patrick McHardy wrote:

Andi Kleen wrote:

Fix the broken qdisc instead.

What do you mean? I don't think the qdiscs are broken.
I cannot think of any way how e.g. TBF can do anything useful
with large TSO packets.


Someone posted a patch some time ago to calculate the amount
of tokens needed in max_size portions and use that, but IMO
people should just configure TBF with the proper MTU for TSO.

TBF with 64k atomic units will always be chunky and uneven. I don't
think that's a useful goal. 


-Andi


Then change TBF to use skb_gso_segment?  Be careful, the fact that
one skb ends up queueing multiple skb's would cause issues to parent
qdisc (ie work generating qdisc).



How about keeping the TSO-capable flag on qdiscs, propagating
the non-capability up the tree and perform segmentation before
queueing to the root?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Rick Jones

Carsten Aulbert wrote:

Hi all, slowly crawling through the mails.

Brandeburg, Jesse wrote:


The test was done with various mtu sizes ranging from 1500 to 9000,
with ethernet flow control switched on and off, and using reno and
cubic as a TCP congestion control.


As asked in LKML thread, please post the exact netperf command used
to start the client/server, whether or not you're using irqbalanced
(aka irqbalance) and what cat /proc/interrupts looks like (you ARE
using MSI, right?)



We are using MSI, /proc/interrupts look like:
n0003:~# cat /proc/interrupts
   CPU0   CPU1   CPU2   CPU3
  0:6536963  0  0  0   IO-APIC-edge  timer
  1:  2  0  0  0   IO-APIC-edge  i8042
  3:  1  0  0  0   IO-APIC-edge  serial
  8:  0  0  0  0   IO-APIC-edge  rtc
  9:  0  0  0  0   IO-APIC-fasteoi   acpi
 14:  32321  0  0  0   IO-APIC-edge  libata
 15:  0  0  0  0   IO-APIC-edge  libata
 16:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb5
 18:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb4
 19:  0  0  0  0   IO-APIC-fasteoi 
uhci_hcd:usb3
 23:  0  0  0  0   IO-APIC-fasteoi 
ehci_hcd:usb1, uhci_hcd:usb2

378:   17234866  0  0  0   PCI-MSI-edge  eth1
379: 129826  0  0  0   PCI-MSI-edge  eth0
NMI:  0  0  0  0
LOC:6537181653732665371496537052
ERR:  0

(sorry for the line break).

What we don't understand is why only core0 gets the interrupts, since 
the affinity is set to f:

# cat /proc/irq/378/smp_affinity
f

Right now, irqbalance is not running, though I can give it shot if 
people think this will make a difference.



I would suggest you try TCP_RR with a command line something like this:
netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K



I did that and the results can be found here:
https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest


For convenience, 2.4.4 (perhaps earlier I can never remember when I've 
added things :) allows the output format for a TCP_RR test to be set to 
the same as a _STREAM or _MAERTS test.  And if you add a -v 2 to it you 
will get the each way values and the average round-trip latency:


[EMAIL PROTECTED]:~/netperf2_trunk$ src/netperf -t TCP_RR -H oslowest.cup -f m 
-v 2 -- -r 64K -b 4
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
oslowest.cup.hp.com (16.89.84.17) port 0 AF_INET : first burst 4

Local /Remote
Socket Size   Request  Resp.   Elapsed
Send   Recv   Size SizeTime Throughput
bytes  Bytes  bytesbytes   secs.10^6bits/sec

16384  87380  6553665536   10.01 105.63
16384  87380
Alignment  Offset RoundTrip  TransThroughput
Local  Remote  Local  Remote  LatencyRate 10^6bits/s
Send   RecvSend   Recvusec/Tran  per sec  Outbound   Inbound
8  0   0  0   49635.583   100.734 52.81452.814
[EMAIL PROTECTED]:~/netperf2_trunk$

(this was a WAN test :)

rick jones

one of these days I may tweak netperf further so if the CPU utilization 
method for either end doesn't require calibration, CPU utilization will 
always be done on that end.  people's thoughts on that tweak would be 
most welcome...

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 07:21:20PM +0100, Patrick McHardy wrote:
 Andi Kleen wrote:
 Then change TBF to use skb_gso_segment?  Be careful, the fact that
 
 That doesn't help because it wants to interleave packets
 from different streams to get everything fair and smooth. The only 
 good way to handle that is to split it up and the simplest way to do 
 this is to just tell TCP to not do GSO in the first place.
 
 
 Thats not correct, TBF keeps packets strictly ordered unless

My point was that without TSO different submitters will interleave
their streams (because they compete about the qdisc submission) 
and then you end up with a smooth rate over time for all of them.

If you submit in large chunks only (as TSO does) it will always 
be more bursty and that works against the TBF goal.

For a single submitter you would be correct.

-Andi
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: e1000 full-duplex TCP performance well below wire speed

2008-01-31 Thread Brandeburg, Jesse
Bill Fink wrote:
 a 2.6.15.4 kernel.  The GigE NICs are Intel PRO/1000
 82546EB_QUAD_COPPER, 
 on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000
 driver, and running with 9000-byte jumbo frames.  The TCP congestion
 control is BIC.

Bill, FYI, there was a known issue with e1000 (fixed in 7.0.38-k2) and
socket charge due to truesize that kept one end or the other from
opening its window.  The result is not so great performance, and you
must upgrade the driver at both ends to fix it.

it was fixed in commit
9e2feace1acd38d7a3b1275f7f9f8a397d09040e

That commit itself needed a couple of follow on bug fixes, but the point
is that you could download 7.3.20 from sourceforge (which would compile
on your kernel) and compare the performance with it if you were
interested in a further experiment.

Jesse
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Andi Kleen
On Thu, Jan 31, 2008 at 10:26:19AM -0800, Rick Jones wrote:
 Andi Kleen wrote:
 TSO interacts badly with many queueing disciplines because they rely on 
 reordering packets from different streams and the large TSO packets can 
 make this difficult. This patch disables TSO for sockets that send over 
 devices with non standard queueing disciplines. That's anything but noop 
 or pfifo_fast and pfifo right now.
 
 Does this also imply that JumboFrames interacts badly with these qdiscs? 
  Or IPoIB with its 65000ish byte MTU?

Correct. Of course it is always relative to the link speed. So if your
link is 10x faster and your packets 10x bigger you can get similarly
smooth shaping.

-Andi
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >