RE: e1000 full-duplex TCP performance well below wire speed
Hi Jesse, It's good to be talking directly to one of the e1000 developers and maintainers. Although at this point I am starting to think that the issue may be TCP stack related and nothing to do with the NIC. Am I correct that these are quite distinct parts of the kernel? Yes, quite. OK. I hope that there is also someone knowledgable about the TCP stack who is following this thread. (Perhaps you also know this part of the kernel, but I am assuming that your expertise is on the e1000/NIC bits.) Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. That eliminates bus bandwidth issues, probably, but small packets take up a lot of extra descriptors, bus bandwidth, CPU, and cache resources. I see. Your concern is the extra ACK packets associated with TCP. Even those these represent a small volume of data (around 5% with MTU=1500, and less at larger MTU) they double the number of packets that must be handled by the system compared to UDP transmission at the same data rate. Is that correct? I have to wait until Carsten or Henning wake up tomorrow (now 23:38 in Germany). So we'll provide this info in ~10 hours. I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I think you'll have to compile netperf with burst mode support enabled. I just saw Carsten a few minutes ago. He has to take part in a 'Baubesprechung' meeting this morning, after which he will start answering the technical questions and doing additional testing as suggested by you and others. If you are on the US west coast, he should have some answers and results posted by Thursday morning Pacific time. I assume that the interrupt load is distributed among all four cores -- the default affinity is 0xff, and I also assume that there is some type of interrupt aggregation taking place in the driver. If the CPUs were not able to service the interrupts fast enough, I assume that we would also see loss of performance with UDP testing. One other thing you can try with e1000 is disabling the dynamic interrupt moderation by loading the driver with InterruptThrottleRate=8000,8000,... (the number of commas depends on your number of ports) which might help in your particular benchmark. OK. Is 'dynamic interrupt moderation' another name for 'interrupt aggregation'? Meaning that if more than one interrupt is generated in a given time interval, then they are replaced by a single interrupt? Yes, InterruptThrottleRate=8000 means there will be no more than 8000 ints/second from that adapter, and if interrupts are generated faster than that they are aggregated. Interestingly since you are interested in ultra low latency, and may be willing to give up some cpu for it during bulk transfers you should try InterruptThrottleRate=1 (can generate up to 7 ints/s) I'm not sure it's quite right to say that we are interested in ultra low latency. Most of our network transfers involve bulk data movement (a few MB or more). We don't care so much about low latency (meaning how long it takes the FIRST byte of data to travel from sender to receiver). We care about aggregate bandwidth: once the pipe is full, how fast can data be moved through it. Sow we don't care so much if getting the pipe full takes 20 us or 50 us. We just want the data to flow fast once the pipe IS full. Welcome, its an interesting discussion. Hope we can come to a good conclusion. Thank you. Carsten will post more info and answers later today. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipcomp regression in 2.6.24
Herbert Xu wrote: On Wed, Jan 30, 2008 at 10:14:46AM +0100, Marco Berizzi wrote: Sorry for bother you again. I have applied to 2.6.24, but ipcomp doesn't work anyway. I have patched a clean 2.6.24 tree and I did a complete rebuild. With tcpdump I see both the esp packets going in/out but I don't see the clear packets on the interface. After testing it here it looks like there is this little typo which means that you can't actually use IPComp for anything that's not compressible :) applied and tested to 2.6.24: ipcomp is working now. As always, thanks a lot Herbert for fixing this. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
xfrm_lookup() and XFRM_POLICY_ICMP
Hello, A question about XFRM_POLICY_ICMP: I had tried to understand this check in __xfrm_lookup() method in net/xfrm/xfrm_policy.c (the recent 2.6 git dave miller tree): ... ... if ((flags XFRM_LOOKUP_ICMP) !(policy-flags XFRM_POLICY_ICMP)) goto error; ... ... Why is the check for XFRM_POLICY_ICMP? I had grepped under the kernel tree, and the only place where XFRM_POLICY_ICMP appears is here (except its definition in xfrm.h). I also grepped under openswan tree, and could not find XFRM_POLICY_ICMP. (the struct xfrm_userpolicy_info in openswan includes XFRM_POLICY_ALLOW and XFRM_POLICY_BLOCK and XFRM_POLICY_LOCALOK, but not XFRM_POLICY_ICMP). I also grepped under iproute2 tree (from git) and there is no XFRM_POLICY_ICMP. So is this there a way at all to set XFRM_POLICY_ICMP? and if not - maybe this check is not needed at all ? Regards, Andy -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Sangtae, Thanks for joining this discussion -- it's good to a CUBIC author and expert here! In our application (cluster computing) we use a very tightly coupled high-speed low-latency network. There is no 'wide area traffic'. So it's hard for me to understand why any networking components or software layers should take more than milliseconds to ramp up or back off in speed. Perhaps we should be asking for a TCP congestion avoidance algorithm which is designed for a data center environment where there are very few hops and typical packet delivery times are tens or hundreds of microseconds. It's very different than delivering data thousands of km across a WAN. If your network latency is low, regardless of type of protocols should give you more than 900Mbps. Yes, this is also what I had thought. In the graph that we posted, the two machines are connected by an ethernet crossover cable. The total RTT of the two machines is probably AT MOST a couple of hundred microseconds. Typically it takes 20 or 30 microseconds to get the first packet out the NIC. Travel across the wire is a few nanoseconds. Then getting the packet into the receiving NIC might be another 20 or 30 microseconds. The ACK should fly back in about the same time. I can guess the RTT of two machines is less than 4ms in your case and I remember the throughputs of all high-speed protocols (including tcp-reno) were more than 900Mbps with 4ms RTT. So, my question which kernel version did you use with your broadcomm NIC and got more than 900Mbps? We are going to double-check this (we did the broadcom testing about two months ago). Carsten is going to re-run the broadcomm experiments later today and will then post the results. You can see results from some testing on crossover-cable wired systems with broadcomm NICs, that I did about 2 years ago, here: http://www.lsc-group.phys.uwm.edu/beowulf/nemo/design/SMC_8508T_Performance.html You'll notice that total TCP throughput on the crossover cable was about 220 MB/sec. With TCP overhead this is very close to 2Gb/s. I have two machines connected by a gig switch and I can see what happens in my environment. Could you post what parameters did you use for netperf testing? Carsten will post these in the next few hours. If you want to simplify further, you can even take away the gig switch and just use a crossover cable. and also if you set any parameters for your testing, please post them here so that I can see that happens to me as well. Carsten will post all the sysctl and ethtool parameters shortly. Thanks again for chiming in. I am sure that with help from you, Jesse, and Rick, we can figure out what is going on here, and get it fixed. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Bruce Allen [EMAIL PROTECTED] writes: Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Andi! Important note: we ARE able to get full duplex wire speed (over 900 Mb/s simulaneously in both directions) using UDP. The problems occur only with TCP connections. Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. Noted. We'll try this also. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] make net/802/tr.c:sysctl_tr_rif_timeout static
Adrian Bunk wrote: sysctl_tr_rif_timeout can now become static. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Acked-by: Pavel Emelyanov [EMAIL PROTECTED] --- e5accd81b924224d40a3f38204c08cf6896ed79b diff --git a/net/802/tr.c b/net/802/tr.c index 3f16b17..18c6647 100644 --- a/net/802/tr.c +++ b/net/802/tr.c @@ -76,7 +76,7 @@ static DEFINE_SPINLOCK(rif_lock); static struct timer_list rif_timer; -int sysctl_tr_rif_timeout = 60*10*HZ; +static int sysctl_tr_rif_timeout = 60*10*HZ; static inline unsigned long rif_hash(const unsigned char *addr) { -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] make struct ipv4_devconf static
Adrian Bunk wrote: struct ipv4_devconf can now become static. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Acked-by: Pavel Emelyanov [EMAIL PROTECTED] --- include/linux/inetdevice.h |2 -- net/ipv4/devinet.c |2 +- 2 files changed, 1 insertion(+), 3 deletions(-) 20262a3317069b1bdbf2b37f4002fa5322445914 diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h index 8d9eaae..fc4e3db 100644 --- a/include/linux/inetdevice.h +++ b/include/linux/inetdevice.h @@ -17,8 +17,6 @@ struct ipv4_devconf DECLARE_BITMAP(state, __NET_IPV4_CONF_MAX - 1); }; -extern struct ipv4_devconf ipv4_devconf; - struct in_device { struct net_device *dev; diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 21f71bf..5ab5acc 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -64,7 +64,7 @@ #include net/rtnetlink.h #include net/net_namespace.h -struct ipv4_devconf ipv4_devconf = { +static struct ipv4_devconf ipv4_devconf = { .data = { [NET_IPV4_CONF_ACCEPT_REDIRECTS - 1] = 1, [NET_IPV4_CONF_SEND_REDIRECTS - 1] = 1, -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] make nf_ct_path[] static
Adrian Bunk wrote: This patch makes the needlessly global nf_ct_path[] static. Signed-off-by: Adrian Bunk [EMAIL PROTECTED] Acked-by: Pavel Emelyanov [EMAIL PROTECTED] Thanks, Adrian! --- 6396fbcebe3eb61f7e6eb1a671920a515912b005 diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 696074a..5bd38a6 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -380,7 +380,7 @@ static ctl_table nf_ct_netfilter_table[] = { { .ctl_name = 0 } }; -struct ctl_path nf_ct_path[] = { +static struct ctl_path nf_ct_path[] = { { .procname = net, .ctl_name = CTL_NET, }, { } }; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5
Adrian Bunk wrote: Commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 ([RAW]: Consolidate proc interface.) did not only change raw6_seq_ops (including adding 3 EXPORT_SYMBOL_GPL's to net/ipv4/raw.c for accessing functions from there), it also removed the only user of raw6_seq_ops... The commit is not strange it is wrong :( Sorry David, when I checked the according proc files, I saw that both files show sockets, but overlooked that the raw6 one shows the ipv4 part of the ipv6 socket. Denis noticed that this morning and has already prepared a fix. So please, do not revert the commit, the fix will be at your mailbox today. Thanks, Adrian. cu Adrian -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25][NETNS]: Fix race between put_net() and netlink_kernel_create().
David Miller wrote: From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 24 Jan 2008 16:15:13 +0300 The comment about race free view of the set of network namespaces was a bit hasty. Look (there even can be only one CPU, as discovered by Alexey Dobriyan and Denis Lunev): ... Instead, I propose to crate the socket inside an init_net namespace and then re-attach it to the desired one right after the socket is created. After doing this, we also have to be careful on error paths not to drop the reference on the namespace, we didn't get the one on. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Acked-by: Denis Lunev [EMAIL PROTECTED] Applied, thanks. Thanks, David. I have one more patch pending in netdev@ and some more to be sent (cleanups, small fixes and net namespaces). Do I have to wait till net-2.6.26, or can I start (re-)sending them while 2.6.25 merge window is open? Thanks, Pavel -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] [RAW]: proc output cleanups.
yesterday Adrian Bunk noticed, that the commit commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 Author: Pavel Emelyanov [EMAIL PROTECTED] Date: Mon Nov 19 22:38:33 2007 -0800 is somewhat strange. Basically, the commit is obviously wrong as the content of the /proc/net/raw6 is incorrect due to it. This series of patches fixes original problem and slightly cleanups the code around. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] [RAW]: Cleanup IPv4 raw_seq_show.
There is no need to use 128 bytes on the stack at all. Clean the code in the IPv6 style. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- net/ipv4/raw.c | 24 +++- 1 files changed, 7 insertions(+), 17 deletions(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 507cbfe..830f19e 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -927,7 +927,7 @@ void raw_seq_stop(struct seq_file *seq, void *v) } EXPORT_SYMBOL_GPL(raw_seq_stop); -static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i) +static void raw_sock_seq_show(struct seq_file *seq, struct sock *sp, int i) { struct inet_sock *inet = inet_sk(sp); __be32 dest = inet-daddr, @@ -935,33 +935,23 @@ static __inline__ char *get_raw_sock(struct sock *sp, char *tmpbuf, int i) __u16 destp = 0, srcp = inet-num; - sprintf(tmpbuf, %4d: %08X:%04X %08X:%04X + seq_printf(seq, %4d: %08X:%04X %08X:%04X %02X %08X:%08X %02X:%08lX %08X %5d %8d %lu %d %p %d, i, src, srcp, dest, destp, sp-sk_state, atomic_read(sp-sk_wmem_alloc), atomic_read(sp-sk_rmem_alloc), 0, 0L, 0, sock_i_uid(sp), 0, sock_i_ino(sp), atomic_read(sp-sk_refcnt), sp, atomic_read(sp-sk_drops)); - return tmpbuf; } -#define TMPSZ 128 - static int raw_seq_show(struct seq_file *seq, void *v) { - char tmpbuf[TMPSZ+1]; - if (v == SEQ_START_TOKEN) - seq_printf(seq, %-*s\n, TMPSZ-1, -sl local_address rem_address st tx_queue - rx_queue tr tm-when retrnsmt uid timeout - inode drops); - else { - struct raw_iter_state *state = raw_seq_private(seq); - - seq_printf(seq, %-*s\n, TMPSZ-1, - get_raw_sock(v, tmpbuf, state-bucket)); - } + seq_printf(seq, sl local_address rem_address st tx_queue + rx_queue tr tm-when retrnsmt uid timeout + inode drops\n); + else + raw_sock_seq_show(seq, v, raw_seq_private(seq)-bucket); return 0; } -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] [RAW]: Wrong content of the /proc/net/raw6.
The address of IPv6 raw sockets was shown in the wrong format, from IPv4 ones. The problem has been introduced by the commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 Author: Pavel Emelyanov [EMAIL PROTECTED] Date: Mon Nov 19 22:38:33 2007 -0800 Thanks to Adrian Bunk who originally noticed the problem. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/raw.h |3 ++- net/ipv4/raw.c|8 net/ipv6/raw.c|2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/include/net/raw.h b/include/net/raw.h index c7ea7a2..1828f81 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -48,7 +48,8 @@ struct raw_iter_state { void *raw_seq_start(struct seq_file *seq, loff_t *pos); void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos); void raw_seq_stop(struct seq_file *seq, void *v); -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h); +int raw_seq_open(struct inode *ino, struct file *file, +struct raw_hashinfo *h, const struct seq_operations *ops); #endif diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 830f19e..a3002fe 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -962,13 +962,13 @@ static const struct seq_operations raw_seq_ops = { .show = raw_seq_show, }; -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h) +int raw_seq_open(struct inode *ino, struct file *file, +struct raw_hashinfo *h, const struct seq_operations *ops) { int err; struct raw_iter_state *i; - err = seq_open_net(ino, file, raw_seq_ops, - sizeof(struct raw_iter_state)); + err = seq_open_net(ino, file, ops, sizeof(struct raw_iter_state)); if (err 0) return err; @@ -980,7 +980,7 @@ EXPORT_SYMBOL_GPL(raw_seq_open); static int raw_v4_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, raw_v4_hashinfo); + return raw_seq_open(inode, file, raw_v4_hashinfo, raw_seq_ops); } static const struct file_operations raw_seq_fops = { diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index a2cf499..8897ccf 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = { static int raw6_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, raw_v6_hashinfo); + return raw_seq_open(inode, file, raw_v6_hashinfo, raw6_seq_ops); } static const struct file_operations raw6_seq_fops = { -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] [RAW]: Family check in the /proc/net/raw[6] is extra.
Different hashtables are used for IPv6 and IPv4 raw sockets, so no need to check the socket family in the iterator over hashtables. Clean this out. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/raw.h |4 +--- net/ipv4/raw.c| 12 net/ipv6/raw.c|2 +- 3 files changed, 6 insertions(+), 12 deletions(-) diff --git a/include/net/raw.h b/include/net/raw.h index cca81d8..c7ea7a2 100644 --- a/include/net/raw.h +++ b/include/net/raw.h @@ -41,7 +41,6 @@ extern void raw_proc_exit(void); struct raw_iter_state { struct seq_net_private p; int bucket; - unsigned short family; struct raw_hashinfo *h; }; @@ -49,8 +48,7 @@ struct raw_iter_state { void *raw_seq_start(struct seq_file *seq, loff_t *pos); void *raw_seq_next(struct seq_file *seq, void *v, loff_t *pos); void raw_seq_stop(struct seq_file *seq, void *v); -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, - unsigned short family); +int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h); #endif diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index f863c3d..507cbfe 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -862,8 +862,7 @@ static struct sock *raw_get_first(struct seq_file *seq) struct hlist_node *node; sk_for_each(sk, node, state-h-ht[state-bucket]) - if (sk-sk_net == state-p.net - sk-sk_family == state-family) + if (sk-sk_net == state-p.net) goto found; } sk = NULL; @@ -879,8 +878,7 @@ static struct sock *raw_get_next(struct seq_file *seq, struct sock *sk) sk = sk_next(sk); try_again: ; - } while (sk sk-sk_net != state-p.net - sk-sk_family != state-family); + } while (sk sk-sk_net != state-p.net); if (!sk ++state-bucket RAW_HTABLE_SIZE) { sk = sk_head(state-h-ht[state-bucket]); @@ -974,8 +972,7 @@ static const struct seq_operations raw_seq_ops = { .show = raw_seq_show, }; -int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, - unsigned short family) +int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h) { int err; struct raw_iter_state *i; @@ -987,14 +984,13 @@ int raw_seq_open(struct inode *ino, struct file *file, struct raw_hashinfo *h, i = raw_seq_private((struct seq_file *)file-private_data); i-h = h; - i-family = family; return 0; } EXPORT_SYMBOL_GPL(raw_seq_open); static int raw_v4_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, raw_v4_hashinfo, PF_INET); + return raw_seq_open(inode, file, raw_v4_hashinfo); } static const struct file_operations raw_seq_fops = { diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c index d61c63d..a2cf499 100644 --- a/net/ipv6/raw.c +++ b/net/ipv6/raw.c @@ -1262,7 +1262,7 @@ static const struct seq_operations raw6_seq_ops = { static int raw6_seq_open(struct inode *inode, struct file *file) { - return raw_seq_open(inode, file, raw_v6_hashinfo, PF_INET6); + return raw_seq_open(inode, file, raw_v6_hashinfo); } static const struct file_operations raw6_seq_fops = { -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25][NETNS]: Fix race between put_net() and netlink_kernel_create().
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 14:05:57 +0300 I have one more patch pending in netdev@ and some more to be sent (cleanups, small fixes and net namespaces). Do I have to wait till net-2.6.26, or can I start (re-)sending them while 2.6.25 merge window is open? Send it, I'll take a look at it. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
On Wed, 30 Jan 2008, SANGTAE HA wrote: On Jan 30, 2008 5:25 PM, Bruce Allen [EMAIL PROTECTED] wrote: In our application (cluster computing) we use a very tightly coupled high-speed low-latency network. There is no 'wide area traffic'. So it's hard for me to understand why any networking components or software layers should take more than milliseconds to ramp up or back off in speed. Perhaps we should be asking for a TCP congestion avoidance algorithm which is designed for a data center environment where there are very few hops and typical packet delivery times are tens or hundreds of microseconds. It's very different than delivering data thousands of km across a WAN. If your network latency is low, regardless of type of protocols should give you more than 900Mbps. I can guess the RTT of two machines is less than 4ms in your case and I remember the throughputs of all high-speed protocols (including tcp-reno) were more than 900Mbps with 4ms RTT. So, my question which kernel version did you use with your broadcomm NIC and got more than 900Mbps? I have two machines connected by a gig switch and I can see what happens in my environment. Could you post what parameters did you use for netperf testing? and also if you set any parameters for your testing, please post them here so that I can see that happens to me as well. I see similar results on my test systems, using Tyan Thunder K8WE (S2895) motherboard with dual Intel Xeon 3.06 GHZ CPUs and 1 GB memory, running a 2.6.15.4 kernel. The GigE NICs are Intel PRO/1000 82546EB_QUAD_COPPER, on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000 driver, and running with 9000-byte jumbo frames. The TCP congestion control is BIC. Unidirectional TCP test: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 tx: 1186.5649 MB / 10.05 sec = 990.2741 Mbps 11 %TX 9 %RX 0 retrans and: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Irx -r -w2m 192.168.6.79 rx: 1186.8281 MB / 10.05 sec = 990.5634 Mbps 14 %TX 9 %RX 0 retrans Each direction gets full GigE line rate. Bidirectional TCP test: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.6.79 tx: 898.9934 MB / 10.05 sec = 750.1634 Mbps 10 %TX 8 %RX 0 retrans rx: 1167.3750 MB / 10.06 sec = 973.8617 Mbps 14 %TX 11 %RX 0 retrans While one direction gets close to line rate, the other only got 750 Mbps. Note there were no TCP retransmitted segments for either data stream, so that doesn't appear to be the cause of the slower transfer rate in one direction. If the receive direction uses a different GigE NIC that's part of the same quad-GigE, all is fine: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.5.79 tx: 1186.5051 MB / 10.05 sec = 990.2250 Mbps 12 %TX 13 %RX 0 retrans rx: 1186.7656 MB / 10.05 sec = 990.5204 Mbps 15 %TX 14 %RX 0 retrans Here's a test using the same GigE NIC for both directions with 1-second interval reports: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -i1 -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -i1 -w2m 192.168.6.79 tx:92.3750 MB / 1.01 sec = 767.2277 Mbps 0 retrans rx: 104.5625 MB / 1.01 sec = 872.4757 Mbps 0 retrans tx:83.3125 MB / 1.00 sec = 700.1845 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5541 Mbps 0 retrans tx:83.8125 MB / 1.00 sec = 703.0322 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5502 Mbps 0 retrans tx:83. MB / 1.00 sec = 696.1779 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5522 Mbps 0 retrans tx:83.7500 MB / 1.00 sec = 702.4989 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5512 Mbps 0 retrans tx:83.1250 MB / 1.00 sec = 697.2270 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5512 Mbps 0 retrans tx:84.1875 MB / 1.00 sec = 706.1665 Mbps 0 retrans rx: 117.5625 MB / 1.00 sec = 985.5510 Mbps 0 retrans tx:83.0625 MB / 1.00 sec = 696.7167 Mbps 0 retrans rx: 117.6875 MB / 1.00 sec = 987.5543 Mbps 0 retrans tx:84.1875 MB / 1.00 sec = 706.1545 Mbps 0 retrans rx: 117.6250 MB / 1.00 sec = 986.5472 Mbps 0 retrans rx: 117.6875 MB / 1.00 sec = 987.0724 Mbps 0 retrans tx:83.3125 MB / 1.00 sec = 698.8137 Mbps 0 retrans tx: 844.9375 MB / 10.07 sec = 703.7699 Mbps 11 %TX 6 %RX 0 retrans rx: 1167.4414 MB / 10.05 sec = 973.9980 Mbps 14 %TX 11 %RX 0 retrans In this test case, the receiver ramped up to nearly full GigE line rate, while the transmitter was stuck at about 700 Mbps. I ran one longer 60-second test and didn't see the oscillating behavior between receiver and transmitter, but maybe that's because I have the GigE NIC interrupts and nuttcp client/server applications both locked to CPU 0. So in my tests, once one direction gets the upper hand, it seems to stay that way. Could this be because the slower side
Re: [PATCH 0/3] [RAW]: proc output cleanups.
From: Denis V. Lunev [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 14:32:52 +0300 yesterday Adrian Bunk noticed, that the commit commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 Author: Pavel Emelyanov [EMAIL PROTECTED] Date: Mon Nov 19 22:38:33 2007 -0800 is somewhat strange. Basically, the commit is obviously wrong as the content of the /proc/net/raw6 is incorrect due to it. This series of patches fixes original problem and slightly cleanups the code around. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] All applied, thanks a lot! -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
hard hang through qdisc
I just managed to hang a 2.6.24 (+ some non network patches) kernel with the following (non sensical) command tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 No oops or anything just hangs. While I understand root can do bad things just hanging like this seems a little extreme. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6][IPV6]: Introduce the INET6_TW_MATCH macro.
We have INET_MATCH, INET_TW_MATCH and INET6_MATCH to test sockets and twbuckets for matching, but ipv6 twbuckets are tested manually. Here's the INET6_TW_MATCH to help with it. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- include/linux/ipv6.h|8 net/ipv6/inet6_hashtables.c | 21 +++-- 2 files changed, 11 insertions(+), 18 deletions(-) diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index 5d35a4c..c347860 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -465,6 +465,14 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk) ipv6_addr_equal(inet6_sk(__sk)-rcv_saddr, (__daddr)) \ (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif +#define INET6_TW_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif) \ + (((__sk)-sk_hash == (__hash)) \ +(*((__portpair *)(inet_twsk(__sk)-tw_dport)) == (__ports))\ +((__sk)-sk_family== PF_INET6) \ +(ipv6_addr_equal(inet6_twsk(__sk)-tw_v6_daddr, (__saddr)))\ +(ipv6_addr_equal(inet6_twsk(__sk)-tw_v6_rcv_saddr, (__daddr))) \ +(!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif + #endif /* __KERNEL__ */ #endif /* _IPV6_H */ diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index a66a7d8..06b01be 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@ -80,17 +80,8 @@ struct sock *__inet6_lookup_established(struct inet_hashinfo *hashinfo, } /* Must check for a TIME_WAIT'er before going to listener hash. */ sk_for_each(sk, node, head-twchain) { - const struct inet_timewait_sock *tw = inet_twsk(sk); - - if(*((__portpair *)(tw-tw_dport)) == ports - sk-sk_family== PF_INET6) { - const struct inet6_timewait_sock *tw6 = inet6_twsk(sk); - - if (ipv6_addr_equal(tw6-tw_v6_daddr, saddr) - ipv6_addr_equal(tw6-tw_v6_rcv_saddr, daddr) - (!sk-sk_bound_dev_if || sk-sk_bound_dev_if == dif)) - goto hit; - } + if (INET6_TW_MATCH(sk, hash, saddr, daddr, ports, dif)) + goto hit; } read_unlock(lock); return NULL; @@ -185,15 +176,9 @@ static int __inet6_check_established(struct inet_timewait_death_row *death_row, /* Check TIME-WAIT sockets first. */ sk_for_each(sk2, node, head-twchain) { - const struct inet6_timewait_sock *tw6 = inet6_twsk(sk2); - tw = inet_twsk(sk2); - if(*((__portpair *)(tw-tw_dport)) == ports - sk2-sk_family == PF_INET6 - ipv6_addr_equal(tw6-tw_v6_daddr, saddr) - ipv6_addr_equal(tw6-tw_v6_rcv_saddr, daddr) - (!sk2-sk_bound_dev_if || sk2-sk_bound_dev_if == dif)) { + if (INET6_TW_MATCH(sk2, hash, saddr, daddr, ports, dif)) { if (twsk_unique(sk, sk2, twp)) goto unique; else -- 1.5.3.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.
These two functions are the same except for what they call to check_established and hash for a socket. This saves half-a-kilo for ipv4 and ipv6. add/remove: 1/0 grow/shrink: 1/4 up/down: 582/-1128 (-546) function old new delta __inet_hash_connect- 577+577 arp_ignore 108 113 +5 static.hint8 4 -4 rt_worker_func 376 372 -4 inet6_hash_connect 584 25-559 inet_hash_connect586 25-561 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- include/net/inet_hashtables.h |5 ++ net/ipv4/inet_hashtables.c| 32 +- net/ipv6/inet6_hashtables.c | 93 + 3 files changed, 28 insertions(+), 102 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 761bdc0..a34a8f2 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -413,6 +413,11 @@ static inline struct sock *inet_lookup(struct inet_hashinfo *hashinfo, return sk; } +extern int __inet_hash_connect(struct inet_timewait_death_row *death_row, + struct sock *sk, + int (*check_established)(struct inet_timewait_death_row *, + struct sock *, __u16, struct inet_timewait_sock **), + void (*hash)(struct inet_hashinfo *, struct sock *)); extern int inet_hash_connect(struct inet_timewait_death_row *death_row, struct sock *sk); #endif /* _INET_HASHTABLES_H */ diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 619c63c..b93d40f 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -348,11 +348,11 @@ void __inet_hash(struct inet_hashinfo *hashinfo, struct sock *sk) } EXPORT_SYMBOL_GPL(__inet_hash); -/* - * Bind a port for a connect operation and hash it. - */ -int inet_hash_connect(struct inet_timewait_death_row *death_row, - struct sock *sk) +int __inet_hash_connect(struct inet_timewait_death_row *death_row, + struct sock *sk, + int (*check_established)(struct inet_timewait_death_row *, + struct sock *, __u16, struct inet_timewait_sock **), + void (*hash)(struct inet_hashinfo *, struct sock *)) { struct inet_hashinfo *hinfo = death_row-hashinfo; const unsigned short snum = inet_sk(sk)-num; @@ -385,9 +385,8 @@ int inet_hash_connect(struct inet_timewait_death_row *death_row, BUG_TRAP(!hlist_empty(tb-owners)); if (tb-fastreuse = 0) goto next_port; - if (!__inet_check_established(death_row, - sk, port, - tw)) + if (!check_established(death_row, sk, + port, tw)) goto ok; goto next_port; } @@ -415,7 +414,7 @@ ok: inet_bind_hash(sk, tb, port); if (sk_unhashed(sk)) { inet_sk(sk)-sport = htons(port); - __inet_hash_nolisten(hinfo, sk); + hash(hinfo, sk); } spin_unlock(head-lock); @@ -432,17 +431,28 @@ ok: tb = inet_csk(sk)-icsk_bind_hash; spin_lock_bh(head-lock); if (sk_head(tb-owners) == sk !sk-sk_bind_node.next) { - __inet_hash_nolisten(hinfo, sk); + hash(hinfo, sk); spin_unlock_bh(head-lock); return 0; } else { spin_unlock(head-lock); /* No definite answer... Walk to established hash table */ - ret = __inet_check_established(death_row, sk, snum, NULL); + ret = check_established(death_row, sk, snum, NULL); out: local_bh_enable(); return ret; } } +EXPORT_SYMBOL_GPL(__inet_hash_connect); + +/* + * Bind a port for a connect operation and hash it. + */ +int inet_hash_connect(struct inet_timewait_death_row *death_row, + struct sock *sk) +{ + return __inet_hash_connect(death_row, sk, + __inet_check_established, __inet_hash_nolisten); +} EXPORT_SYMBOL_GPL(inet_hash_connect); diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c index 06b01be..ece6d0e 100644 --- a/net/ipv6/inet6_hashtables.c +++ b/net/ipv6/inet6_hashtables.c @@
NET: AX88796 use dev_dbg() instead of printk()
Change to using dev_dbg() and the other dev_xxx() macros instead of printk, and update to use the print_mac() helper. Signed-off-by: Ben Dooks [EMAIL PROTECTED] Index: linux-2.6.24-quilt1/drivers/net/ax88796.c === --- linux-2.6.24-quilt1.orig/drivers/net/ax88796.c +++ linux-2.6.24-quilt1/drivers/net/ax88796.c @@ -137,11 +137,12 @@ static int ax_initial_check(struct net_d static void ax_reset_8390(struct net_device *dev) { struct ei_device *ei_local = netdev_priv(dev); + struct ax_device *ax = to_ax_dev(dev); unsigned long reset_start_time = jiffies; void __iomem *addr = (void __iomem *)dev-base_addr; if (ei_debug 1) - printk(KERN_DEBUG resetting the 8390 t=%ld..., jiffies); + dev_dbg(ax-dev-dev, resetting the 8390 t=%ld\n, jiffies); ei_outb(ei_inb(addr + NE_RESET), addr + NE_RESET); @@ -151,7 +152,7 @@ static void ax_reset_8390(struct net_dev /* This check _should_not_ be necessary, omit eventually. */ while ((ei_inb(addr + EN0_ISR) ENISR_RESET) == 0) { if (jiffies - reset_start_time 2*HZ/100) { - printk(KERN_WARNING %s: %s did not complete.\n, + dev_warn(ax-dev-dev, %s: %s did not complete.\n, __FUNCTION__, dev-name); break; } @@ -165,13 +166,15 @@ static void ax_get_8390_hdr(struct net_d int ring_page) { struct ei_device *ei_local = netdev_priv(dev); + struct ax_device *ax = to_ax_dev(dev); void __iomem *nic_base = ei_local-mem; /* This *shouldn't* happen. If it does, it's the last thing you'll see */ if (ei_status.dmaing) { - printk(KERN_EMERG %s: DMAing conflict in %s [DMAstat:%d][irqlock:%d].\n, + dev_err(ax-dev-dev, %s: DMAing conflict in %s + [DMAstat:%d][irqlock:%d].\n, dev-name, __FUNCTION__, - ei_status.dmaing, ei_status.irqlock); + ei_status.dmaing, ei_status.irqlock); return; } @@ -204,13 +207,16 @@ static void ax_block_input(struct net_de struct sk_buff *skb, int ring_offset) { struct ei_device *ei_local = netdev_priv(dev); + struct ax_device *ax = to_ax_dev(dev); void __iomem *nic_base = ei_local-mem; char *buf = skb-data; if (ei_status.dmaing) { - printk(KERN_EMERG %s: DMAing conflict in ax_block_input + dev_err(ax-dev-dev, + %s: DMAing conflict in %s [DMAstat:%d][irqlock:%d].\n, - dev-name, ei_status.dmaing, ei_status.irqlock); + dev-name, __FUNCTION__, + ei_status.dmaing, ei_status.irqlock); return; } @@ -239,6 +245,7 @@ static void ax_block_output(struct net_d const unsigned char *buf, const int start_page) { struct ei_device *ei_local = netdev_priv(dev); + struct ax_device *ax = to_ax_dev(dev); void __iomem *nic_base = ei_local-mem; unsigned long dma_start; @@ -251,7 +258,7 @@ static void ax_block_output(struct net_d /* This *shouldn't* happen. If it does, it's the last thing you'll see */ if (ei_status.dmaing) { - printk(KERN_EMERG %s: DMAing conflict in %s. + dev_err(ax-dev-dev, %s: DMAing conflict in %s. [DMAstat:%d][irqlock:%d]\n, dev-name, __FUNCTION__, ei_status.dmaing, ei_status.irqlock); @@ -281,7 +288,8 @@ static void ax_block_output(struct net_d while ((ei_inb(nic_base + EN0_ISR) ENISR_RDC) == 0) { if (jiffies - dma_start 2*HZ/100) { /* 20ms */ - printk(KERN_WARNING %s: timeout waiting for Tx RDC.\n, dev-name); + dev_warn(ax-dev-dev, +%s: timeout waiting for Tx RDC.\n, dev-name); ax_reset_8390(dev); ax_NS8390_init(dev,1); break; @@ -424,10 +432,11 @@ static void ax_phy_write(struct net_device *dev, int phy_addr, int reg, int value) { struct ei_device *ei = (struct ei_device *) netdev_priv(dev); + struct ax_device *ax = to_ax_dev(dev); unsigned long flags; - printk(KERN_DEBUG %s: %p, %04x, %04x %04x\n, - __FUNCTION__, dev, phy_addr, reg, value); + dev_dbg(ax-dev-dev, %s: %p, %04x, %04x %04x\n, + __FUNCTION__, dev, phy_addr, reg, value); spin_lock_irqsave(ei-page_lock, flags); @@ -750,14 +759,11 @@ static int ax_init_dev(struct net_device ax_NS8390_init(dev, 0); if (first_init) { -
[PATCH 0/6] preparations to enable netdevice notifiers inside a namespace (resend)
Here are some preparations and cleanups to enable network device/inet address notifiers inside a namespace. This set of patches has been originally sent last Friday. One cleanup patch from the original series is dropped as wrong, thanks to Daniel Lezcano. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipcomp regression in 2.6.24
applied and tested to 2.6.24: ipcomp is working now. As always, thanks a lot Herbert for fixing this. Thank you too, I applied the 2 patches and it works. Daniel -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] [IPV4]: Fix memory leak on error path during FIB initialization.
net-ipv4.fib_table_hash is not freed when fib4_rules_init failed. The problem has been introduced by the following commit. commit c8050bf6d84785a7edd2e81591e8f833231477e8 Author: Denis V. Lunev [EMAIL PROTECTED] Date: Thu Jan 10 03:28:24 2008 -0800 Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- net/ipv4/fib_frontend.c | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d282618..d0507f4 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -975,6 +975,7 @@ static struct notifier_block fib_netdev_notifier = { static int __net_init ip_fib_net_init(struct net *net) { + int err; unsigned int i; net-ipv4.fib_table_hash = kzalloc( @@ -985,7 +986,14 @@ static int __net_init ip_fib_net_init(struct net *net) for (i = 0; i FIB_TABLE_HASHSZ; i++) INIT_HLIST_HEAD(net-ipv4.fib_table_hash[i]); - return fib4_rules_init(net); + err = fib4_rules_init(net); + if (err 0) + goto fail; + return 0; + +fail: + kfree(net-ipv4.fib_table_hash); + return err; } static void __net_exit ip_fib_net_exit(struct net *net) -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] [IPV4]: Small style cleanup of the error path in rtm_to_ifaddr.
Remove error code assignment inside brackets on failure. The code looks better if the error is assigned before condition check. Also, the compiler treats this better. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- net/ipv4/devinet.c | 21 - 1 files changed, 8 insertions(+), 13 deletions(-) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 21f71bf..9da4c68 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -492,39 +492,34 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) struct ifaddrmsg *ifm; struct net_device *dev; struct in_device *in_dev; - int err = -EINVAL; + int err; err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFA_MAX, ifa_ipv4_policy); if (err 0) goto errout; ifm = nlmsg_data(nlh); - if (ifm-ifa_prefixlen 32 || tb[IFA_LOCAL] == NULL) { - err = -EINVAL; + err = -EINVAL; + if (ifm-ifa_prefixlen 32 || tb[IFA_LOCAL] == NULL) goto errout; - } dev = __dev_get_by_index(init_net, ifm-ifa_index); - if (dev == NULL) { - err = -ENODEV; + err = -ENODEV; + if (dev == NULL) goto errout; - } in_dev = __in_dev_get_rtnl(dev); - if (in_dev == NULL) { - err = -ENOBUFS; + err = -ENOBUFS; + if (in_dev == NULL) goto errout; - } ifa = inet_alloc_ifa(); - if (ifa == NULL) { + if (ifa == NULL) /* * A potential indev allocation can be left alive, it stays * assigned to its device and is destroy with it. */ - err = -ENOBUFS; goto errout; - } ipv4_devconf_setall(in_dev); in_dev_hold(in_dev); -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] [NETNS]: Add a namespace mark to fib_info.
This is required to make fib_info lookups namespace aware. In the other case initial namespace devices are marked as dead in the local routing table during other namespace stop. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/ip_fib.h |1 + net/ipv4/fib_semantics.c |8 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 1b2f008..cb0df37 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -69,6 +69,7 @@ struct fib_nh { struct fib_info { struct hlist_node fib_hash; struct hlist_node fib_lhash; + struct net *fib_net; int fib_treeref; atomic_tfib_clntref; int fib_dead; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 5beff2e..97cc494 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -687,6 +687,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) struct fib_info *fi = NULL; struct fib_info *ofi; int nhs = 1; + struct net *net = cfg-fc_nlinfo.nl_net; /* Fast check to catch the most weird cases */ if (fib_props[cfg-fc_type].scope cfg-fc_scope) @@ -727,6 +728,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) goto failure; fib_info_cnt++; + fi-fib_net = net; fi-fib_protocol = cfg-fc_protocol; fi-fib_flags = cfg-fc_flags; fi-fib_priority = cfg-fc_priority; @@ -798,8 +800,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) if (nhs != 1 || nh-nh_gw) goto err_inval; nh-nh_scope = RT_SCOPE_NOWHERE; - nh-nh_dev = dev_get_by_index(cfg-fc_nlinfo.nl_net, - fi-fib_nh-nh_oif); + nh-nh_dev = dev_get_by_index(net, fi-fib_nh-nh_oif); err = -ENODEV; if (nh-nh_dev == NULL) goto failure; @@ -813,8 +814,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) if (fi-fib_prefsrc) { if (cfg-fc_type != RTN_LOCAL || !cfg-fc_dst || fi-fib_prefsrc != cfg-fc_dst) - if (inet_addr_type(cfg-fc_nlinfo.nl_net, - fi-fib_prefsrc) != RTN_LOCAL) + if (inet_addr_type(net, fi-fib_prefsrc) != RTN_LOCAL) goto err_inval; } -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] [NETNS]: Process interface address manipulation routines in the namespace.
The namespace is available when required except rtm_to_ifaddr. Add namespace argument to it. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- net/ipv4/devinet.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index e55c85e..6a6e92e 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -485,7 +485,7 @@ errout: return err; } -static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) +static struct in_ifaddr *rtm_to_ifaddr(struct net *net, struct nlmsghdr *nlh) { struct nlattr *tb[IFA_MAX+1]; struct in_ifaddr *ifa; @@ -503,7 +503,7 @@ static struct in_ifaddr *rtm_to_ifaddr(struct nlmsghdr *nlh) if (ifm-ifa_prefixlen 32 || tb[IFA_LOCAL] == NULL) goto errout; - dev = __dev_get_by_index(init_net, ifm-ifa_index); + dev = __dev_get_by_index(net, ifm-ifa_index); err = -ENODEV; if (dev == NULL) goto errout; @@ -571,7 +571,7 @@ static int inet_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg if (net != init_net) return -EINVAL; - ifa = rtm_to_ifaddr(nlh); + ifa = rtm_to_ifaddr(net, nlh); if (IS_ERR(ifa)) return PTR_ERR(ifa); @@ -1189,7 +1189,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) s_ip_idx = ip_idx = cb-args[1]; idx = 0; - for_each_netdev(init_net, dev) { + for_each_netdev(net, dev) { if (idx s_idx) goto cont; if (idx s_idx) @@ -1223,7 +1223,9 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, struct nlmsghdr *nlh, struct sk_buff *skb; u32 seq = nlh ? nlh-nlmsg_seq : 0; int err = -ENOBUFS; + struct net *net; + net = ifa-ifa_dev-dev-nd_net; skb = nlmsg_new(inet_nlmsg_size(), GFP_KERNEL); if (skb == NULL) goto errout; @@ -1235,10 +1237,10 @@ static void rtmsg_ifa(int event, struct in_ifaddr* ifa, struct nlmsghdr *nlh, kfree_skb(skb); goto errout; } - err = rtnl_notify(skb, init_net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL); + err = rtnl_notify(skb, net, pid, RTNLGRP_IPV4_IFADDR, nlh, GFP_KERNEL); errout: if (err 0) - rtnl_set_sk_err(init_net, RTNLGRP_IPV4_IFADDR, err); + rtnl_set_sk_err(net, RTNLGRP_IPV4_IFADDR, err); } #ifdef CONFIG_SYSCTL -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] [IPV4]: fib_sync_down rework.
fib_sync_down can be called with an address and with a device. In reality it is called either with address OR with a device. The codepath inside is completely different, so lets separate it into two calls for these two cases. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/ip_fib.h |3 +- net/ipv4/fib_frontend.c |4 +- net/ipv4/fib_semantics.c | 104 +++-- 3 files changed, 57 insertions(+), 54 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 9daa60b..1b2f008 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -218,7 +218,8 @@ extern void fib_select_default(struct net *net, const struct flowi *flp, /* Exported by fib_semantics.c */ extern int ip_fib_check_default(__be32 gw, struct net_device *dev); -extern int fib_sync_down(__be32 local, struct net_device *dev, int force); +extern int fib_sync_down_dev(struct net_device *dev, int force); +extern int fib_sync_down_addr(__be32 local); extern int fib_sync_up(struct net_device *dev); extern __be32 __fib_res_prefsrc(struct fib_result *res); extern void fib_select_multipath(const struct flowi *flp, struct fib_result *res); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d0507f4..d69ffa2 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa) First of all, we scan fib_info list searching for stray nexthop entries, then ignite fib_flush. */ - if (fib_sync_down(ifa-ifa_local, NULL, 0)) + if (fib_sync_down_addr(ifa-ifa_local)) fib_flush(dev-nd_net); } } @@ -898,7 +898,7 @@ static void nl_fib_lookup_exit(struct net *net) static void fib_disable_ip(struct net_device *dev, int force) { - if (fib_sync_down(0, dev, force)) + if (fib_sync_down_dev(dev, force)) fib_flush(dev-nd_net); rt_cache_flush(0); arp_ifdown(dev); diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c791286..5beff2e 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -1031,70 +1031,72 @@ nla_put_failure: referring to it. - device went down - we must shutdown all nexthops going via it. */ - -int fib_sync_down(__be32 local, struct net_device *dev, int force) +int fib_sync_down_addr(__be32 local) { int ret = 0; - int scope = RT_SCOPE_NOWHERE; - - if (force) - scope = -1; + unsigned int hash = fib_laddr_hashfn(local); + struct hlist_head *head = fib_info_laddrhash[hash]; + struct hlist_node *node; + struct fib_info *fi; - if (local fib_info_laddrhash) { - unsigned int hash = fib_laddr_hashfn(local); - struct hlist_head *head = fib_info_laddrhash[hash]; - struct hlist_node *node; - struct fib_info *fi; + if (fib_info_laddrhash == NULL || local == 0) + return 0; - hlist_for_each_entry(fi, node, head, fib_lhash) { - if (fi-fib_prefsrc == local) { - fi-fib_flags |= RTNH_F_DEAD; - ret++; - } + hlist_for_each_entry(fi, node, head, fib_lhash) { + if (fi-fib_prefsrc == local) { + fi-fib_flags |= RTNH_F_DEAD; + ret++; } } + return ret; +} - if (dev) { - struct fib_info *prev_fi = NULL; - unsigned int hash = fib_devindex_hashfn(dev-ifindex); - struct hlist_head *head = fib_info_devhash[hash]; - struct hlist_node *node; - struct fib_nh *nh; +int fib_sync_down_dev(struct net_device *dev, int force) +{ + int ret = 0; + int scope = RT_SCOPE_NOWHERE; + struct fib_info *prev_fi = NULL; + unsigned int hash = fib_devindex_hashfn(dev-ifindex); + struct hlist_head *head = fib_info_devhash[hash]; + struct hlist_node *node; + struct fib_nh *nh; - hlist_for_each_entry(nh, node, head, nh_hash) { - struct fib_info *fi = nh-nh_parent; - int dead; + if (force) + scope = -1; - BUG_ON(!fi-fib_nhs); - if (nh-nh_dev != dev || fi == prev_fi) - continue; - prev_fi = fi; - dead = 0; - change_nexthops(fi) { - if (nh-nh_flagsRTNH_F_DEAD) - dead++; - else if (nh-nh_dev == dev -nh-nh_scope != scope) { -
[PATCH 6/6] [NETNS]: Lookup in FIB semantic hashes taking into account the namespace.
The namespace is not available in the fib_sync_down_addr, add it as a parameter. Looking up a device by the pointer to it is OK. Looking up using a result from fib_trie/fib_hash table lookup is also safe. No need to fix that at all. So, just fix lookup by address and insertion to the hash table path. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- include/net/ip_fib.h |2 +- net/ipv4/fib_frontend.c |2 +- net/ipv4/fib_semantics.c |6 +- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index cb0df37..90d1175 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -220,7 +220,7 @@ extern void fib_select_default(struct net *net, const struct flowi *flp, /* Exported by fib_semantics.c */ extern int ip_fib_check_default(__be32 gw, struct net_device *dev); extern int fib_sync_down_dev(struct net_device *dev, int force); -extern int fib_sync_down_addr(__be32 local); +extern int fib_sync_down_addr(struct net *net, __be32 local); extern int fib_sync_up(struct net_device *dev); extern __be32 __fib_res_prefsrc(struct fib_result *res); extern void fib_select_multipath(const struct flowi *flp, struct fib_result *res); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index d69ffa2..86ff271 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -808,7 +808,7 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa) First of all, we scan fib_info list searching for stray nexthop entries, then ignite fib_flush. */ - if (fib_sync_down_addr(ifa-ifa_local)) + if (fib_sync_down_addr(dev-nd_net, ifa-ifa_local)) fib_flush(dev-nd_net); } } diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 97cc494..a13c847 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -229,6 +229,8 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) head = fib_info_hash[hash]; hlist_for_each_entry(fi, node, head, fib_hash) { + if (fi-fib_net != nfi-fib_net) + continue; if (fi-fib_nhs != nfi-fib_nhs) continue; if (nfi-fib_protocol == fi-fib_protocol @@ -1031,7 +1033,7 @@ nla_put_failure: referring to it. - device went down - we must shutdown all nexthops going via it. */ -int fib_sync_down_addr(__be32 local) +int fib_sync_down_addr(struct net *net, __be32 local) { int ret = 0; unsigned int hash = fib_laddr_hashfn(local); @@ -1043,6 +1045,8 @@ int fib_sync_down_addr(__be32 local) return 0; hlist_for_each_entry(fi, node, head, fib_lhash) { + if (fi-fib_net != net) + continue; if (fi-fib_prefsrc == local) { fi-fib_flags |= RTNH_F_DEAD; ret++; -- 1.5.3.rc5 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] macb: Fix section mismatch and shrink runtime footprint
macb devices are only found integrated on SoCs, so they can't be hotplugged. Thus, the probe() and exit() functions can be __init and __exit, respectively. By using platform_driver_probe() instead of platform_driver_register(), there won't be any references to the discarded probe() function after the driver has loaded. This also fixes a section mismatch due to macb_probe(), defined as __devinit, calling macb_get_hwaddr, defined as __init. Signed-off-by: Haavard Skinnemoen [EMAIL PROTECTED] --- drivers/net/macb.c |9 - 1 files changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/net/macb.c b/drivers/net/macb.c index e10528e..81bf005 100644 --- a/drivers/net/macb.c +++ b/drivers/net/macb.c @@ -1084,7 +1084,7 @@ static int macb_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) return phy_mii_ioctl(phydev, if_mii(rq), cmd); } -static int __devinit macb_probe(struct platform_device *pdev) +static int __init macb_probe(struct platform_device *pdev) { struct eth_platform_data *pdata; struct resource *regs; @@ -1248,7 +1248,7 @@ err_out: return err; } -static int __devexit macb_remove(struct platform_device *pdev) +static int __exit macb_remove(struct platform_device *pdev) { struct net_device *dev; struct macb *bp; @@ -1276,8 +1276,7 @@ static int __devexit macb_remove(struct platform_device *pdev) } static struct platform_driver macb_driver = { - .probe = macb_probe, - .remove = __devexit_p(macb_remove), + .remove = __exit_p(macb_remove), .driver = { .name = macb, }, @@ -1285,7 +1284,7 @@ static struct platform_driver macb_driver = { static int __init macb_init(void) { - return platform_driver_register(macb_driver); + return platform_driver_probe(macb_driver, macb_probe); } static void __exit macb_exit(void) -- 1.5.3.8 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Good morning (my TZ), I'll try to answer all questions, hoewver if I miss something big, please point my nose to it again. Rick Jones wrote: As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) netperf was used without any special tuning parameters. Usually we start two processes on two hosts which start (almost) simultaneously, last for 20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e. on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20 on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20 192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we use the .2. part for eth1 for a range of mtus. The server is started on both nodes with the start-stop-daemon and no special parameters I'm aware of. /proc/interrupts shows me PCI_MSI-edge thus, I think YES. In particular, it would be good to know if you are doing two concurrent streams, or if you are using the burst mode TCP_RR with large request/response sizes method which then is only using one connection. As outlined above: Two concurrent streams right now. If you think TCP_RR should be better I'm happy to rerun some tests. More in other emails. I'll wade through them slowly. Carsten -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Disable TSO for non standard qdiscs
TSO interacts badly with many queueing disciplines because they rely on reordering packets from different streams and the large TSO packets can make this difficult. This patch disables TSO for sockets that send over devices with non standard queueing disciplines. That's anything but noop or pfifo_fast and pfifo right now. Longer term other queueing disciplines could be checked if they are also ok with TSO. If yes they can set the TCQ_F_GSO_OK flag too. It is still enabled for the standard pfifo_fast because that will never reorder packets with the same type-of-service. This means 99+% of all users will still be able to use TSO just fine. The status is only set up at socket creation so a shifted route will not reenable TSO on a existing socket. I don't think that's a problem though. Signed-off-by: Andi Kleen [EMAIL PROTECTED] --- include/net/sch_generic.h |1 + net/core/sock.c |3 +++ net/sched/sch_generic.c |5 +++-- 3 files changed, 7 insertions(+), 2 deletions(-) Index: linux/include/net/sch_generic.h === --- linux.orig/include/net/sch_generic.h +++ linux/include/net/sch_generic.h @@ -31,6 +31,7 @@ struct Qdisc #define TCQ_F_BUILTIN 1 #define TCQ_F_THROTTLED2 #define TCQ_F_INGRESS 4 +#define TCQ_F_GSO_OK 8 int padded; struct Qdisc_ops*ops; u32 handle; Index: linux/net/sched/sch_generic.c === --- linux.orig/net/sched/sch_generic.c +++ linux/net/sched/sch_generic.c @@ -307,7 +307,7 @@ struct Qdisc_ops noop_qdisc_ops __read_m struct Qdisc noop_qdisc = { .enqueue= noop_enqueue, .dequeue= noop_dequeue, - .flags = TCQ_F_BUILTIN, + .flags = TCQ_F_BUILTIN | TCQ_F_GSO_OK, .ops= noop_qdisc_ops, .list = LIST_HEAD_INIT(noop_qdisc.list), }; @@ -325,7 +325,7 @@ static struct Qdisc_ops noqueue_qdisc_op static struct Qdisc noqueue_qdisc = { .enqueue= NULL, .dequeue= noop_dequeue, - .flags = TCQ_F_BUILTIN, + .flags = TCQ_F_BUILTIN | TCQ_F_GSO_OK, .ops= noqueue_qdisc_ops, .list = LIST_HEAD_INIT(noqueue_qdisc.list), }; @@ -538,6 +538,7 @@ void dev_activate(struct net_device *dev printk(KERN_INFO %s: activation failed\n, dev-name); return; } + qdisc-flags |= TCQ_F_GSO_OK; list_add_tail(qdisc-list, dev-qdisc_list); } else { qdisc = noqueue_qdisc; Index: linux/net/core/sock.c === --- linux.orig/net/core/sock.c +++ linux/net/core/sock.c @@ -112,6 +112,7 @@ #include linux/tcp.h #include linux/init.h #include linux/highmem.h +#include net/sch_generic.h #include asm/uaccess.h #include asm/system.h @@ -1062,6 +1063,8 @@ void sk_setup_caps(struct sock *sk, stru { __sk_dst_set(sk, dst); sk-sk_route_caps = dst-dev-features; + if (!(dst-dev-qdisc-flags TCQ_F_GSO_OK)) + sk-sk_route_caps = ~NETIF_F_GSO_MASK; if (sk-sk_route_caps NETIF_F_GSO) sk-sk_route_caps |= NETIF_F_GSO_SOFTWARE; if (sk_can_gso(sk)) { Index: linux/net/sched/sch_fifo.c === --- linux.orig/net/sched/sch_fifo.c +++ linux/net/sched/sch_fifo.c @@ -62,6 +62,7 @@ static int fifo_init(struct Qdisc *sch, q-limit = ctl-limit; } + sch-flags |= TCQ_F_GSO_OK; return 0; } -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc II
On Thursday 31 January 2008 13:21:00 Andi Kleen wrote: I just managed to hang a 2.6.24 (+ some non network patches) kernel with the following (non sensical) command Correction: the kernel was actually a git linus kernel with David's recent merge included. I found it's pretty easy to hang the kernel with various tbf parameters. -Andi tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 No oops or anything just hangs. While I understand root can do bad things just hanging like this seems a little extreme. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6][NETNS]: Udp sockets per-net lookup.
Add the net parameter to udp_get_port family of calls and udp_lookup one and use it to filter sockets. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- net/ipv4/udp.c | 25 ++--- net/ipv6/udp.c | 10 ++ 2 files changed, 20 insertions(+), 15 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 2fb8d73..7ea1b67 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -130,14 +130,14 @@ EXPORT_SYMBOL(sysctl_udp_wmem_min); atomic_t udp_memory_allocated; EXPORT_SYMBOL(udp_memory_allocated); -static inline int __udp_lib_lport_inuse(__u16 num, +static inline int __udp_lib_lport_inuse(struct net *net, __u16 num, const struct hlist_head udptable[]) { struct sock *sk; struct hlist_node *node; sk_for_each(sk, node, udptable[num (UDP_HTABLE_SIZE - 1)]) - if (sk-sk_hash == num) + if (sk-sk_net == net sk-sk_hash == num) return 1; return 0; } @@ -159,6 +159,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum, struct hlist_head *head; struct sock *sk2; interror = 1; + struct net *net = sk-sk_net; write_lock_bh(udp_hash_lock); @@ -198,7 +199,7 @@ int __udp_lib_get_port(struct sock *sk, unsigned short snum, /* 2nd pass: find hole in shortest hash chain */ rover = best; for (i = 0; i (1 16) / UDP_HTABLE_SIZE; i++) { - if (! __udp_lib_lport_inuse(rover, udptable)) + if (! __udp_lib_lport_inuse(net, rover, udptable)) goto gotit; rover += UDP_HTABLE_SIZE; if (rover high) @@ -218,6 +219,7 @@ gotit: sk_for_each(sk2, node, head) if (sk2-sk_hash == snum sk2 != sk + sk2-sk_net == net (!sk2-sk_reuse|| !sk-sk_reuse) (!sk2-sk_bound_dev_if || !sk-sk_bound_dev_if || sk2-sk_bound_dev_if == sk-sk_bound_dev_if) @@ -261,9 +263,9 @@ static inline int udp_v4_get_port(struct sock *sk, unsigned short snum) /* UDP is nearly always wildcards out the wazoo, it makes no sense to try * harder than this. -DaveM */ -static struct sock *__udp4_lib_lookup(__be32 saddr, __be16 sport, - __be32 daddr, __be16 dport, - int dif, struct hlist_head udptable[]) +static struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr, + __be16 sport, __be32 daddr, __be16 dport, + int dif, struct hlist_head udptable[]) { struct sock *sk, *result = NULL; struct hlist_node *node; @@ -274,7 +276,8 @@ static struct sock *__udp4_lib_lookup(__be32 saddr, __be16 sport, sk_for_each(sk, node, udptable[hnum (UDP_HTABLE_SIZE - 1)]) { struct inet_sock *inet = inet_sk(sk); - if (sk-sk_hash == hnum !ipv6_only_sock(sk)) { + if (sk-sk_net == net sk-sk_hash == hnum + !ipv6_only_sock(sk)) { int score = (sk-sk_family == PF_INET ? 1 : 0); if (inet-rcv_saddr) { if (inet-rcv_saddr != daddr) @@ -361,8 +364,8 @@ void __udp4_lib_err(struct sk_buff *skb, u32 info, struct hlist_head udptable[]) int harderr; int err; - sk = __udp4_lib_lookup(iph-daddr, uh-dest, iph-saddr, uh-source, - skb-dev-ifindex, udptable ); + sk = __udp4_lib_lookup(skb-dev-nd_net, iph-daddr, uh-dest, + iph-saddr, uh-source, skb-dev-ifindex, udptable); if (sk == NULL) { ICMP_INC_STATS_BH(ICMP_MIB_INERRORS); return; /* No socket for error */ @@ -1185,8 +1188,8 @@ int __udp4_lib_rcv(struct sk_buff *skb, struct hlist_head udptable[], if (rt-rt_flags (RTCF_BROADCAST|RTCF_MULTICAST)) return __udp4_lib_mcast_deliver(skb, uh, saddr, daddr, udptable); - sk = __udp4_lib_lookup(saddr, uh-source, daddr, uh-dest, - inet_iif(skb), udptable); + sk = __udp4_lib_lookup(skb-dev-nd_net, saddr, uh-source, daddr, + uh-dest, inet_iif(skb), udptable); if (sk != NULL) { int ret = 0; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index bd4b9df..53739de 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -56,7 +56,8 @@ static inline int udp_v6_get_port(struct sock *sk, unsigned short snum) return udp_get_port(sk, snum, ipv6_rcv_saddr_equal); } -static struct sock *__udp6_lib_lookup(struct in6_addr
[PATCH 4/6][NETNS]: Tcp-v4 sockets per-net lookup.
Add a net argument to inet_lookup and propagate it further into lookup calls. Plus tune the __inet_check_established. The dccp and inet_diag, which use that lookup functions pass the init_net into them. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- include/net/inet_hashtables.h | 48 +++-- net/dccp/ipv4.c |6 ++-- net/ipv4/inet_diag.c |2 +- net/ipv4/inet_hashtables.c| 29 net/ipv4/tcp_ipv4.c | 15 ++-- 5 files changed, 58 insertions(+), 42 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index 55532b9..c23c4ed 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -302,15 +302,17 @@ out: wake_up(hashinfo-lhash_wait); } -extern struct sock *__inet_lookup_listener(struct inet_hashinfo *hashinfo, +extern struct sock *__inet_lookup_listener(struct net *net, + struct inet_hashinfo *hashinfo, const __be32 daddr, const unsigned short hnum, const int dif); -static inline struct sock *inet_lookup_listener(struct inet_hashinfo *hashinfo, - __be32 daddr, __be16 dport, int dif) +static inline struct sock *inet_lookup_listener(struct net *net, + struct inet_hashinfo *hashinfo, + __be32 daddr, __be16 dport, int dif) { - return __inet_lookup_listener(hashinfo, daddr, ntohs(dport), dif); + return __inet_lookup_listener(net, hashinfo, daddr, ntohs(dport), dif); } /* Socket demux engine toys. */ @@ -344,26 +346,26 @@ typedef __u64 __bitwise __addrpair; (((__force __u64)(__be32)(__daddr)) 32) | \ ((__force __u64)(__be32)(__saddr))); #endif /* __BIG_ENDIAN */ -#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ - (((__sk)-sk_hash == (__hash))\ +#define INET_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ + (((__sk)-sk_hash == (__hash)) ((__sk)-sk_net == (__net)) \ ((*((__addrpair *)(inet_sk(__sk)-daddr))) == (__cookie)) \ ((*((__portpair *)(inet_sk(__sk)-dport))) == (__ports)) \ (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif -#define INET_TW_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ - (((__sk)-sk_hash == (__hash))\ +#define INET_TW_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ + (((__sk)-sk_hash == (__hash)) ((__sk)-sk_net == (__net)) \ ((*((__addrpair *)(inet_twsk(__sk)-tw_daddr))) == (__cookie)) \ ((*((__portpair *)(inet_twsk(__sk)-tw_dport))) == (__ports)) \ (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif #else /* 32-bit arch */ #define INET_ADDR_COOKIE(__name, __saddr, __daddr) -#define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif) \ - (((__sk)-sk_hash == (__hash))\ +#define INET_MATCH(__sk, __net, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ + (((__sk)-sk_hash == (__hash)) ((__sk)-sk_net == (__net)) \ (inet_sk(__sk)-daddr == (__saddr)) \ (inet_sk(__sk)-rcv_saddr == (__daddr)) \ ((*((__portpair *)(inet_sk(__sk)-dport))) == (__ports)) \ (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif -#define INET_TW_MATCH(__sk, __hash,__cookie, __saddr, __daddr, __ports, __dif) \ - (((__sk)-sk_hash == (__hash))\ +#define INET_TW_MATCH(__sk, __net, __hash,__cookie, __saddr, __daddr, __ports, __dif) \ + (((__sk)-sk_hash == (__hash)) ((__sk)-sk_net == (__net)) \ (inet_twsk(__sk)-tw_daddr == (__saddr)) \ (inet_twsk(__sk)-tw_rcv_saddr == (__daddr)) \ ((*((__portpair *)(inet_twsk(__sk)-tw_dport))) == (__ports)) \ @@ -376,32 +378,36 @@ typedef __u64 __bitwise __addrpair; * * Local BH must be disabled here. */ -extern struct sock * __inet_lookup_established(struct inet_hashinfo *hashinfo, +extern struct sock * __inet_lookup_established(struct net *net, + struct inet_hashinfo *hashinfo, const __be32 saddr, const __be16 sport, const __be32 daddr, const u16 hnum, const int dif); static inline struct sock * - inet_lookup_established(struct inet_hashinfo *hashinfo, + inet_lookup_established(struct net *net, struct inet_hashinfo *hashinfo, const
[PATCH 3/6][NETNS]: Make bind buckets live in net namespaces.
This tags the inet_bind_bucket struct with net pointer, initializes it during creation and makes a filtering during lookup. A better hashfn, that takes the net into account is to be done in the future, but currently all bind buckets with similar port will be in one hash chain. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- include/net/inet_hashtables.h |2 ++ net/ipv4/inet_connection_sock.c |8 +--- net/ipv4/inet_hashtables.c |8 ++-- 3 files changed, 13 insertions(+), 5 deletions(-) diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index a34a8f2..55532b9 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -74,6 +74,7 @@ struct inet_ehash_bucket { * ports are created in O(1) time? I thought so. ;-) -DaveM */ struct inet_bind_bucket { + struct net *ib_net; unsigned short port; signed shortfastreuse; struct hlist_node node; @@ -194,6 +195,7 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo) extern struct inet_bind_bucket * inet_bind_bucket_create(struct kmem_cache *cachep, + struct net *net, struct inet_bind_hashbucket *head, const unsigned short snum); extern void inet_bind_bucket_destroy(struct kmem_cache *cachep, diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 7801cce..de5a41d 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -87,6 +87,7 @@ int inet_csk_get_port(struct inet_hashinfo *hashinfo, struct hlist_node *node; struct inet_bind_bucket *tb; int ret; + struct net *net = sk-sk_net; local_bh_disable(); if (!snum) { @@ -100,7 +101,7 @@ int inet_csk_get_port(struct inet_hashinfo *hashinfo, head = hashinfo-bhash[inet_bhashfn(rover, hashinfo-bhash_size)]; spin_lock(head-lock); inet_bind_bucket_for_each(tb, node, head-chain) - if (tb-port == rover) + if (tb-ib_net == net tb-port == rover) goto next; break; next: @@ -127,7 +128,7 @@ int inet_csk_get_port(struct inet_hashinfo *hashinfo, head = hashinfo-bhash[inet_bhashfn(snum, hashinfo-bhash_size)]; spin_lock(head-lock); inet_bind_bucket_for_each(tb, node, head-chain) - if (tb-port == snum) + if (tb-ib_net == net tb-port == snum) goto tb_found; } tb = NULL; @@ -147,7 +148,8 @@ tb_found: } tb_not_found: ret = 1; - if (!tb (tb = inet_bind_bucket_create(hashinfo-bind_bucket_cachep, head, snum)) == NULL) + if (!tb (tb = inet_bind_bucket_create(hashinfo-bind_bucket_cachep, + net, head, snum)) == NULL) goto fail_unlock; if (hlist_empty(tb-owners)) { if (sk-sk_reuse sk-sk_state != TCP_LISTEN) diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index b93d40f..db1e53a 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -28,12 +28,14 @@ * The bindhash mutex for snum's hash chain must be held here. */ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep, +struct net *net, struct inet_bind_hashbucket *head, const unsigned short snum) { struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC); if (tb != NULL) { + tb-ib_net = net; tb-port = snum; tb-fastreuse = 0; INIT_HLIST_HEAD(tb-owners); @@ -359,6 +361,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, struct inet_bind_hashbucket *head; struct inet_bind_bucket *tb; int ret; + struct net *net = sk-sk_net; if (!snum) { int i, remaining, low, high, port; @@ -381,7 +384,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, * unique enough. */ inet_bind_bucket_for_each(tb, node, head-chain) { - if (tb-port == port) { + if (tb-ib_net == net tb-port == port) { BUG_TRAP(!hlist_empty(tb-owners)); if (tb-fastreuse = 0) goto next_port; @@ -392,7
[PATCH 5/6][NETNS]: Tcp-v6 sockets per-net lookup.
Add a net argument to inet6_lookup and propagate it further. Actually, this is tcp-v6 implementation of what was done for tcp-v4 sockets in a previous patch. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- include/linux/ipv6.h |8 include/net/inet6_hashtables.h | 17 ++--- net/dccp/ipv6.c|8 net/ipv4/inet_diag.c |2 +- net/ipv6/inet6_hashtables.c| 25 ++--- net/ipv6/tcp_ipv6.c| 19 ++- 6 files changed, 43 insertions(+), 36 deletions(-) diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index c347860..4aaefc3 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -457,16 +457,16 @@ static inline struct raw6_sock *raw6_sk(const struct sock *sk) #define inet_v6_ipv6only(__sk) 0 #endif /* defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) */ -#define INET6_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif)\ - (((__sk)-sk_hash == (__hash)) \ +#define INET6_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif)\ + (((__sk)-sk_hash == (__hash)) ((__sk)-sk_net == (__net))\ ((*((__portpair *)(inet_sk(__sk)-dport))) == (__ports)) \ ((__sk)-sk_family == AF_INET6) \ ipv6_addr_equal(inet6_sk(__sk)-daddr, (__saddr)) \ ipv6_addr_equal(inet6_sk(__sk)-rcv_saddr, (__daddr)) \ (!((__sk)-sk_bound_dev_if) || ((__sk)-sk_bound_dev_if == (__dif -#define INET6_TW_MATCH(__sk, __hash, __saddr, __daddr, __ports, __dif) \ - (((__sk)-sk_hash == (__hash)) \ +#define INET6_TW_MATCH(__sk, __net, __hash, __saddr, __daddr, __ports, __dif) \ + (((__sk)-sk_hash == (__hash)) ((__sk)-sk_net == (__net))\ (*((__portpair *)(inet_twsk(__sk)-tw_dport)) == (__ports))\ ((__sk)-sk_family== PF_INET6) \ (ipv6_addr_equal(inet6_twsk(__sk)-tw_v6_daddr, (__saddr)))\ diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h index 668056b..fdff630 100644 --- a/include/net/inet6_hashtables.h +++ b/include/net/inet6_hashtables.h @@ -57,34 +57,37 @@ extern void __inet6_hash(struct inet_hashinfo *hashinfo, struct sock *sk); * * The sockhash lock must be held as a reader here. */ -extern struct sock *__inet6_lookup_established(struct inet_hashinfo *hashinfo, +extern struct sock *__inet6_lookup_established(struct net *net, + struct inet_hashinfo *hashinfo, const struct in6_addr *saddr, const __be16 sport, const struct in6_addr *daddr, const u16 hnum, const int dif); -extern struct sock *inet6_lookup_listener(struct inet_hashinfo *hashinfo, +extern struct sock *inet6_lookup_listener(struct net *net, + struct inet_hashinfo *hashinfo, const struct in6_addr *daddr, const unsigned short hnum, const int dif); -static inline struct sock *__inet6_lookup(struct inet_hashinfo *hashinfo, +static inline struct sock *__inet6_lookup(struct net *net, + struct inet_hashinfo *hashinfo, const struct in6_addr *saddr, const __be16 sport, const struct in6_addr *daddr, const u16 hnum, const int dif) { - struct sock *sk = __inet6_lookup_established(hashinfo, saddr, sport, -daddr, hnum, dif); + struct sock *sk = __inet6_lookup_established(net, hashinfo, saddr, + sport, daddr, hnum, dif); if (sk) return sk; - return inet6_lookup_listener(hashinfo, daddr, hnum, dif); + return inet6_lookup_listener(net, hashinfo, daddr, hnum, dif); } -extern struct sock *inet6_lookup(struct inet_hashinfo *hashinfo, +extern struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo, const struct in6_addr *saddr, const __be16 sport, const struct in6_addr *daddr, const __be16 dport, const int dif); diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index f42b75c..ed0a005 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -101,8 +101,8 @@ static void dccp_v6_err(struct sk_buff *skb, struct inet6_skb_parm *opt, int err;
Re: [PATCH 5/6][NETNS]: Tcp-v6 sockets per-net lookup.
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 15:40:16 +0300 Add a net argument to inet6_lookup and propagate it further. Actually, this is tcp-v6 implementation of what was done for tcp-v4 sockets in a previous patch. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6][NETNS]: Udp sockets per-net lookup.
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 15:41:58 +0300 Add the net parameter to udp_get_port family of calls and udp_lookup one and use it to filter sockets. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH retry] bluetooth : add conn add/del workqueues to avoid connection fail
On Wed, Jan 30 2008, Dave Young wrote: The bluetooth hci_conn sysfs add/del executed in the default workqueue. If the del_conn is executed after the new add_conn with same target, add_conn will failed with warning of same kobject name. Here add btaddconn btdelconn workqueues, flush the btdelconn workqueue in the add_conn function to avoid the issue. Signed-off-by: Dave Young [EMAIL PROTECTED] --- diff -upr a/net/bluetooth/hci_sysfs.c b/net/bluetooth/hci_sysfs.c --- a/net/bluetooth/hci_sysfs.c 2008-01-30 10:14:27.0 +0800 +++ b/net/bluetooth/hci_sysfs.c 2008-01-30 10:14:14.0 +0800 @@ -12,6 +12,8 @@ #undef BT_DBG #define BT_DBG(D...) #endif +static struct workqueue_struct *btaddconn; +static struct workqueue_struct *btdelconn; static inline char *typetostr(int type) { @@ -279,6 +281,7 @@ static void add_conn(struct work_struct struct hci_conn *conn = container_of(work, struct hci_conn, work); int i; + flush_workqueue(btdelconn); if (device_add(conn-dev) 0) { BT_ERR(Failed to register connection device); return; @@ -313,6 +316,7 @@ void hci_conn_add_sysfs(struct hci_conn INIT_WORK(conn-work, add_conn); + queue_work(btaddconn, conn-work); schedule_work(conn-work); } So you queue conn-work on both btaddconn and keventd_wq? -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6][NETNS]: Udp sockets per-net lookup.
In article [EMAIL PROTECTED] (at Thu, 31 Jan 2008 15:41:58 +0300), Pavel Emelyanov [EMAIL PROTECTED] says: Add the net parameter to udp_get_port family of calls and udp_lookup one and use it to filter sockets. I may miss something, but I'm afraid that I have to disagree. Port is identified only by family, address, protocol and port, and should not be split by name space. --yoshfuji -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6][NETNS]: Udp sockets per-net lookup.
From: YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED] Date: Fri, 01 Feb 2008 00:11:38 +1100 (EST) In article [EMAIL PROTECTED] (at Thu, 31 Jan 2008 15:41:58 +0300), Pavel Emelyanov [EMAIL PROTECTED] says: Add the net parameter to udp_get_port family of calls and udp_lookup one and use it to filter sockets. I may miss something, but I'm afraid that I have to disagree. Port is identified only by family, address, protocol and port, and should not be split by name space. It is like being on a totally different system. Without sockets in namespaces, there is no point. The networking devices are even per-namespace already, so you can even say that each namespace is even physically different. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6][NETNS]: Tcp-v4 sockets per-net lookup.
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 15:38:15 +0300 Add a net argument to inet_lookup and propagate it further into lookup calls. Plus tune the __inet_check_established. The dccp and inet_diag, which use that lookup functions pass the init_net into them. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6][IPV6]: Introduce the INET6_TW_MATCH macro.
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 15:29:20 +0300 0/6? :-) We have INET_MATCH, INET_TW_MATCH and INET6_MATCH to test sockets and twbuckets for matching, but ipv6 twbuckets are tested manually. Here's the INET6_TW_MATCH to help with it. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.
Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu: These two functions are the same except for what they call to check_established and hash for a socket. This saves half-a-kilo for ipv4 and ipv6. Good stuff! Yesterday I was perusing tcp_hash and I think we could have the hashinfo pointer stored perhaps in sk-sk_prot. That way we would be able to kill tcp_hash(), inet_put_port() could receive just sk, etc. What do you think? - Arnaldo -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 15:32:09 +0300 These two functions are the same except for what they call to check_established and hash for a socket. This saves half-a-kilo for ipv4 and ipv6. add/remove: 1/0 grow/shrink: 1/4 up/down: 582/-1128 (-546) function old new delta __inet_hash_connect- 577+577 arp_ignore 108 113 +5 static.hint8 4 -4 rt_worker_func 376 372 -4 inet6_hash_connect 584 25-559 inet_hash_connect586 25-561 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6][NETNS]: Make bind buckets live in net namespaces.
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 15:35:39 +0300 This tags the inet_bind_bucket struct with net pointer, initializes it during creation and makes a filtering during lookup. A better hashfn, that takes the net into account is to be done in the future, but currently all bind buckets with similar port will be in one hash chain. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Applied. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] [IPV4]: Fix memory leak on error path during FIB initialization.
From: Denis V. Lunev [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 15:00:45 +0300 commit c8050bf6d84785a7edd2e81591e8f833231477e8 Author: Denis V. Lunev [EMAIL PROTECTED] Date: Thu Jan 10 03:28:24 2008 -0800 I am fixing it up for you this time, but please do not reference the commit this way. Say something like: blah blah blah in commit $(SHA1_HASH) (commit head line). The author and date give no real useful information in this context, the important part is giving the reader enough information to find the commit should they wish to gain more information. If they have the commit hash they can usually find the commit, but if that fails they can search the commit messages for the head line text string. I feel like I've had to explain this 10 times in the past week... :-/ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.
From: Arnaldo Carvalho de Melo [EMAIL PROTECTED] Date: Thu, 31 Jan 2008 11:01:53 -0200 Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu: These two functions are the same except for what they call to check_established and hash for a socket. This saves half-a-kilo for ipv4 and ipv6. Good stuff! Yesterday I was perusing tcp_hash and I think we could have the hashinfo pointer stored perhaps in sk-sk_prot. That way we would be able to kill tcp_hash(), inet_put_port() could receive just sk, etc. What do you think? Sounds good to me. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Null pointer dereference when bringing up bonding device on kernel-2.6.24-2.fc9.i686
Yo! Jay Vosburgh wrote: Benny Amorsen [EMAIL PROTECTED] wrote: https://bugzilla.redhat.com/show_bug.cgi?id=430391 I know what this is, I'll fix it. do you know when this happend, so we would know which kernel is ok to use (not to start trying blindly)? Siim -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.
Arnaldo Carvalho de Melo wrote: Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu: These two functions are the same except for what they call to check_established and hash for a socket. This saves half-a-kilo for ipv4 and ipv6. Good stuff! Yesterday I was perusing tcp_hash and I think we could have the hashinfo pointer stored perhaps in sk-sk_prot. That way we would be able to kill tcp_hash(), inet_put_port() could receive just sk, etc. But each proto will still have its own hashfn, so proto's callbacks will be called to hash/unhash sockets, so this will give us just one extra dereference. No? What do you think? Hmmm... Even raw_hash, etc may become simpler. On the other hand maybe this is a good idea, but I'm not very common with this code yet to foresee such things in advance... I think that we should try to prepare a patch and look, but if you have smth ready, then it's better to review your stuff first. - Arnaldo Thanks, Pavel -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
On Thu, 2008-31-01 at 13:21 +0100, Andi Kleen wrote: I just managed to hang a 2.6.24 (+ some non network patches) kernel with the following (non sensical) command tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 No oops or anything just hangs. While I understand root can do bad things just hanging like this seems a little extreme. - lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 lilsol:~# uname -a Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686 GNU/Linux lilsol:~# tc qdisc ls dev eth0 qdisc tbf 8001: root rate 1000bit burst 10b lat 737.3ms lilsol:~# --- What do your patches do? cheers, jamal -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
- lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 lilsol:~# uname -a Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686 Can you try it again with current git mainline? GNU/Linux lilsol:~# tc qdisc ls dev eth0 qdisc tbf 8001: root rate 1000bit burst 10b lat 737.3ms lilsol:~# --- What do your patches do? Nothing really related to qdiscs. I suspect it came from the git mainline patch I had (but forgot to mention in the first email) -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
Andi Kleen wrote: - lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 lilsol:~# uname -a Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686 Can you try it again with current git mainline? I'll look into it. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] cls_u32 u32_classify() +
On Wed, 2008-30-01 at 11:31 -0200, Dzianis Kahanovich wrote: Currently fine u32 hashkey ... at ... not work with relative offsets. There are simpliest fix to use eat. (sorry, v2) Hi, Please send me the commands you are trying to run that motivated this patch. cheers, jamal -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6][NETNS]: Udp sockets per-net lookup.
In article [EMAIL PROTECTED] (at Thu, 31 Jan 2008 05:20:07 -0800 (PST)), David Miller [EMAIL PROTECTED] says: The networking devices are even per-namespace already, so you can even say that each namespace is even physically different. Ah, okay, we are splitting weak domains... --yoshfuji -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.
Em Thu, Jan 31, 2008 at 04:18:51PM +0300, Pavel Emelyanov escreveu: Arnaldo Carvalho de Melo wrote: Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu: These two functions are the same except for what they call to check_established and hash for a socket. This saves half-a-kilo for ipv4 and ipv6. Good stuff! Yesterday I was perusing tcp_hash and I think we could have the hashinfo pointer stored perhaps in sk-sk_prot. That way we would be able to kill tcp_hash(), inet_put_port() could receive just sk, etc. But each proto will still have its own hashfn, so proto's callbacks will be called to hash/unhash sockets, so this will give us just one extra dereference. No? What do you think? Hmmm... Even raw_hash, etc may become simpler. On the other hand maybe this is a good idea, but I'm not very common with this code yet to foresee such things in advance... I think that we should try to prepare a patch and look, but if you have smth ready, then it's better to review your stuff first. gimme some minutes - Arnaldo -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
Patrick McHardy wrote: Andi Kleen wrote: - lilsol:~# tc qdisc add dev eth0 root tbf rate 1000 burst 10 limit 100 lilsol:~# uname -a Linux lilsol 2.6.24 #1 PREEMPT Sun Jan 27 09:22:00 EST 2008 i686 Can you try it again with current git mainline? I'll look into it. Works for me: qdisc tbf 8001: root rate 1000bit burst 10b/8 mpu 0b lat 720.0ms Sent 0 bytes 0 pkt (dropped 9, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 Packets are dropped as expected. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [VLAN] vlan_dev: Initialize dev pointer only when it is being used
Benjamin Li wrote: Signed-off-by: Benjamin Li [EMAIL PROTECTED] --- net/8021q/vlan_dev.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index 8059fa4..2fa5d68 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -49,7 +49,7 @@ */ static int vlan_dev_rebuild_header(struct sk_buff *skb) { - struct net_device *dev = skb-dev; + struct net_device *dev; struct vlan_ethhdr *veth = (struct vlan_ethhdr *)(skb-data); switch (veth-h_vlan_encapsulated_proto) { @@ -60,6 +60,7 @@ static int vlan_dev_rebuild_header(struct sk_buff *skb) return arp_find(veth-h_dest, skb); #endif default: + dev = skb-dev; pr_debug(%s: unable to resolve type %X addresses.\n, dev-name, ntohs(veth-h_vlan_encapsulated_proto)); This seems pretty pointless to me. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NET: AX88796 use dev_dbg() instead of printk()
On Thu, Jan 31, 2008 at 11:25:31AM +, Ben Dooks wrote: Change to using dev_dbg() and the other dev_xxx() macros instead of printk, and update to use the print_mac() helper. Signed-off-by: Ben Dooks [EMAIL PROTECTED] Please send to [EMAIL PROTECTED] or [EMAIL PROTECTED], the email addresses I've always used for communication. The redhat.com address is only for legal sign-offs, not actual communication. Thanks, Jeff -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Bill Fink wrote: If the receive direction uses a different GigE NIC that's part of the same quad-GigE, all is fine: [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -w2m 192.168.5.79 tx: 1186.5051 MB / 10.05 sec = 990.2250 Mbps 12 %TX 13 %RX 0 retrans rx: 1186.7656 MB / 10.05 sec = 990.5204 Mbps 15 %TX 14 %RX 0 retrans Could this be an issue with pause frames? At a previous job I remember having issues with a similar configuration using two broadcom sb1250 3 gigE port devices. If I ran bidirectional tests on a single pair of ports connected via cross over, it was slower than when I gave each direction its own pair of ports. The problem turned out to be that pause frame generation and handling was not configured correctly. -Ack -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
Works for me: qdisc tbf 8001: root rate 1000bit burst 10b/8 mpu 0b lat 720.0ms Sent 0 bytes 0 pkt (dropped 9, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 Packets are dropped as expected. I can still reproduce it on 64bit with http://halobates.de/config-qdisc (all qdiscs etc. compiled in for testing) with latest git tip (8af03e782cae1e0a0f530ddd22301cdd12cf9dc0) The command line above causes an instant hang. Also tried it with newer iproute2 (the original one was quite old), but it didn't make a difference. Perhaps it's related to what qdiscs are enabled? Can you please try with the above config? If everything fails I can do a bisect later. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][NETFILTER]: Ipv6-related xt_hashlimit compilation fix.
The hashlimit_ipv6_mask() is called from under IP6_NF_IPTABLES config option, but is not under it by itself. gcc warns us about it :) : net/netfilter/xt_hashlimit.c:473: warning: ‘hashlimit_ipv6_mask’ defined but not used Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 54aaf5b..744c7f2 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -469,6 +469,7 @@ static inline __be32 maskl(__be32 a, unsigned int l) return htonl(ntohl(a) ~(~(u_int32_t)0 l)); } +#if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE) static void hashlimit_ipv6_mask(__be32 *i, unsigned int p) { switch (p) { @@ -503,6 +504,7 @@ static void hashlimit_ipv6_mask(__be32 *i, unsigned int p) break; } } +#endif static int hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo, -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hard hang through qdisc
Andi Kleen wrote: Works for me: qdisc tbf 8001: root rate 1000bit burst 10b/8 mpu 0b lat 720.0ms Sent 0 bytes 0 pkt (dropped 9, overlimits 0 requeues 0) rate 0bit 0pps backlog 0b 0p requeues 0 Packets are dropped as expected. I can still reproduce it on 64bit with http://halobates.de/config-qdisc (all qdiscs etc. compiled in for testing) with latest git tip (8af03e782cae1e0a0f530ddd22301cdd12cf9dc0) The command line above causes an instant hang. Also tried it with newer iproute2 (the original one was quite old), but it didn't make a difference. Perhaps it's related to what qdiscs are enabled? I'm also testing on 64 bit, with all qdiscs enabled as modules. Can you please try with the above config? I'll give it a try later. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi all, slowly crawling through the mails. Brandeburg, Jesse wrote: The test was done with various mtu sizes ranging from 1500 to 9000, with ethernet flow control switched on and off, and using reno and cubic as a TCP congestion control. As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) We are using MSI, /proc/interrupts look like: n0003:~# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0:6536963 0 0 0 IO-APIC-edge timer 1: 2 0 0 0 IO-APIC-edge i8042 3: 1 0 0 0 IO-APIC-edge serial 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-fasteoi acpi 14: 32321 0 0 0 IO-APIC-edge libata 15: 0 0 0 0 IO-APIC-edge libata 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb5 18: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 378: 17234866 0 0 0 PCI-MSI-edge eth1 379: 129826 0 0 0 PCI-MSI-edge eth0 NMI: 0 0 0 0 LOC:6537181653732665371496537052 ERR: 0 (sorry for the line break). What we don't understand is why only core0 gets the interrupts, since the affinity is set to f: # cat /proc/irq/378/smp_affinity f Right now, irqbalance is not running, though I can give it shot if people think this will make a difference. I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I did that and the results can be found here: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest The results with netperf running like netperf -t TCP_STREAM -H host -l 20 can be found here: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf1 I reran the tests with netperf -t test -H host -l 20 -c -C or in the case of TCP_RR with the suggested burst settings -b 4 -r 64k Yes, InterruptThrottleRate=8000 means there will be no more than 8000 ints/second from that adapter, and if interrupts are generated faster than that they are aggregated. Interestingly since you are interested in ultra low latency, and may be willing to give up some cpu for it during bulk transfers you should try InterruptThrottleRate=1 (can generate up to 7 ints/s) On the web page you'll see that there are about 4000 interrupts/s for most tests and up to 20,000/s for the TCP_RR test. Shall I change the throttle rate? just for completeness can you post the dump of ethtool -e eth0 and lspci -vvv? Yup, we'll give that info also. n0002:~# ethtool -e eth1 Offset Values -- -- 0x 00 30 48 93 94 2d 20 0d 46 f7 57 00 ff ff ff ff 0x0010 ff ff ff ff 6b 02 9a 10 d9 15 9a 10 86 80 df 80 0x0020 00 00 00 20 54 7e 00 00 00 10 da 00 04 00 00 27 0x0030 c9 6c 50 31 32 07 0b 04 84 29 00 00 00 c0 06 07 0x0040 08 10 00 00 04 0f ff 7f 01 4d ff ff ff ff ff ff 0x0050 14 00 1d 00 14 00 1d 00 af aa 1e 00 00 00 1d 00 0x0060 00 01 00 40 1e 12 ff ff ff ff ff ff ff ff ff ff 0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff cf 2f lspci -vvv for this card: 0e:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller Subsystem: Super Micro Computer Inc Unknown device 109a Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort- SERR- PERR- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 378 Region 0: Memory at ee20 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at 5000 [size=32] Capabilities: [c8] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [d0] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+ Address: fee0f00c Data: 41b9 Capabilities: [e0] Express Endpoint IRQ 0 Device: Supported: MaxPayload 256 bytes, PhantFunc 0, ExtTag- Device: Latency L0s 512ns, L1 64us Device: AtnBtn- AtnInd- PwrInd- Device: Errors: Correctable- Non-Fatal- Fatal-
Re: e1000 full-duplex TCP performance well below wire speed
Brief question I forgot to ask: Right now we are using the old version 7.3.20-k2. To save some effort on your end, shall we upgrade this to 7.6.15 or should our version be good enough? Thanks Carsten -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Bill, I see similar results on my test systems Thanks for this report and for confirming our observations. Could you please confirm that a single-port bidrectional UDP link runs at wire speed? This helps to localize the problem to the TCP stack or interaction of the TCP stack with the e1000 driver and hardware. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi David, Could this be an issue with pause frames? At a previous job I remember having issues with a similar configuration using two broadcom sb1250 3 gigE port devices. If I ran bidirectional tests on a single pair of ports connected via cross over, it was slower than when I gave each direction its own pair of ports. The problem turned out to be that pause frame generation and handling was not configured correctly. We had PAUSE frames turned off for our testing. The idea is to let TCP do the flow and congestion control. The problem with PAUSE+TCP is that it can cause head-of-line blocking, where a single oversubscribed output port on a switch can PAUSE a large number of flows on other paths. Cheers, Bruce -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rtl8150: use default MTU of 1500
On Wed, 30 Jan 2008, Lennert Buytenhek wrote: The RTL8150 driver uses an MTU of 1540 by default, which causes a bunch of problems -- it prevents booting from NFS root, for one. Agreed, although it is a bit strange how this particular bug has sneaked up for so long... cheers, Petko Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED] Cc: Petko Manolov [EMAIL PROTECTED] --- linux-2.6.24-git7.orig/drivers/net/usb/rtl8150.c2008-01-24 23:58:37.0 +0100 +++ linux-2.6.24-git7/drivers/net/usb/rtl8150.c 2008-01-30 20:29:00.0 +0100 @@ -925,9 +925,8 @@ netdev-hard_start_xmit = rtl8150_start_xmit; netdev-set_multicast_list = rtl8150_set_multicast; netdev-set_mac_address = rtl8150_set_mac_address; netdev-get_stats = rtl8150_netdev_stats; - netdev-mtu = RTL8150_MTU; SET_ETHTOOL_OPS(netdev, ops); dev-intr_interval = 100;/* 100ms */ if (!alloc_all_urbs(dev)) { -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/6][INET]: Consolidate inet(6)_hash_connect.
Em Thu, Jan 31, 2008 at 11:39:55AM -0200, Arnaldo Carvalho de Melo escreveu: Em Thu, Jan 31, 2008 at 04:18:51PM +0300, Pavel Emelyanov escreveu: Arnaldo Carvalho de Melo wrote: Em Thu, Jan 31, 2008 at 03:32:09PM +0300, Pavel Emelyanov escreveu: These two functions are the same except for what they call to check_established and hash for a socket. This saves half-a-kilo for ipv4 and ipv6. Good stuff! Yesterday I was perusing tcp_hash and I think we could have the hashinfo pointer stored perhaps in sk-sk_prot. That way we would be able to kill tcp_hash(), inet_put_port() could receive just sk, etc. But each proto will still have its own hashfn, so proto's callbacks will be called to hash/unhash sockets, so this will give us just one extra dereference. No? What do you think? Hmmm... Even raw_hash, etc may become simpler. On the other hand maybe this is a good idea, but I'm not very common with this code yet to foresee such things in advance... I think that we should try to prepare a patch and look, but if you have smth ready, then it's better to review your stuff first. gimme some minutes A bit more than minutes tho, but here it is, I'm testing it now. Take a look and if testing is ok I'll submit it with a proper description. - Arnaldo diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h index fdff630..62a5b69 100644 --- a/include/net/inet6_hashtables.h +++ b/include/net/inet6_hashtables.h @@ -49,7 +49,7 @@ static inline int inet6_sk_ehashfn(const struct sock *sk) return inet6_ehashfn(laddr, lport, faddr, fport); } -extern void __inet6_hash(struct inet_hashinfo *hashinfo, struct sock *sk); +extern void __inet6_hash(struct sock *sk); /* * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 133cf30..f00f057 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -29,7 +29,6 @@ #undef INET_CSK_CLEAR_TIMERS struct inet_bind_bucket; -struct inet_hashinfo; struct tcp_congestion_ops; /* @@ -59,6 +58,8 @@ struct inet_connection_sock_af_ops { int level, int optname, char __user *optval, int __user *optlen); void(*addr2sockaddr)(struct sock *sk, struct sockaddr *); + int (*bind_conflict)(const struct sock *sk, +const struct inet_bind_bucket *tb); }; /** inet_connection_sock - INET connection oriented sock @@ -244,10 +245,7 @@ extern struct request_sock *inet_csk_search_req(const struct sock *sk, const __be32 laddr); extern int inet_csk_bind_conflict(const struct sock *sk, const struct inet_bind_bucket *tb); -extern int inet_csk_get_port(struct inet_hashinfo *hashinfo, -struct sock *sk, unsigned short snum, -int (*bind_conflict)(const struct sock *sk, - const struct inet_bind_bucket *tb)); +extern int inet_csk_get_port(struct sock *sk, unsigned short snum); extern struct dst_entry* inet_csk_route_req(struct sock *sk, const struct request_sock *req); diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h index c23c4ed..48ac620 100644 --- a/include/net/inet_hashtables.h +++ b/include/net/inet_hashtables.h @@ -221,9 +221,9 @@ static inline int inet_sk_listen_hashfn(const struct sock *sk) } /* Caller must disable local BH processing. */ -static inline void __inet_inherit_port(struct inet_hashinfo *table, - struct sock *sk, struct sock *child) +static inline void __inet_inherit_port(struct sock *sk, struct sock *child) { + struct inet_hashinfo *table = sk-sk_prot-hashinfo; const int bhash = inet_bhashfn(inet_sk(child)-num, table-bhash_size); struct inet_bind_hashbucket *head = table-bhash[bhash]; struct inet_bind_bucket *tb; @@ -235,15 +235,14 @@ static inline void __inet_inherit_port(struct inet_hashinfo *table, spin_unlock(head-lock); } -static inline void inet_inherit_port(struct inet_hashinfo *table, -struct sock *sk, struct sock *child) +static inline void inet_inherit_port(struct sock *sk, struct sock *child) { local_bh_disable(); - __inet_inherit_port(table, sk, child); + __inet_inherit_port(sk, child); local_bh_enable(); } -extern void inet_put_port(struct inet_hashinfo *table, struct sock *sk); +extern void inet_put_port(struct sock *sk); extern void inet_listen_wlock(struct inet_hashinfo *hashinfo); @@ -266,41 +265,9 @@ static inline void inet_listen_unlock(struct inet_hashinfo *hashinfo)
Re: rtl8150: use default MTU of 1500
On Thu, Jan 31, 2008 at 05:42:34PM +0200, Petko Manolov wrote: The RTL8150 driver uses an MTU of 1540 by default, which causes a bunch of problems -- it prevents booting from NFS root, for one. Agreed, although it is a bit strange how this particular bug has sneaked up for so long... I posted this patch sometime in 2006, and you asked me a question about it then (why we don't just set RTL8150_MTU to 1500 -- the answer would be that RTL8150_MTU is used in a couple more places in the driver, including for allocing skbuffs), but I failed to follow up to that question at the time, which is why I assume it got dropped. I have been carrying the patch in my own tree since then, and only noticed recently that the patch never made it upstream. cheers, Lennert Signed-off-by: Lennert Buytenhek [EMAIL PROTECTED] Cc: Petko Manolov [EMAIL PROTECTED] --- linux-2.6.24-git7.orig/drivers/net/usb/rtl8150.c 2008-01-24 23:58:37.0 +0100 +++ linux-2.6.24-git7/drivers/net/usb/rtl8150.c 2008-01-30 20:29:00.0 +0100 @@ -925,9 +925,8 @@ netdev-hard_start_xmit = rtl8150_start_xmit; netdev-set_multicast_list = rtl8150_set_multicast; netdev-set_mac_address = rtl8150_set_mac_address; netdev-get_stats = rtl8150_netdev_stats; -netdev-mtu = RTL8150_MTU; SET_ETHTOOL_OPS(netdev, ops); dev-intr_interval = 100; /* 100ms */ if (!alloc_all_urbs(dev)) { -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Andi, Andi Kleen wrote: Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. I just tried that: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3 It seems that the numbers do get better (sweet-spot seems to be MTU6000 with 914 MBit/s and 927 MBit/s), however for other settings the results vary a lot so I'm not sure how large the statistical fluctuations are. Next test I'll try if it makes sense to enlarge the ring buffers. Thanks Carsten -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}
Andi Kleen wrote: I believe the problem was that all of my ports were used up with TIME_WAIT sockets and so it couldn't create more. My test case was similar to this: Ah that's simple to solve then :- use more IP addresses and bind to them in RR in your user program. Arguably the Linux TCP code should be able to do this by itself when enough IP addresses are available, but it's not very hard to do in user space using bind(2) BTW it's also an very unusual case -- in most cases there are more remote IP addresses This could be done, but it does decrease our options for testing certain scenarios. So, is there a better way to max out the connections per second without having to use tcp_tw_recycle? Well did you profile where the bottle necks were? Perhaps also just increase the memory allowed for TCP sockets. I may be missing something, but I believe the issue is that the sockets wait around a while (maybe 30 seconds or so) in TIME_WAIT state. So, even if we use all 64k of the local port range, that will limit us to about 2000 new sockets per second, as we have to wait for old ones to transition out of TIME_WAIT. I guess I could probably decrease TIME_WAIT, but then all of my connections would be affected, not just the ones on the ports creating very large numbers of connections per second. From 'man tcp', it does not seem I can set the TIME_WAIT on a per-socket basis. I don't know exactly how the tcp_tw_recycle works, but it seems like it could be made to only take affect when all local ports are used up in TIME_WAIT. It could then recycle the oldest one as a new socket is requested. For any normal program, it would be very unlikely to ever need to recycle in this case because there would be enough free IP/port pairs available. But, for weird things like my own, at least it could be made to work w/out hacking the global TIME_WAIT. Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [1/1] Deprecate tcp_tw_{reuse,recycle}
On Thu, Jan 31, 2008 at 08:41:38AM -0800, Ben Greear wrote: I don't know exactly how the tcp_tw_recycle works, but it seems like it could be made to only take affect when all local ports are used up in TIME_WAIT. TIME-WAIT does not actually use up local ports; it uses up remote ports because it is done on the LISTEN socket which has always a fixed local port. And it has no idea how many ports the other end has left. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi all, Brandeburg, Jesse wrote: I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I did that and the results can be found here: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest seems something went wrong and all you ran was the 1 byte tests, where it should have been 64K both directions (request/response). Yes, shell-quoting got me there. I'll re-run the tests, so please don't look at the TCP_RR results too closely. I think I'll be able to run maybe one or two more tests today, rest will follow tomorrow. Thanks for bearing with me Carsten PS: Am I right that the TCP_RR tests should only be run on a single node at a time, not on both ends simultaneously? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, 31 Jan 2008 13:46:32 +0100 Andi Kleen [EMAIL PROTECTED] wrote: TSO interacts badly with many queueing disciplines because they rely on reordering packets from different streams and the large TSO packets can make this difficult. This patch disables TSO for sockets that send over devices with non standard queueing disciplines. That's anything but noop or pfifo_fast and pfifo right now. Longer term other queueing disciplines could be checked if they are also ok with TSO. If yes they can set the TCQ_F_GSO_OK flag too. It is still enabled for the standard pfifo_fast because that will never reorder packets with the same type-of-service. This means 99+% of all users will still be able to use TSO just fine. The status is only set up at socket creation so a shifted route will not reenable TSO on a existing socket. I don't think that's a problem though. Signed-off-by: Andi Kleen [EMAIL PROTECTED] Fix the broken qdisc instead. -- Stephen Hemminger [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1]: Add support for aes-ctr to ipsec
Very sorry, re-posting as first patch was incomplete. The below patch allows IPsec to use CTR mode with AES encryption algorithm. Tested this using setkey in ipsec-tools. regards, Joy Signed-off-by: Joy Latten [EMAIL PROTECTED] -- diff -urpN net-2.6.25/include/linux/pfkeyv2.h net-2.6.25.patch/include/linux/pfkeyv2.h --- net-2.6.25/include/linux/pfkeyv2.h 2008-01-29 11:48:00.0 -0600 +++ net-2.6.25.patch/include/linux/pfkeyv2.h2008-01-29 13:43:59.0 -0600 @@ -298,6 +298,7 @@ struct sadb_x_sec_ctx { #define SADB_X_EALG_BLOWFISHCBC7 #define SADB_EALG_NULL 11 #define SADB_X_EALG_AESCBC 12 +#define SADB_X_EALG_AESCTR 13 #define SADB_X_EALG_CAMELLIACBC22 #define SADB_EALG_MAX 253 /* last EALG */ /* private allocations should use 249-255 (RFC2407) */ diff -urpN net-2.6.25/net/xfrm/xfrm_algo.c net-2.6.25.patch/net/xfrm/xfrm_algo.c --- net-2.6.25/net/xfrm/xfrm_algo.c 2008-01-29 11:48:03.0 -0600 +++ net-2.6.25.patch/net/xfrm/xfrm_algo.c 2008-01-29 13:42:43.0 -0600 @@ -300,6 +300,23 @@ static struct xfrm_algo_desc ealg_list[] .sadb_alg_maxbits = 256 } }, +{ + .name = rfc3686(ctr(aes)), + + .uinfo = { + .encr = { + .blockbits = 128, + .defkeybits = 160, /* 128-bit key + 32-bit nonce */ + } + }, + + .desc = { + .sadb_alg_id = SADB_X_EALG_AESCTR, + .sadb_alg_ivlen = 8, + .sadb_alg_minbits = 128, + .sadb_alg_maxbits = 256 + } +}, }; static struct xfrm_algo_desc calg_list[] = { -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Hi Bruce, On Thu, 31 Jan 2008, Bruce Allen wrote: I see similar results on my test systems Thanks for this report and for confirming our observations. Could you please confirm that a single-port bidrectional UDP link runs at wire speed? This helps to localize the problem to the TCP stack or interaction of the TCP stack with the e1000 driver and hardware. Yes, a single-port bidirectional UDP test gets full GigE line rate in both directions with no packet loss. [EMAIL PROTECTED] ~]$ nuttcp -f-beta -Itx -u -Ru -w2m 192.168.6.79 nuttcp -f-beta -Irx -r -u -Ru -w2m 192.168.6.79 tx: 1187.0078 MB / 10.04 sec = 992.0550 Mbps 19 %TX 7 %RX 0 / 151937 drop/pkt 0.00 %loss rx: 1187.1016 MB / 10.03 sec = 992.3408 Mbps 19 %TX 7 %RX 0 / 151949 drop/pkt 0.00 %loss -Bill -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 full-duplex TCP performance well below wire speed
Carsten Aulbert wrote: PS: Am I right that the TCP_RR tests should only be run on a single node at a time, not on both ends simultaneously? yes, they are a request/response test, and so perform the bidirectional test with a single node starting the test. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
netperf was used without any special tuning parameters. Usually we start two processes on two hosts which start (almost) simultaneously, last for 20-60 seconds and simply use UDP_STREAM (works well) and TCP_STREAM, i.e. on 192.168.0.202: netperf -H 192.168.2.203 -t TCP_STREAL -l 20 on 192.168.0.203: netperf -H 192.168.2.202 -t TCP_STREAL -l 20 192.168.0.20[23] here is on eth0 which cannot do jumbo frames, thus we use the .2. part for eth1 for a range of mtus. The server is started on both nodes with the start-stop-daemon and no special parameters I'm aware of. So long as you are relying on external (netperf relative) means to report the throughput, those command lines would be fine. I wouldn't be comfortably relying on the sum of the netperf-reported throughtputs with those comand lines though. Netperf2 has no test synchronization, so two separate commands, particularly those initiated on different systems, are subject to skew errors. 99 times out of ten they might be epsilon, but I get a _little_ paranoid there. There are three alternatives: 1) use netperf4. not as convenient for quick testing at present, but it has explicit test synchronization, so you know that the numbers presented are from when all connections were actively transferring data 2) use the aforementioned burst TCP_RR test. This is then a single netperf with data flowing both ways on a single connection so no issue of skew, but perhaps an issue of being one connection and so one process on each end. 3) start both tests from the same system and follow the suggestions contained in : http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/doc/netperf.html particluarly: http://www.netperf.org/svn/netperf2/tags/netperf-2.4.4/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance and use a combination of TCP_STREAM and TCP_MAERTS (STREAM backwards) tests. happy benchmarking, rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] preparations to enable netdevice notifiers inside a namespace (resend)
Benjamin Thery wrote: On Jan 31, 2008 3:58 PM, Daniel Lezcano [EMAIL PROTECTED] wrote: Denis V. Lunev wrote: Here are some preparations and cleanups to enable network device/inet address notifiers inside a namespace. This set of patches has been originally sent last Friday. One cleanup patch from the original series is dropped as wrong, thanks to Daniel Lezcano. Can you explain please. I think Denis refers to the patch called 3/7 Prohibit assignment of 0.0.0.0as interface address. , he dropped because it was inappropriate, no? Yes, you are right, Denis explained me in a private email. I think I really need to sleep a little more :) -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NET_SCHED 00/04]: External SFQ classifiers/flow classifier
These patches add support for external classifiers to SFQ and add a new flow classifier, which can do hashing based on user-specified keys or deterministic mapping of keys to classes. Additionally there is a patch to make the SFQ queues visisble as classes to verify that the hash is indeed doing something useful and a patch to consifiy struct tcf_ext_map, which I had queued in the same tree. Please apply, thanks. include/linux/pkt_cls.h | 50 include/linux/pkt_sched.h |5 + include/net/pkt_cls.h |6 +- net/sched/Kconfig | 11 + net/sched/Makefile|1 + net/sched/cls_api.c |6 +- net/sched/cls_basic.c |2 +- net/sched/cls_flow.c | 660 + net/sched/cls_fw.c|2 +- net/sched/cls_route.c |2 +- net/sched/cls_tcindex.c |2 +- net/sched/cls_u32.c |2 +- net/sched/sch_sfq.c | 134 +- 13 files changed, 868 insertions(+), 15 deletions(-) create mode 100644 net/sched/cls_flow.c Patrick McHardy (4): [NET_SCHED]: Constify struct tcf_ext_map [NET_SCHED]: sch_sfq: add support for external classifiers [NET_SCHED]: sch_sfq: make internal queues visible as classes [NET_SCHED]: Add flow classifier -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NET_SCHED 01/04]: Constify struct tcf_ext_map
[NET_SCHED]: Constify struct tcf_ext_map Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 12e33ddf57910b685501df10bd92223ea9b98fd6 tree 1ce47c7b6b6b968940f3dc28f9d7839e78c85089 parent 8af03e782cae1e0a0f530ddd22301cdd12cf9dc0 author Patrick McHardy [EMAIL PROTECTED] Wed, 30 Jan 2008 21:59:26 +0100 committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:55 +0100 include/net/pkt_cls.h |6 +++--- net/sched/cls_api.c |6 +++--- net/sched/cls_basic.c |2 +- net/sched/cls_fw.c |2 +- net/sched/cls_route.c |2 +- net/sched/cls_tcindex.c |2 +- net/sched/cls_u32.c |2 +- 7 files changed, 11 insertions(+), 11 deletions(-) diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h index 8716eb7..d349c66 100644 --- a/include/net/pkt_cls.h +++ b/include/net/pkt_cls.h @@ -131,14 +131,14 @@ tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts, extern int tcf_exts_validate(struct tcf_proto *tp, struct nlattr **tb, struct nlattr *rate_tlv, struct tcf_exts *exts, -struct tcf_ext_map *map); +const struct tcf_ext_map *map); extern void tcf_exts_destroy(struct tcf_proto *tp, struct tcf_exts *exts); extern void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst, struct tcf_exts *src); extern int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts, -struct tcf_ext_map *map); +const struct tcf_ext_map *map); extern int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts, - struct tcf_ext_map *map); + const struct tcf_ext_map *map); /** * struct tcf_pkt_info - packet information diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 3377ca0..0fbedca 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -482,7 +482,7 @@ EXPORT_SYMBOL(tcf_exts_destroy); int tcf_exts_validate(struct tcf_proto *tp, struct nlattr **tb, struct nlattr *rate_tlv, struct tcf_exts *exts, - struct tcf_ext_map *map) + const struct tcf_ext_map *map) { memset(exts, 0, sizeof(*exts)); @@ -535,7 +535,7 @@ void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst, EXPORT_SYMBOL(tcf_exts_change); int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts, - struct tcf_ext_map *map) + const struct tcf_ext_map *map) { #ifdef CONFIG_NET_CLS_ACT if (map-action exts-action) { @@ -571,7 +571,7 @@ EXPORT_SYMBOL(tcf_exts_dump); int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts, - struct tcf_ext_map *map) + const struct tcf_ext_map *map) { #ifdef CONFIG_NET_CLS_ACT if (exts-action) diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c index bfb4342..956915c 100644 --- a/net/sched/cls_basic.c +++ b/net/sched/cls_basic.c @@ -35,7 +35,7 @@ struct basic_filter struct list_headlink; }; -static struct tcf_ext_map basic_ext_map = { +static const struct tcf_ext_map basic_ext_map = { .action = TCA_BASIC_ACT, .police = TCA_BASIC_POLICE }; diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c index 436a6e7..b0f90e5 100644 --- a/net/sched/cls_fw.c +++ b/net/sched/cls_fw.c @@ -47,7 +47,7 @@ struct fw_filter struct tcf_exts exts; }; -static struct tcf_ext_map fw_ext_map = { +static const struct tcf_ext_map fw_ext_map = { .action = TCA_FW_ACT, .police = TCA_FW_POLICE }; diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c index f7e7d39..784dcb8 100644 --- a/net/sched/cls_route.c +++ b/net/sched/cls_route.c @@ -62,7 +62,7 @@ struct route4_filter #define ROUTE4_FAILURE ((struct route4_filter*)(-1L)) -static struct tcf_ext_map route_ext_map = { +static const struct tcf_ext_map route_ext_map = { .police = TCA_ROUTE4_POLICE, .action = TCA_ROUTE4_ACT }; diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c index ee60b2d..7a7bff5 100644 --- a/net/sched/cls_tcindex.c +++ b/net/sched/cls_tcindex.c @@ -55,7 +55,7 @@ struct tcindex_data { int fall_through; /* 0: only classify if explicit match */ }; -static struct tcf_ext_map tcindex_ext_map = { +static const struct tcf_ext_map tcindex_ext_map = { .police = TCA_TCINDEX_POLICE, .action = TCA_TCINDEX_ACT }; diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c index e8a7756..b18fa95 100644 --- a/net/sched/cls_u32.c +++ b/net/sched/cls_u32.c @@ -82,7 +82,7 @@ struct tc_u_common u32 hgenerator; }; -static struct tcf_ext_map u32_ext_map = { +static const struct tcf_ext_map u32_ext_map = { .action = TCA_U32_ACT, .police = TCA_U32_POLICE }; -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to
[NET_SCHED 02/04]: sch_sfq: add support for external classifiers
[NET_SCHED]: sch_sfq: add support for external classifiers Add support for external classifiers to allow using different flow hash functions similar to ESFQ. When no classifier is attached the built-in hash is used as before. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 6049892cc4acca9af393e134e4cdaf6b3e1ccad9 tree 9a8347d45808de2aef14486e5792fcab58baf3fe parent 12e33ddf57910b685501df10bd92223ea9b98fd6 author Patrick McHardy [EMAIL PROTECTED] Wed, 30 Jan 2008 21:59:27 +0100 committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:55 +0100 net/sched/sch_sfq.c | 95 +-- 1 files changed, 91 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index 91af539..d818d19 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -95,6 +95,7 @@ struct sfq_sched_data int limit; /* Variables */ + struct tcf_proto *filter_list; struct timer_list perturb_timer; u32 perturbation; sfq_index tail; /* Index of current slot in round */ @@ -155,6 +156,39 @@ static unsigned sfq_hash(struct sfq_sched_data *q, struct sk_buff *skb) return sfq_fold_hash(q, h, h2); } +static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch, +int *qerr) +{ + struct sfq_sched_data *q = qdisc_priv(sch); + struct tcf_result res; + int result; + + if (TC_H_MAJ(skb-priority) == sch-handle + TC_H_MIN(skb-priority) 0 + TC_H_MIN(skb-priority) = SFQ_HASH_DIVISOR) + return TC_H_MIN(skb-priority); + + if (!q-filter_list) + return sfq_hash(q, skb) + 1; + + *qerr = NET_XMIT_BYPASS; + result = tc_classify(skb, q-filter_list, res); + if (result = 0) { +#ifdef CONFIG_NET_CLS_ACT + switch (result) { + case TC_ACT_STOLEN: + case TC_ACT_QUEUED: + *qerr = NET_XMIT_SUCCESS; + case TC_ACT_SHOT: + return 0; + } +#endif + if (TC_H_MIN(res.classid) = SFQ_HASH_DIVISOR) + return TC_H_MIN(res.classid); + } + return 0; +} + static inline void sfq_link(struct sfq_sched_data *q, sfq_index x) { sfq_index p, n; @@ -245,8 +279,18 @@ static int sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch) { struct sfq_sched_data *q = qdisc_priv(sch); - unsigned hash = sfq_hash(q, skb); + unsigned int hash; sfq_index x; + int ret; + + hash = sfq_classify(skb, sch, ret); + if (hash == 0) { + if (ret == NET_XMIT_BYPASS) + sch-qstats.drops++; + kfree_skb(skb); + return ret; + } + hash--; x = q-ht[hash]; if (x == SFQ_DEPTH) { @@ -289,8 +333,18 @@ static int sfq_requeue(struct sk_buff *skb, struct Qdisc *sch) { struct sfq_sched_data *q = qdisc_priv(sch); - unsigned hash = sfq_hash(q, skb); + unsigned int hash; sfq_index x; + int ret; + + hash = sfq_classify(skb, sch, ret); + if (hash == 0) { + if (ret == NET_XMIT_BYPASS) + sch-qstats.drops++; + kfree_skb(skb); + return ret; + } + hash--; x = q-ht[hash]; if (x == SFQ_DEPTH) { @@ -465,6 +519,8 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt) static void sfq_destroy(struct Qdisc *sch) { struct sfq_sched_data *q = qdisc_priv(sch); + + tcf_destroy_chain(q-filter_list); del_timer(q-perturb_timer); } @@ -490,9 +546,40 @@ nla_put_failure: return -1; } +static int sfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid, + struct nlattr **tca, unsigned long *arg) +{ + return -EOPNOTSUPP; +} + +static unsigned long sfq_get(struct Qdisc *sch, u32 classid) +{ + return 0; +} + +static struct tcf_proto **sfq_find_tcf(struct Qdisc *sch, unsigned long cl) +{ + struct sfq_sched_data *q = qdisc_priv(sch); + + if (cl) + return NULL; + return q-filter_list; +} + +static void sfq_walk(struct Qdisc *sch, struct qdisc_walker *arg) +{ + return; +} + +static const struct Qdisc_class_ops sfq_class_ops = { + .get= sfq_get, + .change = sfq_change_class, + .tcf_chain = sfq_find_tcf, + .walk = sfq_walk, +}; + static struct Qdisc_ops sfq_qdisc_ops __read_mostly = { - .next = NULL, - .cl_ops = NULL, + .cl_ops = sfq_class_ops, .id = sfq, .priv_size = sizeof(struct sfq_sched_data), .enqueue= sfq_enqueue, -- To unsubscribe from this list: send the
[NET_SCHED 04/04]: Add flow classifier
[NET_SCHED]: Add flow classifier Add new flow classifier, which is meant to extend the SFQ hashing capabilities without hard-coding new hash functions and also allows deterministic mappings of keys to classes, replacing some out of tree iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps IPs to marks, with fw filters to classes), ... Some examples: - Classic SFQ hash: tc filter add ... flow hash \ keys src,dst,proto,proto-src,proto-dst divisor 1024 - Classic SFQ hash, but using information from conntrack to work properly in combination with NAT: tc filter add ... flow hash \ keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024 - Map destination IPs of 192.168.0.0/24 to classids 1-257: tc filter add ... flow map \ key dst addend -192.168.0.0 divisor 256 - alternatively: tc filter add ... flow map \ key dst and 0xff - similar, but reverse ordered: tc filter add ... flow map \ key dst and 0xff xor 0xff Perturbation is currently not supported because we can't reliable kill the timer on destruction. Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 91a3a09ce63cba8df30ac42133a40dd64c0a7259 tree 2572feb8ffd88e6abf9270d2137af2a4cf7f542a parent 7a281f8ef334a35d699682315e9f80a3e006376c author Patrick McHardy [EMAIL PROTECTED] Wed, 30 Jan 2008 21:59:31 +0100 committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:56 +0100 include/linux/pkt_cls.h | 50 net/sched/Kconfig | 11 + net/sched/Makefile |1 net/sched/cls_flow.c| 660 +++ 4 files changed, 722 insertions(+), 0 deletions(-) diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h index 30b8571..1c1dba9 100644 --- a/include/linux/pkt_cls.h +++ b/include/linux/pkt_cls.h @@ -328,6 +328,56 @@ enum #define TCA_TCINDEX_MAX (__TCA_TCINDEX_MAX - 1) +/* Flow filter */ + +enum +{ + FLOW_KEY_SRC, + FLOW_KEY_DST, + FLOW_KEY_PROTO, + FLOW_KEY_PROTO_SRC, + FLOW_KEY_PROTO_DST, + FLOW_KEY_IIF, + FLOW_KEY_PRIORITY, + FLOW_KEY_MARK, + FLOW_KEY_NFCT, + FLOW_KEY_NFCT_SRC, + FLOW_KEY_NFCT_DST, + FLOW_KEY_NFCT_PROTO_SRC, + FLOW_KEY_NFCT_PROTO_DST, + FLOW_KEY_RTCLASSID, + FLOW_KEY_SKUID, + FLOW_KEY_SKGID, + __FLOW_KEY_MAX, +}; + +#define FLOW_KEY_MAX (__FLOW_KEY_MAX - 1) + +enum +{ + FLOW_MODE_MAP, + FLOW_MODE_HASH, +}; + +enum +{ + TCA_FLOW_UNSPEC, + TCA_FLOW_KEYS, + TCA_FLOW_MODE, + TCA_FLOW_BASECLASS, + TCA_FLOW_RSHIFT, + TCA_FLOW_ADDEND, + TCA_FLOW_MASK, + TCA_FLOW_XOR, + TCA_FLOW_DIVISOR, + TCA_FLOW_ACT, + TCA_FLOW_POLICE, + TCA_FLOW_EMATCHES, + __TCA_FLOW_MAX +}; + +#define TCA_FLOW_MAX (__TCA_FLOW_MAX - 1) + /* Basic filter */ enum diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 87af7c9..bccf42b 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -307,6 +307,17 @@ config NET_CLS_RSVP6 To compile this code as a module, choose M here: the module will be called cls_rsvp6. +config NET_CLS_FLOW + tristate Flow classifier + select NET_CLS + ---help--- + If you say Y here, you will be able to classify packets based on + a configurable combination of packet keys. This is mostly useful + in combination with SFQ. + + To compile this code as a module, choose M here: the + module will be called cls_flow. + config NET_EMATCH bool Extended Matches select NET_CLS diff --git a/net/sched/Makefile b/net/sched/Makefile index 81ecbe8..1d2b0f7 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -35,6 +35,7 @@ obj-$(CONFIG_NET_CLS_RSVP)+= cls_rsvp.o obj-$(CONFIG_NET_CLS_TCINDEX) += cls_tcindex.o obj-$(CONFIG_NET_CLS_RSVP6)+= cls_rsvp6.o obj-$(CONFIG_NET_CLS_BASIC)+= cls_basic.o +obj-$(CONFIG_NET_CLS_FLOW) += cls_flow.o obj-$(CONFIG_NET_EMATCH) += ematch.o obj-$(CONFIG_NET_EMATCH_CMP) += em_cmp.o obj-$(CONFIG_NET_EMATCH_NBYTE) += em_nbyte.o diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c new file mode 100644 index 000..5a7f6a3 --- /dev/null +++ b/net/sched/cls_flow.c @@ -0,0 +1,660 @@ +/* + * net/sched/cls_flow.cGeneric flow classifier + * + * Copyright (c) 2007, 2008 Patrick McHardy [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + */ + +#include linux/kernel.h +#include linux/init.h +#include linux/list.h +#include linux/jhash.h +#include linux/random.h +#include linux/pkt_cls.h +#include linux/skbuff.h +#include linux/in.h +#include linux/ip.h +#include linux/ipv6.h + +#include
Re: [PATCH] Disable TSO for non standard qdiscs
Fix the broken qdisc instead. What do you mean? I don't think the qdiscs are broken. I cannot think of any way how e.g. TBF can do anything useful with large TSO packets. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[IPROUTE 01/02]: Add support for SFQ xstats
[IPROUTE]: Add support for SFQ xstats Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 196870f762ee393438c42115425f4af69e5b5186 tree 5650c1f93cc58886f8f97a0e55e374c157b96e2e parent 54bb35c69cec6c730a4ac95530a1d2ca6670f73b author Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 15:10:07 +0100 committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 15:10:07 +0100 include/linux/pkt_sched.h |5 + tc/q_sfq.c| 17 + 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h index 3276135..4ccd684 100644 --- a/include/linux/pkt_sched.h +++ b/include/linux/pkt_sched.h @@ -150,6 +150,11 @@ struct tc_sfq_qopt unsigned flows; /* Maximal number of flows */ }; +struct tc_sfq_xstats +{ + __u32 allot; +}; + /* * NOTE: limit, divisor and flows are hardwired to code at the moment. * diff --git a/tc/q_sfq.c b/tc/q_sfq.c index 05385cf..ce4dade 100644 --- a/tc/q_sfq.c +++ b/tc/q_sfq.c @@ -100,8 +100,25 @@ static int sfq_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt) return 0; } +static int sfq_print_xstats(struct qdisc_util *qu, FILE *f, + struct rtattr *xstats) +{ + struct tc_sfq_xstats *st; + + if (xstats == NULL) + return 0; + if (RTA_PAYLOAD(xstats) sizeof(*st)) + return -1; + st = RTA_DATA(xstats); + + fprintf(f, allot %d , st-allot); + fprintf(f, \n); + return 0; +} + struct qdisc_util sfq_qdisc_util = { .id = sfq, .parse_qopt = sfq_parse_opt, .print_qopt = sfq_print_opt, + .print_xstats = sfq_print_xstats, };
[IPROUTE 02/02]: Add flow classifier support
[IPROUTE]: Add flow classifier support Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit ac3df2d7e37826b06cc9093f50d829a9da1873a4 tree b33a2b29abdcea0267fe7a357d282a4c2f67124b parent 196870f762ee393438c42115425f4af69e5b5186 author Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:47 +0100 committer Patrick McHardy [EMAIL PROTECTED] Thu, 31 Jan 2008 18:52:47 +0100 include/linux/pkt_cls.h | 50 +++ tc/Makefile |1 tc/f_flow.c | 347 +++ 3 files changed, 398 insertions(+), 0 deletions(-) diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h index afb79d0..16869c2 100644 --- a/include/linux/pkt_cls.h +++ b/include/linux/pkt_cls.h @@ -328,6 +328,56 @@ enum #define TCA_TCINDEX_MAX (__TCA_TCINDEX_MAX - 1) +/* Flow filter */ + +enum +{ + FLOW_KEY_SRC, + FLOW_KEY_DST, + FLOW_KEY_PROTO, + FLOW_KEY_PROTO_SRC, + FLOW_KEY_PROTO_DST, + FLOW_KEY_IIF, + FLOW_KEY_PRIORITY, + FLOW_KEY_MARK, + FLOW_KEY_NFCT, + FLOW_KEY_NFCT_SRC, + FLOW_KEY_NFCT_DST, + FLOW_KEY_NFCT_PROTO_SRC, + FLOW_KEY_NFCT_PROTO_DST, + FLOW_KEY_RTCLASSID, + FLOW_KEY_SKUID, + FLOW_KEY_SKGID, + __FLOW_KEY_MAX, +}; + +#define FLOW_KEY_MAX (__FLOW_KEY_MAX - 1) + +enum +{ + FLOW_MODE_MAP, + FLOW_MODE_HASH, +}; + +enum +{ + TCA_FLOW_UNSPEC, + TCA_FLOW_KEYS, + TCA_FLOW_MODE, + TCA_FLOW_BASECLASS, + TCA_FLOW_RSHIFT, + TCA_FLOW_ADDEND, + TCA_FLOW_MASK, + TCA_FLOW_XOR, + TCA_FLOW_DIVISOR, + TCA_FLOW_ACT, + TCA_FLOW_POLICE, + TCA_FLOW_EMATCHES, + __TCA_FLOW_MAX +}; + +#define TCA_FLOW_MAX (__TCA_FLOW_MAX - 1) + /* Basic filter */ enum diff --git a/tc/Makefile b/tc/Makefile index 0facc88..7ece958 100644 --- a/tc/Makefile +++ b/tc/Makefile @@ -18,6 +18,7 @@ TCMODULES += f_u32.o TCMODULES += f_route.o TCMODULES += f_fw.o TCMODULES += f_basic.o +TCMODULES += f_flow.o TCMODULES += q_dsmark.o TCMODULES += q_gred.o TCMODULES += f_tcindex.o diff --git a/tc/f_flow.c b/tc/f_flow.c new file mode 100644 index 000..eca05cd --- /dev/null +++ b/tc/f_flow.c @@ -0,0 +1,347 @@ +/* + * f_flow.c Flow filter + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors: Patrick McHardy [EMAIL PROTECTED] + */ +#include stdio.h +#include stdlib.h +#include unistd.h +#include string.h +#include errno.h + +#include utils.h +#include tc_util.h +#include m_ematch.h + +static void explain(void) +{ + fprintf(stderr, +Usage: ... flow ...\n +\n + [mapping mode]: map key KEY [ OPS ] ...\n + [hashing mode]: hash keys KEY-LIST ...\n +\n + [ divisor NUM ] [ baseclass ID ] [ match EMATCH_TREE ]\n + [ police POLICE_SPEC ] [ action ACTION_SPEC ]\n +\n +KEY-LIST := [ KEY-LIST , ] KEY\n +KEY := [ src | dst | proto | proto-src | proto-dst | iif | priority | \n + mark | nfct | nfct-src | nfct-dst | nfct-proto-src | \n + nfct-proto-dst | rt-classid | sk-uid | sk-gid ]\n +OPS := [ or NUM | and NUM | xor NUM | rshift NUM | addend NUM ]\n +ID := X:Y\n + ); +} + +static const char *flow_keys[FLOW_KEY_MAX+1] = { + [FLOW_KEY_SRC] = src, + [FLOW_KEY_DST] = dst, + [FLOW_KEY_PROTO] = proto, + [FLOW_KEY_PROTO_SRC] = proto-src, + [FLOW_KEY_PROTO_DST] = proto-dst, + [FLOW_KEY_IIF] = iif, + [FLOW_KEY_PRIORITY] = priority, + [FLOW_KEY_MARK] = mark, + [FLOW_KEY_NFCT] = nfct, + [FLOW_KEY_NFCT_SRC] = nfct-src, + [FLOW_KEY_NFCT_DST] = nfct-dst, + [FLOW_KEY_NFCT_PROTO_SRC] = nfct-proto-src, + [FLOW_KEY_NFCT_PROTO_DST] = nfct-proto-dst, + [FLOW_KEY_RTCLASSID] = rt-classid, + [FLOW_KEY_SKUID] = sk-uid, + [FLOW_KEY_SKGID] = sk-gid, +}; + +static int flow_parse_keys(__u32 *keys, __u32 *nkeys, char *argv) +{ + char *s, *sep; + unsigned int i; + + *keys = 0; + *nkeys = 0; + s = argv; + while (s != NULL) { + sep = strchr(s, ','); + if (sep) + *sep = '\0'; + + for (i = 0; i = FLOW_KEY_MAX; i++) { + if (matches(s, flow_keys[i]) == 0) { +*keys |= 1 i; +(*nkeys)++; +break; + } + } + if (i FLOW_KEY_MAX) { + fprintf(stderr, Unknown flow key \%s\\n, s); + return -1; + } + s = sep ? sep + 1 : NULL; + } + return 0; +} + +static void transfer_bitop(__u32 *mask, __u32 *xor, __u32 m, __u32 x) +{ + *xor = x ^ (*xor m); + *mask = m; +} + +static int get_addend(__u32 *addend, char *argv, __u32 keys) +{ + inet_prefix addr; + int sign = 0; + __u32 tmp; + + if (*argv == '-') { + sign = 1; + argv++; + } + + if (get_u32(tmp, argv, 0) == 0) + goto out; + + if (keys (FLOW_KEY_SRC | FLOW_KEY_DST | + FLOW_KEY_NFCT_SRC | FLOW_KEY_NFCT_DST) + get_addr(addr, argv, AF_UNSPEC) == 0) { + switch (addr.family) { + case AF_INET: + tmp = ntohl(addr.data[0]); + goto out; + case AF_INET6: + tmp = ntohl(addr.data[3]); + goto out; + } + } + + return -1; +out: + if (sign)
Re: [PATCH] Disable TSO for non standard qdiscs
Andi Kleen wrote: Fix the broken qdisc instead. What do you mean? I don't think the qdiscs are broken. I cannot think of any way how e.g. TBF can do anything useful with large TSO packets. Someone posted a patch some time ago to calculate the amount of tokens needed in max_size portions and use that, but IMO people should just configure TBF with the proper MTU for TSO. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still oopsing in nf_nat_move_storage()
On 01/29/2008 12:18 PM, Patrick McHardy wrote: Chuck Ebbert wrote: nf_nat_move_storage(): /usr/src/debug/kernel-2.6.23/linux-2.6.23.i686/net/ipv4/netfilter/nf_nat_core.c:612 87: f7 47 64 80 01 00 00testl $0x180,0x64(%edi) 8e: 74 39 je c9 nf_nat_move_storage+0x65 line 612: if (!(ct-status IPS_NAT_DONE_MASK)) return; ct is NULL The current kernel (and 2.6.23-stable) have: if (!ct || !(ct-status IPS_NAT_DONE_MASK)) return; so it seems you're using an old version. Sorry, I re-used the analysis from before that change went in. I now have an oops report from 2.6.23.14 on x86_64. It is oopsing there, and only on x86_64 now, because x86_64 refuses to use a non-canonical address. ct contains what appears to be ASCII data. i386 might be dereferencing some random address instead of oopsing... 0: 48 f7 45 78 80 01 00testq $0x180,0x78(%rbp) 7: 00 8: 74 4c je 0x56 a: 48 c7 c7 e0 18 28 88mov$0x882818e0,%rdi %rbp has a bogus (non-canonical) address. On i386 there is no such test possible so it will just dereference the address if it is mapped. %rbp contains 8 valid ASCII chars: salcf x\ -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Carsten Aulbert wrote: Hi Andi, Andi Kleen wrote: Another issue with full duplex TCP not mentioned yet is that if TSO is used the output will be somewhat bursty and might cause problems with the TCP ACK clock of the other direction because the ACKs would need to squeeze in between full TSO bursts. You could try disabling TSO with ethtool. I just tried that: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTestNetperf3 It seems that the numbers do get better (sweet-spot seems to be MTU6000 with 914 MBit/s and 927 MBit/s), however for other settings the results vary a lot so I'm not sure how large the statistical fluctuations are. Next test I'll try if it makes sense to enlarge the ring buffers. sometimes it may help if the system (cpu) is laggy or busy a lot so that the card has more buffers available (and thus can go longer without servicing) Usually (if your system responds quickly) it's better to use *smaller* ring sizes as this reduces cache. Hence the small default value. so, unless the ethtool -S ethX output indicates that your system is too busy (rx_no_buffer_count increases) I would not recommend increasing the ring size. Auke -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
Then change TBF to use skb_gso_segment? Be careful, the fact that That doesn't help because it wants to interleave packets from different streams to get everything fair and smooth. The only good way to handle that is to split it up and the simplest way to do this is to just tell TCP to not do GSO in the first place. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
Andi Kleen wrote: Then change TBF to use skb_gso_segment? Be careful, the fact that That doesn't help because it wants to interleave packets from different streams to get everything fair and smooth. The only good way to handle that is to split it up and the simplest way to do this is to just tell TCP to not do GSO in the first place. Thats not correct, TBF keeps packets strictly ordered unless an inner qdisc does reordering. But even then (let say you use SFQ) packets of a single flow will stay ordered. Segmenting TSO packets is no different than having them arrive independantly for other reasons. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
Andi Kleen wrote: TSO interacts badly with many queueing disciplines because they rely on reordering packets from different streams and the large TSO packets can make this difficult. This patch disables TSO for sockets that send over devices with non standard queueing disciplines. That's anything but noop or pfifo_fast and pfifo right now. Does this also imply that JumboFrames interacts badly with these qdiscs? Or IPoIB with its 65000ish byte MTU? rick jones -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
Stephen Hemminger wrote: On Thu, 31 Jan 2008 19:37:35 +0100 Andi Kleen [EMAIL PROTECTED] wrote: On Thu, Jan 31, 2008 at 07:01:00PM +0100, Patrick McHardy wrote: Andi Kleen wrote: Fix the broken qdisc instead. What do you mean? I don't think the qdiscs are broken. I cannot think of any way how e.g. TBF can do anything useful with large TSO packets. Someone posted a patch some time ago to calculate the amount of tokens needed in max_size portions and use that, but IMO people should just configure TBF with the proper MTU for TSO. TBF with 64k atomic units will always be chunky and uneven. I don't think that's a useful goal. -Andi Then change TBF to use skb_gso_segment? Be careful, the fact that one skb ends up queueing multiple skb's would cause issues to parent qdisc (ie work generating qdisc). How about keeping the TSO-capable flag on qdiscs, propagating the non-capability up the tree and perform segmentation before queueing to the root? -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 full-duplex TCP performance well below wire speed
Carsten Aulbert wrote: Hi all, slowly crawling through the mails. Brandeburg, Jesse wrote: The test was done with various mtu sizes ranging from 1500 to 9000, with ethernet flow control switched on and off, and using reno and cubic as a TCP congestion control. As asked in LKML thread, please post the exact netperf command used to start the client/server, whether or not you're using irqbalanced (aka irqbalance) and what cat /proc/interrupts looks like (you ARE using MSI, right?) We are using MSI, /proc/interrupts look like: n0003:~# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0:6536963 0 0 0 IO-APIC-edge timer 1: 2 0 0 0 IO-APIC-edge i8042 3: 1 0 0 0 IO-APIC-edge serial 8: 0 0 0 0 IO-APIC-edge rtc 9: 0 0 0 0 IO-APIC-fasteoi acpi 14: 32321 0 0 0 IO-APIC-edge libata 15: 0 0 0 0 IO-APIC-edge libata 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb5 18: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 378: 17234866 0 0 0 PCI-MSI-edge eth1 379: 129826 0 0 0 PCI-MSI-edge eth0 NMI: 0 0 0 0 LOC:6537181653732665371496537052 ERR: 0 (sorry for the line break). What we don't understand is why only core0 gets the interrupts, since the affinity is set to f: # cat /proc/irq/378/smp_affinity f Right now, irqbalance is not running, though I can give it shot if people think this will make a difference. I would suggest you try TCP_RR with a command line something like this: netperf -t TCP_RR -H hostname -C -c -- -b 4 -r 64K I did that and the results can be found here: https://n0.aei.uni-hannover.de/wiki/index.php/NetworkTest For convenience, 2.4.4 (perhaps earlier I can never remember when I've added things :) allows the output format for a TCP_RR test to be set to the same as a _STREAM or _MAERTS test. And if you add a -v 2 to it you will get the each way values and the average round-trip latency: [EMAIL PROTECTED]:~/netperf2_trunk$ src/netperf -t TCP_RR -H oslowest.cup -f m -v 2 -- -r 64K -b 4 TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to oslowest.cup.hp.com (16.89.84.17) port 0 AF_INET : first burst 4 Local /Remote Socket Size Request Resp. Elapsed Send Recv Size SizeTime Throughput bytes Bytes bytesbytes secs.10^6bits/sec 16384 87380 6553665536 10.01 105.63 16384 87380 Alignment Offset RoundTrip TransThroughput Local Remote Local Remote LatencyRate 10^6bits/s Send RecvSend Recvusec/Tran per sec Outbound Inbound 8 0 0 0 49635.583 100.734 52.81452.814 [EMAIL PROTECTED]:~/netperf2_trunk$ (this was a WAN test :) rick jones one of these days I may tweak netperf further so if the CPU utilization method for either end doesn't require calibration, CPU utilization will always be done on that end. people's thoughts on that tweak would be most welcome... -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, Jan 31, 2008 at 07:21:20PM +0100, Patrick McHardy wrote: Andi Kleen wrote: Then change TBF to use skb_gso_segment? Be careful, the fact that That doesn't help because it wants to interleave packets from different streams to get everything fair and smooth. The only good way to handle that is to split it up and the simplest way to do this is to just tell TCP to not do GSO in the first place. Thats not correct, TBF keeps packets strictly ordered unless My point was that without TSO different submitters will interleave their streams (because they compete about the qdisc submission) and then you end up with a smooth rate over time for all of them. If you submit in large chunks only (as TSO does) it will always be more bursty and that works against the TBF goal. For a single submitter you would be correct. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: e1000 full-duplex TCP performance well below wire speed
Bill Fink wrote: a 2.6.15.4 kernel. The GigE NICs are Intel PRO/1000 82546EB_QUAD_COPPER, on a 64-bit/133-MHz PCI-X bus, using version 6.1.16-k2 of the e1000 driver, and running with 9000-byte jumbo frames. The TCP congestion control is BIC. Bill, FYI, there was a known issue with e1000 (fixed in 7.0.38-k2) and socket charge due to truesize that kept one end or the other from opening its window. The result is not so great performance, and you must upgrade the driver at both ends to fix it. it was fixed in commit 9e2feace1acd38d7a3b1275f7f9f8a397d09040e That commit itself needed a couple of follow on bug fixes, but the point is that you could download 7.3.20 from sourceforge (which would compile on your kernel) and compare the performance with it if you were interested in a further experiment. Jesse -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Disable TSO for non standard qdiscs
On Thu, Jan 31, 2008 at 10:26:19AM -0800, Rick Jones wrote: Andi Kleen wrote: TSO interacts badly with many queueing disciplines because they rely on reordering packets from different streams and the large TSO packets can make this difficult. This patch disables TSO for sockets that send over devices with non standard queueing disciplines. That's anything but noop or pfifo_fast and pfifo right now. Does this also imply that JumboFrames interacts badly with these qdiscs? Or IPoIB with its 65000ish byte MTU? Correct. Of course it is always relative to the link speed. So if your link is 10x faster and your packets 10x bigger you can get similarly smooth shaping. -Andi -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html