RE: raw PF_PACKET protocol selection
-Original Message- From: Herbert Xu [mailto:[EMAIL PROTECTED] Sent: den 9 oktober 2007 05:17 To: [EMAIL PROTECTED] Cc: netdev@vger.kernel.org Subject: Re: raw PF_PACKET protocol selection Joakim Tjernlund [EMAIL PROTECTED] wrote: I trying to open my own raw PF_PACKET socket to receive pkgs sent to this socket. I can only make ETH_P_ALL protocol work, but then I receive all pkgs and I want pkgs with a specific protocol type. I have tried lots of ETH_P types and none of them work. Naturally I make sure the sender is using the same protocol as my test program below. I guess I must be doing something wrong??? Your program works fine here. You did run it as root, right? Yes and ETH_P_ALL is the only protocol that prints anything I am on 2.6.22 Did you try stracing it? Just did and now it works, it didn't yesterday :( But if I change protocol to ETH_P_MOBITEX, I don't get any pkgs(I did change protocol on sending side too) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection
From: Ilpo Järvinen [EMAIL PROTECTED] Subject: Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection Date: Mon, 8 Oct 2007 14:11:55 +0300 (EEST) On Sun, 7 Oct 2007, TAKANO Ryousei wrote: From: Ilpo Järvinen [EMAIL PROTECTED] Subject: Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection Date: Fri, 5 Oct 2007 13:02:07 +0300 (EEST) On Thu, 4 Oct 2007, TAKANO Ryousei wrote: In case sacktag uses fastpath, this code won't be executed for the skb's that we would like to check (those with SACKED_RETRANS set, that are below the fastpath_skb_hint). We will eventually deal with the whole queue when fastpath_skb_hint gets set to NULL, with the next cumulative ACK that fully ACKs an skb at the latest. Maybe there's a need for a larger surgery than this to fix it. I think we need additional field to tcp_sock to avoid doing a full-walk per ACK: I think the problem occurs in slowpath. For example, in case when the receiver detects and sends back a new SACK block, the sender may fail to detect loss of a retransmitted packet. ...No, the slow path is very likely to _correct_ the problem! The problem occurs because fastpath _skips_ processing of skbs below fastpath_skb_hint whereas slowpath starts from tcp_write_queue_head. The retransmitted skbs we're interested in reside exactly on that skipped range. Sorry, my explanaton is insufficient. This problem occurs even if we are in the slowpath. For example, consider the case when the state of the retransmission queue is as shown in Fig.1 (http://projects.gtrc.aist.go.jp/gnet/meeting/sack-bug.html). There are 13 packets in the retransmission queue, P(0) and P(7) are retransmitted, and P(1) to P(9) except P(7) are already SACKed. If only a SACK block SACK(8,11) is received, P(0) is re-transmitted, because the ack_seq of P(0) is smaller than the end sequence number (11) of the SACK block. However, if the ACK packet has SACK(8,11) and SACK(1,3), these SACK blocks are sorted, and SACK(1,3) is process from P(0) to P(2) first. In this case, P(0) is not detected as lost because the ack_seq of P(0) is not smaller than the end sequence number (3) of this SACK block. Next, SACK(8,11) is scanned from P(3) to P(10). P(0) is not checked here, and the loss is not detected. Is it corrent? Ryousei Takano - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.24][trivial] fix inconsistency of terms
Fix inconsistency of terms: 1) D-SACK 2) F-RTO Signed-off-by: Ryousei Takano [EMAIL PROTECTED] --- Documentation/networking/ip-sysctl.txt |6 +++--- net/ipv4/tcp_input.c | 16 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 6ae2fef..829a4d8 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -184,14 +184,14 @@ tcp_frto - INTEGER F-RTO is an enhanced recovery algorithm for TCP retransmission timeouts. It is particularly beneficial in wireless environments where packet loss is typically due to random radio interference - rather than intermediate router congestion. FRTO is sender-side + rather than intermediate router congestion. F-RTO is sender-side only modification. Therefore it does not require any support from the peer, but in a typical case, however, where wireless link is the local access link and most of the data flows downlink, the - faraway servers should have FRTO enabled to take advantage of it. + faraway servers should have F-RTO enabled to take advantage of it. If set to 1, basic version is enabled. 2 enables SACK enhanced F-RTO if flow uses SACK. The basic version can be used also when - SACK is in use though scenario(s) with it exists where FRTO + SACK is in use though scenario(s) with it exists where F-RTO interacts badly with the packet counting of the SACK enabled TCP flow. diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e8c3948..372ff6f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -103,7 +103,7 @@ int sysctl_tcp_abc __read_mostly; #define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/ #define FLAG_ONLY_ORIG_SACKED 0x200 /* SACKs only non-rexmit sent before RTO */ #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */ -#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained DSACK info */ +#define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */ #define FLAG_NONHEAD_RETRANS_ACKED 0x1000 /* Non-head rexmitted data was ACKed */ #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED) @@ -866,7 +866,7 @@ static void tcp_disable_fack(struct tcp_sock *tp) tp-rx_opt.sack_ok = ~2; } -/* Take a notice that peer is sending DSACKs */ +/* Take a notice that peer is sending D-SACKs */ static void tcp_dsack_seen(struct tcp_sock *tp) { tp-rx_opt.sack_ok |= 4; @@ -1058,7 +1058,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric, * * With D-SACK the lower bound is extended to cover sequence space below * SND.UNA down to undo_marker, which is the last point of interest. Yet - * again, DSACK block must not to go across snd_una (for the same reason as + * again, D-SACK block must not to go across snd_una (for the same reason as * for the normal SACK blocks, explained above). But there all simplicity * ends, TCP might receive valid D-SACKs below that. As long as they reside * fully below undo_marker they do not affect behavior in anyway and can @@ -1080,7 +1080,7 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack, if (!before(start_seq, tp-snd_nxt)) return 0; - /* In outstanding window? ...This is valid exit for DSACKs too. + /* In outstanding window? ...This is valid exit for D-SACKs too. * start_seq == snd_una is non-sensical (see comments above) */ if (after(start_seq, tp-snd_una)) @@ -1581,7 +1581,7 @@ void tcp_enter_frto(struct sock *sk) !icsk-icsk_retransmits)) { tp-prior_ssthresh = tcp_current_ssthresh(sk); /* Our state is too optimistic in ssthresh() call because cwnd -* is not reduced until tcp_enter_frto_loss() when previous FRTO +* is not reduced until tcp_enter_frto_loss() when previous F-RTO * recovery has not yet completed. Pattern would be this: RTO, * Cumulative ACK, RTO (2xRTO for the same segment does not end * up here twice). @@ -1760,7 +1760,7 @@ void tcp_enter_loss(struct sock *sk, int how) tcp_set_ca_state(sk, TCP_CA_Loss); tp-high_seq = tp-snd_nxt; TCP_ECN_queue_cwr(tp); - /* Abort FRTO algorithm if one is in progress */ + /* Abort F-RTO algorithm if one is in progress */ tp-frto_counter = 0; } @@ -1905,7 +1905,7 @@ static int tcp_time_to_recover(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); __u32 packets_out; - /* Do not perform any recovery during FRTO algorithm */ + /* Do not perform any recovery during F-RTO algorithm */ if (tp-frto_counter) return 0; @@ -2921,7
Re: [PATCH 4/8][BNX2X] resubmit as attachments: bnx2x_hsi.h
On Mon, Oct 08, 2007 at 06:16:17PM +0200, Eliezer Tamir wrote: bnx2x_hsi.h - machine generated hardware and microcode definitions over 7000 lines... how much of this is really needed? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][TCP] tcp_vegas.c: (tcp_vegas_cong_avoid) fix a bug in disabling slow start by gamma parameter
TCP Vegas implementation has a bug in the process of disabling slow-start with gamma parameter. The bug may lead to extreme unfairness in the presence of early packet loss. See details in: http://www.cs.caltech.edu/~weixl/technical/ns2linux/known_linux/index.html#vegas Switch the order of if (tp-snd_cwnd = tp-snd_ssthresh) statement and if (diff gamma) statement to eliminate the problem. Signed-off-by: Xiaoliang (David) Wei http://www.davidwei.org --- linux-2.6.22.9-fixedvegas/net/ipv4/tcp_vegas.c | 35 - 1 files changed, 17 insertions(+), 18 deletions(-) diff -uprN linux-2.6.22.9/net/ipv4/tcp_vegas.c linux-2.6.22.9-fixedvegas/net/ipv4/tcp_vegas.c --- linux-2.6.22.9/net/ipv4/tcp_vegas.c 2007-09-26 11:03:01.0 -0700 +++ linux-2.6.22.9-fixedvegas/net/ipv4/tcp_vegas.c 2007-10-08 22:44:46.0 -0700 @@ -266,26 +266,25 @@ static void tcp_vegas_cong_avoid(struct */ diff = (old_wnd V_PARAM_SHIFT) - target_cwnd; - if (tp-snd_cwnd = tp-snd_ssthresh) { - /* Slow start. */ - if (diff gamma) { - /* Going too fast. Time to slow down -* and switch to congestion avoidance. -*/ - tp-snd_ssthresh = 2; + if (diff gamma tp-snd_ssthresh 2 ) { + /* Going too fast. Time to slow down +* and switch to congestion avoidance. +*/ + tp-snd_ssthresh = 2; - /* Set cwnd to match the actual rate -* exactly: -* cwnd = (actual rate) * baseRTT -* Then we add 1 because the integer -* truncation robs us of full link -* utilization. -*/ - tp-snd_cwnd = min(tp-snd_cwnd, - (target_cwnd - V_PARAM_SHIFT)+1); + /* Set cwnd to match the actual rate +* exactly: +* cwnd = (actual rate) * baseRTT +* Then we add 1 because the integer +* truncation robs us of full link +* utilization. +*/ + tp-snd_cwnd = min(tp-snd_cwnd, + (target_cwnd + V_PARAM_SHIFT)+1); - } + } else if (tp-snd_cwnd = tp-snd_ssthresh) { + /* Slow start. */ tcp_slow_start(tp); } else { /* Congestion avoidance. */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw PF_PACKET protocol selection
On Tue, Oct 09, 2007 at 08:08:22AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) wrote: Your program works fine here. You did run it as root, right? Yes and ETH_P_ALL is the only protocol that prints anything I am on 2.6.22 ETH_P_ARP works too. Did you try stracing it? Just did and now it works, it didn't yesterday :( But if I change protocol to ETH_P_MOBITEX, I don't get any pkgs(I did change protocol on sending side too) Did you change eth_type_trans() to catch your proto? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [IPV6] Defer IPv6 device initialization until a valid qdisc is specified
To judge the timing for DAD, netif_carrier_ok() is used. However, there is a possibility that dev-qdisc stays noop_qdisc even if netif_carrier_ok() returns true. In that case, DAD NS is not sent out. We need to defer the IPv6 device initialization until a valid qdisc is specified. Signed-off-by: Mitsuru Chinen [EMAIL PROTECTED] Signed-off-by: YOSHIFUJI Hideaki [EMAIL PROTECTED] --- net/ipv6/addrconf.c | 13 ++--- 1 files changed, 10 insertions(+), 3 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 6d5c3c2..da8c79c 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -74,6 +74,7 @@ #include net/tcp.h #include net/ip.h #include net/netlink.h +#include net/pkt_sched.h #include linux/if_tunnel.h #include linux/rtnetlink.h @@ -213,6 +214,12 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT; const struct in6_addr in6addr_loopback = IN6ADDR_LOOPBACK_INIT; +/* Check if a valid qdisc is available */ +static inline int addrconf_qdisc_ok(struct net_device *dev) +{ + return (dev-qdisc != noop_qdisc); +} + static void addrconf_del_timer(struct inet6_ifaddr *ifp) { if (del_timer(ifp-timer)) @@ -384,7 +391,7 @@ static struct inet6_dev * ipv6_add_dev(struct net_device *dev) } #endif - if (netif_running(dev) netif_carrier_ok(dev)) + if (netif_running(dev) addrconf_qdisc_ok(dev)) ndev-if_flags |= IF_READY; ipv6_mc_init_dev(ndev); @@ -2283,7 +2290,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, break; if (event == NETDEV_UP) { - if (!netif_carrier_ok(dev)) { + if (!addrconf_qdisc_ok(dev)) { /* device is not ready yet. */ printk(KERN_INFO ADDRCONF(NETDEV_UP): %s: @@ -2295,7 +2302,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event, if (idev) idev-if_flags |= IF_READY; } else { - if (!netif_carrier_ok(dev)) { + if (!addrconf_qdisc_ok(dev)) { /* device is still not ready. */ break; } -- 1.5.2.2 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver
Jay Vosburgh wrote: Jeff Garzik [EMAIL PROTECTED] wrote: Moni Shoua wrote: Jay Vosburgh wrote: ACK patches 3 - 9. Roland, are you comfortable with the IB changes in patches 1 and 2? Jeff, when Roland acks patches 1 and 2, please apply all 9. -J Hi Jeff, Roland acked the IPoIB patches. If you haven't done so already can you please apply them? I'm not sure when 2.6.24 is going to open and I'm afraid to miss it. hrm, I don't see them in my inbox for some reason. can someone bounce them to me? or give me a git tree to pull from? Moni, can you repost the patch series to Jeff, and put the appropriate Acked-by lines in for myself (patches 3 - 8) and Roland (patches 1 and 2)? You can probably leave off the netdev and openfabrics lists, but cc me. -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] Hi Jeff, I don't commits of the patches in http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=summary (I hope that I'm looking in the right place). Did you get them? thanks MoniS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw PF_PACKET protocol selection
On Tue, 2007-10-09 at 11:13 +0400, Evgeniy Polyakov wrote: On Tue, Oct 09, 2007 at 08:08:22AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) wrote: Your program works fine here. You did run it as root, right? Yes and ETH_P_ALL is the only protocol that prints anything I am on 2.6.22 ETH_P_ARP works too. Did you try stracing it? Just did and now it works, it didn't yesterday :( But if I change protocol to ETH_P_MOBITEX, I don't get any pkgs(I did change protocol on sending side too) Did you change eth_type_trans() to catch your proto? Just fond out something: if I redirect my prog like so: ./sniff log and press Ctrl-C after a packet has been sent to it, it does NOT work. I don't get ANY output in my log file, not even the printf(-\n) appears. But if I run whithout redirect it works(at least with ETH_P_BPQ) Anyone else see this too? Jocke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw PF_PACKET protocol selection
On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) wrote: Did you change eth_type_trans() to catch your proto? Just fond out something: if I redirect my prog like so: ./sniff log and press Ctrl-C after a packet has been sent to it, it does NOT work. I don't get ANY output in my log file, not even the printf(-\n) appears. But if I run whithout redirect it works(at least with ETH_P_BPQ) Anyone else see this too? I only tested with IP and ARP packets - I can not say when packet was actually received and written to log, but it does start filling up, but maybe not immediately - it can be output buffering in shell though. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw PF_PACKET protocol selection
On Tue, 2007-10-09 at 11:34 +0400, Evgeniy Polyakov wrote: On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) wrote: Did you change eth_type_trans() to catch your proto? Just fond out something: if I redirect my prog like so: ./sniff log and press Ctrl-C after a packet has been sent to it, it does NOT work. I don't get ANY output in my log file, not even the printf(-\n) appears. But if I run whithout redirect it works(at least with ETH_P_BPQ) Anyone else see this too? I only tested with IP and ARP packets - I can not say when packet was actually received and written to log, but it does start filling up, but maybe not immediately - it can be output buffering in shell though. Did you receive many packets? Seems like when I receive just 1 or 2 pkgs I get the empty log. If I strace ./sniff log I see that recvfrom gets pkgs, but there are no trace of writes. I guess this is a bash(3.2_p17) or glibc(2.5.-r4) bug? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw PF_PACKET protocol selection
On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund wrote: Just fond out something: if I redirect my prog like so: ./sniff log and press Ctrl-C after a packet has been sent to it, it does NOT work. I don't get ANY output in my log file, not even the printf(-\n) appears. But if I run whithout redirect it works(at least with ETH_P_BPQ) Anyone else see this too? Um, this is what we call buffering. You either need to turn buffering off with setbuf(3) or you should install a SIGINT handler to flush the output before exiting. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHES] TX batching
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 07:35:20 PM: I dont see something from Krishna's approach that i can take and reuse. This maybe because my old approaches have evolved from the same path. There is a long list but as a sample: i used to do a lot more work while holding the queue lock which i have now moved post queue lock; i dont have any speacial interfaces/tricks just for batching, i provide hints to the core of how much the driver can take etc etc. I have offered Krishna co-authorship if he makes the IPOIB driver to work on my patches, that offer still stands if he chooses to take it. My feeling is that since the approaches are very different, it would be a good idea to test the two for performance. Do you mind me doing that? Ofcourse others and/or you are more than welcome to do the same. I had sent a note to you yesterday about this, please let me know either way. *** Previous mail ** Hi Jamal, If you don't mind, I am trying to run your approach vs mine to get some results for comparison. For starters, I am having issues with iperf when using your infrastructure code with my IPoIB driver - about 100MB is sent and then everything stops for some reason. The changes in the IPoIB driver that I made to support batching is to set BTX, set xmit_win, and dynamically reduce xmit_win on every xmit and increase xmit_win on every xmit completion. Is there anything else that is required from the driver? thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw PF_PACKET protocol selection
On Tue, Oct 09, 2007 at 09:51:25AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) wrote: On Tue, 2007-10-09 at 11:34 +0400, Evgeniy Polyakov wrote: On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) wrote: Did you change eth_type_trans() to catch your proto? Just fond out something: if I redirect my prog like so: ./sniff log and press Ctrl-C after a packet has been sent to it, it does NOT work. I don't get ANY output in my log file, not even the printf(-\n) appears. But if I run whithout redirect it works(at least with ETH_P_BPQ) Anyone else see this too? I only tested with IP and ARP packets - I can not say when packet was actually received and written to log, but it does start filling up, but maybe not immediately - it can be output buffering in shell though. Did you receive many packets? Seems like when I receive just 1 or 2 pkgs I get the empty log. If I strace ./sniff log I see that recvfrom gets pkgs, but there are no trace of writes. I guess this is a bash(3.2_p17) or glibc(2.5.-r4) bug? I received 1396 bytes of logs before terminated, which is 27 ARP packets, so there is quite big number of packet there. Your application works correctly (although you swapped source and destination ethernet fields) - buffered writing is not a bug, if you do not like it, use write(2), mmap(2) or turn buffering off as Herbert suggested. To get packets with your own ethernet protocol number you have to change eth_type_trans() function in kernel, which parses ethernet header and returns protocol number, under some conditions it will just return your number automatically, but you should check it. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raw PF_PACKET protocol selection
On Tue, 2007-10-09 at 12:17 +0400, Evgeniy Polyakov wrote: On Tue, Oct 09, 2007 at 09:51:25AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) wrote: On Tue, 2007-10-09 at 11:34 +0400, Evgeniy Polyakov wrote: On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) wrote: Did you change eth_type_trans() to catch your proto? Just fond out something: if I redirect my prog like so: ./sniff log and press Ctrl-C after a packet has been sent to it, it does NOT work. I don't get ANY output in my log file, not even the printf(-\n) appears. But if I run whithout redirect it works(at least with ETH_P_BPQ) Anyone else see this too? I only tested with IP and ARP packets - I can not say when packet was actually received and written to log, but it does start filling up, but maybe not immediately - it can be output buffering in shell though. Did you receive many packets? Seems like when I receive just 1 or 2 pkgs I get the empty log. If I strace ./sniff log I see that recvfrom gets pkgs, but there are no trace of writes. I guess this is a bash(3.2_p17) or glibc(2.5.-r4) bug? I received 1396 bytes of logs before terminated, which is 27 ARP packets, so there is quite big number of packet there. Your application works correctly (although you swapped source and destination ethernet fields) - buffered writing is not a bug, if you do not like it, use write(2), mmap(2) or turn buffering off as Herbert suggested. To get packets with your own ethernet protocol number you have to change eth_type_trans() function in kernel, which parses ethernet header and returns protocol number, under some conditions it will just return your number automatically, but you should check it. I thought that flushing was done automatically when SIGINT happened but I was apperently wrong. Sorry for the noise and thanks for your help. I have added setvbuf calls to make it unbuffered. Jocke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
net-2.6.24 rebased...
The TCP bug fix that went into Linus's tree today created some conflicts with net-2.6.24, so I rebased: kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.24.git I took the opportunity to integrate fix patches into the patches which introduced those bugs or build failures. BTW, when the tree gets this big I tend to move the pre-rebase tree to: kernel.org:/pub/scm/linux/kernel/git/davem/bak-net-2.6.24.git It can be useful if a mistake is made during a rebase and we need to figure out what happened. 9.1MB, 739 changesets, keep those patches flowing! Enjoy - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][NET-2.6.24] Remove double dev-flags checking when calling dev_close()
The unregister_netdevice() and dev_change_net_namespace() both check for dev-flags to be IFF_UP before calling the dev_close(), but the dev_close() checks for IFF_UP itself, so remove those unneeded checks. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/net/core/dev.c b/net/core/dev.c index e7e728a..1e169a5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3893,8 +3893,7 @@ void unregister_netdevice(struct net_dev BUG_ON(dev-reg_state != NETREG_REGISTERED); /* If device is running, close it first. */ - if (dev-flags IFF_UP) - dev_close(dev); + dev_close(dev); /* And unlink it from device chain. */ unlist_netdevice(dev); @@ -4018,8 +4017,7 @@ int dev_change_net_namespace(struct net_ */ /* If device is running close it first. */ - if (dev-flags IFF_UP) - dev_close(dev); + dev_close(dev); /* And unlink it from device chain */ err = -ENODEV; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3][NET_BATCH] net core use batching
Hi Peter, Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote on 10/09/2007 04:03:42 AM: true, that needs some resolution. Heres a hand-waving thought: Assuming all packets of a specific map end up in the same qdiscn queue, it seems feasible to ask the qdisc scheduler to give us enough packages (ive seen people use that terms to refer to packets) for each hardware ring's available space. With the patches i posted, i do that via dev-xmit_win that assumes only one view of the driver; essentially a single ring. If that is doable, then it is up to the driver to say i have space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on what scheduling scheme the driver implements - the dev-blist can stay the same. Its a handwave, so there may be issues there and there could be better ways to handle this. Note: The other issue that needs resolving that i raised earlier was in regards to multiqueue running on multiple cpus servicing different rings concurently. I can see the qdisc being modified to send batches per queue_mapping. This shouldn't be too difficult, and if we had the xmit_win per queue (in the subqueue struct like Dave pointed out). I hope my understanding of multiqueue is correct for this mail to make sense :-) Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. (This is required only if driver is supposed to transmit 1 skb in one call, otherwise it is not an issue) Alternatively, supporting drivers could return a different code on mapping change, like: NETDEV_TX_MAPPING_CHANGED (for batching only) so that qdisc_run() could retry. Would that work? Secondly having xmit_win per queue: would it help in multiple skb case? Currently there is no way to tell qdisc to dequeue skbs from a particular band - it returns skb from highest priority band. thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:28:27 +0530 Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. The complexity in most of these suggestions is beginning to drive me a bit crazy :-) This should be the simplest thing in the world, when TX queue has space, give it packets. Period. When I hear suggestions like have the driver pick the queue in -hard_start_xmit() and return some special status if the queue becomes different. you know, I really begin to wonder :-) If we have to go back, get into the queueing layer locks, have these special cases, and whatnot, what's the point? This code should eventually be able to run lockless all the way to the TX queue handling code of the driver. The queueing code should know what TX queue the packet will be bound for, and always precisely invoke the driver in a state where the driver can accept the packet. Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
Hi Dave, David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Isn't it enough that the multiqueue+batching drivers handle skbs belonging to different queue's themselves, instead of qdisc having to figure that out? This will reduce costs for most skbs that are neither batched nor sent to multiqueue devices. Eg, driver can keep processing skbs and put to the correct tx_queue as long as mapping remains the same. If the mapping changes, it posts earlier skbs (with the correct lock) and then iterates for the other skbs that have the next different mapping, and so on. The complexity in most of these suggestions is beginning to drive me a bit crazy :-) This should be the simplest thing in the world, when TX queue has space, give it packets. Period. When I hear suggestions like have the driver pick the queue in -hard_start_xmit() and return some special status if the queue becomes different. you know, I really begin to wonder :-) If we have to go back, get into the queueing layer locks, have these special cases, and whatnot, what's the point? I understand your point, but the qdisc code itself needs almost no change, as small as: qdisc_restart() { ... case NETDEV_TX_MAPPING_CHANGED: /* * Driver sent some skbs from one mapping, and found others * are for different queue_mapping. Try again. */ ret = 1; /* guaranteed to have atleast 1 skb in batch list */ break; ... } Alternatively if the driver does all the dirty work, qdisc needs no change at all. However, I am not sure if this addresses all the concerns raised by you, Peter, Jamal, others. This code should eventually be able to run lockless all the way to the TX queue handling code of the driver. The queueing code should know what TX queue the packet will be bound for, and always precisely invoke the driver in a state where the driver can accept the packet. This sounds like a good idea :) I need to think more on this, esp as my batching sends multiple skbs of possibly different mappings to device, and those skbs stay in batch list if driver couldn't send them out. thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:51:14 +0530 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NET-2.6.24] Remove double dev-flags checking when calling dev_close()
Pavel Emelyanov wrote: The unregister_netdevice() and dev_change_net_namespace() both check for dev-flags to be IFF_UP before calling the dev_close(), but the dev_close() checks for IFF_UP itself, so remove those unneeded checks. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/net/core/dev.c b/net/core/dev.c index e7e728a..1e169a5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3893,8 +3893,7 @@ void unregister_netdevice(struct net_dev BUG_ON(dev-reg_state != NETREG_REGISTERED); /* If device is running, close it first. */ - if (dev-flags IFF_UP) - dev_close(dev); + dev_close(dev); /* And unlink it from device chain. */ unlist_netdevice(dev); @@ -4018,8 +4017,7 @@ int dev_change_net_namespace(struct net_ */ /* If device is running close it first. */ - if (dev-flags IFF_UP) - dev_close(dev); + dev_close(dev); /* And unlink it from device chain */ err = -ENODEV; One side effect of this patch: might_sleep() is now called unconditionally. If that is irrelevant... ACK. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NET-2.6.24] Remove double dev-flags checking when calling dev_close()
Jeff Garzik wrote: Pavel Emelyanov wrote: The unregister_netdevice() and dev_change_net_namespace() both check for dev-flags to be IFF_UP before calling the dev_close(), but the dev_close() checks for IFF_UP itself, so remove those unneeded checks. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/net/core/dev.c b/net/core/dev.c index e7e728a..1e169a5 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3893,8 +3893,7 @@ void unregister_netdevice(struct net_dev BUG_ON(dev-reg_state != NETREG_REGISTERED); /* If device is running, close it first. */ -if (dev-flags IFF_UP) -dev_close(dev); +dev_close(dev); /* And unlink it from device chain. */ unlist_netdevice(dev); @@ -4018,8 +4017,7 @@ int dev_change_net_namespace(struct net_ */ /* If device is running close it first. */ -if (dev-flags IFF_UP) -dev_close(dev); +dev_close(dev); /* And unlink it from device chain */ err = -ENODEV; One side effect of this patch: might_sleep() is now called unconditionally. That's not a big deal - see, the synchronize_net() is called further in both places unconditionally (!) and also calls might_sleep(), so this is OK. If that is irrelevant... ACK. Thanks :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][NETNS] Don't memset() netns to zero manually
The newly created net namespace is set to 0 with memset() in setup_net(). The setup_net() is also called for the init_net_ns(), which is zeroed naturally as a global var. So remove this memset and allocate new nets with the kmem_cache_zalloc(). Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 0e0ca6d..6f71db8 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -24,7 +24,7 @@ EXPORT_SYMBOL_GPL(init_net); static struct net *net_alloc(void) { - return kmem_cache_alloc(net_cachep, GFP_KERNEL); + return kmem_cache_zalloc(net_cachep, GFP_KERNEL); } static void net_free(struct net *net) @@ -90,7 +90,6 @@ static int setup_net(struct net *net) struct pernet_operations *ops; int error; - memset(net, 0, sizeof(struct net)); atomic_set(net-count, 1); atomic_set(net-use_count, 0); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection
On Tue, 9 Oct 2007, TAKANO Ryousei wrote: From: Ilpo Järvinen [EMAIL PROTECTED] Subject: Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection Date: Mon, 8 Oct 2007 14:11:55 +0300 (EEST) On Sun, 7 Oct 2007, TAKANO Ryousei wrote: From: Ilpo Järvinen [EMAIL PROTECTED] Just couple of thoughts, not that this change itself is incorrect... ...Added this line back, it's important. :-) In case sacktag uses fastpath, this code won't be executed for the skb's that we would like to check (those with SACKED_RETRANS set, that are below the fastpath_skb_hint). We will eventually deal with the whole queue when fastpath_skb_hint gets set to NULL, with the next cumulative ACK that fully ACKs an skb at the latest. Maybe there's a need for a larger surgery than this to fix it. I think we need additional field to tcp_sock to avoid doing a full-walk per ACK: I think the problem occurs in slowpath. For example, in case when the receiver detects and sends back a new SACK block, the sender may fail to detect loss of a retransmitted packet. ...No, the slow path is very likely to _correct_ the problem! The problem occurs because fastpath _skips_ processing of skbs below fastpath_skb_hint whereas slowpath starts from tcp_write_queue_head. The retransmitted skbs we're interested in reside exactly on that skipped range. Sorry, my explanaton is insufficient. This problem occurs even if we are in the slowpath. [...long explination snip...] Is it corrent? Yes, we're dealing with two related, but different problems here. I understood well what you're trying to fix, and even acknowledged your change (see what I added back from my original reply above). The far worse problem which I describe, however, occurs with fastpath only (well, very unlikely to occur with slow-path but it could in theory), has been like this since the hints were added. And it's preventing the execution of this portion you fixed (until slowpath is taken). ...What makes the other problem even nastier is the fact that it becomes more and more significant when fastpath (or it's equivalent) skips more and more skb processing. It's good that we noticed it now... I tried to do a fix to that other problem, it seems that the solution will be intertwined with the problem you're describing so that I in fact ended removing the code block where your key modification is and placing somewhat similar sequence gathering elsewhere... Posting it shortly. -- i.
[RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness
Lost_retrans handling of sacktag was found to be flawed, two problems that were found have an intertwined solution. Fastpath problem has existed since hints got added and the other problem has probably been there even longer than that. ...This change may add non-trivial processing cost. Initial sketch, only compile tested. This will become more and more useful, when sacktag starts to process less and less skbs, which hopefully happens quite soon... :-) Sadly enough it will probably then be consuming part of the benefits we're able to achieve by less skb walking... First one is trivial, so Dave might want to apply it already. -- i. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [TCP]: Separate lost_retrans loop into own function
Follows own function for each task principle, this is really somewhat separate task being done in sacktag. Also reduces indentation. In addition, added ack_seq local var to break some long lines fixed coding style things. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 80 +++--- 1 files changed, 43 insertions(+), 37 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a04e78e..56aa34a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1106,6 +1106,47 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack, return !before(start_seq, end_seq - tp-max_window); } +/* Check for lost retransmit. This superb idea is borrowed from ratehalving. + * Event C. Later note: FACK people cheated me again 8), we have to account + * for reordering! Ugly, but should help. + */ +static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct sk_buff *skb; + int flag = 0; + + tcp_for_write_queue(skb, sk) { + u32 ack_seq = TCP_SKB_CB(skb)-ack_seq; + + if (skb == tcp_send_head(sk)) + break; + if (after(TCP_SKB_CB(skb)-seq, lost_retrans)) + break; + if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una)) + continue; + + if ((TCP_SKB_CB(skb)-sacked TCPCB_SACKED_RETRANS) + after(lost_retrans, ack_seq) + (tcp_is_fack(tp) || +!before(lost_retrans, +ack_seq + tp-reordering * tp-mss_cache))) { + TCP_SKB_CB(skb)-sacked = ~TCPCB_SACKED_RETRANS; + tp-retrans_out -= tcp_skb_pcount(skb); + + /* clear lost hint */ + tp-retransmit_skb_hint = NULL; + + if (!(TCP_SKB_CB(skb)-sacked (TCPCB_LOST|TCPCB_SACKED_ACKED))) { + tp-lost_out += tcp_skb_pcount(skb); + TCP_SKB_CB(skb)-sacked |= TCPCB_LOST; + flag |= FLAG_DATA_SACKED; + NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT); + } + } + } + return flag; +} static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb, struct tcp_sack_block_wire *sp, int num_sacks, @@ -1422,43 +1463,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ } } - /* Check for lost retransmit. This superb idea is -* borrowed from ratehalving. Event C. -* Later note: FACK people cheated me again 8), -* we have to account for reordering! Ugly, -* but should help. -*/ - if (lost_retrans icsk-icsk_ca_state == TCP_CA_Recovery) { - struct sk_buff *skb; - - tcp_for_write_queue(skb, sk) { - if (skb == tcp_send_head(sk)) - break; - if (after(TCP_SKB_CB(skb)-seq, lost_retrans)) - break; - if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una)) - continue; - if ((TCP_SKB_CB(skb)-sackedTCPCB_SACKED_RETRANS) - after(lost_retrans, TCP_SKB_CB(skb)-ack_seq) - (tcp_is_fack(tp) || -!before(lost_retrans, -TCP_SKB_CB(skb)-ack_seq + tp-reordering * -tp-mss_cache))) { - TCP_SKB_CB(skb)-sacked = ~TCPCB_SACKED_RETRANS; - tp-retrans_out -= tcp_skb_pcount(skb); - - /* clear lost hint */ - tp-retransmit_skb_hint = NULL; - - if (!(TCP_SKB_CB(skb)-sacked(TCPCB_LOST|TCPCB_SACKED_ACKED))) { - tp-lost_out += tcp_skb_pcount(skb); - TCP_SKB_CB(skb)-sacked |= TCPCB_LOST; - flag |= FLAG_DATA_SACKED; - NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT); - } - } - } - } + if (lost_retrans icsk-icsk_ca_state == TCP_CA_Recovery) + flag |= tcp_mark_lost_retrans(sk, lost_retrans); tcp_verify_left_out(tp); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems
Detection implemented with lost_retrans must work also when fastpath is taken, yet most of the queue is skipped including (very likely) those retransmitted skb's we're interested in. This problem appeared when the hints got added, which removed a need to always walk over the whole write queue head. Therefore decicion for the lost_retrans worker loop entry must be separated from the sacktag processing more than it was necessary before. It turns out to be problematic to optimize the worker loop very heavily because ack_seqs of skb may have a number of discontinuity points. Maybe similar approach as currently is implemented could be attempted but that's becoming more and more complex because the trend is towards less skb walking in sacktag marker. Maybe after(highest_sack_end_seq, tp-high_seq) checking is not sufficiently accurate and causes entry too often in no-work-to-do cases. Since that's not known, I've separated solution to that from this patch. Noticed because of report against a related problem from TAKANO Ryousei [EMAIL PROTECTED]. He also provided a patch to that part of the problem. This patch includes solution to it (though this patch has to use somewhat different placement). TAKANO's description and patch is available here: http://marc.info/?l=linux-netdevm=119149311913288w=2 ...In short, TAKANO's problem is that end_seq the loop is using not necessarily the largest SACK block's end_seq because the current ACK may still have higher SACK blocks which are later by the loop. Cc: TAKANO Ryousei [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 37 ++--- 1 files changed, 22 insertions(+), 15 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 56aa34a..b90c2fc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1109,27 +1109,34 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack, /* Check for lost retransmit. This superb idea is borrowed from ratehalving. * Event C. Later note: FACK people cheated me again 8), we have to account * for reordering! Ugly, but should help. + * + * Search retransmitted skbs from write_queue that were sent when snd_nxt was + * less than what is now known to be received by the other end (derived from + * SACK blocks by the caller). */ -static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans) +static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; int flag = 0; + int cnt = 0; tcp_for_write_queue(skb, sk) { u32 ack_seq = TCP_SKB_CB(skb)-ack_seq; if (skb == tcp_send_head(sk)) break; - if (after(TCP_SKB_CB(skb)-seq, lost_retrans)) + if (cnt == tp-retrans_out) break; if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una)) continue; - if ((TCP_SKB_CB(skb)-sacked TCPCB_SACKED_RETRANS) - after(lost_retrans, ack_seq) + if (!(TCP_SKB_CB(skb)-sacked TCPCB_SACKED_RETRANS)) + continue; + + if (after(received_upto, ack_seq) (tcp_is_fack(tp) || -!before(lost_retrans, +!before(received_upto, ack_seq + tp-reordering * tp-mss_cache))) { TCP_SKB_CB(skb)-sacked = ~TCPCB_SACKED_RETRANS; tp-retrans_out -= tcp_skb_pcount(skb); @@ -1143,6 +1150,8 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans) flag |= FLAG_DATA_SACKED; NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT); } + } else { + cnt += tcp_skb_pcount(skb); } } return flag; @@ -1193,7 +1202,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)3; int reord = tp-packets_out; int prior_fackets; - u32 lost_retrans = 0; + u32 highest_sack_end_seq = 0; int flag = 0; int found_dup_sack = 0; int cached_fack_count; @@ -1383,11 +1392,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ continue; } - if ((sackedTCPCB_SACKED_RETRANS) - after(end_seq, TCP_SKB_CB(skb)-ack_seq) - (!lost_retrans || after(end_seq, lost_retrans))) - lost_retrans = end_seq; - if (!in_sack) continue; @@ -1441,9 +1445,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb,
[RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases
This addition of lost_retrans_low to tcp_sock might be unnecessary, it's not clear how often lost_retrans worker is executed when there wasn't work to do. Cc: TAKANO Ryousei [EMAIL PROTECTED] Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- include/linux/tcp.h |2 ++ net/ipv4/tcp_input.c | 15 +++ net/ipv4/tcp_output.c |2 ++ 3 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 9ff456e..c5b94c1 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -348,6 +348,8 @@ struct tcp_sock { int lost_cnt_hint; int retransmit_cnt_hint; + u32 lost_retrans_low; /* Sent seq after any rxmit (lowest) */ + u16 advmss; /* Advertised MSS */ u16 prior_ssthresh; /* ssthresh saved at recovery start */ u32 lost_out; /* Lost packets */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index b90c2fc..e7bc720 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1112,7 +1112,8 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack, * * Search retransmitted skbs from write_queue that were sent when snd_nxt was * less than what is now known to be received by the other end (derived from - * SACK blocks by the caller). + * SACK blocks by the caller). Also calculate the lowest snd_nxt among the + * remaining retransmitted skbs to avoid some costly processing per ACKs. */ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) { @@ -1120,6 +1121,7 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) struct sk_buff *skb; int flag = 0; int cnt = 0; + u32 new_low_seq = 0; tcp_for_write_queue(skb, sk) { u32 ack_seq = TCP_SKB_CB(skb)-ack_seq; @@ -1150,10 +1152,15 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto) flag |= FLAG_DATA_SACKED; NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT); } - } else { + } else if (!new_low_seq || before(ack_seq, new_low_seq)) { + new_low_seq = ack_seq; cnt += tcp_skb_pcount(skb); } } + + if (tp-retrans_out) + tp-lost_retrans_low = new_low_seq; + return flag; } @@ -1468,8 +1475,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ } } - if (tp-retrans_out highest_sack_end_seq - after(highest_sack_end_seq, tp-high_seq) + if (tp-retrans_out + after(highest_sack_end_seq, tp-lost_retrans_low) icsk-icsk_ca_state == TCP_CA_Recovery) flag |= tcp_mark_lost_retrans(sk, highest_sack_end_seq); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5329675..324b420 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1914,6 +1914,8 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) printk(KERN_DEBUG retrans_out leaked.\n); } #endif + if (!tp-retrans_out) + tp-lost_retrans_low = tp-snd_nxt; TCP_SKB_CB(skb)-sacked |= TCPCB_RETRANS; tp-retrans_out += tcp_skb_pcount(skb); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][NETNS] Make ifindex generation per-namespace
Currently indexes for netdevices come sequentially one by one, and the same stays true even for devices that are created for namespaces. Side effects of this are: * lo device has not 1 index in a namespace. This may break some userspace that relies on it (and AFAIR something really broke in OpenVZ VEs without this); * after some time namespaces will have devices with indexes like 100 os similar. This might be confusing for a human (tools will not mind). So move the (currently global and static) ifindex variable on the struct net, making the indexes allocation look more like on a standalone machine. Moreover - when we have indexes intersect between namespaces, we may catch more BUGs in the future related to wrong device was found for a given index. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 93aa87d..83a18d0 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -29,6 +29,8 @@ struct net { struct list_headdev_base_head; struct hlist_head *dev_name_head; struct hlist_head *dev_index_head; + + int ifindex; }; #ifdef CONFIG_NET diff --git a/net/core/dev.c b/net/core/dev.c index e7e728a..a08ed8c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3443,12 +3443,11 @@ int dev_ioctl(struct net *net, unsigned */ static int dev_new_index(struct net *net) { - static int ifindex; for (;;) { - if (++ifindex = 0) - ifindex = 1; - if (!__dev_get_by_index(net, ifindex)) - return ifindex; + if (++net-ifindex = 0) + net-ifindex = 1; + if (!__dev_get_by_index(net, net-ifindex)) + return net-ifindex; } } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
David Miller wrote: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:51:14 +0530 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Jeff drivers/net/atl1/atl1_main.c | 16 +--- drivers/net/chelsio/cxgb2.c|1 - drivers/net/chelsio/sge.c | 20 +--- drivers/net/e1000/e1000_main.c |6 +- drivers/net/ixgb/ixgb_main.c | 24 drivers/net/pasemi_mac.c |2 +- drivers/net/rionet.c | 19 +++ drivers/net/spider_net.c |2 +- drivers/net/sungem.c | 17 ++--- drivers/net/tehuti.c | 12 +--- drivers/net/tehuti.h |3 +-- 11 files changed, 32 insertions(+), 90 deletions(-) diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c index 4c728f1..03e94fe 100644 --- a/drivers/net/atl1/atl1_main.c +++ b/drivers/net/atl1/atl1_main.c @@ -1665,10 +1665,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct net_device *netdev) len -= skb-data_len; - if (unlikely(skb-len == 0)) { - dev_kfree_skb_any(skb); - return NETDEV_TX_OK; - } + WARN_ON(skb-len == 0); param.data = 0; param.tso.tsopu = 0; @@ -1703,11 +1700,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct net_device *netdev) } } - if (!spin_trylock_irqsave(adapter-lock, flags)) { - /* Can't get lock - tell upper layer to requeue */ - dev_printk(KERN_DEBUG, adapter-pdev-dev, tx locked\n); - return NETDEV_TX_LOCKED; - } + spin_lock_irqsave(adapter-lock, flags); if (atl1_tpd_avail(adapter-tpd_ring) count) { /* not enough descriptors */ @@ -1749,8 +1742,11 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct net_device *netdev) atl1_tx_map(adapter, skb, 1 == val); atl1_tx_queue(adapter, count, param); netdev-trans_start = jiffies; + spin_unlock_irqrestore(adapter-lock, flags); + atl1_update_mailbox(adapter); + return NETDEV_TX_OK; } @@ -2301,8 +2297,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev, */ /* netdev-features |= NETIF_F_TSO; */ - netdev-features |= NETIF_F_LLTX; - /* * patch for some L1 of old version, * the final version of L1 may not need these diff --git a/drivers/net/chelsio/cxgb2.c b/drivers/net/chelsio/cxgb2.c index 2dbf8dc..0aba7e7 100644 --- a/drivers/net/chelsio/cxgb2.c +++ b/drivers/net/chelsio/cxgb2.c @@ -1084,7 +1084,6 @@ static int __devinit init_one(struct pci_dev *pdev, netdev-mem_end = mmio_start + mmio_len - 1; netdev-priv = adapter; netdev-features |= NETIF_F_SG | NETIF_F_IP_CSUM; - netdev-features |= NETIF_F_LLTX; adapter-flags |= RX_CSUM_ENABLED | TCP_CSUM_CAPABLE; if (pci_using_dac) diff --git a/drivers/net/chelsio/sge.c b/drivers/net/chelsio/sge.c index ffa7e64..84f5869 100644 --- a/drivers/net/chelsio/sge.c +++ b/drivers/net/chelsio/sge.c @@ -1739,8 +1739,7 @@ static int t1_sge_tx(struct sk_buff *skb, struct adapter *adapter, struct cmdQ *q = sge-cmdQ[qid]; unsigned int credits, pidx, genbit, count, use_sched_skb = 0; - if (!spin_trylock(q-lock)) - return NETDEV_TX_LOCKED; + spin_lock(q-lock); reclaim_completed_tx(sge, q); @@ -1817,12 +1816,12 @@ use_sched: } if (use_sched_skb) { - if (spin_trylock(q-lock)) { - credits = q-size - q-in_use; - skb = NULL; - goto use_sched; - } + spin_lock(q-lock); + credits = q-size - q-in_use; + skb = NULL; + goto use_sched; } + return NETDEV_TX_OK; } @@ -1977,13 +1976,12 @@ static void sge_tx_reclaim_cb(unsigned long data) for (i = 0; i SGE_CMDQ_N; ++i) { struct cmdQ *q = sge-cmdQ[i]; - if (!spin_trylock(q-lock)) - continue; + spin_lock(q-lock); reclaim_completed_tx(sge, q); - if (i == 0 q-in_use) {/* flush pending credits */ + if (i == 0 q-in_use)/* flush pending credits */ writel(F_CMDQ0_ENABLE,
ax88796: add superh to kconfig depencencies
ax88796: add superh to kconfig depencencies This patch adds sh architecture support to the ax88796 kconfig. Signed-off-by: Magnus Damm [EMAIL PROTECTED] --- This is a broken out version of the larger patch recently posted to netdev: http://www.mail-archive.com/netdev@vger.kernel.org/msg47278.html drivers/net/Kconfig |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- 0002/drivers/net/Kconfig +++ work/drivers/net/Kconfig2007-10-09 19:37:27.0 +0900 @@ -218,7 +218,7 @@ source drivers/net/arm/Kconfig config AX88796 tristate ASIX AX88796 NE2000 clone support - depends on ARM || MIPS + depends on ARM || MIPS || SUPERH select CRC32 select MII help - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote: David Miller wrote: I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Awsome :) -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
Herbert Xu wrote: On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote: David Miller wrote: I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Awsome :) Note my patch is just to get the maintainers going. :) I'm not going to commit that, since I don't have any way to test any of the drivers I touched (but I wouldn't scream if it appeared in net-2.6.24 either) Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness
On Tue, 9 Oct 2007, Ilpo Järvinen wrote: Lost_retrans handling of sacktag was found to be flawed, two problems that were found have an intertwined solution. Fastpath problem has existed since hints got added and the other problem has probably been there even longer than that. ...This change may add non-trivial processing cost. Initial sketch, only compile tested. This will become more and more useful, when sacktag starts to process less and less skbs, which hopefully happens quite soon... :-) Sadly enough it will probably then be consuming part of the benefits we're able to achieve by less skb walking... First one is trivial, so Dave might want to apply it already. Hmm, forgot to add -n to git-format-patch. Since it's currently RFC, I won't bother to resubmit with numbers unless somebody really wants that. Here's the correct ordering, if it's not obvious from the patches alone: $ git-log -n 3 --pretty=oneline HEAD | cat c628f5040cf31323f9166d41cb46070aa7be8cc5 [TCP]: Limit processing lost_retrans loop to work-to-do cases b7388980b7bfa1b9159a65a81b9501df43c67b08 [TCP]: Fix lost_retrans loop vs fastpath problems a1c39117034c58bf786a5941dd7e9a5968439007 [TCP]: Separate lost_retrans loop into own function -- i.
Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote: Driver might ask for 10 and we send 10, but LLTX driver might fail to get lock and return TX_LOCKED. I haven't seen your code in greater detail, but don't you requeue in that case too? For others drivers that are non-batching and LLTX, it is possible - at the moment in my patch i whine that the driver is buggy. I will fix this up so it checks for NETIF_F_BTX. Thanks for pointing the above use case. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHES] TX batching
On Tue, 2007-09-10 at 13:44 +0530, Krishna Kumar2 wrote: My feeling is that since the approaches are very different, My concern is the approaches are different only for short periods of time. For example, I do requeueing, have xmit_win, have -end_xmit, do batching from core etc; if you see value in any of these concepts, they will appear in your patches and this goes on a loop. Perhaps what we need is a referee and use our energies in something more positive. it would be a good idea to test the two for performance. Which i dont mind as long as it has an analysis that goes with it. If all you post is heres what netperf showed, it is not useful at all. There are also a lot of affecting variables. For example, is the receiver a bottleneck? To make it worse, I could demonstrate to you that if i slowed down the driver and allowed more packets to queue up on the qdisc, batching will do well. In the past my feeling is you glossed over such details and i am sucker for things like that - hence the conflict. Do you mind me doing that? Ofcourse others and/or you are more than welcome to do the same. I had sent a note to you yesterday about this, please let me know either way. I responded to you - but it may have been lost in the noise; heres a copy: http://marc.info/?l=linux-netdevm=119185137124008w=2 cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][E1000E] some cleanups
On Mon, 2007-08-10 at 15:40 -0700, Kok, Auke wrote: My biggest problem with the patch as you sent it that it's a tonload of changes and no implicit benefit immediately as I can see. The patch looks scary but is pretty tame when you apply it and stare at it. I would really have to see the LLTX change as well to make a wellfounded decision whether it is the right thing to go and overhaul this part of the driver or not :) I'm not particularly against the changes per se though. so please, if needed offlist send me those LLTX changes as well. Dont worry about it - i didnt get any love for a similar patch i did for tg3 either ;- I will hold onto it for now. I could do the un-LLTX outside of this - would you like that for both e1000/e1000e ? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][NET_SCHED] explict hold dev tx lock
On Tue, 2007-09-10 at 12:00 +0800, Herbert Xu wrote: OK, after waking up a bit more me too;- What I'm worried about is would we see worse behaviour with drivers that do all their TX clean-up with the TX lock held Good point Herbert. When i looked around i only found one driver that behaved like that; some IBM mainframe one from the looks of it. That driver did a lot of other obscure things (i think they maintain their own napi calls etc), so it didnt worry me very much. IIRC, I think it could be fixed to do what tg3 and relatives like bnx do (I really like the approach) - just didnt have the time to chase it. There are a _lot_ more drivers that have no respect for netif_tx_lock and implement their own locking down in the driver. Those already suffer from the phenomena you describe whether TX_LOCK is held or not. (which would cause qdisc_restart to spin while this is happening)? Yes, with such a driver, we spin in the worst case. But that provides opportunities for optimization of a driver behaving that way; two approaches off top of my head: a) we could prune the tx descriptors on the the tx path since it is safe to do so and reduce the amount of time spent doing such work in the napi poll or b) Have the napi side do a trylock (sort of what the e1000 attempts to do) and reschedule the poll to retry. #b is fair because the cost of a queue_lock goes up as the number of cpus goes up (which is the case i was optimizing for). The cost of tx lock is contention by two cpus in worst case. Thoughts? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
On Mon, 8 Oct 2007, Steve Wise wrote: The correct solution, IMO, is to enhance the core low level 4-tuple allocation services to be more generic (eg: not be tied to a struct sock). Then the host tcp stack and the host rdma stack can allocate TCP/iWARP ports/4tuples from this common exported service and share the port space. This allocation service could also be used by other deep adapters like iscsi adapters if needed. As a developer of an RDMA ULP, NFS-RDMA, I like this approach because it will simplify the configuration of an RDMA device and the services that use it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ax88796: add superh to kconfig depencencies
On Tue, Oct 09, 2007 at 09:51:38PM +0900, Magnus Damm wrote: ax88796: add superh to kconfig depencencies This patch adds sh architecture support to the ax88796 kconfig. Signed-off-by: Magnus Damm [EMAIL PROTECTED] Acked-by: Paul Mundt [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ipv4: kernel panic when only one unsecured port available
Steps to reproduce: Server: [EMAIL PROTECTED] ~]# cat /etc/exports /export *(ro,insecure) // there is insecure ... I am using ports like 1024 to 61000 [EMAIL PROTECTED] ~] service nfs restart Client: [EMAIL PROTECTED] ~]# echo 32768 32768 /proc/sys/net/ipv4/ip_local_port_range 32768 32768 // two same numbers, for ex 32769 32769 etc. [EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/ip_local_port_range 32768 32768 [EMAIL PROTECTED] ~]# mount server:/export /import Actual results: Kernel always panics [PATCH] ipv4: kernel panic when only one unsecured port available Patch prevents division by zero. Kernel panics if only one unsecured port available. Signed-off-by: Anton Arapov [EMAIL PROTECTED] --- net/ipv4/inet_connection_sock.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index fbe7714..00ad079 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -80,7 +80,7 @@ int inet_csk_get_port(struct inet_hashinfo *hashinfo, int low = sysctl_local_port_range[0]; int high = sysctl_local_port_range[1]; int remaining = (high - low) + 1; - int rover = net_random() % (high - low) + low; + int rover = net_random() % remaining + low; do { head = hashinfo-bhash[inet_bhashfn(rover, hashinfo-bhash_size)]; -- Anton Arapov, [EMAIL PROTECTED] Kernel Development, Red Hat GPG Key ID: 0x6FA8C812 pgp0y91epZLHV.pgp Description: PGP signature
[0/7] IPsec: More work on async crypto on output
Hi Dave: Here's another round of patches which ends with the moving down of the state lock into x-type-output. I need to stop getting distracted by fixing bugs and concentrate on creating them :) Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] [IPSEC]: Remove bogus ref count in xfrm_secpath_reject
[IPSEC]: Remove bogus ref count in xfrm_secpath_reject Constructs of the form xfrm_state_hold(x); foo(x); xfrm_state_put(x); tend to be broken because foo is either synchronous where this is totally unnecessary or if foo is asynchronous then the reference count is in the wrong spot. In the case of xfrm_secpath_reject, the function is synchronous and therefore we should just kill the reference count. Signed-off-by: Herbert Xu [EMAIL PROTECTED] --- net/xfrm/xfrm_policy.c |6 +- 1 files changed, 1 insertion(+), 5 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 76f172f..af27c19 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -1682,17 +1682,13 @@ static inline int xfrm_secpath_reject(int idx, struct sk_buff *skb, struct flowi *fl) { struct xfrm_state *x; - int err; if (!skb-sp || idx 0 || idx = skb-sp-len) return 0; x = skb-sp-xvec[idx]; if (!x-type-reject) return 0; - xfrm_state_hold(x); - err = x-type-reject(x, skb, fl); - xfrm_state_put(x); - return err; + return x-type-reject(x, skb, fl); } /* When skb is transformed back to its native form, we have to - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] [IPSEC]: Store IPv6 nh pointer in mac_header on output
[IPSEC]: Store IPv6 nh pointer in mac_header on output Current the x-mode-output functions store the IPv6 nh pointer in the skb network header. This is inconvenient because the network header then has to be fixed up before the packet can leave the IPsec stack. The mac header field is unused on output so we can use that to store this instead. This patch does that and removes the network header fix-up in xfrm_output. It also uses ipv6_hdr where appropriate in the x-type-output functions. There is also a minor clean-up in esp4 to make it use the same code as esp6 to help any subsequent effort to merge the two. Lastly it kills two redundant skb_set_* statements in BEET that were simply copied over from transport mode. Signed-off-by: Herbert Xu [EMAIL PROTECTED] --- net/ipv4/esp4.c |2 +- net/ipv6/ah6.c |6 +++--- net/ipv6/esp6.c |7 --- net/ipv6/ipcomp6.c |6 +++--- net/ipv6/mip6.c | 12 ++-- net/ipv6/xfrm6_mode_beet.c | 16 +++- net/ipv6/xfrm6_mode_ro.c| 13 +++-- net/ipv6/xfrm6_mode_transport.c | 13 +++-- net/ipv6/xfrm6_mode_tunnel.c| 13 +++-- net/ipv6/xfrm6_tunnel.c |2 +- net/xfrm/xfrm_output.c |2 -- 11 files changed, 46 insertions(+), 46 deletions(-) diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 0f62af9..ffd5653 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -59,7 +59,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb) tail[clen - skb-len - 2] = (clen - skb-len) - 2; pskb_put(skb, trailer, clen - skb-len); - __skb_push(skb, skb-data - skb_network_header(skb)); + __skb_push(skb, -skb_network_offset(skb)); top_iph = ip_hdr(skb); esph = (struct ip_esp_hdr *)(skb_network_header(skb) + top_iph-ihl * 4); diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c index ae68a90..ff904a7 100644 --- a/net/ipv6/ah6.c +++ b/net/ipv6/ah6.c @@ -235,11 +235,11 @@ static int ah6_output(struct xfrm_state *x, struct sk_buff *skb) char hdrs[0]; } *tmp_ext; - top_iph = (struct ipv6hdr *)skb-data; + top_iph = ipv6_hdr(skb); top_iph-payload_len = htons(skb-len - sizeof(*top_iph)); - nexthdr = *skb_network_header(skb); - *skb_network_header(skb) = IPPROTO_AH; + nexthdr = *skb_mac_header(skb); + *skb_mac_header(skb) = IPPROTO_AH; /* When there are no extension headers, we only need to save the first * 8 bytes of the base IP header. diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 0c5fb81..9fc1940 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -88,11 +88,12 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb) tail[clen-skb-len - 2] = (clen - skb-len) - 2; pskb_put(skb, trailer, clen - skb-len); - top_iph = (struct ipv6hdr *)__skb_push(skb, hdr_len); + __skb_push(skb, -skb_network_offset(skb)); + top_iph = ipv6_hdr(skb); esph = (struct ipv6_esp_hdr *)skb_transport_header(skb); top_iph-payload_len = htons(skb-len + alen - sizeof(*top_iph)); - *(skb_tail_pointer(trailer) - 1) = *skb_network_header(skb); - *skb_network_header(skb) = IPPROTO_ESP; + *(skb_tail_pointer(trailer) - 1) = *skb_mac_header(skb); + *skb_mac_header(skb) = IPPROTO_ESP; esph-spi = x-id.spi; esph-seq_no = htonl(XFRM_SKB_CB(skb)-seq); diff --git a/net/ipv6/ipcomp6.c b/net/ipv6/ipcomp6.c index 91b2a75..71a14c0 100644 --- a/net/ipv6/ipcomp6.c +++ b/net/ipv6/ipcomp6.c @@ -157,15 +157,15 @@ static int ipcomp6_output(struct xfrm_state *x, struct sk_buff *skb) pskb_trim(skb, hdr_len + dlen + sizeof(struct ip_comp_hdr)); /* insert ipcomp header and replace datagram */ - top_iph = (struct ipv6hdr *)skb-data; + top_iph = ipv6_hdr(skb); top_iph-payload_len = htons(skb-len - sizeof(struct ipv6hdr)); ipch = (struct ipv6_comp_hdr *)start; - ipch-nexthdr = *skb_network_header(skb); + ipch-nexthdr = *skb_mac_header(skb); ipch-flags = 0; ipch-cpi = htons((u16 )ntohl(x-id.spi)); - *skb_network_header(skb) = IPPROTO_COMP; + *skb_mac_header(skb) = IPPROTO_COMP; out_ok: return 0; diff --git a/net/ipv6/mip6.c b/net/ipv6/mip6.c index 8a1399c..7261c29 100644 --- a/net/ipv6/mip6.c +++ b/net/ipv6/mip6.c @@ -153,11 +153,11 @@ static int mip6_destopt_output(struct xfrm_state *x, struct sk_buff *skb) u8 nexthdr; int len; - iph = (struct ipv6hdr *)skb-data; + iph = ipv6_hdr(skb); iph-payload_len = htons(skb-len - sizeof(*iph)); - nexthdr = *skb_network_header(skb); - *skb_network_header(skb) = IPPROTO_DSTOPTS; + nexthdr = *skb_mac_header(skb); + *skb_mac_header(skb) = IPPROTO_DSTOPTS; dstopt = (struct
[PATCH 3/7] [IPSEC]: Remove gratuitous km wake-up events on ACQUIRE
[IPSEC]: Remove gratuitous km wake-up events on ACQUIRE There is no point in waking people up when creating/updating larval states because they'll just go back to sleep again as larval states by definition cannot be found by xfrm_state_find. We should only wake them up when the larvals mature or die. Signed-off-by: Herbert Xu [EMAIL PROTECTED] --- net/xfrm/xfrm_state.c |2 -- 1 files changed, 2 deletions(-) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index a00745a..0d07f6b 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -848,7 +848,6 @@ static struct xfrm_state *__find_acq_core(unsigned short family, u8 mode, u32 re hlist_add_head(x-bydst, xfrm_state_bydst+h); h = xfrm_src_hash(daddr, saddr, family); hlist_add_head(x-bysrc, xfrm_state_bysrc+h); - wake_up(km_waitq); xfrm_state_num++; @@ -1311,7 +1310,6 @@ xfrm_alloc_spi(struct xfrm_state *x, __be32 minspi, __be32 maxspi) h = xfrm_spi_hash(x-id.daddr, x-id.spi, x-id.proto, x-props.family); hlist_add_head(x-byspi, xfrm_state_byspi+h); spin_unlock_bh(xfrm_state_lock); - wake_up(km_waitq); } } EXPORT_SYMBOL(xfrm_alloc_spi); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] [IPSEC]: Move common code into xfrm_alloc_spi
[IPSEC]: Move common code into xfrm_alloc_spi This patch moves some common code that conceptually belongs to the xfrm core from af_key/xfrm_user into xfrm_alloc_spi. In particular, the spin lock on the state is now taken inside xfrm_alloc_spi. Previously it also protected the construction of the response PF_KEY/XFRM messages to user-space. This is inconsistent as other identical constructions are not protected by the state lock. This is bad because they in fact should be protected but only in certain spots (so as not to hold the lock for too long which may cause packet drops). The SPI byte order conversion has also been moved. Signed-off-by: Herbert Xu [EMAIL PROTECTED] --- include/net/xfrm.h|2 +- net/key/af_key.c | 29 - net/xfrm/xfrm_state.c | 26 -- net/xfrm/xfrm_user.c | 13 - 4 files changed, 37 insertions(+), 33 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 064a4ca..1c116dc 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -1084,7 +1084,7 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(u8 type, int dir, struct xfrm_policy *xfrm_policy_byid(u8, int dir, u32 id, int delete, int *err); int xfrm_policy_flush(u8 type, struct xfrm_audit *audit_info); u32 xfrm_get_acqseq(void); -void xfrm_alloc_spi(struct xfrm_state *x, __be32 minspi, __be32 maxspi); +extern int xfrm_alloc_spi(struct xfrm_state *x, u32 minspi, u32 maxspi); struct xfrm_state * xfrm_find_acq(u8 mode, u32 reqid, u8 proto, xfrm_address_t *daddr, xfrm_address_t *saddr, int create, unsigned short family); diff --git a/net/key/af_key.c b/net/key/af_key.c index ff5c3d0..143d46f 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -1253,8 +1253,11 @@ static int pfkey_getspi(struct sock *sk, struct sk_buff *skb, struct sadb_msg *h struct sadb_x_sa2 *sa2; struct sadb_address *saddr, *daddr; struct sadb_msg *out_hdr; + struct sadb_spirange *range; struct xfrm_state *x = NULL; int mode; + int err; + u32 min_spi, max_spi; u32 reqid; u8 proto; unsigned short family; @@ -1309,25 +1312,17 @@ static int pfkey_getspi(struct sock *sk, struct sk_buff *skb, struct sadb_msg *h if (x == NULL) return -ENOENT; - resp_skb = ERR_PTR(-ENOENT); - - spin_lock_bh(x-lock); - if (x-km.state != XFRM_STATE_DEAD) { - struct sadb_spirange *range = ext_hdrs[SADB_EXT_SPIRANGE-1]; - u32 min_spi, max_spi; + min_spi = 0x100; + max_spi = 0x0fff; - if (range != NULL) { - min_spi = range-sadb_spirange_min; - max_spi = range-sadb_spirange_max; - } else { - min_spi = 0x100; - max_spi = 0x0fff; - } - xfrm_alloc_spi(x, htonl(min_spi), htonl(max_spi)); - if (x-id.spi) - resp_skb = pfkey_xfrm_state2msg(x, 0, 3); + range = ext_hdrs[SADB_EXT_SPIRANGE-1]; + if (range) { + min_spi = range-sadb_spirange_min; + max_spi = range-sadb_spirange_max; } - spin_unlock_bh(x-lock); + + err = xfrm_alloc_spi(x, min_spi, max_spi); + resp_skb = err ? ERR_PTR(err) : pfkey_xfrm_state2msg(x, 0, 3); if (IS_ERR(resp_skb)) { xfrm_state_put(x); diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 0d07f6b..344f0a6 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -1275,26 +1275,33 @@ u32 xfrm_get_acqseq(void) } EXPORT_SYMBOL(xfrm_get_acqseq); -void -xfrm_alloc_spi(struct xfrm_state *x, __be32 minspi, __be32 maxspi) +int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high) { unsigned int h; struct xfrm_state *x0; + int err = -ENOENT; + __be32 minspi = htonl(low); + __be32 maxspi = htonl(high); + + spin_lock_bh(x-lock); + if (x-km.state == XFRM_STATE_DEAD) + goto unlock; + err = 0; if (x-id.spi) - return; + goto unlock; + + err = -ENOENT; if (minspi == maxspi) { x0 = xfrm_state_lookup(x-id.daddr, minspi, x-id.proto, x-props.family); if (x0) { xfrm_state_put(x0); - return; + goto unlock; } x-id.spi = minspi; } else { u32 spi = 0; - u32 low = ntohl(minspi); - u32 high = ntohl(maxspi); for (h=0; hhigh-low+1; h++) { spi = low + net_random()%(high-low+1); x0 = xfrm_state_lookup(x-id.daddr, htonl(spi), x-id.proto, x-props.family); @@ -1310,7 +1317,14 @@ xfrm_alloc_spi(struct xfrm_state
[PATCH 6/7] [IPSEC]: Lock state when copying non-atomic fields to user-space
[IPSEC]: Lock state when copying non-atomic fields to user-space This patch adds locking so that when we're copying non-atomic fields such as life-time or coaddr to user-space we don't get a partial result. For af_key I've changed every instance of pfkey_xfrm_state2msg apart from expiration notification to include the keys and life-times. This is in-line with XFRM behaviour. The actual cases affected are: * pfkey_getspi: No change as we don't have any keys to copy. * key_notify_sa: + ADD/UPD: This wouldn't work otherwise. + DEL: It can't hurt. Signed-off-by: Herbert Xu [EMAIL PROTECTED] --- net/key/af_key.c | 35 +-- net/xfrm/xfrm_user.c | 14 -- 2 files changed, 33 insertions(+), 16 deletions(-) diff --git a/net/key/af_key.c b/net/key/af_key.c index 143d46f..7969f8a 100644 --- a/net/key/af_key.c +++ b/net/key/af_key.c @@ -655,7 +655,8 @@ static inline int pfkey_mode_to_xfrm(int mode) } } -static struct sk_buff * pfkey_xfrm_state2msg(struct xfrm_state *x, int add_keys, int hsc) +static struct sk_buff *__pfkey_xfrm_state2msg(struct xfrm_state *x, + int add_keys, int hsc) { struct sk_buff *skb; struct sadb_msg *hdr; @@ -1009,6 +1010,24 @@ static struct sk_buff * pfkey_xfrm_state2msg(struct xfrm_state *x, int add_keys, return skb; } + +static inline struct sk_buff *pfkey_xfrm_state2msg(struct xfrm_state *x) +{ + struct sk_buff *skb; + + spin_lock_bh(x-lock); + skb = __pfkey_xfrm_state2msg(x, 1, 3); + spin_unlock_bh(x-lock); + + return skb; +} + +static inline struct sk_buff *pfkey_xfrm_state2msg_expire(struct xfrm_state *x, + int hsc) +{ + return __pfkey_xfrm_state2msg(x, 0, hsc); +} + static struct xfrm_state * pfkey_msg2xfrm_state(struct sadb_msg *hdr, void **ext_hdrs) { @@ -1322,7 +1341,7 @@ static int pfkey_getspi(struct sock *sk, struct sk_buff *skb, struct sadb_msg *h } err = xfrm_alloc_spi(x, min_spi, max_spi); - resp_skb = err ? ERR_PTR(err) : pfkey_xfrm_state2msg(x, 0, 3); + resp_skb = err ? ERR_PTR(err) : pfkey_xfrm_state2msg(x); if (IS_ERR(resp_skb)) { xfrm_state_put(x); @@ -1412,12 +1431,8 @@ static int key_notify_sa(struct xfrm_state *x, struct km_event *c) { struct sk_buff *skb; struct sadb_msg *hdr; - int hsc = 3; - - if (c-event == XFRM_MSG_DELSA) - hsc = 0; - skb = pfkey_xfrm_state2msg(x, 0, hsc); + skb = pfkey_xfrm_state2msg(x); if (IS_ERR(skb)) return PTR_ERR(skb); @@ -1529,7 +1544,7 @@ static int pfkey_get(struct sock *sk, struct sk_buff *skb, struct sadb_msg *hdr, if (x == NULL) return -ESRCH; - out_skb = pfkey_xfrm_state2msg(x, 1, 3); + out_skb = pfkey_xfrm_state2msg(x); proto = x-id.proto; xfrm_state_put(x); if (IS_ERR(out_skb)) @@ -1709,7 +1724,7 @@ static int dump_sa(struct xfrm_state *x, int count, void *ptr) struct sk_buff *out_skb; struct sadb_msg *out_hdr; - out_skb = pfkey_xfrm_state2msg(x, 1, 3); + out_skb = pfkey_xfrm_state2msg(x); if (IS_ERR(out_skb)) return PTR_ERR(out_skb); @@ -2910,7 +2925,7 @@ static int key_notify_sa_expire(struct xfrm_state *x, struct km_event *c) else hsc = 1; - out_skb = pfkey_xfrm_state2msg(x, 0, hsc); + out_skb = pfkey_xfrm_state2msg_expire(x, hsc); if (IS_ERR(out_skb)) return PTR_ERR(out_skb); diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 2cbbe5e..5238f6a 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -507,8 +507,16 @@ static int copy_to_user_state_extra(struct xfrm_state *x, struct xfrm_usersa_info *p, struct sk_buff *skb) { + spin_lock_bh(x-lock); copy_to_user_state(x, p); + if (x-coaddr) + NLA_PUT(skb, XFRMA_COADDR, sizeof(*x-coaddr), x-coaddr); + + if (x-lastused) + NLA_PUT_U64(skb, XFRMA_LASTUSED, x-lastused); + spin_unlock_bh(x-lock); + if (x-aalg) NLA_PUT(skb, XFRMA_ALG_AUTH, alg_len(x-aalg), x-aalg); if (x-ealg) @@ -522,12 +530,6 @@ static int copy_to_user_state_extra(struct xfrm_state *x, if (x-security copy_sec_ctx(x-security, skb) 0) goto nla_put_failure; - if (x-coaddr) - NLA_PUT(skb, XFRMA_COADDR, sizeof(*x-coaddr), x-coaddr); - - if (x-lastused) - NLA_PUT_U64(skb, XFRMA_LASTUSED, x-lastused); - return 0; nla_put_failure: - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More
[PATCH 5/7] [XFRM] user: Move attribute copying code into copy_to_user_state_extra
[XFRM] user: Move attribute copying code into copy_to_user_state_extra Here's a good example of code duplication leading to code rot. The notification patch did its own netlink message creation for xfrm states. It duplicated code that was already in dump_one_state. Guess what, the next time (and the time after) when someone updated dump_one_state the notification path got zilch. This patch moves that code from dump_one_state to copy_to_user_state_extra and uses it in xfrm_notify_sa too. Unfortunately whoever updates this still needs to update xfrm_sa_len since the notification path wants to know the exact size for allocation. At least I've added a comment saying so and if someone still forgest, we'll have a WARN_ON telling us so. I also changed the security size calculation to use xfrm_user_sec_ctx since that's what we actually put into the skb. However it makes no practical difference since it has the same size as xfrm_sec_ctx. Signed-off-by: Herbert Xu [EMAIL PROTECTED] --- net/xfrm/xfrm_user.c | 76 +++ 1 files changed, 47 insertions(+), 29 deletions(-) diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c index 52c7fce..2cbbe5e 100644 --- a/net/xfrm/xfrm_user.c +++ b/net/xfrm/xfrm_user.c @@ -483,9 +483,9 @@ struct xfrm_dump_info { static int copy_sec_ctx(struct xfrm_sec_ctx *s, struct sk_buff *skb) { - int ctx_size = sizeof(struct xfrm_sec_ctx) + s-ctx_len; struct xfrm_user_sec_ctx *uctx; struct nlattr *attr; + int ctx_size = sizeof(*uctx) + s-ctx_len; attr = nla_reserve(skb, XFRMA_SEC_CTX, ctx_size); if (attr == NULL) @@ -502,23 +502,11 @@ static int copy_sec_ctx(struct xfrm_sec_ctx *s, struct sk_buff *skb) return 0; } -static int dump_one_state(struct xfrm_state *x, int count, void *ptr) +/* Don't change this without updating xfrm_sa_len! */ +static int copy_to_user_state_extra(struct xfrm_state *x, + struct xfrm_usersa_info *p, + struct sk_buff *skb) { - struct xfrm_dump_info *sp = ptr; - struct sk_buff *in_skb = sp-in_skb; - struct sk_buff *skb = sp-out_skb; - struct xfrm_usersa_info *p; - struct nlmsghdr *nlh; - - if (sp-this_idx sp-start_idx) - goto out; - - nlh = nlmsg_put(skb, NETLINK_CB(in_skb).pid, sp-nlmsg_seq, - XFRM_MSG_NEWSA, sizeof(*p), sp-nlmsg_flags); - if (nlh == NULL) - return -EMSGSIZE; - - p = nlmsg_data(nlh); copy_to_user_state(x, p); if (x-aalg) @@ -540,6 +528,35 @@ static int dump_one_state(struct xfrm_state *x, int count, void *ptr) if (x-lastused) NLA_PUT_U64(skb, XFRMA_LASTUSED, x-lastused); + return 0; + +nla_put_failure: + return -EMSGSIZE; +} + +static int dump_one_state(struct xfrm_state *x, int count, void *ptr) +{ + struct xfrm_dump_info *sp = ptr; + struct sk_buff *in_skb = sp-in_skb; + struct sk_buff *skb = sp-out_skb; + struct xfrm_usersa_info *p; + struct nlmsghdr *nlh; + int err; + + if (sp-this_idx sp-start_idx) + goto out; + + nlh = nlmsg_put(skb, NETLINK_CB(in_skb).pid, sp-nlmsg_seq, + XFRM_MSG_NEWSA, sizeof(*p), sp-nlmsg_flags); + if (nlh == NULL) + return -EMSGSIZE; + + p = nlmsg_data(nlh); + + err = copy_to_user_state_extra(x, p, skb); + if (err) + goto nla_put_failure; + nlmsg_end(skb, nlh); out: sp-this_idx++; @@ -547,7 +564,7 @@ out: nla_put_failure: nlmsg_cancel(skb, nlh); - return -EMSGSIZE; + return err; } static int xfrm_dump_sa(struct sk_buff *skb, struct netlink_callback *cb) @@ -1973,6 +1990,14 @@ static inline size_t xfrm_sa_len(struct xfrm_state *x) l += nla_total_size(sizeof(*x-calg)); if (x-encap) l += nla_total_size(sizeof(*x-encap)); + if (x-security) + l += nla_total_size(sizeof(struct xfrm_user_sec_ctx) + + x-security-ctx_len); + if (x-coaddr) + l += nla_total_size(sizeof(*x-coaddr)); + + /* Must count this as this may become non-zero behind our back. */ + l += nla_total_size(sizeof(x-lastused)); return l; } @@ -2018,23 +2043,16 @@ static int xfrm_notify_sa(struct xfrm_state *x, struct km_event *c) p = nla_data(attr); } - copy_to_user_state(x, p); - - if (x-aalg) - NLA_PUT(skb, XFRMA_ALG_AUTH, alg_len(x-aalg), x-aalg); - if (x-ealg) - NLA_PUT(skb, XFRMA_ALG_CRYPT, alg_len(x-ealg), x-ealg); - if (x-calg) - NLA_PUT(skb, XFRMA_ALG_COMP, sizeof(*(x-calg)), x-calg); - - if (x-encap) - NLA_PUT(skb, XFRMA_ENCAP, sizeof(*x-encap), x-encap); + if
[PATCH 7/7] [IPSEC]: Move state lock into x-type-output
[IPSEC]: Move state lock into x-type-output This patch releases the lock on the state before calling x-type-output. It also adds the lock to the spots where they're currently needed. Most of those places (all except mip6) are expected to disappear with async crypto. Signed-off-by: Herbert Xu [EMAIL PROTECTED] --- net/ipv4/ah4.c |7 ++- net/ipv4/esp4.c| 10 -- net/ipv6/ah6.c |9 ++--- net/ipv6/esp6.c| 10 -- net/ipv6/mip6.c|4 net/xfrm/xfrm_output.c |8 6 files changed, 36 insertions(+), 12 deletions(-) diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c index 58af298..3513149 100644 --- a/net/ipv4/ah4.c +++ b/net/ipv4/ah4.c @@ -5,6 +5,7 @@ #include net/ah.h #include linux/crypto.h #include linux/pfkeyv2.h +#include linux/spinlock.h #include net/icmp.h #include net/protocol.h #include asm/scatterlist.h @@ -97,10 +98,14 @@ static int ah_output(struct xfrm_state *x, struct sk_buff *skb) ah-reserved = 0; ah-spi = x-id.spi; ah-seq_no = htonl(XFRM_SKB_CB(skb)-seq); + + spin_lock_bh(x-lock); err = ah_mac_digest(ahp, skb, ah-auth_data); + memcpy(ah-auth_data, ahp-work_icv, ahp-icv_trunc_len); + spin_unlock_bh(x-lock); + if (err) goto error; - memcpy(ah-auth_data, ahp-work_icv, ahp-icv_trunc_len); top_iph-tos = iph-tos; top_iph-ttl = iph-ttl; diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index ffd5653..452910d 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -8,6 +8,7 @@ #include linux/kernel.h #include linux/pfkeyv2.h #include linux/random.h +#include linux/spinlock.h #include net/icmp.h #include net/protocol.h #include net/udp.h @@ -66,6 +67,8 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb) top_iph-tot_len = htons(skb-len + alen); *(skb_tail_pointer(trailer) - 1) = top_iph-protocol; + spin_lock_bh(x-lock); + /* this is non-NULL only with UDP Encapsulation */ if (x-encap) { struct xfrm_encap_tmpl *encap = x-encap; @@ -111,7 +114,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb) if (unlikely(nfrags ESP_NUM_FAST_SG)) { sg = kmalloc(sizeof(struct scatterlist)*nfrags, GFP_ATOMIC); if (!sg) - goto error; + goto unlock; } skb_to_sgvec(skb, sg, esph-enc_data+esp-conf.ivlen-skb-data, clen); err = crypto_blkcipher_encrypt(desc, sg, sg, clen); @@ -120,7 +123,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb) } while (0); if (unlikely(err)) - goto error; + goto unlock; if (esp-conf.ivlen) { memcpy(esph-enc_data, esp-conf.ivec, esp-conf.ivlen); @@ -133,6 +136,9 @@ static int esp_output(struct xfrm_state *x, struct sk_buff *skb) memcpy(pskb_put(skb, trailer, alen), esp-auth.work_icv, alen); } +unlock: + spin_unlock_bh(x-lock); + ip_send_check(top_iph); error: diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c index ff904a7..c51d775 100644 --- a/net/ipv6/ah6.c +++ b/net/ipv6/ah6.c @@ -29,6 +29,7 @@ #include net/ah.h #include linux/crypto.h #include linux/pfkeyv2.h +#include linux/spinlock.h #include linux/string.h #include net/icmp.h #include net/ipv6.h @@ -284,12 +285,14 @@ static int ah6_output(struct xfrm_state *x, struct sk_buff *skb) ah-reserved = 0; ah-spi = x-id.spi; ah-seq_no = htonl(XFRM_SKB_CB(skb)-seq); + + spin_lock_bh(x-lock); err = ah_mac_digest(ahp, skb, ah-auth_data); - if (err) - goto error_free_iph; memcpy(ah-auth_data, ahp-work_icv, ahp-icv_trunc_len); + spin_unlock_bh(x-lock); - err = 0; + if (err) + goto error_free_iph; memcpy(top_iph, tmp_base, sizeof(tmp_base)); if (tmp_ext) { diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 9fc1940..7355bb0 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -34,6 +34,7 @@ #include linux/kernel.h #include linux/pfkeyv2.h #include linux/random.h +#include linux/spinlock.h #include net/icmp.h #include net/ipv6.h #include net/protocol.h @@ -98,6 +99,8 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb) esph-spi = x-id.spi; esph-seq_no = htonl(XFRM_SKB_CB(skb)-seq); + spin_lock_bh(x-lock); + if (esp-conf.ivlen) { if (unlikely(!esp-conf.ivinitted)) { get_random_bytes(esp-conf.ivec, esp-conf.ivlen); @@ -112,7 +115,7 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff *skb) if (unlikely(nfrags ESP_NUM_FAST_SG)) { sg = kmalloc(sizeof(struct scatterlist)*nfrags, GFP_ATOMIC);
Re: [PATCH][NETNS] Make ifindex generation per-namespace
Pavel Emelyanov wrote: Currently indexes for netdevices come sequentially one by one, and the same stays true even for devices that are created for namespaces. Side effects of this are: * lo device has not 1 index in a namespace. This may break some userspace that relies on it (and AFAIR something really broke in OpenVZ VEs without this); * after some time namespaces will have devices with indexes like 100 os similar. This might be confusing for a human (tools will not mind). So move the (currently global and static) ifindex variable on the struct net, making the indexes allocation look more like on a standalone machine. Moreover - when we have indexes intersect between namespaces, we may catch more BUGs in the future related to wrong device was found for a given index. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Applied and tested against netns49. Works fine. Acked-by: Daniel Lezcano [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND]: RNDIS host: adjust MTU instead of refusing to talk to devices with MTU 1.5k
On Tue, 2007-10-09 at 08:18 -0700, David Brownell wrote: This patch makes the host RNDIS driver talk to RNDIS devices with an MTU less than 1.5k, instead of refusing to talk to such a device. Please apply. Signed-Off-by: Thomas Sailer [EMAIL PROTECTED] Acked-by: David Brownell [EMAIL PROTECTED] ... but patches in drivers/net go to netdev, not linux-usb-devel. And it'd be nice if the patch comment mentioned some particular device that's been observed to act this way. Ok, next try. The device in question is a low power radio device that speaks the bpqether protocol over ethernet. bpqether frames are never longer than 320 bytes, so there is no need to support 1.5kbyte MTU in the device firmware. Tom Signed-Off-by: Thomas Sailer [EMAIL PROTECTED] --- 1/drivers/net/usb/rndis_host.c.orig 2007-09-04 17:51:11.0 +0200 +++ 2/drivers/net/usb/rndis_host.c 2007-09-04 17:54:26.0 +0200 @@ -512,11 +512,18 @@ } tmp = le32_to_cpu(u.init_c-max_transfer_size); if (tmp dev-hard_mtu) { - dev_err(intf-dev, - dev can't take %u byte packets (max %u)\n, - dev-hard_mtu, tmp); - retval = -EINVAL; - goto fail_and_release; + if (tmp = net-hard_header_len) { + dev_warn(intf-dev, +dev can't take %u byte packets (max %u)\n, +net-hard_header_len+1, tmp); + retval = -EINVAL; + goto fail_and_release; + } + dev-hard_mtu = tmp; + net-mtu = dev-hard_mtu - net-hard_header_len; + dev_warn(intf-dev, +dev can't take %u byte packets (max %u), adjusting MTU to %u\n, +dev-hard_mtu, tmp, net-mtu); } /* REVISIT: peripheral alignment request is ignored ... */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6][NET-2.6.24] Consolidate private allocations in seq files
Many (very many) seq files in net/ allocate some private data to use it later (mostly for iteration state). All this code was obviously get using copy-paste method, so move it into one place. Almost all of these places either set this private to 0, or keep uninitialized. Some places, however, pre-initialize this area, but there are few of them. The seq_open_private() call just opens the seq file with allocated and set to zero area. The __seq_open_private() call makes the same, but returns the allocated memory to the called to be initialized later. I didn't measure how much of the .text section this saves, but I suspect a lot of :) As far as the code is concerned, this set saves ~450 lines. Such thing may be useful for any subsystem, but I found this mostly in the networking code and fixed only it (for a while). Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6][NET-2.6.24] Introduce the seq_open_private()
This function allocates the zeroed chunk of memory and call seq_open(). The __seq_open_private() helper returns the allocated memory to make it possible for the caller to initialize it. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/fs/seq_file.c b/fs/seq_file.c index bbb19be..ca71c11 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -429,6 +429,39 @@ int seq_release_private(struct inode *in } EXPORT_SYMBOL(seq_release_private); +void *__seq_open_private(struct file *f, const struct seq_operations *ops, + int psize) +{ + int rc; + void *private; + struct seq_file *seq; + + private = kzalloc(psize, GFP_KERNEL); + if (private == NULL) + goto out; + + rc = seq_open(f, ops); + if (rc 0) + goto out_free; + + seq = f-private_data; + seq-private = private; + return private; + +out_free: + kfree(private); +out: + return NULL; +} +EXPORT_SYMBOL(__seq_open_private); + +int seq_open_private(struct file *filp, const struct seq_operations *ops, + int psize) +{ + return __seq_open_private(filp, ops, psize) ? 0 : -ENOMEM; +} +EXPORT_SYMBOL(seq_open_private); + int seq_putc(struct seq_file *m, char c) { if (m-count m-size) { diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 83783ab..8bf1e05 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -46,6 +46,8 @@ int seq_path(struct seq_file *, struct v int single_open(struct file *, int (*)(struct seq_file *, void *), void *); int single_release(struct inode *, struct file *); +void *__seq_open_private(struct file *, const struct seq_operations *, int); +int seq_open_private(struct file *, const struct seq_operations *, int); int seq_release_private(struct inode *, struct file *); #define SEQ_START_TOKEN ((void *)1) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6][NET-2.6.24] Make core networking code use seq_open_private
This concerns the ipv4 and ipv6 code mostly, but also the netlink and unix sockets. The netlink code is an example of how to use the __seq_open_private() call - it saves the net namespace on this private. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c index d824819..36d6798 100644 --- a/net/ipv4/arp.c +++ b/net/ipv4/arp.c @@ -1378,24 +1378,8 @@ static const struct seq_operations arp_s static int arp_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct neigh_seq_state *s = kzalloc(sizeof(*s), GFP_KERNEL); - - if (!s) - goto out; - - rc = seq_open(file, arp_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, arp_seq_ops, + sizeof(struct neigh_seq_state)); } static const struct file_operations arp_seq_fops = { diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c index 9fafbee..527a6e0 100644 --- a/net/ipv4/fib_hash.c +++ b/net/ipv4/fib_hash.c @@ -1039,24 +1039,8 @@ static const struct seq_operations fib_s static int fib_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct fib_iter_state *s = kzalloc(sizeof(*s), GFP_KERNEL); - - if (!s) - goto out; - - rc = seq_open(file, fib_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, fib_seq_ops, + sizeof(struct fib_iter_state)); } static const struct file_operations fib_seq_fops = { diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index be34bd5..81a8285 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -2379,25 +2379,8 @@ static const struct seq_operations fib_t static int fib_trie_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct fib_trie_iter *s = kmalloc(sizeof(*s), GFP_KERNEL); - - if (!s) - goto out; - - rc = seq_open(file, fib_trie_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; - memset(s, 0, sizeof(*s)); -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, fib_trie_seq_ops, + sizeof(struct fib_trie_iter)); } static const struct file_operations fib_trie_fops = { @@ -2500,25 +2483,8 @@ static const struct seq_operations fib_r static int fib_route_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct fib_trie_iter *s = kmalloc(sizeof(*s), GFP_KERNEL); - - if (!s) - goto out; - - rc = seq_open(file, fib_route_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; - memset(s, 0, sizeof(*s)); -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, fib_route_seq_ops, + sizeof(struct fib_trie_iter)); } static const struct file_operations fib_route_fops = { diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index 2b6e59c..7dbc282 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -2410,23 +2410,8 @@ static const struct seq_operations igmp_ static int igmp_mc_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct igmp_mc_iter_state *s = kzalloc(sizeof(*s), GFP_KERNEL); - - if (!s) - goto out; - rc = seq_open(file, igmp_mc_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, igmp_mc_seq_ops, + sizeof(struct igmp_mc_iter_state)); } static const struct file_operations igmp_mc_seq_fops = { @@ -2584,23 +2569,8 @@ static const struct seq_operations igmp_ static int igmp_mcf_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct igmp_mcf_iter_state *s = kzalloc(sizeof(*s), GFP_KERNEL); - - if (!s) - goto out; - rc = seq_open(file, igmp_mcf_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, igmp_mcf_seq_ops, +
[PATCH 3/6][NET-2.6.24] Make netfilter code use the seq_open_private
Just switch to the consolidated calls. ipt_recent() has to initialize the private, so use the __seq_open_private() helper. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Cc: Patrick McHardy [EMAIL PROTECTED] --- diff --git a/net/ipv4/netfilter/ipt_recent.c b/net/ipv4/netfilter/ipt_recent.c index db2a798..11d39fb 100644 --- a/net/ipv4/netfilter/ipt_recent.c +++ b/net/ipv4/netfilter/ipt_recent.c @@ -381,25 +381,14 @@ static const struct seq_operations recen static int recent_seq_open(struct inode *inode, struct file *file) { struct proc_dir_entry *pde = PDE(inode); - struct seq_file *seq; struct recent_iter_state *st; - int ret; - st = kzalloc(sizeof(*st), GFP_KERNEL); + st = __seq_open_private(file, recent_seq_ops, sizeof(*st)); if (st == NULL) return -ENOMEM; - ret = seq_open(file, recent_seq_ops); - if (ret) { - kfree(st); - goto out; - } - st-table= pde-data; - seq = file-private_data; - seq-private = st; -out: - return ret; + return 0; } static ssize_t recent_proc_write(struct file *file, const char __user *input, diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c index a5ae2ea..741f3df 100644 --- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c +++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c @@ -174,22 +174,8 @@ static const struct seq_operations ct_se static int ct_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - struct ct_iter_state *st; - int ret; - - st = kzalloc(sizeof(struct ct_iter_state), GFP_KERNEL); - if (st == NULL) - return -ENOMEM; - ret = seq_open(file, ct_seq_ops); - if (ret) - goto out_free; - seq = file-private_data; - seq-private = st; - return ret; -out_free: - kfree(st); - return ret; + return seq_open_private(file, ct_seq_ops, + sizeof(struct ct_iter_state)); } static const struct file_operations ct_file_ops = { @@ -291,22 +277,8 @@ static const struct seq_operations exp_s static int exp_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - struct ct_expect_iter_state *st; - int ret; - - st = kzalloc(sizeof(struct ct_expect_iter_state), GFP_KERNEL); - if (!st) - return -ENOMEM; - ret = seq_open(file, exp_seq_ops); - if (ret) - goto out_free; - seq = file-private_data; - seq-private = st; - return ret; -out_free: - kfree(st); - return ret; + return seq_open_private(file, exp_seq_ops, + sizeof(struct ct_expect_iter_state)); } static const struct file_operations ip_exp_file_ops = { diff --git a/net/netfilter/nf_conntrack_expect.c b/net/netfilter/nf_conntrack_expect.c index 7a0ae36..175c8d1 100644 --- a/net/netfilter/nf_conntrack_expect.c +++ b/net/netfilter/nf_conntrack_expect.c @@ -472,22 +472,8 @@ static const struct seq_operations exp_s static int exp_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - struct ct_expect_iter_state *st; - int ret; - - st = kzalloc(sizeof(struct ct_expect_iter_state), GFP_KERNEL); - if (!st) - return -ENOMEM; - ret = seq_open(file, exp_seq_ops); - if (ret) - goto out_free; - seq = file-private_data; - seq-private = st; - return ret; -out_free: - kfree(st); - return ret; + return seq_open_private(file, exp_seq_ops, + sizeof(struct ct_expect_iter_state)); } static const struct file_operations exp_file_ops = { diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 2a19c5f..9efdd37 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -195,22 +195,8 @@ static const struct seq_operations ct_se static int ct_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - struct ct_iter_state *st; - int ret; - - st = kzalloc(sizeof(struct ct_iter_state), GFP_KERNEL); - if (st == NULL) - return -ENOMEM; - ret = seq_open(file, ct_seq_ops); - if (ret) - goto out_free; - seq = file-private_data; - seq-private = st; - return ret; -out_free: - kfree(st); - return ret; + return seq_open_private(file, ct_seq_ops, + sizeof(struct ct_iter_state)); } static const struct file_operations ct_file_ops = { diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c index 2135926..2c7bd2e 100644 --- a/net/netfilter/nfnetlink_log.c +++ b/net/netfilter/nfnetlink_log.c
Re: [PATCH][E1000E] some cleanups
jamal wrote: On Mon, 2007-08-10 at 15:40 -0700, Kok, Auke wrote: My biggest problem with the patch as you sent it that it's a tonload of changes and no implicit benefit immediately as I can see. The patch looks scary but is pretty tame when you apply it and stare at it. I would really have to see the LLTX change as well to make a wellfounded decision whether it is the right thing to go and overhaul this part of the driver or not :) I'm not particularly against the changes per se though. so please, if needed offlist send me those LLTX changes as well. Dont worry about it - i didnt get any love for a similar patch i did for tg3 either ;- I will hold onto it for now. I could do the un-LLTX outside of this - would you like that for both e1000/e1000e ? if we're going to remove LLTX from e1000 I prefer to do that at a much later time. Let's focus on e1000e instead - while it is still moving ;) Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6][NET-2.6.24] Make decnet code use the seq_open_private()
Just switch to the consolidated code. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Cc: Patrick Caulfield [EMAIL PROTECTED] --- diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c index b66e3be..e851b14 100644 --- a/net/decnet/dn_neigh.c +++ b/net/decnet/dn_neigh.c @@ -580,24 +580,8 @@ static const struct seq_operations dn_ne static int dn_neigh_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct neigh_seq_state *s = kzalloc(sizeof(*s), GFP_KERNEL); - - if (!s) - goto out; - - rc = seq_open(file, dn_neigh_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, dn_neigh_seq_ops, + sizeof(struct neigh_seq_state)); } static const struct file_operations dn_neigh_seq_fops = { diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index b7ebf99..97eee5e 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -1739,23 +1739,8 @@ static const struct seq_operations dn_rt static int dn_rt_cache_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct dn_rt_cache_iter_state *s; - - s = kzalloc(sizeof(*s), GFP_KERNEL); - if (!s) - goto out; - rc = seq_open(file, dn_rt_cache_seq_ops); - if (rc) - goto out_kfree; - seq = file-private_data; - seq-private= s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, dn_rt_cache_seq_ops, + sizeof(struct dn_rt_cache_iter_state)); } static const struct file_operations dn_rt_cache_seq_fops = { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6][NET-2.6.24] Make the IRDA use the seq_open_private()
Just switch to the consolidated code Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Cc: Samuel Ortiz [EMAIL PROTECTED] --- diff --git a/net/irda/irlap.c b/net/irda/irlap.c index 3d76aaf..f3236ac 100644 --- a/net/irda/irlap.c +++ b/net/irda/irlap.c @@ -1219,29 +1219,11 @@ static const struct seq_operations irlap static int irlap_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct irlap_iter_state *s = kzalloc(sizeof(*s), GFP_KERNEL); + if (irlap == NULL) + return -EINVAL; - if (!s) - goto out; - - if (irlap == NULL) { - rc = -EINVAL; - goto out_kfree; - } - - rc = seq_open(file, irlap_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, irlap_seq_ops, + sizeof(struct irlap_iter_state)); } const struct file_operations irlap_seq_fops = { diff --git a/net/irda/irlmp.c b/net/irda/irlmp.c index 7efa930..7db92ce 100644 --- a/net/irda/irlmp.c +++ b/net/irda/irlmp.c @@ -2003,27 +2003,10 @@ static const struct seq_operations irlmp static int irlmp_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct irlmp_iter_state *s; - IRDA_ASSERT(irlmp != NULL, return -EINVAL;); - s = kmalloc(sizeof(*s), GFP_KERNEL); - if (!s) - goto out; - - rc = seq_open(file, irlmp_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, irlmp_seq_ops, + sizeof(struct irlmp_iter_state)); } const struct file_operations irlmp_seq_fops = { diff --git a/net/irda/irttp.c b/net/irda/irttp.c index 3d7ab03..1311976 100644 --- a/net/irda/irttp.c +++ b/net/irda/irttp.c @@ -1884,25 +1884,8 @@ static const struct seq_operations irttp static int irttp_seq_open(struct inode *inode, struct file *file) { - struct seq_file *seq; - int rc = -ENOMEM; - struct irttp_iter_state *s; - - s = kzalloc(sizeof(*s), GFP_KERNEL); - if (!s) - goto out; - - rc = seq_open(file, irttp_seq_ops); - if (rc) - goto out_kfree; - - seq = file-private_data; - seq-private = s; -out: - return rc; -out_kfree: - kfree(s); - goto out; + return seq_open_private(file, irttp_seq_ops, + sizeof(struct irttp_iter_state)); } const struct file_operations irttp_seq_fops = { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6][NET-2.6.24] Make the sunrpc use the seq_open_private()
Just switch to the consolidated code. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Cc: Neil Brown [EMAIL PROTECTED] --- diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c index ebe344f..8e05557 100644 --- a/net/sunrpc/cache.c +++ b/net/sunrpc/cache.c @@ -1218,23 +1218,15 @@ static const struct seq_operations cache static int content_open(struct inode *inode, struct file *file) { - int res; struct handle *han; struct cache_detail *cd = PDE(inode)-data; - han = kmalloc(sizeof(*han), GFP_KERNEL); + han = __seq_open_private(file, cache_content_op, sizeof(*han)); if (han == NULL) return -ENOMEM; han-cd = cd; - - res = seq_open(file, cache_content_op); - if (res) - kfree(han); - else - ((struct seq_file *)file-private_data)-private = han; - - return res; + return 0; } static const struct file_operations content_file_operations = { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8][BNX2X] resubmit as attachments: add bnx2x to Kconfig and Makefile
On Mon, 2007-10-08 at 21:29 -0700, David Miller wrote: From: Eliezer Tamir [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 06:13:23 +0200 Due to the size of the patch I can not post it to the list. Understood Here is an FTP link. ftp://[EMAIL PROTECTED]/bnx2x-0.40.10-net-2.6.24-one.patch.txt Or if you prefer, I can post it gzipped. I took a look at what takes up so much space and it's the firmware and this register init stuff. The firmware is unavoidable, but wrt. the register init tables there appears to be a ton of superfluous information in that header file. First, I must admit that looking over the file again I noticed an error on my part in generating the file, all the data that initializes FPGA or EMULATION should not be there. fixed. (This reduced about half of the lines but only a third in size.) It seem extreme overkill and I would guess it exists purely to simplify your in-house chip validation. I would really wish you wouldn't do this as these tables take up unswappable kernel memory. No other driver goes about initializing the chip registers using tables like this. Many of them are sequences of nothing but zeros or the same value repeated over and over! The non-zero values that are specific are magic constants with no macros to describe up the meaning of the bits being set in these registers. Almost all of the zero-filled tables will be removed. The rest of the registers do need to be initialized. I agree that the number of registers that needs to be initialized is huge, but that is caused by the way the hardware was designed. The values for the initialization come from several sources: Some are derived from HW code (the XML files used to derive the verilog code), Others (along with much of the machine generated .h files) are generated at microcode build time, adding a microcode routine will cause the init values to change, using a new variable can cause an .h file to change. In the last group which is very small, are registers that are controlled by the driver. The values in this file really are machine generated, they really are not meant to be modified directly by editing the file. The registers that are under the driver's control are in the main .c and .h files. Huge functions like bnx2x_idle_chk() that just validate register values with hundreds of lines of C code are not appropriate for submission into a Linux kernel network driver. Again, this looks like code that assists you with in-house chip validation. The idle check code is not a manufacturing test, it is meant to help debug the driver and microcode. If the driver sends an invalid command to one of the CPUs which then chokes on it, this will tell you which one of them died and the general whereabouts of the problem. (ingress CPU X is stuck because output queue Y is full) I agree that it is not a clean implementation. I will remove it and it will be re-written. I'm open to putting self tests into the driver, but this adds so much bloat it really takes things beyond reasonable. Please remove this self-test code and compress the register initialization information, preferrably into foo_hw_init() C functions as is done traditionally in drivers. It will take up less space, remove the magic values, and make the hardware easier to understand for other developers. This should get the driver under the posting limit of 400K, which to be honest no driver's should be larger than. Unfortunately that firmware file is 350K so you won't have much left to work with :-) What we could do is post the driver without the firmware, so that people can review the actual C-code by just replying to this posting and it won't go over the posting size limit. ( Michael has showed me the trick of how to post with evolution, so I hope that the mangled patch problem is behind us and I think that I can now post everything without a problem, Hallelujah!) Thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
Sorry if this is a dumb question, but what is the model you intend for SNMP? Do you want each namespace to be its own virtual machine with its own, separate MIB? Ifindex's have to uniquely identify the interface (virtual or otherwise) to remote queriers (not just local applications), so unless you pay the price of separating all the SNMP MIBs per namespace too, it seems you'll need some way to remap these for SNMP queries, right? +-DLS - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
Pavel Emelyanov [EMAIL PROTECTED] writes: Currently indexes for netdevices come sequentially one by one, and the same stays true even for devices that are created for namespaces. Side effects of this are: * lo device has not 1 index in a namespace. This may break some userspace that relies on it (and AFAIR something really broke in OpenVZ VEs without this); As it happens lo hasn't been registered first for some time so it hasn't had ifindex of 1 in the normal kernel. * after some time namespaces will have devices with indexes like 100 os similar. This might be confusing for a human (tools will not mind). Only if we wind up creating that many devices. So move the (currently global and static) ifindex variable on the struct net, making the indexes allocation look more like on a standalone machine. Moreover - when we have indexes intersect between namespaces, we may catch more BUGs in the future related to wrong device was found for a given index. Not yet. I know there are several data structures internal to the kernel that are indexed by ifindex, and not struct net_device *. There is the iflink field in struct net_device. We need a way to refer to network devices in other namespaces in rtnetlink in an unambiguous way. I don't see any real problems with a global ifindex assignment until we start migrating applications. So please hold off on this until the kernel has been audited and we have removed all of the uses of ifindex that assume ifindex is global, that we can find. Right now a namespace local ifindex seems to be just asking for trouble. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
David Stevens [EMAIL PROTECTED] writes: Sorry if this is a dumb question, but what is the model you intend for SNMP? Do you want each namespace to be its own virtual machine with its own, separate MIB? Each network namespace appears to user space as a completely separate network stack. So yes a separate instance of the MIB is appropriate. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On 09 Oct 2007 18:51:51 +0200 Andi Kleen [EMAIL PROTECTED] wrote: David Miller [EMAIL PROTECTED] writes: 2) Switch the default qdisc away from pfifo_fast to a new DRR fifo with load balancing using the code in #1. I think this is kind of in the territory of what Peter said he is working on. Hopefully that new qdisc will just use the TX rings of the hardware directly. They are typically large enough these days. That might avoid some locking in this critical path. I know this is controversial, but realistically I doubt users benefit at all from the prioritization that pfifo provides. I agree. For most interfaces the priority is probably dubious. Even for DSL the prioritization will be likely usually done in a router these days. Also for the fast interfaces where we do TSO priority doesn't work very well anyways -- with large packets there is not too much to prioritize. 3) Work on discovering a way to make the locking on transmit as localized to the current thread of execution as possible. Things like RCU and statistic replication, techniques we use widely elsewhere in the stack, begin to come to mind. If the data is just passed on to the hardware queue, why is any locking needed at all? (except for the driver locking of course) -Andi I wonder about the whole idea of queueing in general at such high speeds. Given the normal bi-modal distribution of packets, and the predominance of 1500 byte MTU; does it make sense to even have any queueing in software at all? -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
I wonder about the whole idea of queueing in general at such high speeds. Given the normal bi-modal distribution of packets, and the predominance of 1500 byte MTU; does it make sense to even have any queueing in software at all? Yes that is my point -- it should just pass it through directly and the driver can then put it into the different per CPU (or per whatever) queues managed by the hardware. The only thing the qdisc needs to do is to set some bit that says it is ok to put this into difference queues; don't need strict ordering Otherwise if the drivers did that unconditionally they might cause problems with other qdiscs. This would also require that the driver exports some hint to the upper layer on how large its internal queues are. A device with a short queue would still require pfifo_fast. Long queue devices could just pass through. That again could be a single flag. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xen-netfront: rearrange netfront_info structure to separate tx and rx
Keep tx and rx elements separate on different cachelines to prevent bouncing. Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED] Cc: Stephen Hemminger [EMAIL PROTECTED] Cc: Christoph Hellwig [EMAIL PROTECTED] --- drivers/net/xen-netfront.c | 37 ++--- 1 file changed, 18 insertions(+), 19 deletions(-) === --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -72,22 +72,12 @@ struct netfront_info { struct list_head list; struct net_device *netdev; + unsigned int evtchn; + struct xenbus_device *xbdev; + + spinlock_t tx_lock; struct xen_netif_tx_front_ring tx; - struct xen_netif_rx_front_ring rx; - - spinlock_t tx_lock; - spinlock_t rx_lock; - - unsigned int evtchn; - - /* Receive-ring batched refills. */ -#define RX_MIN_TARGET 8 -#define RX_DFL_MIN_TARGET 64 -#define RX_MAX_TARGET min_t(int, NET_RX_RING_SIZE, 256) - unsigned rx_min_target, rx_max_target, rx_target; - struct sk_buff_head rx_batch; - - struct timer_list rx_refill_timer; + int tx_ring_ref; /* * {tx,rx}_skbs store outstanding skbuffs. Free tx_skb entries @@ -106,13 +96,22 @@ struct netfront_info { grant_ref_t grant_tx_ref[NET_TX_RING_SIZE]; unsigned tx_skb_freelist; + spinlock_t rx_lock cacheline_aligned_in_smp; + struct xen_netif_rx_front_ring rx; + int rx_ring_ref; + + /* Receive-ring batched refills. */ +#define RX_MIN_TARGET 8 +#define RX_DFL_MIN_TARGET 64 +#define RX_MAX_TARGET min_t(int, NET_RX_RING_SIZE, 256) + unsigned rx_min_target, rx_max_target, rx_target; + struct sk_buff_head rx_batch; + + struct timer_list rx_refill_timer; + struct sk_buff *rx_skbs[NET_RX_RING_SIZE]; grant_ref_t gref_rx_head; grant_ref_t grant_rx_ref[NET_RX_RING_SIZE]; - - struct xenbus_device *xbdev; - int tx_ring_ref; - int rx_ring_ref; unsigned long rx_pfn_array[NET_RX_RING_SIZE]; struct multicall_entry rx_mcl[NET_RX_RING_SIZE+1]; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3][NET_BATCH] net core use batching
IMO the net driver really should provide a hint as to what it wants. 8139cp and tg3 would probably prefer multiple TX queue behavior to match silicon behavior -- strict prio. If I understand what you just said, I disagree. If your hardware is running strict prio, you don't want to enforce strict prio in the qdisc layer; performing two layers of QoS is excessive, and may lead to results you don't want. The reason I added the DRR qdisc is for the Si that has its own queueing strategy that is not RR. For Si that implements RR (like e1000), you can either use the DRR qdisc, or if you want to prioritize your flows, use PRIO. -PJ Waskiewicz [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
Waskiewicz Jr, Peter P wrote: IMO the net driver really should provide a hint as to what it wants. 8139cp and tg3 would probably prefer multiple TX queue behavior to match silicon behavior -- strict prio. If I understand what you just said, I disagree. If your hardware is running strict prio, you don't want to enforce strict prio in the qdisc layer; performing two layers of QoS is excessive, and may lead to results you don't want. The reason I added the DRR qdisc is for the Si that has its own queueing strategy that is not RR. For Si that implements RR (like e1000), you can either use the DRR qdisc, or if you want to prioritize your flows, use PRIO. A misunderstanding, I think. To my brain, DaveM's item #2 seemed to assume/require the NIC hardware to balance fairly across hw TX rings, which seemed to preclude the 8139cp/tg3 style of strict-prio hardware. That's what I was responding to. As long as there is some modular way to fit 8139cp/tg3 style multi-TX into our universe, I'm happy :) Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc8-mm2 BUG: register_netdevice() issue as (ab)used by ISDN
Hi, On Mon, Oct 08, 2007 at 05:58:36PM +0200, Karsten Keil wrote: You could try following patch with 2.6.23-rc8-mm2, it change I4L to use alloc_netdev(). I just did the horribly unthinkable: I rebooted the production internet gateway *remotely*, and: [EMAIL PROTECTED]:/home/andi# uname -a Linux gate 2.6.23-rc8-mm2-gate #1 Mon Oct 8 22:14:00 CEST 2007 i586 GNU/Linux [EMAIL PROTECTED]:/home/andi# dmesg|grep BUG [EMAIL PROTECTED]:/home/andi# grep BUG /var/log/messages Oct 7 13:07:40 gate kernel: kernel BUG at net/core/dev.c:3485! ^^^ This 2.6.23-rc8-mm2-gate was ISDN-patched, obviously. Oh, and isdnlog *is* indeed running, so this is a first check that ISDN functionality seems to be up and running... IOW, thanks!! Andreas Mohr -- Want to provide valuable development feedback about your kernel version? Install http://klive.cpushare.com/ ! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/3][NET_BATCH] net core use batching
A misunderstanding, I think. To my brain, DaveM's item #2 seemed to assume/require the NIC hardware to balance fairly across hw TX rings, which seemed to preclude the 8139cp/tg3 style of strict-prio hardware. That's what I was responding to. As long as there is some modular way to fit 8139cp/tg3 style multi-TX into our universe, I'm happy :) Ah hah. Yes, a misunderstanding on my part. Thanks for the clarification. Methinks more caffeine is required for today... -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Don't memset() netns to zero manually
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 15:48:19 +0400 The newly created net namespace is set to 0 with memset() in setup_net(). The setup_net() is also called for the init_net_ns(), which is zeroed naturally as a global var. So remove this memset and allocate new nets with the kmem_cache_zalloc(). Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Applied, thanks Pavel! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Tue, 09 Oct 2007 11:43:58 -0600 David Stevens [EMAIL PROTECTED] writes: Sorry if this is a dumb question, but what is the model you intend for SNMP? Do you want each namespace to be its own virtual machine with its own, separate MIB? Each network namespace appears to user space as a completely separate network stack. So yes a separate instance of the MIB is appropriate. We don't think you can validly do that, as David tried to explain. The interface indexes are visible remotely to remote SNMP querying applications. They have to be unique within the physical system. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
From: David Stevens [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 09:18:25 -0700 Ifindex's have to uniquely identify the interface (virtual or otherwise) to remote queriers (not just local applications), so unless you pay the price of separating all the SNMP MIBs per namespace too, it seems you'll need some way to remap these for SNMP queries, right? I don't see how it can work even with per-namespace MIBs, the interface indexes have to be unique per system. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 16:19:25 +0400 Currently indexes for netdevices come sequentially one by one, and the same stays true even for devices that are created for namespaces. Side effects of this are: * lo device has not 1 index in a namespace. This may break some userspace that relies on it (and AFAIR something really broke in OpenVZ VEs without this); * after some time namespaces will have devices with indexes like 100 os similar. This might be confusing for a human (tools will not mind). So move the (currently global and static) ifindex variable on the struct net, making the indexes allocation look more like on a standalone machine. Moreover - when we have indexes intersect between namespaces, we may catch more BUGs in the future related to wrong device was found for a given index. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Based upon Eric's and other's comments, I'm holding off on this for now. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Jeff Garzik [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 08:44:25 -0400 David Miller wrote: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:51:14 +0530 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Thanks, but it's probably going to need some corrections and/or an audit. If you unconditionally take those locks in the transmit function, there is probably an ABBA deadlock elsewhere in the driver now, most likely in the TX reclaim processing, and you therefore need to handle that too. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
David Miller wrote: From: Jeff Garzik [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 08:44:25 -0400 David Miller wrote: From: Krishna Kumar2 [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 16:51:14 +0530 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM: Ignore LLTX, it sucks, it was a big mistake, and we will get rid of it. Great, this will make life easy. Any idea how long that would take? It seems simple enough to do. I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) What, like this? :) Thanks, but it's probably going to need some corrections and/or an audit. I would be happy if someone wanted to audit that patch. If you unconditionally take those locks in the transmit function, there is probably an ABBA deadlock elsewhere in the driver now, most likely in the TX reclaim processing, and you therefore need to handle that too. And I most certainly checked the relevant transmit paths and other locking to make sure lock ordering was correct. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
I'd say we can probably try to get rid of it in 2.6.25, this is assuming we get driver authors to cooperate and do the conversions or alternatively some other motivated person. I can just threaten to do them all and that should get the driver maintainers going :-) I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to my TODO list so I don't forget. In fact if 2.6.23 drags on long enough I may do it for 2.6.24 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] [IPSEC]: Remove bogus ref count in xfrm_secpath_reject
From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 22:36:31 +0800 [IPSEC]: Remove bogus ref count in xfrm_secpath_reject Constructs of the form xfrm_state_hold(x); foo(x); xfrm_state_put(x); tend to be broken because foo is either synchronous where this is totally unnecessary or if foo is asynchronous then the reference count is in the wrong spot. In the case of xfrm_secpath_reject, the function is synchronous and therefore we should just kill the reference count. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied, thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] [IPSEC]: Store IPv6 nh pointer in mac_header on output
From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 22:36:32 +0800 [IPSEC]: Store IPv6 nh pointer in mac_header on output Current the x-mode-output functions store the IPv6 nh pointer in the skb network header. This is inconvenient because the network header then has to be fixed up before the packet can leave the IPsec stack. The mac header field is unused on output so we can use that to store this instead. This patch does that and removes the network header fix-up in xfrm_output. It also uses ipv6_hdr where appropriate in the x-type-output functions. There is also a minor clean-up in esp4 to make it use the same code as esp6 to help any subsequent effort to merge the two. Lastly it kills two redundant skb_set_* statements in BEET that were simply copied over from transport mode. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied, thanks Herbert. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] [IPSEC]: Remove gratuitous km wake-up events on ACQUIRE
From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 22:36:33 +0800 [IPSEC]: Remove gratuitous km wake-up events on ACQUIRE There is no point in waking people up when creating/updating larval states because they'll just go back to sleep again as larval states by definition cannot be found by xfrm_state_find. We should only wake them up when the larvals mature or die. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied, thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7] [IPSEC]: Move common code into xfrm_alloc_spi
From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 22:36:34 +0800 [IPSEC]: Move common code into xfrm_alloc_spi This patch moves some common code that conceptually belongs to the xfrm core from af_key/xfrm_user into xfrm_alloc_spi. In particular, the spin lock on the state is now taken inside xfrm_alloc_spi. Previously it also protected the construction of the response PF_KEY/XFRM messages to user-space. This is inconsistent as other identical constructions are not protected by the state lock. This is bad because they in fact should be protected but only in certain spots (so as not to hold the lock for too long which may cause packet drops). The SPI byte order conversion has also been moved. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied, thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] [XFRM] user: Move attribute copying code into copy_to_user_state_extra
From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 22:36:35 +0800 [XFRM] user: Move attribute copying code into copy_to_user_state_extra Here's a good example of code duplication leading to code rot. The notification patch did its own netlink message creation for xfrm states. It duplicated code that was already in dump_one_state. Guess what, the next time (and the time after) when someone updated dump_one_state the notification path got zilch. This patch moves that code from dump_one_state to copy_to_user_state_extra and uses it in xfrm_notify_sa too. Unfortunately whoever updates this still needs to update xfrm_sa_len since the notification path wants to know the exact size for allocation. At least I've added a comment saying so and if someone still forgest, we'll have a WARN_ON telling us so. I also changed the security size calculation to use xfrm_user_sec_ctx since that's what we actually put into the skb. However it makes no practical difference since it has the same size as xfrm_sec_ctx. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied, thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] [IPSEC]: Lock state when copying non-atomic fields to user-space
From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 22:36:36 +0800 [IPSEC]: Lock state when copying non-atomic fields to user-space This patch adds locking so that when we're copying non-atomic fields such as life-time or coaddr to user-space we don't get a partial result. For af_key I've changed every instance of pfkey_xfrm_state2msg apart from expiration notification to include the keys and life-times. This is in-line with XFRM behaviour. The actual cases affected are: * pfkey_getspi: No change as we don't have any keys to copy. * key_notify_sa: + ADD/UPD: This wouldn't work otherwise. + DEL: It can't hurt. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied. I would be more careful with the changelog description for something like this in the future. It sounds like this patch will cause us to touch userspace with locks held, which obviously only works in very limited scenerios and is usually a BUG. But you're actually just constructing SKB response netlink bits, which later will be copied into userspace but long after we've released these locks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] [IPSEC]: Move state lock into x-type-output
From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 22:36:37 +0800 [IPSEC]: Move state lock into x-type-output This patch releases the lock on the state before calling x-type-output. It also adds the lock to the spots where they're currently needed. Most of those places (all except mip6) are expected to disappear with async crypto. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Applied, thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Andi Kleen [EMAIL PROTECTED] Date: 09 Oct 2007 18:51:51 +0200 Hopefully that new qdisc will just use the TX rings of the hardware directly. They are typically large enough these days. That might avoid some locking in this critical path. Indeed, I also realized last night that for the default qdiscs we do a lot of stupid useless work. If the queue is a FIFO and the device can take packets, we should send it directly and not stick it into the qdisc at all. If the data is just passed on to the hardware queue, why is any locking needed at all? (except for the driver locking of course) Absolutely. Our packet scheduler subsystem is great, but by default it should just get out of the way. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Roland Dreier [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 13:22:44 -0700 I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to my TODO list so I don't forget. In fact if 2.6.23 drags on long enough I may do it for 2.6.24 Before you add new entries to your list, how is that ibm driver NAPI conversion coming along? :-) Right now that's a more pressing task to complete. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, 09 Oct 2007 13:43:31 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Andi Kleen [EMAIL PROTECTED] Date: 09 Oct 2007 18:51:51 +0200 Hopefully that new qdisc will just use the TX rings of the hardware directly. They are typically large enough these days. That might avoid some locking in this critical path. Indeed, I also realized last night that for the default qdiscs we do a lot of stupid useless work. If the queue is a FIFO and the device can take packets, we should send it directly and not stick it into the qdisc at all. If the data is just passed on to the hardware queue, why is any locking needed at all? (except for the driver locking of course) Absolutely. Our packet scheduler subsystem is great, but by default it should just get out of the way. I was thinking why not have a default transmit queue len of 0 like the virtual devices. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
David Miller [EMAIL PROTECTED] writes: From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Tue, 09 Oct 2007 11:43:58 -0600 David Stevens [EMAIL PROTECTED] writes: Sorry if this is a dumb question, but what is the model you intend for SNMP? Do you want each namespace to be its own virtual machine with its own, separate MIB? Each network namespace appears to user space as a completely separate network stack. So yes a separate instance of the MIB is appropriate. We don't think you can validly do that, as David tried to explain. The interface indexes are visible remotely to remote SNMP querying applications. They have to be unique within the physical system. I think figuring out what we are doing with SNMP is not any harder or easier then any other user space interface, and like I said I don't think we are ready yet. From the perspective of monitoring network namespaces make the entire system looks more like a cluster then it does a single machine, and that is how I would look at portraying the system to SNMP if I had to do that work today. A switch with a bunch of different machines behind it. Especially in the context of container migration this becomes an attractive model. Regardless it is early yet and there is plenty of time to revisit this after we solved the easier and less controversial problems. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net-2.6.24 rebased...
Hallo Dave, 9.1MB, 739 changesets, keep those patches flowing! Last week I have sent another version of our patch series for PF_CAN. The changes after the last review feedback were only cosmetics. Do you have any plans with that code for this or the next release? Regards, urs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][NETNS] Make ifindex generation per-namespace
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Tue, 09 Oct 2007 15:00:10 -0600 Regardless it is early yet and there is plenty of time to revisit this after we solved the easier and less controversial problems. Ok. I would encourage you to learn how the SNMP mibs work, and whether they associate things with interfaces and/or unique MAC addresses. The semantics may have conflicts with your envisioned cluster abstraction. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net-2.6.24 rebased...
From: Urs Thuermann [EMAIL PROTECTED] Date: 09 Oct 2007 23:13:42 +0200 Last week I have sent another version of our patch series for PF_CAN. The changes after the last review feedback were only cosmetics. Do you have any plans with that code for this or the next release? I think PF_CAN will go in 2.6.25 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Stephen Hemminger [EMAIL PROTECTED] Date: Tue, 9 Oct 2007 13:53:40 -0700 I was thinking why not have a default transmit queue len of 0 like the virtual devices. I'm not so sure. Even if the device has huge queues I still think we need a software queue for when the hardware one backs up. It is even beneficial to stick with reasonably sized TX queues because it keeps the total resident state accessed by the CPU within the bounds of the L2 cache. If you go past that it actually hurts to make the TX queue larger instead of helps even if it means you never hit back pressure. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
From: Jeff Garzik [EMAIL PROTECTED] Date: Tue, 09 Oct 2007 16:20:14 -0400 David Miller wrote: If you unconditionally take those locks in the transmit function, there is probably an ABBA deadlock elsewhere in the driver now, most likely in the TX reclaim processing, and you therefore need to handle that too. And I most certainly checked the relevant transmit paths and other locking to make sure lock ordering was correct. Awesome. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
Before you add new entries to your list, how is that ibm driver NAPI conversion coming along? :-) I still haven't done much. OK, I will try to get my board booting again this week. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching
On Tue, 2007-09-10 at 14:22 -0700, David Miller wrote: Even if the device has huge queues I still think we need a software queue for when the hardware one backs up. It should be fine to just pretend the qdisc exists despite it sitting in the driver and not have s/ware queues at all to avoid all the challenges that qdiscs bring; if the h/ware queues are full because of link pressure etc, you drop. We drop today when the s/ware queues are full. The driver txmit lock takes place of the qdisc queue lock etc. I am assuming there is still need for that locking. The filter/classification scheme still works as is and select classes which map to rings. tc still works as is etc. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHES] TX batching rev2.5
Please provide feedback on the code and/or architecture. They are now updated to work with the latest rebased net-2.6.24 from a few hours ago. I am on travel mode so wont have time to do more testing for the next few days - i do consider this to be stable at this point based on what i have been testing (famous last words). Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev-gso_skb What has changed since i posted last: - 1) There was some cruft leftover from prep_frame feature that i forgot to remove last time - it is now gone. 2) In the shower this AM, i realized that it is plausible that a batch of packets sent to the driver may all be dropped because they are badly formatted. Current drivers all return NETDEV_TX_OK in all such cases. This will lead for us to call dev-hard_end_xmit() to be invoked unnecessarily. I already had a NETDEV_TX_DROPPED within batching drivers, so i made that global and make the batching drivers return it if they drop packets. The core calls dev-hard_end_xmit() when at least one packet makes it through. Things i am gonna say that nobody will see (wink) - Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev-gso_skb that you introduced). UPCOMING PATCHES --- As before: More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver - tun driver - pktgen - netiron driver - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto that i will post soon today. PERFORMANCE TESTING System under test hardware is still a 2xdual core opteron with a couple of tg3s. A test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. I do plan also to run forwarding and TCP tests in the future when the dust settles. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] [NET_BATCH] Introduce batching interface Rev2.5
This patch introduces the netdevice interface for batching. cheers, jamal [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. BACKGROUND - A driver dev-hard_start_xmit() has 4 typical parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts, set last tx time, etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. INTRODUCING API --- With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)Remove #d from dev-hard_start_xmit() and put it in dev-hard_end_xmit() method. 2)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) 3) #a is deffered to future work to reduce confusion (since it holds on its own). Note: There are drivers which may need not support any of the two approaches (example the tun driver i patched) so the methods are optional. xmit_win variable is set by the driver to tell the core how much space it has to take on new skbs. It is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Refer to the driver howto for more details. THEORY OF OPERATION --- 1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented and GSO packets are accounted for as well. 2. Core grabs TX_LOCK 3. Core loop for all skbs: invokes driver dev-hard_start_xmit() 4. Core invokes driver dev-hard_end_xmit() ACKNOWLEDGEMENT AND SOME HISTORY There's a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller, and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and neterion folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 98d39ea7922fa2719a80eecd02cae359f3d7 tree 63822bf3040ea41846399c589c912c2be654f008 parent 7b4cd20628fe5c4e145c383fcd8d954d38f7be61 author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400 include/linux/netdevice.h |9 +- net/core/dev.c| 67 ++--- net/sched/sch_generic.c |4 +-- 3 files changed, 73 insertions(+), 7 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..b0e71c9 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -86,6 +86,7 @@ struct wireless_dev; /* Driver transmit return codes */ #define NETDEV_TX_OK 0 /* driver took care of packet */ #define NETDEV_TX_BUSY 1 /* driver tx path was busy*/ +#define NETDEV_TX_DROPPED 2 /* driver tx path dropped packet*/ #define NETDEV_TX_LOCKED -1 /* driver tx lock was already taken */ /* @@ -467,6 +468,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL 8192 /* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE 16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO 32768 /* large receive offload */ +#define NETIF_F_BTX 65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +597,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + void (*hard_end_xmit) (struct net_device *dev); + int xmit_win; + /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -609,6 +614,7 @@ struct net_device /* delayed register/unregister */ struct list_head todo_list; + struct sk_buff_head blist; /* device index hash chain */ struct hlist_node index_hlist; @@ -1043,7 +1049,8 @@
[PATCH 2/3][NET_BATCH] Rev2.5 net core use batching
This patch adds the usage of batching within the core. cheers, jamal [NET_BATCH] net core use batching This patch adds the usage of batching within the core. Performance results demonstrating improvement are provided separately. I have #if-0ed some of the old functions so the patch is more readable. A future patch will remove all if-0ed content. Patrick McHardy eyeballed a bug that will cause re-ordering in case of a requeue. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit c73d8ee8cce61a98f55fbfb2cafe813a7eca472c tree 8b9155fe15baa4c2e7adb69585c7aa275a6bc896 parent 98d39ea7922fa2719a80eecd02cae359f3d7 author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400 net/sched/sch_generic.c | 104 ++- 1 files changed, 94 insertions(+), 10 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 424c08b..d98c680 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q-q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,85 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev-xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING +Dead loop on netdevice %s, fix it urgently!\n, +dev-name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue_tail(skbs)) != NULL) + q-ops-requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb-next) { + count += skb_shinfo(skb)-nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev-xmit_win; + + if (count dev-gso_skb) { + skb = dev-gso_skb; + dev-gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count 0) { + skb = q-dequeue(q); + if (!skb) + break; + + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + return skb_queue_len(pktlist); +} /* * NOTE: Called under dev-queue_lock with locally disabled BH. @@ -133,19 +213,20 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, static inline int qdisc_restart(struct net_device *dev) { struct Qdisc *q = dev-qdisc; - struct sk_buff *skb; - int ret, xcnt = 0; + int ret = 0; - /* Dequeue packet */ - if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL)) - return 0; + /* Dequeue packets */ + ret = xmit_get_pkts(dev, q, dev-blist); + if (!ret) + return 0; - /* And release queue */ + /* We got em packets */ spin_unlock(dev-queue_lock); + /* bye packets */ HARD_TX_LOCK(dev, smp_processor_id()); - ret = dev_hard_start_xmit(skb, dev, xcnt); + ret = dev_batch_xmit(dev); HARD_TX_UNLOCK(dev); spin_lock(dev-queue_lock); @@ -158,8 +239,8 @@ static inline int qdisc_restart(struct net_device *dev) break; case NETDEV_TX_LOCKED: - /* Driver try lock failed */ - ret = handle_dev_cpu_collision(skb, dev, q); + /* Driver lock failed */ + ret = xmit_islocked(dev-blist, dev, q); break; default: @@ -168,7 +249,7 @@ static inline int qdisc_restart(struct net_device *dev) printk(KERN_WARNING BUG %s code %d qlen %d\n, dev-name, ret, q-q.qlen); - ret = dev_requeue_skb(skb, dev, q); + ret = dev_requeue_skbs(dev-blist, dev, q); break; } @@ -564,6 +645,9 @@ void dev_deactivate(struct net_device *dev) skb = dev-gso_skb; dev-gso_skb = NULL; + if (!skb_queue_empty(dev-blist)) + skb_queue_purge(dev-blist); + dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); kfree_skb(skb);