RE: raw PF_PACKET protocol selection

2007-10-09 Thread Joakim Tjernlund
 

 -Original Message-
 From: Herbert Xu [mailto:[EMAIL PROTECTED] 
 Sent: den 9 oktober 2007 05:17
 To: [EMAIL PROTECTED]
 Cc: netdev@vger.kernel.org
 Subject: Re: raw PF_PACKET protocol selection
 
 Joakim Tjernlund [EMAIL PROTECTED] wrote:
  
  I trying to open my own raw PF_PACKET socket to receive 
  pkgs sent to this socket. I can only make ETH_P_ALL protocol
  work, but then I receive all pkgs and I want pkgs with a specific
  protocol type. I have tried lots of ETH_P types and none of 
 them work.
  Naturally I make sure the sender is using the same protocol 
 as my test
  program below. I guess I must be doing something wrong???
 
 Your program works fine here.  You did run it as root, right?

Yes and ETH_P_ALL is the only protocol that prints anything
I am on 2.6.22

 Did you try stracing it?

Just did and now it works, it didn't yesterday :(
But if I change protocol to ETH_P_MOBITEX, I don't get any
pkgs(I did change protocol on sending side too)

 
 Cheers,
 -- 
 Visit Openswan at http://www.openswan.org/
 Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
 Home Page: http://gondor.apana.org.au/~herbert/
 PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
 
 
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection

2007-10-09 Thread TAKANO Ryousei
From: Ilpo Järvinen [EMAIL PROTECTED]
Subject: Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection
Date: Mon, 8 Oct 2007 14:11:55 +0300 (EEST)

 On Sun, 7 Oct 2007, TAKANO Ryousei wrote:
 
  From: Ilpo Järvinen [EMAIL PROTECTED]
  Subject: Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection
  Date: Fri, 5 Oct 2007 13:02:07 +0300 (EEST)
  
   On Thu, 4 Oct 2007, TAKANO Ryousei wrote:
   
   In case sacktag uses fastpath, this code won't be executed for the skb's 
   that we would like to check (those with SACKED_RETRANS set, that are 
   below the fastpath_skb_hint). We will eventually deal with the whole 
   queue 
   when fastpath_skb_hint gets set to NULL, with the next cumulative ACK 
   that 
   fully ACKs an skb at the latest. Maybe there's a need for a larger 
   surgery 
   than this to fix it. I think we need additional field to tcp_sock to 
   avoid 
   doing a full-walk per ACK:
 
  I think the problem occurs in slowpath. For example, in case when the 
  receiver
  detects and sends back a new SACK block, the sender may fail to detect loss
  of a retransmitted packet.
 
 ...No, the slow path is very likely to _correct_ the problem! The problem 
 occurs because fastpath _skips_ processing of skbs below fastpath_skb_hint 
 whereas slowpath starts from tcp_write_queue_head. The retransmitted skbs 
 we're interested in reside exactly on that skipped range.
 
Sorry, my explanaton is insufficient.
This problem occurs even if we are in the slowpath.

For example, consider the case when the state of the retransmission queue is
as shown in Fig.1 (http://projects.gtrc.aist.go.jp/gnet/meeting/sack-bug.html).
There are 13 packets in the retransmission queue, P(0) and P(7) are 
retransmitted, and P(1) to P(9) except P(7) are already SACKed.

If only a SACK block SACK(8,11) is received, P(0) is re-transmitted, 
because the ack_seq of P(0) is smaller than the end sequence number (11) 
of the SACK block. However, if the ACK packet has SACK(8,11) and SACK(1,3), 
these SACK blocks are sorted, and SACK(1,3) is process from P(0) to P(2) 
first. In this case, P(0) is not detected as lost because the ack_seq of 
P(0) is not smaller than the end sequence number (3) of this SACK block. 
Next, SACK(8,11) is scanned from P(3) to P(10). P(0) is not checked here, 
and the loss is not detected.

Is it corrent?

Ryousei Takano
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-2.6.24][trivial] fix inconsistency of terms

2007-10-09 Thread TAKANO Ryousei

Fix inconsistency of terms:
1) D-SACK
2) F-RTO

Signed-off-by: Ryousei Takano [EMAIL PROTECTED]
---
 Documentation/networking/ip-sysctl.txt |6 +++---
 net/ipv4/tcp_input.c   |   16 
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 6ae2fef..829a4d8 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -184,14 +184,14 @@ tcp_frto - INTEGER
F-RTO is an enhanced recovery algorithm for TCP retransmission
timeouts.  It is particularly beneficial in wireless environments
where packet loss is typically due to random radio interference
-   rather than intermediate router congestion.  FRTO is sender-side
+   rather than intermediate router congestion.  F-RTO is sender-side
only modification.  Therefore it does not require any support from
the peer, but in a typical case, however, where wireless link is
the local access link and most of the data flows downlink, the
-   faraway servers should have FRTO enabled to take advantage of it.
+   faraway servers should have F-RTO enabled to take advantage of it.
If set to 1, basic version is enabled.  2 enables SACK enhanced
F-RTO if flow uses SACK.  The basic version can be used also when
-   SACK is in use though scenario(s) with it exists where FRTO
+   SACK is in use though scenario(s) with it exists where F-RTO
interacts badly with the packet counting of the SACK enabled TCP
flow.
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e8c3948..372ff6f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -103,7 +103,7 @@ int sysctl_tcp_abc __read_mostly;
 #define FLAG_SLOWPATH  0x100 /* Do not skip RFC checks for window 
update.*/
 #define FLAG_ONLY_ORIG_SACKED  0x200 /* SACKs only non-rexmit sent before RTO 
*/
 #define FLAG_SND_UNA_ADVANCED  0x400 /* Snd_una was changed (!= 
FLAG_DATA_ACKED) */
-#define FLAG_DSACKING_ACK  0x800 /* SACK blocks contained DSACK info */
+#define FLAG_DSACKING_ACK  0x800 /* SACK blocks contained D-SACK info */
 #define FLAG_NONHEAD_RETRANS_ACKED 0x1000 /* Non-head rexmitted data was 
ACKed */
 
 #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
@@ -866,7 +866,7 @@ static void tcp_disable_fack(struct tcp_sock *tp)
tp-rx_opt.sack_ok = ~2;
 }
 
-/* Take a notice that peer is sending DSACKs */
+/* Take a notice that peer is sending D-SACKs */
 static void tcp_dsack_seen(struct tcp_sock *tp)
 {
tp-rx_opt.sack_ok |= 4;
@@ -1058,7 +1058,7 @@ static void tcp_update_reordering(struct sock *sk, const 
int metric,
  *
  * With D-SACK the lower bound is extended to cover sequence space below
  * SND.UNA down to undo_marker, which is the last point of interest. Yet
- * again, DSACK block must not to go across snd_una (for the same reason as
+ * again, D-SACK block must not to go across snd_una (for the same reason as
  * for the normal SACK blocks, explained above). But there all simplicity
  * ends, TCP might receive valid D-SACKs below that. As long as they reside
  * fully below undo_marker they do not affect behavior in anyway and can
@@ -1080,7 +1080,7 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, 
int is_dsack,
if (!before(start_seq, tp-snd_nxt))
return 0;
 
-   /* In outstanding window? ...This is valid exit for DSACKs too.
+   /* In outstanding window? ...This is valid exit for D-SACKs too.
 * start_seq == snd_una is non-sensical (see comments above)
 */
if (after(start_seq, tp-snd_una))
@@ -1581,7 +1581,7 @@ void tcp_enter_frto(struct sock *sk)
 !icsk-icsk_retransmits)) {
tp-prior_ssthresh = tcp_current_ssthresh(sk);
/* Our state is too optimistic in ssthresh() call because cwnd
-* is not reduced until tcp_enter_frto_loss() when previous FRTO
+* is not reduced until tcp_enter_frto_loss() when previous 
F-RTO
 * recovery has not yet completed. Pattern would be this: RTO,
 * Cumulative ACK, RTO (2xRTO for the same segment does not end
 * up here twice).
@@ -1760,7 +1760,7 @@ void tcp_enter_loss(struct sock *sk, int how)
tcp_set_ca_state(sk, TCP_CA_Loss);
tp-high_seq = tp-snd_nxt;
TCP_ECN_queue_cwr(tp);
-   /* Abort FRTO algorithm if one is in progress */
+   /* Abort F-RTO algorithm if one is in progress */
tp-frto_counter = 0;
 }
 
@@ -1905,7 +1905,7 @@ static int tcp_time_to_recover(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
__u32 packets_out;
 
-   /* Do not perform any recovery during FRTO algorithm */
+   /* Do not perform any recovery during F-RTO algorithm */
if (tp-frto_counter)
return 0;
 
@@ -2921,7 

Re: [PATCH 4/8][BNX2X] resubmit as attachments: bnx2x_hsi.h

2007-10-09 Thread Chris Wedgwood
On Mon, Oct 08, 2007 at 06:16:17PM +0200, Eliezer Tamir wrote:

 bnx2x_hsi.h - machine generated hardware and microcode definitions

over 7000 lines... how much of this is really needed?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][TCP] tcp_vegas.c: (tcp_vegas_cong_avoid) fix a bug in disabling slow start by gamma parameter

2007-10-09 Thread Xiaoliang (David) Wei
TCP Vegas implementation has a bug in the process of disabling
slow-start with gamma parameter. The bug may lead to extreme
unfairness in the presence of early packet loss. See details in:
http://www.cs.caltech.edu/~weixl/technical/ns2linux/known_linux/index.html#vegas


Switch the order of if (tp-snd_cwnd = tp-snd_ssthresh) statement
and if (diff  gamma) statement to eliminate the problem.

Signed-off-by: Xiaoliang (David) Wei  http://www.davidwei.org
---
 linux-2.6.22.9-fixedvegas/net/ipv4/tcp_vegas.c |   35 -
 1 files changed, 17 insertions(+), 18 deletions(-)

diff -uprN linux-2.6.22.9/net/ipv4/tcp_vegas.c
linux-2.6.22.9-fixedvegas/net/ipv4/tcp_vegas.c
--- linux-2.6.22.9/net/ipv4/tcp_vegas.c 2007-09-26 11:03:01.0 -0700
+++ linux-2.6.22.9-fixedvegas/net/ipv4/tcp_vegas.c  2007-10-08
22:44:46.0 -0700
@@ -266,26 +266,25 @@ static void tcp_vegas_cong_avoid(struct
 */
diff = (old_wnd  V_PARAM_SHIFT) - target_cwnd;

-   if (tp-snd_cwnd = tp-snd_ssthresh) {
-   /* Slow start.  */
-   if (diff  gamma) {
-   /* Going too fast. Time to slow down
-* and switch to congestion avoidance.
-*/
-   tp-snd_ssthresh = 2;
+   if (diff  gamma  tp-snd_ssthresh  2 ) {
+   /* Going too fast. Time to slow down
+* and switch to congestion avoidance.
+*/
+   tp-snd_ssthresh = 2;

-   /* Set cwnd to match the actual rate
-* exactly:
-*   cwnd = (actual rate) * baseRTT
-* Then we add 1 because the integer
-* truncation robs us of full link
-* utilization.
-*/
-   tp-snd_cwnd = min(tp-snd_cwnd,
-  (target_cwnd 
-   V_PARAM_SHIFT)+1);
+   /* Set cwnd to match the actual rate
+* exactly:
+*   cwnd = (actual rate) * baseRTT
+* Then we add 1 because the integer
+* truncation robs us of full link
+* utilization.
+*/
+   tp-snd_cwnd = min(tp-snd_cwnd,
+  (target_cwnd 
+   V_PARAM_SHIFT)+1);

-   }
+   } else if (tp-snd_cwnd = tp-snd_ssthresh) {
+   /* Slow start.  */
tcp_slow_start(tp);
} else {
/* Congestion avoidance. */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw PF_PACKET protocol selection

2007-10-09 Thread Evgeniy Polyakov
On Tue, Oct 09, 2007 at 08:08:22AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) 
wrote:
  Your program works fine here.  You did run it as root, right?
 
 Yes and ETH_P_ALL is the only protocol that prints anything
 I am on 2.6.22

ETH_P_ARP works too.

  Did you try stracing it?
 
 Just did and now it works, it didn't yesterday :(
 But if I change protocol to ETH_P_MOBITEX, I don't get any
 pkgs(I did change protocol on sending side too)

Did you change eth_type_trans() to catch your proto?


-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [IPV6] Defer IPv6 device initialization until a valid qdisc is specified

2007-10-09 Thread Mitsuru Chinen
To judge the timing for DAD, netif_carrier_ok() is used. However,
there is a possibility that dev-qdisc stays noop_qdisc even if
netif_carrier_ok() returns true. In that case, DAD NS is not sent out.
We need to defer the IPv6 device initialization until a valid qdisc
is specified.

Signed-off-by: Mitsuru Chinen [EMAIL PROTECTED]
Signed-off-by: YOSHIFUJI Hideaki [EMAIL PROTECTED]
---
 net/ipv6/addrconf.c |   13 ++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 6d5c3c2..da8c79c 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -74,6 +74,7 @@
 #include net/tcp.h
 #include net/ip.h
 #include net/netlink.h
+#include net/pkt_sched.h
 #include linux/if_tunnel.h
 #include linux/rtnetlink.h
 
@@ -213,6 +214,12 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
 const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT;
 const struct in6_addr in6addr_loopback = IN6ADDR_LOOPBACK_INIT;
 
+/* Check if a valid qdisc is available */
+static inline int addrconf_qdisc_ok(struct net_device *dev)
+{
+   return (dev-qdisc != noop_qdisc);
+}
+
 static void addrconf_del_timer(struct inet6_ifaddr *ifp)
 {
if (del_timer(ifp-timer))
@@ -384,7 +391,7 @@ static struct inet6_dev * ipv6_add_dev(struct net_device 
*dev)
}
 #endif
 
-   if (netif_running(dev)  netif_carrier_ok(dev))
+   if (netif_running(dev)  addrconf_qdisc_ok(dev))
ndev-if_flags |= IF_READY;
 
ipv6_mc_init_dev(ndev);
@@ -2283,7 +2290,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
break;
 
if (event == NETDEV_UP) {
-   if (!netif_carrier_ok(dev)) {
+   if (!addrconf_qdisc_ok(dev)) {
/* device is not ready yet. */
printk(KERN_INFO
ADDRCONF(NETDEV_UP): %s: 
@@ -2295,7 +2302,7 @@ static int addrconf_notify(struct notifier_block *this, 
unsigned long event,
if (idev)
idev-if_flags |= IF_READY;
} else {
-   if (!netif_carrier_ok(dev)) {
+   if (!addrconf_qdisc_ok(dev)) {
/* device is still not ready. */
break;
}
-- 
1.5.2.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH V6 0/9] net/bonding: ADD IPoIB support for the bonding driver

2007-10-09 Thread Moni Shoua
Jay Vosburgh wrote:
 Jeff Garzik [EMAIL PROTECTED] wrote:
 
 Moni Shoua wrote:
 Jay Vosburgh wrote:
 ACK patches 3 - 9.

Roland, are you comfortable with the IB changes in patches 1 and 2?

Jeff, when Roland acks patches 1 and 2, please apply all 9.

-J
 Hi Jeff,
 Roland acked the IPoIB patches. If you haven't done so already can you 
 please apply them?
 I'm not sure when 2.6.24 is going to open and I'm afraid to miss it.
 hrm, I don't see them in my inbox for some reason.  can someone bounce
 them to me?  or give me a git tree to pull from?
 
   Moni, can you repost the patch series to Jeff, and put the
 appropriate Acked-by lines in for myself (patches 3 - 8) and Roland
 (patches 1 and 2)?  You can probably leave off the netdev and
 openfabrics lists, but cc me.
 
   -J
 
 ---
   -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
 
Hi Jeff,
I don't commits of the patches in 
http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=summary
(I hope that I'm looking in the right place).

Did you get them?

thanks
MoniS


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw PF_PACKET protocol selection

2007-10-09 Thread Joakim Tjernlund
On Tue, 2007-10-09 at 11:13 +0400, Evgeniy Polyakov wrote:
 On Tue, Oct 09, 2007 at 08:08:22AM +0200, Joakim Tjernlund ([EMAIL 
 PROTECTED]) wrote:
   Your program works fine here.  You did run it as root, right?
  
  Yes and ETH_P_ALL is the only protocol that prints anything
  I am on 2.6.22
 
 ETH_P_ARP works too.
 
   Did you try stracing it?
  
  Just did and now it works, it didn't yesterday :(
  But if I change protocol to ETH_P_MOBITEX, I don't get any
  pkgs(I did change protocol on sending side too)
 
 Did you change eth_type_trans() to catch your proto?
 

Just fond out something:
if I redirect my prog like so:
./sniff  log
and press Ctrl-C after a packet has been sent to it, 
it does NOT work. I don't get ANY output in my log file, not
even the printf(-\n) appears.
But if I run whithout redirect it works(at least with ETH_P_BPQ)
Anyone else see this too?

  Jocke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw PF_PACKET protocol selection

2007-10-09 Thread Evgeniy Polyakov
On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) 
wrote:
  Did you change eth_type_trans() to catch your proto?
  
 
 Just fond out something:
 if I redirect my prog like so:
 ./sniff  log
 and press Ctrl-C after a packet has been sent to it, 
 it does NOT work. I don't get ANY output in my log file, not
 even the printf(-\n) appears.
 But if I run whithout redirect it works(at least with ETH_P_BPQ)
 Anyone else see this too?

I only tested with IP and ARP packets - I can not say when packet was
actually received and written to log, but it does start filling up, but
maybe not immediately - it can be output buffering in shell though.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw PF_PACKET protocol selection

2007-10-09 Thread Joakim Tjernlund
On Tue, 2007-10-09 at 11:34 +0400, Evgeniy Polyakov wrote:
 On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund ([EMAIL 
 PROTECTED]) wrote:
   Did you change eth_type_trans() to catch your proto?
   
  
  Just fond out something:
  if I redirect my prog like so:
  ./sniff  log
  and press Ctrl-C after a packet has been sent to it, 
  it does NOT work. I don't get ANY output in my log file, not
  even the printf(-\n) appears.
  But if I run whithout redirect it works(at least with ETH_P_BPQ)
  Anyone else see this too?
 
 I only tested with IP and ARP packets - I can not say when packet was
 actually received and written to log, but it does start filling up, but
 maybe not immediately - it can be output buffering in shell though.

Did you receive many packets? Seems like when I receive just 1 or 2 pkgs
I get the empty log. If I strace ./sniff  log I see that recvfrom gets
pkgs, but there are no trace of writes. I guess this
is a bash(3.2_p17) or glibc(2.5.-r4) bug?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw PF_PACKET protocol selection

2007-10-09 Thread Herbert Xu
On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund wrote:

 Just fond out something:
 if I redirect my prog like so:
 ./sniff  log
 and press Ctrl-C after a packet has been sent to it, 
 it does NOT work. I don't get ANY output in my log file, not
 even the printf(-\n) appears.
 But if I run whithout redirect it works(at least with ETH_P_BPQ)
 Anyone else see this too?

Um, this is what we call buffering.

You either need to turn buffering off with setbuf(3) or you
should install a SIGINT handler to flush the output before
exiting.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHES] TX batching

2007-10-09 Thread Krishna Kumar2
J Hadi Salim [EMAIL PROTECTED] wrote on 10/08/2007 07:35:20 PM:

 I dont see something from Krishna's approach that i can take and reuse.
 This maybe because my old approaches have evolved from the same path.
 There is a long list but as a sample: i used to do a lot more work while
 holding the queue lock which i have now moved post queue lock; i dont
 have any speacial interfaces/tricks just for batching, i provide hints
 to the core of how much the driver can take etc etc. I have offered
 Krishna co-authorship if he makes the IPOIB driver to work on my
 patches, that offer still stands if he chooses to take it.

My feeling is that since the approaches are very different, it would
be a good idea to test the two for performance. Do you mind me doing
that? Ofcourse others and/or you are more than welcome to do the same.

I had sent a note to you yesterday about this, please let me know
either way.

*** Previous mail **

Hi Jamal,

If you don't mind, I am trying to run your approach vs mine to get some
results
for comparison.

For starters, I am having issues with iperf when using your infrastructure
code with
my IPoIB driver - about 100MB is sent and then everything stops for some
reason.
The changes in the IPoIB driver that I made to support batching is to set
BTX, set
xmit_win, and dynamically reduce xmit_win on every xmit and increase
xmit_win on
every xmit completion. Is there anything else that is required from the
driver?

thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw PF_PACKET protocol selection

2007-10-09 Thread Evgeniy Polyakov
On Tue, Oct 09, 2007 at 09:51:25AM +0200, Joakim Tjernlund ([EMAIL PROTECTED]) 
wrote:
 On Tue, 2007-10-09 at 11:34 +0400, Evgeniy Polyakov wrote:
  On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund ([EMAIL 
  PROTECTED]) wrote:
Did you change eth_type_trans() to catch your proto?

   
   Just fond out something:
   if I redirect my prog like so:
   ./sniff  log
   and press Ctrl-C after a packet has been sent to it, 
   it does NOT work. I don't get ANY output in my log file, not
   even the printf(-\n) appears.
   But if I run whithout redirect it works(at least with ETH_P_BPQ)
   Anyone else see this too?
  
  I only tested with IP and ARP packets - I can not say when packet was
  actually received and written to log, but it does start filling up, but
  maybe not immediately - it can be output buffering in shell though.
 
 Did you receive many packets? Seems like when I receive just 1 or 2 pkgs
 I get the empty log. If I strace ./sniff  log I see that recvfrom gets
 pkgs, but there are no trace of writes. I guess this
 is a bash(3.2_p17) or glibc(2.5.-r4) bug?

I received 1396 bytes of logs before terminated, which is 27 ARP packets, 
so there is quite big number of packet there. 
Your application works correctly (although you swapped source and
destination ethernet fields) - buffered writing is not a bug,
if you do not like it, use write(2), mmap(2) or turn buffering off as
Herbert suggested. To get packets with your own ethernet protocol number
you have to change eth_type_trans() function in kernel, which parses
ethernet header and returns protocol number, under some conditions it
will just return your number automatically, but you should check it.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raw PF_PACKET protocol selection

2007-10-09 Thread Joakim Tjernlund
On Tue, 2007-10-09 at 12:17 +0400, Evgeniy Polyakov wrote:
 On Tue, Oct 09, 2007 at 09:51:25AM +0200, Joakim Tjernlund ([EMAIL 
 PROTECTED]) wrote:
  On Tue, 2007-10-09 at 11:34 +0400, Evgeniy Polyakov wrote:
   On Tue, Oct 09, 2007 at 09:27:38AM +0200, Joakim Tjernlund ([EMAIL 
   PROTECTED]) wrote:
 Did you change eth_type_trans() to catch your proto?
 

Just fond out something:
if I redirect my prog like so:
./sniff  log
and press Ctrl-C after a packet has been sent to it, 
it does NOT work. I don't get ANY output in my log file, not
even the printf(-\n) appears.
But if I run whithout redirect it works(at least with ETH_P_BPQ)
Anyone else see this too?
   
   I only tested with IP and ARP packets - I can not say when packet was
   actually received and written to log, but it does start filling up, but
   maybe not immediately - it can be output buffering in shell though.
  
  Did you receive many packets? Seems like when I receive just 1 or 2 pkgs
  I get the empty log. If I strace ./sniff  log I see that recvfrom gets
  pkgs, but there are no trace of writes. I guess this
  is a bash(3.2_p17) or glibc(2.5.-r4) bug?
 
 I received 1396 bytes of logs before terminated, which is 27 ARP packets, 
 so there is quite big number of packet there. 
 Your application works correctly (although you swapped source and
 destination ethernet fields) - buffered writing is not a bug,
 if you do not like it, use write(2), mmap(2) or turn buffering off as
 Herbert suggested. To get packets with your own ethernet protocol number
 you have to change eth_type_trans() function in kernel, which parses
 ethernet header and returns protocol number, under some conditions it
 will just return your number automatically, but you should check it.

I thought that flushing was done automatically when SIGINT happened
but I was apperently wrong. Sorry for the noise and thanks for your
help. I have added setvbuf calls to make it unbuffered.

 Jocke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


net-2.6.24 rebased...

2007-10-09 Thread David Miller

The TCP bug fix that went into Linus's tree today created some
conflicts with net-2.6.24, so I rebased:

kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.24.git

I took the opportunity to integrate fix patches into the patches which
introduced those bugs or build failures.

BTW, when the tree gets this big I tend to move the pre-rebase tree
to:

kernel.org:/pub/scm/linux/kernel/git/davem/bak-net-2.6.24.git

It can be useful if a mistake is made during a rebase and we need
to figure out what happened.

9.1MB, 739 changesets, keep those patches flowing!

Enjoy
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][NET-2.6.24] Remove double dev-flags checking when calling dev_close()

2007-10-09 Thread Pavel Emelyanov
The unregister_netdevice() and dev_change_net_namespace() 
both check for dev-flags to be IFF_UP before calling the 
dev_close(), but the dev_close() checks for IFF_UP itself, 
so remove those unneeded checks.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/net/core/dev.c b/net/core/dev.c
index e7e728a..1e169a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3893,8 +3893,7 @@ void unregister_netdevice(struct net_dev
BUG_ON(dev-reg_state != NETREG_REGISTERED);
 
/* If device is running, close it first. */
-   if (dev-flags  IFF_UP)
-   dev_close(dev);
+   dev_close(dev);
 
/* And unlink it from device chain. */
unlist_netdevice(dev);
@@ -4018,8 +4017,7 @@ int dev_change_net_namespace(struct net_
 */
 
/* If device is running close it first. */
-   if (dev-flags  IFF_UP)
-   dev_close(dev);
+   dev_close(dev);
 
/* And unlink it from device chain */
err = -ENODEV;

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
Hi Peter,

Waskiewicz Jr, Peter P [EMAIL PROTECTED] wrote on
10/09/2007 04:03:42 AM:

  true, that needs some resolution. Heres a hand-waving thought:
  Assuming all packets of a specific map end up in the same
  qdiscn queue, it seems feasible to ask the qdisc scheduler to
  give us enough packages (ive seen people use that terms to
  refer to packets) for each hardware ring's available space.
  With the patches i posted, i do that via
  dev-xmit_win that assumes only one view of the driver; essentially a
  single ring.
  If that is doable, then it is up to the driver to say i have
  space for 5 in ring[0], 10 in ring[1] 0 in ring[2] based on
  what scheduling scheme the driver implements - the dev-blist
  can stay the same. Its a handwave, so there may be issues
  there and there could be better ways to handle this.
 
  Note: The other issue that needs resolving that i raised
  earlier was in regards to multiqueue running on multiple cpus
  servicing different rings concurently.

 I can see the qdisc being modified to send batches per queue_mapping.
 This shouldn't be too difficult, and if we had the xmit_win per queue
 (in the subqueue struct like Dave pointed out).

I hope my understanding of multiqueue is correct for this mail to make
sense :-)

Isn't it enough that the multiqueue+batching drivers handle skbs
belonging to different queue's themselves, instead of qdisc having
to figure that out? This will reduce costs for most skbs that are
neither batched nor sent to multiqueue devices.

Eg, driver can keep processing skbs and put to the correct tx_queue
as long as mapping remains the same. If the mapping changes, it posts
earlier skbs (with the correct lock) and then iterates for the other
skbs that have the next different mapping, and so on.

(This is required only if driver is supposed to transmit 1 skb in one
call, otherwise it is not an issue)

Alternatively, supporting drivers could return a different code on
mapping change, like: NETDEV_TX_MAPPING_CHANGED (for batching only)
so that qdisc_run() could retry. Would that work?

Secondly having xmit_win per queue: would it help in multiple skb
case? Currently there is no way to tell qdisc to dequeue skbs from
a particular band - it returns skb from highest priority band.

thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:28:27 +0530

 Isn't it enough that the multiqueue+batching drivers handle skbs
 belonging to different queue's themselves, instead of qdisc having
 to figure that out? This will reduce costs for most skbs that are
 neither batched nor sent to multiqueue devices.
 
 Eg, driver can keep processing skbs and put to the correct tx_queue
 as long as mapping remains the same. If the mapping changes, it posts
 earlier skbs (with the correct lock) and then iterates for the other
 skbs that have the next different mapping, and so on.

The complexity in most of these suggestions is beginning to drive me a
bit crazy :-)

This should be the simplest thing in the world, when TX queue has
space, give it packets.  Period.

When I hear suggestions like have the driver pick the queue in
-hard_start_xmit() and return some special status if the queue
becomes different.  you know, I really begin to wonder :-)

If we have to go back, get into the queueing layer locks, have these
special cases, and whatnot, what's the point?

This code should eventually be able to run lockless all the way to the
TX queue handling code of the driver.  The queueing code should know
what TX queue the packet will be bound for, and always precisely
invoke the driver in a state where the driver can accept the packet.

Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
Hi Dave,

David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

  Isn't it enough that the multiqueue+batching drivers handle skbs
  belonging to different queue's themselves, instead of qdisc having
  to figure that out? This will reduce costs for most skbs that are
  neither batched nor sent to multiqueue devices.
 
  Eg, driver can keep processing skbs and put to the correct tx_queue
  as long as mapping remains the same. If the mapping changes, it posts
  earlier skbs (with the correct lock) and then iterates for the other
  skbs that have the next different mapping, and so on.

 The complexity in most of these suggestions is beginning to drive me a
 bit crazy :-)

 This should be the simplest thing in the world, when TX queue has
 space, give it packets.  Period.

 When I hear suggestions like have the driver pick the queue in
 -hard_start_xmit() and return some special status if the queue
 becomes different.  you know, I really begin to wonder :-)

 If we have to go back, get into the queueing layer locks, have these
 special cases, and whatnot, what's the point?

I understand your point, but the qdisc code itself needs almost no
change, as small as:

qdisc_restart()
{
  ...
  case NETDEV_TX_MAPPING_CHANGED:
  /*
   * Driver sent some skbs from one mapping, and found others
   * are for different queue_mapping. Try again.
   */
  ret = 1; /* guaranteed to have atleast 1 skb in batch list */
  break;
  ...
}

Alternatively if the driver does all the dirty work, qdisc needs no
change at all. However, I am not sure if this addresses all the
concerns raised by you, Peter, Jamal, others.

 This code should eventually be able to run lockless all the way to the
 TX queue handling code of the driver.  The queueing code should know
 what TX queue the packet will be bound for, and always precisely
 invoke the driver in a state where the driver can accept the packet.

This sounds like a good idea :)

I need to think more on this, esp as my batching sends multiple skbs of
possibly different mappings to device, and those skbs stay in batch list
if driver couldn't send them out.

thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Krishna Kumar2
David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:

 Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
 it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.

thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530

 David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
 
  Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
  it.
 
 Great, this will make life easy. Any idea how long that would take?
 It seems simple enough to do.

I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.

I can just threaten to do them all and that should get the driver
maintainers going :-)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NET-2.6.24] Remove double dev-flags checking when calling dev_close()

2007-10-09 Thread Jeff Garzik

Pavel Emelyanov wrote:
The unregister_netdevice() and dev_change_net_namespace() 
both check for dev-flags to be IFF_UP before calling the 
dev_close(), but the dev_close() checks for IFF_UP itself, 
so remove those unneeded checks.


Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/net/core/dev.c b/net/core/dev.c
index e7e728a..1e169a5 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3893,8 +3893,7 @@ void unregister_netdevice(struct net_dev
BUG_ON(dev-reg_state != NETREG_REGISTERED);
 
 	/* If device is running, close it first. */

-   if (dev-flags  IFF_UP)
-   dev_close(dev);
+   dev_close(dev);
 
 	/* And unlink it from device chain. */

unlist_netdevice(dev);
@@ -4018,8 +4017,7 @@ int dev_change_net_namespace(struct net_
 */
 
 	/* If device is running close it first. */

-   if (dev-flags  IFF_UP)
-   dev_close(dev);
+   dev_close(dev);
 
 	/* And unlink it from device chain */

err = -ENODEV;


One side effect of this patch:  might_sleep() is now called unconditionally.

If that is irrelevant... ACK.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NET-2.6.24] Remove double dev-flags checking when calling dev_close()

2007-10-09 Thread Pavel Emelyanov
Jeff Garzik wrote:
 Pavel Emelyanov wrote:
 The unregister_netdevice() and dev_change_net_namespace() 
 both check for dev-flags to be IFF_UP before calling the 
 dev_close(), but the dev_close() checks for IFF_UP itself, 
 so remove those unneeded checks.

 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

 ---

 diff --git a/net/core/dev.c b/net/core/dev.c
 index e7e728a..1e169a5 100644
 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -3893,8 +3893,7 @@ void unregister_netdevice(struct net_dev
  BUG_ON(dev-reg_state != NETREG_REGISTERED);
  
  /* If device is running, close it first. */
 -if (dev-flags  IFF_UP)
 -dev_close(dev);
 +dev_close(dev);
  
  /* And unlink it from device chain. */
  unlist_netdevice(dev);
 @@ -4018,8 +4017,7 @@ int dev_change_net_namespace(struct net_
   */
  
  /* If device is running close it first. */
 -if (dev-flags  IFF_UP)
 -dev_close(dev);
 +dev_close(dev);
  
  /* And unlink it from device chain */
  err = -ENODEV;
 
 One side effect of this patch:  might_sleep() is now called unconditionally.

That's not a big deal - see, the synchronize_net() is called further
in both places unconditionally (!) and also calls might_sleep(), so
this is OK.

 If that is irrelevant... ACK.

Thanks :)

 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][NETNS] Don't memset() netns to zero manually

2007-10-09 Thread Pavel Emelyanov
The newly created net namespace is set to 0 with memset()
in setup_net(). The setup_net() is also called for the
init_net_ns(), which is zeroed naturally as a global var.

So remove this memset and allocate new nets with the
kmem_cache_zalloc().

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 0e0ca6d..6f71db8 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -24,7 +24,7 @@ EXPORT_SYMBOL_GPL(init_net);
 
 static struct net *net_alloc(void)
 {
-   return kmem_cache_alloc(net_cachep, GFP_KERNEL);
+   return kmem_cache_zalloc(net_cachep, GFP_KERNEL);
 }
 
 static void net_free(struct net *net)
@@ -90,7 +90,6 @@ static int setup_net(struct net *net)
struct pernet_operations *ops;
int error;
 
-   memset(net, 0, sizeof(struct net));
atomic_set(net-count, 1);
atomic_set(net-use_count, 0);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection

2007-10-09 Thread Ilpo Järvinen
On Tue, 9 Oct 2007, TAKANO Ryousei wrote:

 From: Ilpo Järvinen [EMAIL PROTECTED]
 Subject: Re: [RFC][PATCH 1/2] TCP: fix lost retransmit detection
 Date: Mon, 8 Oct 2007 14:11:55 +0300 (EEST)
 
  On Sun, 7 Oct 2007, TAKANO Ryousei wrote:
  
   From: Ilpo Järvinen [EMAIL PROTECTED]
   
Just couple of thoughts, not that this change itself is incorrect...

...Added this line back, it's important. :-)

In case sacktag uses fastpath, this code won't be executed for the 
skb's 
that we would like to check (those with SACKED_RETRANS set, that are 
below the fastpath_skb_hint). We will eventually deal with the whole 
queue 
when fastpath_skb_hint gets set to NULL, with the next cumulative ACK 
that 
fully ACKs an skb at the latest. Maybe there's a need for a larger 
surgery 
than this to fix it. I think we need additional field to tcp_sock to 
avoid 
doing a full-walk per ACK:
  
   I think the problem occurs in slowpath. For example, in case when the 
   receiver
   detects and sends back a new SACK block, the sender may fail to detect 
   loss
   of a retransmitted packet.
  
  ...No, the slow path is very likely to _correct_ the problem! The problem 
  occurs because fastpath _skips_ processing of skbs below fastpath_skb_hint 
  whereas slowpath starts from tcp_write_queue_head. The retransmitted skbs 
  we're interested in reside exactly on that skipped range.
  
 Sorry, my explanaton is insufficient.
 This problem occurs even if we are in the slowpath.

 [...long explination snip...]

 Is it corrent?

Yes, we're dealing with two related, but different problems here. I 
understood well what you're trying to fix, and even acknowledged your 
change (see what I added back from my original reply above). The far
worse problem which I describe, however, occurs with fastpath only (well, 
very unlikely to occur with slow-path but it could in theory), has been 
like this since the hints were added. And it's preventing the execution
of this portion you fixed (until slowpath is taken).

...What makes the other problem even nastier is the fact that it
becomes more and more significant when fastpath (or it's equivalent)
skips more and more skb processing. It's good that we noticed it
now...

I tried to do a fix to that other problem, it seems that the solution
will be intertwined with the problem you're describing so that I in
fact ended removing the code block where your key modification is and
placing  somewhat similar sequence gathering elsewhere... Posting it 
shortly.

-- 
 i.

[RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness

2007-10-09 Thread Ilpo Järvinen
Lost_retrans handling of sacktag was found to be flawed, two
problems that were found have an intertwined solution. Fastpath
problem has existed since hints got added and the other problem
has probably been there even longer than that. ...This change
may add non-trivial processing cost.

Initial sketch, only compile tested. This will become more and
more useful, when sacktag starts to process less and less skbs,
which hopefully happens quite soon... :-) Sadly enough it will
probably then be consuming part of the benefits we're able to
achieve by less skb walking...

First one is trivial, so Dave might want to apply it already.

--
 i.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [TCP]: Separate lost_retrans loop into own function

2007-10-09 Thread Ilpo Järvinen
Follows own function for each task principle, this is really
somewhat separate task being done in sacktag. Also reduces
indentation.

In addition, added ack_seq local var to break some long
lines  fixed coding style things.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_input.c |   80 +++---
 1 files changed, 43 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a04e78e..56aa34a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1106,6 +1106,47 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, 
int is_dsack,
return !before(start_seq, end_seq - tp-max_window);
 }
 
+/* Check for lost retransmit. This superb idea is borrowed from ratehalving.
+ * Event C. Later note: FACK people cheated me again 8), we have to account
+ * for reordering! Ugly, but should help.
+ */
+static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct sk_buff *skb;
+   int flag = 0;
+
+   tcp_for_write_queue(skb, sk) {
+   u32 ack_seq = TCP_SKB_CB(skb)-ack_seq;
+
+   if (skb == tcp_send_head(sk))
+   break;
+   if (after(TCP_SKB_CB(skb)-seq, lost_retrans))
+   break;
+   if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una))
+   continue;
+
+   if ((TCP_SKB_CB(skb)-sacked  TCPCB_SACKED_RETRANS) 
+   after(lost_retrans, ack_seq) 
+   (tcp_is_fack(tp) ||
+!before(lost_retrans,
+ack_seq + tp-reordering * tp-mss_cache))) {
+   TCP_SKB_CB(skb)-sacked = ~TCPCB_SACKED_RETRANS;
+   tp-retrans_out -= tcp_skb_pcount(skb);
+
+   /* clear lost hint */
+   tp-retransmit_skb_hint = NULL;
+
+   if (!(TCP_SKB_CB(skb)-sacked  
(TCPCB_LOST|TCPCB_SACKED_ACKED))) {
+   tp-lost_out += tcp_skb_pcount(skb);
+   TCP_SKB_CB(skb)-sacked |= TCPCB_LOST;
+   flag |= FLAG_DATA_SACKED;
+   NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
+   }
+   }
+   }
+   return flag;
+}
 
 static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb,
   struct tcp_sack_block_wire *sp, int num_sacks,
@@ -1422,43 +1463,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
*ack_skb, u32 prior_snd_
}
}
 
-   /* Check for lost retransmit. This superb idea is
-* borrowed from ratehalving. Event C.
-* Later note: FACK people cheated me again 8),
-* we have to account for reordering! Ugly,
-* but should help.
-*/
-   if (lost_retrans  icsk-icsk_ca_state == TCP_CA_Recovery) {
-   struct sk_buff *skb;
-
-   tcp_for_write_queue(skb, sk) {
-   if (skb == tcp_send_head(sk))
-   break;
-   if (after(TCP_SKB_CB(skb)-seq, lost_retrans))
-   break;
-   if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una))
-   continue;
-   if ((TCP_SKB_CB(skb)-sackedTCPCB_SACKED_RETRANS) 
-   after(lost_retrans, TCP_SKB_CB(skb)-ack_seq) 
-   (tcp_is_fack(tp) ||
-!before(lost_retrans,
-TCP_SKB_CB(skb)-ack_seq + tp-reordering *
-tp-mss_cache))) {
-   TCP_SKB_CB(skb)-sacked = 
~TCPCB_SACKED_RETRANS;
-   tp-retrans_out -= tcp_skb_pcount(skb);
-
-   /* clear lost hint */
-   tp-retransmit_skb_hint = NULL;
-
-   if 
(!(TCP_SKB_CB(skb)-sacked(TCPCB_LOST|TCPCB_SACKED_ACKED))) {
-   tp-lost_out += tcp_skb_pcount(skb);
-   TCP_SKB_CB(skb)-sacked |= TCPCB_LOST;
-   flag |= FLAG_DATA_SACKED;
-   
NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
-   }
-   }
-   }
-   }
+   if (lost_retrans  icsk-icsk_ca_state == TCP_CA_Recovery)
+   flag |= tcp_mark_lost_retrans(sk, lost_retrans);
 
tcp_verify_left_out(tp);
 
-- 
1.5.0.6

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] [TCP]: Fix lost_retrans loop vs fastpath problems

2007-10-09 Thread Ilpo Järvinen
Detection implemented with lost_retrans must work also when
fastpath is taken, yet most of the queue is skipped including
(very likely) those retransmitted skb's we're interested in.
This problem appeared when the hints got added, which removed
a need to always walk over the whole write queue head.
Therefore decicion for the lost_retrans worker loop entry must
be separated from the sacktag processing more than it was
necessary before.

It turns out to be problematic to optimize the worker loop
very heavily because ack_seqs of skb may have a number of
discontinuity points. Maybe similar approach as currently is
implemented could be attempted but that's becoming more and
more complex because the trend is towards less skb walking
in sacktag marker.

Maybe after(highest_sack_end_seq, tp-high_seq) checking is not
sufficiently accurate and causes entry too often in no-work-to-do
cases. Since that's not known, I've separated solution to that
from this patch.

Noticed because of report against a related problem from TAKANO
Ryousei [EMAIL PROTECTED]. He also provided a patch to
that part of the problem. This patch includes solution to it
(though this patch has to use somewhat different placement).
TAKANO's description and patch is available here:

  http://marc.info/?l=linux-netdevm=119149311913288w=2

...In short, TAKANO's problem is that end_seq the loop is using
not necessarily the largest SACK block's end_seq because the
current ACK may still have higher SACK blocks which are later
by the loop.

Cc: TAKANO Ryousei [EMAIL PROTECTED]
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_input.c |   37 ++---
 1 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 56aa34a..b90c2fc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1109,27 +1109,34 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, 
int is_dsack,
 /* Check for lost retransmit. This superb idea is borrowed from ratehalving.
  * Event C. Later note: FACK people cheated me again 8), we have to account
  * for reordering! Ugly, but should help.
+ *
+ * Search retransmitted skbs from write_queue that were sent when snd_nxt was
+ * less than what is now known to be received by the other end (derived from
+ * SACK blocks by the caller).
  */
-static int tcp_mark_lost_retrans(struct sock *sk, u32 lost_retrans)
+static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto)
 {
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
int flag = 0;
+   int cnt = 0;
 
tcp_for_write_queue(skb, sk) {
u32 ack_seq = TCP_SKB_CB(skb)-ack_seq;
 
if (skb == tcp_send_head(sk))
break;
-   if (after(TCP_SKB_CB(skb)-seq, lost_retrans))
+   if (cnt == tp-retrans_out)
break;
if (!after(TCP_SKB_CB(skb)-end_seq, tp-snd_una))
continue;
 
-   if ((TCP_SKB_CB(skb)-sacked  TCPCB_SACKED_RETRANS) 
-   after(lost_retrans, ack_seq) 
+   if (!(TCP_SKB_CB(skb)-sacked  TCPCB_SACKED_RETRANS))
+   continue;
+
+   if (after(received_upto, ack_seq) 
(tcp_is_fack(tp) ||
-!before(lost_retrans,
+!before(received_upto,
 ack_seq + tp-reordering * tp-mss_cache))) {
TCP_SKB_CB(skb)-sacked = ~TCPCB_SACKED_RETRANS;
tp-retrans_out -= tcp_skb_pcount(skb);
@@ -1143,6 +1150,8 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 
lost_retrans)
flag |= FLAG_DATA_SACKED;
NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
}
+   } else {
+   cnt += tcp_skb_pcount(skb);
}
}
return flag;
@@ -1193,7 +1202,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
*ack_skb, u32 prior_snd_
int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)3;
int reord = tp-packets_out;
int prior_fackets;
-   u32 lost_retrans = 0;
+   u32 highest_sack_end_seq = 0;
int flag = 0;
int found_dup_sack = 0;
int cached_fack_count;
@@ -1383,11 +1392,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
*ack_skb, u32 prior_snd_
continue;
}
 
-   if ((sackedTCPCB_SACKED_RETRANS) 
-   after(end_seq, TCP_SKB_CB(skb)-ack_seq) 
-   (!lost_retrans || after(end_seq, lost_retrans)))
-   lost_retrans = end_seq;
-
if (!in_sack)
continue;
 
@@ -1441,9 +1445,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
*ack_skb, 

[RFC PATCH] [TCP]: Limit processing lost_retrans loop to work-to-do cases

2007-10-09 Thread Ilpo Järvinen
This addition of lost_retrans_low to tcp_sock might be
unnecessary, it's not clear how often lost_retrans worker is
executed when there wasn't work to do.

Cc: TAKANO Ryousei [EMAIL PROTECTED]
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 include/linux/tcp.h   |2 ++
 net/ipv4/tcp_input.c  |   15 +++
 net/ipv4/tcp_output.c |2 ++
 3 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 9ff456e..c5b94c1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -348,6 +348,8 @@ struct tcp_sock {
int lost_cnt_hint;
int retransmit_cnt_hint;
 
+   u32 lost_retrans_low;   /* Sent seq after any rxmit (lowest) */
+
u16 advmss; /* Advertised MSS   */
u16 prior_ssthresh; /* ssthresh saved at recovery start */
u32 lost_out;   /* Lost packets */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b90c2fc..e7bc720 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1112,7 +1112,8 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, 
int is_dsack,
  *
  * Search retransmitted skbs from write_queue that were sent when snd_nxt was
  * less than what is now known to be received by the other end (derived from
- * SACK blocks by the caller).
+ * SACK blocks by the caller). Also calculate the lowest snd_nxt among the
+ * remaining retransmitted skbs to avoid some costly processing per ACKs.
  */
 static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto)
 {
@@ -1120,6 +1121,7 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 
received_upto)
struct sk_buff *skb;
int flag = 0;
int cnt = 0;
+   u32 new_low_seq = 0;
 
tcp_for_write_queue(skb, sk) {
u32 ack_seq = TCP_SKB_CB(skb)-ack_seq;
@@ -1150,10 +1152,15 @@ static int tcp_mark_lost_retrans(struct sock *sk, u32 
received_upto)
flag |= FLAG_DATA_SACKED;
NET_INC_STATS_BH(LINUX_MIB_TCPLOSTRETRANSMIT);
}
-   } else {
+   } else if (!new_low_seq || before(ack_seq, new_low_seq)) {
+   new_low_seq = ack_seq;
cnt += tcp_skb_pcount(skb);
}
}
+
+   if (tp-retrans_out)
+   tp-lost_retrans_low = new_low_seq;
+
return flag;
 }
 
@@ -1468,8 +1475,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
*ack_skb, u32 prior_snd_
}
}
 
-   if (tp-retrans_out  highest_sack_end_seq 
-   after(highest_sack_end_seq, tp-high_seq) 
+   if (tp-retrans_out 
+   after(highest_sack_end_seq, tp-lost_retrans_low) 
icsk-icsk_ca_state == TCP_CA_Recovery)
flag |= tcp_mark_lost_retrans(sk, highest_sack_end_seq);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5329675..324b420 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1914,6 +1914,8 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb)
printk(KERN_DEBUG retrans_out leaked.\n);
}
 #endif
+   if (!tp-retrans_out)
+   tp-lost_retrans_low = tp-snd_nxt;
TCP_SKB_CB(skb)-sacked |= TCPCB_RETRANS;
tp-retrans_out += tcp_skb_pcount(skb);
 
-- 
1.5.0.6

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread Pavel Emelyanov
Currently indexes for netdevices come sequentially one by
one, and the same stays true even for devices that are 
created for namespaces.

Side effects of this are:
 * lo device has not 1 index in a namespace. This may break
   some userspace that relies on it (and AFAIR something
   really broke in OpenVZ VEs without this);
 * after some time namespaces will have devices with indexes
   like 100 os similar. This might be confusing for a
   human (tools will not mind).

So move the (currently global and static) ifindex variable
on the struct net, making the indexes allocation look more
like on a standalone machine.

Moreover - when we have indexes intersect between namespaces,
we may catch more BUGs in the future related to wrong device 
was found for a given index.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 93aa87d..83a18d0 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -29,6 +29,8 @@ struct net {
struct list_headdev_base_head;
struct hlist_head   *dev_name_head;
struct hlist_head   *dev_index_head;
+
+   int ifindex;
 };
 
 #ifdef CONFIG_NET
diff --git a/net/core/dev.c b/net/core/dev.c
index e7e728a..a08ed8c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3443,12 +3443,11 @@ int dev_ioctl(struct net *net, unsigned 
  */
 static int dev_new_index(struct net *net)
 {
-   static int ifindex;
for (;;) {
-   if (++ifindex = 0)
-   ifindex = 1;
-   if (!__dev_get_by_index(net, ifindex))
-   return ifindex;
+   if (++net-ifindex = 0)
+   net-ifindex = 1;
+   if (!__dev_get_by_index(net, net-ifindex))
+   return net-ifindex;
}
 }
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik

David Miller wrote:

From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530


David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:


Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.


I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.

I can just threaten to do them all and that should get the driver
maintainers going :-)


What, like this?  :)

Jeff




 drivers/net/atl1/atl1_main.c   |   16 +---
 drivers/net/chelsio/cxgb2.c|1 -
 drivers/net/chelsio/sge.c  |   20 +---
 drivers/net/e1000/e1000_main.c |6 +-
 drivers/net/ixgb/ixgb_main.c   |   24 
 drivers/net/pasemi_mac.c   |2 +-
 drivers/net/rionet.c   |   19 +++
 drivers/net/spider_net.c   |2 +-
 drivers/net/sungem.c   |   17 ++---
 drivers/net/tehuti.c   |   12 +---
 drivers/net/tehuti.h   |3 +--
 11 files changed, 32 insertions(+), 90 deletions(-)

diff --git a/drivers/net/atl1/atl1_main.c b/drivers/net/atl1/atl1_main.c
index 4c728f1..03e94fe 100644
--- a/drivers/net/atl1/atl1_main.c
+++ b/drivers/net/atl1/atl1_main.c
@@ -1665,10 +1665,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
 
len -= skb-data_len;
 
-   if (unlikely(skb-len == 0)) {
-   dev_kfree_skb_any(skb);
-   return NETDEV_TX_OK;
-   }
+   WARN_ON(skb-len == 0);
 
param.data = 0;
param.tso.tsopu = 0;
@@ -1703,11 +1700,7 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
}
}
 
-   if (!spin_trylock_irqsave(adapter-lock, flags)) {
-   /* Can't get lock - tell upper layer to requeue */
-   dev_printk(KERN_DEBUG, adapter-pdev-dev, tx locked\n);
-   return NETDEV_TX_LOCKED;
-   }
+   spin_lock_irqsave(adapter-lock, flags);
 
if (atl1_tpd_avail(adapter-tpd_ring)  count) {
/* not enough descriptors */
@@ -1749,8 +1742,11 @@ static int atl1_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
atl1_tx_map(adapter, skb, 1 == val);
atl1_tx_queue(adapter, count, param);
netdev-trans_start = jiffies;
+
spin_unlock_irqrestore(adapter-lock, flags);
+
atl1_update_mailbox(adapter);
+
return NETDEV_TX_OK;
 }
 
@@ -2301,8 +2297,6 @@ static int __devinit atl1_probe(struct pci_dev *pdev,
 */
/* netdev-features |= NETIF_F_TSO; */
 
-   netdev-features |= NETIF_F_LLTX;
-
/*
 * patch for some L1 of old version,
 * the final version of L1 may not need these
diff --git a/drivers/net/chelsio/cxgb2.c b/drivers/net/chelsio/cxgb2.c
index 2dbf8dc..0aba7e7 100644
--- a/drivers/net/chelsio/cxgb2.c
+++ b/drivers/net/chelsio/cxgb2.c
@@ -1084,7 +1084,6 @@ static int __devinit init_one(struct pci_dev *pdev,
netdev-mem_end = mmio_start + mmio_len - 1;
netdev-priv = adapter;
netdev-features |= NETIF_F_SG | NETIF_F_IP_CSUM;
-   netdev-features |= NETIF_F_LLTX;
 
adapter-flags |= RX_CSUM_ENABLED | TCP_CSUM_CAPABLE;
if (pci_using_dac)
diff --git a/drivers/net/chelsio/sge.c b/drivers/net/chelsio/sge.c
index ffa7e64..84f5869 100644
--- a/drivers/net/chelsio/sge.c
+++ b/drivers/net/chelsio/sge.c
@@ -1739,8 +1739,7 @@ static int t1_sge_tx(struct sk_buff *skb, struct adapter 
*adapter,
struct cmdQ *q = sge-cmdQ[qid];
unsigned int credits, pidx, genbit, count, use_sched_skb = 0;
 
-   if (!spin_trylock(q-lock))
-   return NETDEV_TX_LOCKED;
+   spin_lock(q-lock);
 
reclaim_completed_tx(sge, q);
 
@@ -1817,12 +1816,12 @@ use_sched:
}
 
if (use_sched_skb) {
-   if (spin_trylock(q-lock)) {
-   credits = q-size - q-in_use;
-   skb = NULL;
-   goto use_sched;
-   }
+   spin_lock(q-lock);
+   credits = q-size - q-in_use;
+   skb = NULL;
+   goto use_sched;
}
+
return NETDEV_TX_OK;
 }
 
@@ -1977,13 +1976,12 @@ static void sge_tx_reclaim_cb(unsigned long data)
for (i = 0; i  SGE_CMDQ_N; ++i) {
struct cmdQ *q = sge-cmdQ[i];
 
-   if (!spin_trylock(q-lock))
-   continue;
+   spin_lock(q-lock);
 
reclaim_completed_tx(sge, q);
-   if (i == 0  q-in_use) {/* flush pending credits */
+   if (i == 0  q-in_use)/* flush pending credits */
writel(F_CMDQ0_ENABLE, 

ax88796: add superh to kconfig depencencies

2007-10-09 Thread Magnus Damm
ax88796: add superh to kconfig depencencies

This patch adds sh architecture support to the ax88796 kconfig.

Signed-off-by: Magnus Damm [EMAIL PROTECTED]
---

 This is a broken out version of the larger patch recently posted to netdev:
 http://www.mail-archive.com/netdev@vger.kernel.org/msg47278.html

 drivers/net/Kconfig |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- 0002/drivers/net/Kconfig
+++ work/drivers/net/Kconfig2007-10-09 19:37:27.0 +0900
@@ -218,7 +218,7 @@ source drivers/net/arm/Kconfig
 
 config AX88796
tristate ASIX AX88796 NE2000 clone support
-   depends on ARM || MIPS
+   depends on ARM || MIPS || SUPERH
select CRC32
select MII
help
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Herbert Xu
On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote:
 David Miller wrote:
 
 I can just threaten to do them all and that should get the driver
 maintainers going :-)
 
 What, like this?  :)

Awsome :)
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik

Herbert Xu wrote:

On Tue, Oct 09, 2007 at 08:44:25AM -0400, Jeff Garzik wrote:

David Miller wrote:

I can just threaten to do them all and that should get the driver
maintainers going :-)

What, like this?  :)


Awsome :)


Note my patch is just to get the maintainers going.  :)  I'm not going 
to commit that, since I don't have any way to test any of the drivers I 
touched (but I wouldn't scream if it appeared in net-2.6.24 either)


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-2.6.24 0/3]: Attempt to fix lost_retrans brokeness

2007-10-09 Thread Ilpo Järvinen
On Tue, 9 Oct 2007, Ilpo Järvinen wrote:

 Lost_retrans handling of sacktag was found to be flawed, two
 problems that were found have an intertwined solution. Fastpath
 problem has existed since hints got added and the other problem
 has probably been there even longer than that. ...This change
 may add non-trivial processing cost.
 
 Initial sketch, only compile tested. This will become more and
 more useful, when sacktag starts to process less and less skbs,
 which hopefully happens quite soon... :-) Sadly enough it will
 probably then be consuming part of the benefits we're able to
 achieve by less skb walking...
 
 First one is trivial, so Dave might want to apply it already.

Hmm, forgot to add -n to git-format-patch. Since it's currently
RFC, I won't bother to resubmit with numbers unless somebody
really wants that. Here's the correct ordering, if it's not
obvious from the patches alone:

$ git-log -n 3 --pretty=oneline HEAD | cat
c628f5040cf31323f9166d41cb46070aa7be8cc5 [TCP]: Limit processing lost_retrans 
loop to work-to-do cases
b7388980b7bfa1b9159a65a81b9501df43c67b08 [TCP]: Fix lost_retrans loop vs 
fastpath problems
a1c39117034c58bf786a5941dd7e9a5968439007 [TCP]: Separate lost_retrans loop into 
own function


-- 
 i.

Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread jamal
On Tue, 2007-09-10 at 08:39 +0530, Krishna Kumar2 wrote:

 Driver might ask for 10 and we send 10, but LLTX driver might fail to get
 lock and return TX_LOCKED. I haven't seen your code in greater detail, but
 don't you requeue in that case too?

For others drivers that are non-batching and LLTX, it is possible - at
the moment in my patch i whine that the driver is buggy. I will fix this
up so it checks for NETIF_F_BTX. Thanks for pointing the above use case.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHES] TX batching

2007-10-09 Thread jamal
On Tue, 2007-09-10 at 13:44 +0530, Krishna Kumar2 wrote:

 My feeling is that since the approaches are very different, 

My concern is the approaches are different only for short periods of
time. For example, I do requeueing, have xmit_win, have -end_xmit,
do batching from core etc; if you see value in any of these concepts,
they will appear in your patches and this goes on a loop. Perhaps what
we need is a referee and use our energies in something more positive.

 it would be a good idea to test the two for performance. 

Which i dont mind as long as it has an analysis that goes with it.
If all you post is heres what netperf showed, it is not useful at all.
There are also a lot of affecting variables. For example, is the
receiver a bottleneck? To make it worse, I could demonstrate to you that
if i slowed down the driver and allowed more packets to queue up on the
qdisc, batching will do well. In the past my feeling is you glossed over
such details and i am sucker for things like that - hence the conflict.

 Do you mind me doing
 that? Ofcourse others and/or you are more than welcome to do the same.
 
 I had sent a note to you yesterday about this, please let me know
 either way.
 

I responded to you - but it may have been lost in the noise; heres a
copy:
http://marc.info/?l=linux-netdevm=119185137124008w=2

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][E1000E] some cleanups

2007-10-09 Thread jamal
On Mon, 2007-08-10 at 15:40 -0700, Kok, Auke wrote:

 
 My biggest problem with the patch as you sent it that it's a tonload of 
 changes
 and no implicit benefit immediately as I can see. 

The patch looks scary but is pretty tame when you apply it and stare at
it.

 I would really have to see the
 LLTX change as well to make a wellfounded decision whether it is the right 
 thing
 to go and overhaul this part of the driver or not :)
 
 I'm not particularly against the changes per se though.
 
 so please, if needed offlist send me those LLTX changes as well.

Dont worry about it - i didnt get any love for a similar patch i did for
tg3 either ;- I will hold onto it for now. I could do the un-LLTX
outside of this - would you like that for both e1000/e1000e ?

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][NET_SCHED] explict hold dev tx lock

2007-10-09 Thread jamal
On Tue, 2007-09-10 at 12:00 +0800, Herbert Xu wrote:

 
 OK, after waking up a bit more 

me too;- 

 What I'm worried about is would we see worse behaviour with
 drivers that do all their TX clean-up with the TX lock held

Good point Herbert.
When i looked around i only found one driver that behaved like that;
some IBM mainframe one from the looks of it. That driver did a lot of
other obscure things (i think they maintain their own napi calls etc),
so it didnt worry me very much. 
IIRC, I think it could be fixed to do what tg3 and relatives like bnx do
(I really like the approach) - just didnt have the time to chase it. 
There are a _lot_ more drivers that have no respect for netif_tx_lock
and implement their own locking down in the driver. Those already suffer
from the phenomena you describe whether TX_LOCK is held or not.

 (which would cause qdisc_restart to spin while this is happening)?

Yes, with such a driver, we spin in the worst case. But that provides
opportunities for optimization of a driver behaving that way; two
approaches off top of my head:
a) we could prune the tx descriptors on the the tx path since it is safe
to do so and reduce the amount of time spent doing such work in the napi
poll
or
b) Have the napi side do a trylock (sort of what the e1000 attempts to
do) and reschedule the poll to retry.

#b is fair because the cost of a queue_lock goes up as the number of
cpus goes up (which is the case i was optimizing for). The cost of tx
lock is contention by two cpus in worst case. Thoughts?

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

2007-10-09 Thread James Lentini

On Mon, 8 Oct 2007, Steve Wise wrote:

 The correct solution, IMO, is to enhance the core low level 4-tuple 
 allocation services to be more generic (eg: not be tied to a struct 
 sock).  Then the host tcp stack and the host rdma stack can allocate 
 TCP/iWARP ports/4tuples from this common exported service and share 
 the port space.  This allocation service could also be used by other 
 deep adapters like iscsi adapters if needed.

As a developer of an RDMA ULP, NFS-RDMA, I like this approach because 
it will simplify the configuration of an RDMA device and the services 
that use it.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ax88796: add superh to kconfig depencencies

2007-10-09 Thread Paul Mundt
On Tue, Oct 09, 2007 at 09:51:38PM +0900, Magnus Damm wrote:
 ax88796: add superh to kconfig depencencies
 
 This patch adds sh architecture support to the ax88796 kconfig.
 
 Signed-off-by: Magnus Damm [EMAIL PROTECTED]

Acked-by: Paul Mundt [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ipv4: kernel panic when only one unsecured port available

2007-10-09 Thread Anton Arapov
Steps to reproduce:
Server:
  [EMAIL PROTECTED] ~]# cat /etc/exports
  /export *(ro,insecure)
// there is insecure ... I am using ports like 1024 to 61000
  [EMAIL PROTECTED] ~] service nfs restart 

Client:
  [EMAIL PROTECTED] ~]# echo 32768 32768  
/proc/sys/net/ipv4/ip_local_port_range
  32768   32768
// two same numbers, for ex 32769 32769 etc.
  [EMAIL PROTECTED] ~]# cat /proc/sys/net/ipv4/ip_local_port_range
  32768   32768
  [EMAIL PROTECTED] ~]# mount server:/export /import
  
Actual results:
  Kernel always panics 


[PATCH] ipv4: kernel panic when only one unsecured port available

  Patch prevents division by zero. Kernel panics if only one 
unsecured port available.

Signed-off-by: Anton Arapov [EMAIL PROTECTED]
---

 net/ipv4/inet_connection_sock.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index fbe7714..00ad079 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -80,7 +80,7 @@ int inet_csk_get_port(struct inet_hashinfo *hashinfo,
int low = sysctl_local_port_range[0];
int high = sysctl_local_port_range[1];
int remaining = (high - low) + 1;
-   int rover = net_random() % (high - low) + low;
+   int rover = net_random() % remaining + low;
 
do {
head = hashinfo-bhash[inet_bhashfn(rover, 
hashinfo-bhash_size)];

-- 
Anton Arapov, [EMAIL PROTECTED]
Kernel Development, Red Hat
GPG Key ID: 0x6FA8C812


pgp0y91epZLHV.pgp
Description: PGP signature


[0/7] IPsec: More work on async crypto on output

2007-10-09 Thread Herbert Xu
Hi Dave:

Here's another round of patches which ends with the moving down
of the state lock into x-type-output.  I need to stop getting
distracted by fixing bugs and concentrate on creating them :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] [IPSEC]: Remove bogus ref count in xfrm_secpath_reject

2007-10-09 Thread Herbert Xu
[IPSEC]: Remove bogus ref count in xfrm_secpath_reject

Constructs of the form

xfrm_state_hold(x);
foo(x);
xfrm_state_put(x);

tend to be broken because foo is either synchronous where this is totally
unnecessary or if foo is asynchronous then the reference count is in the
wrong spot.

In the case of xfrm_secpath_reject, the function is synchronous and therefore
we should just kill the reference count.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]
---

 net/xfrm/xfrm_policy.c |6 +-
 1 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 76f172f..af27c19 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1682,17 +1682,13 @@ static inline int
 xfrm_secpath_reject(int idx, struct sk_buff *skb, struct flowi *fl)
 {
struct xfrm_state *x;
-   int err;
 
if (!skb-sp || idx  0 || idx = skb-sp-len)
return 0;
x = skb-sp-xvec[idx];
if (!x-type-reject)
return 0;
-   xfrm_state_hold(x);
-   err = x-type-reject(x, skb, fl);
-   xfrm_state_put(x);
-   return err;
+   return x-type-reject(x, skb, fl);
 }
 
 /* When skb is transformed back to its native form, we have to
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] [IPSEC]: Store IPv6 nh pointer in mac_header on output

2007-10-09 Thread Herbert Xu
[IPSEC]: Store IPv6 nh pointer in mac_header on output

Current the x-mode-output functions store the IPv6 nh pointer in the
skb network header.  This is inconvenient because the network header then
has to be fixed up before the packet can leave the IPsec stack.  The mac
header field is unused on output so we can use that to store this instead.

This patch does that and removes the network header fix-up in xfrm_output.

It also uses ipv6_hdr where appropriate in the x-type-output functions.

There is also a minor clean-up in esp4 to make it use the same code as
esp6 to help any subsequent effort to merge the two.

Lastly it kills two redundant skb_set_* statements in BEET that were
simply copied over from transport mode.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]
---

 net/ipv4/esp4.c |2 +-
 net/ipv6/ah6.c  |6 +++---
 net/ipv6/esp6.c |7 ---
 net/ipv6/ipcomp6.c  |6 +++---
 net/ipv6/mip6.c |   12 ++--
 net/ipv6/xfrm6_mode_beet.c  |   16 +++-
 net/ipv6/xfrm6_mode_ro.c|   13 +++--
 net/ipv6/xfrm6_mode_transport.c |   13 +++--
 net/ipv6/xfrm6_mode_tunnel.c|   13 +++--
 net/ipv6/xfrm6_tunnel.c |2 +-
 net/xfrm/xfrm_output.c  |2 --
 11 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 0f62af9..ffd5653 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -59,7 +59,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
tail[clen - skb-len - 2] = (clen - skb-len) - 2;
pskb_put(skb, trailer, clen - skb-len);
 
-   __skb_push(skb, skb-data - skb_network_header(skb));
+   __skb_push(skb, -skb_network_offset(skb));
top_iph = ip_hdr(skb);
esph = (struct ip_esp_hdr *)(skb_network_header(skb) +
 top_iph-ihl * 4);
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index ae68a90..ff904a7 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -235,11 +235,11 @@ static int ah6_output(struct xfrm_state *x, struct 
sk_buff *skb)
char hdrs[0];
} *tmp_ext;
 
-   top_iph = (struct ipv6hdr *)skb-data;
+   top_iph = ipv6_hdr(skb);
top_iph-payload_len = htons(skb-len - sizeof(*top_iph));
 
-   nexthdr = *skb_network_header(skb);
-   *skb_network_header(skb) = IPPROTO_AH;
+   nexthdr = *skb_mac_header(skb);
+   *skb_mac_header(skb) = IPPROTO_AH;
 
/* When there are no extension headers, we only need to save the first
 * 8 bytes of the base IP header.
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 0c5fb81..9fc1940 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -88,11 +88,12 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff 
*skb)
tail[clen-skb-len - 2] = (clen - skb-len) - 2;
pskb_put(skb, trailer, clen - skb-len);
 
-   top_iph = (struct ipv6hdr *)__skb_push(skb, hdr_len);
+   __skb_push(skb, -skb_network_offset(skb));
+   top_iph = ipv6_hdr(skb);
esph = (struct ipv6_esp_hdr *)skb_transport_header(skb);
top_iph-payload_len = htons(skb-len + alen - sizeof(*top_iph));
-   *(skb_tail_pointer(trailer) - 1) = *skb_network_header(skb);
-   *skb_network_header(skb) = IPPROTO_ESP;
+   *(skb_tail_pointer(trailer) - 1) = *skb_mac_header(skb);
+   *skb_mac_header(skb) = IPPROTO_ESP;
 
esph-spi = x-id.spi;
esph-seq_no = htonl(XFRM_SKB_CB(skb)-seq);
diff --git a/net/ipv6/ipcomp6.c b/net/ipv6/ipcomp6.c
index 91b2a75..71a14c0 100644
--- a/net/ipv6/ipcomp6.c
+++ b/net/ipv6/ipcomp6.c
@@ -157,15 +157,15 @@ static int ipcomp6_output(struct xfrm_state *x, struct 
sk_buff *skb)
pskb_trim(skb, hdr_len + dlen + sizeof(struct ip_comp_hdr));
 
/* insert ipcomp header and replace datagram */
-   top_iph = (struct ipv6hdr *)skb-data;
+   top_iph = ipv6_hdr(skb);
 
top_iph-payload_len = htons(skb-len - sizeof(struct ipv6hdr));
 
ipch = (struct ipv6_comp_hdr *)start;
-   ipch-nexthdr = *skb_network_header(skb);
+   ipch-nexthdr = *skb_mac_header(skb);
ipch-flags = 0;
ipch-cpi = htons((u16 )ntohl(x-id.spi));
-   *skb_network_header(skb) = IPPROTO_COMP;
+   *skb_mac_header(skb) = IPPROTO_COMP;
 
 out_ok:
return 0;
diff --git a/net/ipv6/mip6.c b/net/ipv6/mip6.c
index 8a1399c..7261c29 100644
--- a/net/ipv6/mip6.c
+++ b/net/ipv6/mip6.c
@@ -153,11 +153,11 @@ static int mip6_destopt_output(struct xfrm_state *x, 
struct sk_buff *skb)
u8 nexthdr;
int len;
 
-   iph = (struct ipv6hdr *)skb-data;
+   iph = ipv6_hdr(skb);
iph-payload_len = htons(skb-len - sizeof(*iph));
 
-   nexthdr = *skb_network_header(skb);
-   *skb_network_header(skb) = IPPROTO_DSTOPTS;
+   nexthdr = *skb_mac_header(skb);
+   *skb_mac_header(skb) = IPPROTO_DSTOPTS;
 
dstopt = (struct 

[PATCH 3/7] [IPSEC]: Remove gratuitous km wake-up events on ACQUIRE

2007-10-09 Thread Herbert Xu
[IPSEC]: Remove gratuitous km wake-up events on ACQUIRE

There is no point in waking people up when creating/updating larval states
because they'll just go back to sleep again as larval states by definition
cannot be found by xfrm_state_find.

We should only wake them up when the larvals mature or die.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]
---

 net/xfrm/xfrm_state.c |2 --
 1 files changed, 2 deletions(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index a00745a..0d07f6b 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -848,7 +848,6 @@ static struct xfrm_state *__find_acq_core(unsigned short 
family, u8 mode, u32 re
hlist_add_head(x-bydst, xfrm_state_bydst+h);
h = xfrm_src_hash(daddr, saddr, family);
hlist_add_head(x-bysrc, xfrm_state_bysrc+h);
-   wake_up(km_waitq);
 
xfrm_state_num++;
 
@@ -1311,7 +1310,6 @@ xfrm_alloc_spi(struct xfrm_state *x, __be32 minspi, 
__be32 maxspi)
h = xfrm_spi_hash(x-id.daddr, x-id.spi, x-id.proto, 
x-props.family);
hlist_add_head(x-byspi, xfrm_state_byspi+h);
spin_unlock_bh(xfrm_state_lock);
-   wake_up(km_waitq);
}
 }
 EXPORT_SYMBOL(xfrm_alloc_spi);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] [IPSEC]: Move common code into xfrm_alloc_spi

2007-10-09 Thread Herbert Xu
[IPSEC]: Move common code into xfrm_alloc_spi

This patch moves some common code that conceptually belongs to the xfrm core
from af_key/xfrm_user into xfrm_alloc_spi.

In particular, the spin lock on the state is now taken inside xfrm_alloc_spi.
Previously it also protected the construction of the response PF_KEY/XFRM
messages to user-space.  This is inconsistent as other identical constructions
are not protected by the state lock.  This is bad because they in fact should
be protected but only in certain spots (so as not to hold the lock for too
long which may cause packet drops).

The SPI byte order conversion has also been moved.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]
---

 include/net/xfrm.h|2 +-
 net/key/af_key.c  |   29 -
 net/xfrm/xfrm_state.c |   26 --
 net/xfrm/xfrm_user.c  |   13 -
 4 files changed, 37 insertions(+), 33 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 064a4ca..1c116dc 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1084,7 +1084,7 @@ struct xfrm_policy *xfrm_policy_bysel_ctx(u8 type, int 
dir,
 struct xfrm_policy *xfrm_policy_byid(u8, int dir, u32 id, int delete, int 
*err);
 int xfrm_policy_flush(u8 type, struct xfrm_audit *audit_info);
 u32 xfrm_get_acqseq(void);
-void xfrm_alloc_spi(struct xfrm_state *x, __be32 minspi, __be32 maxspi);
+extern int xfrm_alloc_spi(struct xfrm_state *x, u32 minspi, u32 maxspi);
 struct xfrm_state * xfrm_find_acq(u8 mode, u32 reqid, u8 proto,
  xfrm_address_t *daddr, xfrm_address_t *saddr,
  int create, unsigned short family);
diff --git a/net/key/af_key.c b/net/key/af_key.c
index ff5c3d0..143d46f 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -1253,8 +1253,11 @@ static int pfkey_getspi(struct sock *sk, struct sk_buff 
*skb, struct sadb_msg *h
struct sadb_x_sa2 *sa2;
struct sadb_address *saddr, *daddr;
struct sadb_msg *out_hdr;
+   struct sadb_spirange *range;
struct xfrm_state *x = NULL;
int mode;
+   int err;
+   u32 min_spi, max_spi;
u32 reqid;
u8 proto;
unsigned short family;
@@ -1309,25 +1312,17 @@ static int pfkey_getspi(struct sock *sk, struct sk_buff 
*skb, struct sadb_msg *h
if (x == NULL)
return -ENOENT;
 
-   resp_skb = ERR_PTR(-ENOENT);
-
-   spin_lock_bh(x-lock);
-   if (x-km.state != XFRM_STATE_DEAD) {
-   struct sadb_spirange *range = ext_hdrs[SADB_EXT_SPIRANGE-1];
-   u32 min_spi, max_spi;
+   min_spi = 0x100;
+   max_spi = 0x0fff;
 
-   if (range != NULL) {
-   min_spi = range-sadb_spirange_min;
-   max_spi = range-sadb_spirange_max;
-   } else {
-   min_spi = 0x100;
-   max_spi = 0x0fff;
-   }
-   xfrm_alloc_spi(x, htonl(min_spi), htonl(max_spi));
-   if (x-id.spi)
-   resp_skb = pfkey_xfrm_state2msg(x, 0, 3);
+   range = ext_hdrs[SADB_EXT_SPIRANGE-1];
+   if (range) {
+   min_spi = range-sadb_spirange_min;
+   max_spi = range-sadb_spirange_max;
}
-   spin_unlock_bh(x-lock);
+
+   err = xfrm_alloc_spi(x, min_spi, max_spi);
+   resp_skb = err ? ERR_PTR(err) : pfkey_xfrm_state2msg(x, 0, 3);
 
if (IS_ERR(resp_skb)) {
xfrm_state_put(x);
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 0d07f6b..344f0a6 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -1275,26 +1275,33 @@ u32 xfrm_get_acqseq(void)
 }
 EXPORT_SYMBOL(xfrm_get_acqseq);
 
-void
-xfrm_alloc_spi(struct xfrm_state *x, __be32 minspi, __be32 maxspi)
+int xfrm_alloc_spi(struct xfrm_state *x, u32 low, u32 high)
 {
unsigned int h;
struct xfrm_state *x0;
+   int err = -ENOENT;
+   __be32 minspi = htonl(low);
+   __be32 maxspi = htonl(high);
+
+   spin_lock_bh(x-lock);
+   if (x-km.state == XFRM_STATE_DEAD)
+   goto unlock;
 
+   err = 0;
if (x-id.spi)
-   return;
+   goto unlock;
+
+   err = -ENOENT;
 
if (minspi == maxspi) {
x0 = xfrm_state_lookup(x-id.daddr, minspi, x-id.proto, 
x-props.family);
if (x0) {
xfrm_state_put(x0);
-   return;
+   goto unlock;
}
x-id.spi = minspi;
} else {
u32 spi = 0;
-   u32 low = ntohl(minspi);
-   u32 high = ntohl(maxspi);
for (h=0; hhigh-low+1; h++) {
spi = low + net_random()%(high-low+1);
x0 = xfrm_state_lookup(x-id.daddr, htonl(spi), 
x-id.proto, x-props.family);
@@ -1310,7 +1317,14 @@ xfrm_alloc_spi(struct xfrm_state 

[PATCH 6/7] [IPSEC]: Lock state when copying non-atomic fields to user-space

2007-10-09 Thread Herbert Xu
[IPSEC]: Lock state when copying non-atomic fields to user-space

This patch adds locking so that when we're copying non-atomic fields such as
life-time or coaddr to user-space we don't get a partial result.

For af_key I've changed every instance of pfkey_xfrm_state2msg apart from
expiration notification to include the keys and life-times.  This is in-line
with XFRM behaviour.

The actual cases affected are:

* pfkey_getspi: No change as we don't have any keys to copy.
* key_notify_sa:
+ ADD/UPD: This wouldn't work otherwise.
+ DEL: It can't hurt.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]
---

 net/key/af_key.c |   35 +--
 net/xfrm/xfrm_user.c |   14 --
 2 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/net/key/af_key.c b/net/key/af_key.c
index 143d46f..7969f8a 100644
--- a/net/key/af_key.c
+++ b/net/key/af_key.c
@@ -655,7 +655,8 @@ static inline int pfkey_mode_to_xfrm(int mode)
}
 }
 
-static struct sk_buff * pfkey_xfrm_state2msg(struct xfrm_state *x, int 
add_keys, int hsc)
+static struct sk_buff *__pfkey_xfrm_state2msg(struct xfrm_state *x,
+ int add_keys, int hsc)
 {
struct sk_buff *skb;
struct sadb_msg *hdr;
@@ -1009,6 +1010,24 @@ static struct sk_buff * pfkey_xfrm_state2msg(struct 
xfrm_state *x, int add_keys,
return skb;
 }
 
+
+static inline struct sk_buff *pfkey_xfrm_state2msg(struct xfrm_state *x)
+{
+   struct sk_buff *skb;
+
+   spin_lock_bh(x-lock);
+   skb = __pfkey_xfrm_state2msg(x, 1, 3);
+   spin_unlock_bh(x-lock);
+
+   return skb;
+}
+
+static inline struct sk_buff *pfkey_xfrm_state2msg_expire(struct xfrm_state *x,
+ int hsc)
+{
+   return __pfkey_xfrm_state2msg(x, 0, hsc);
+}
+
 static struct xfrm_state * pfkey_msg2xfrm_state(struct sadb_msg *hdr,
void **ext_hdrs)
 {
@@ -1322,7 +1341,7 @@ static int pfkey_getspi(struct sock *sk, struct sk_buff 
*skb, struct sadb_msg *h
}
 
err = xfrm_alloc_spi(x, min_spi, max_spi);
-   resp_skb = err ? ERR_PTR(err) : pfkey_xfrm_state2msg(x, 0, 3);
+   resp_skb = err ? ERR_PTR(err) : pfkey_xfrm_state2msg(x);
 
if (IS_ERR(resp_skb)) {
xfrm_state_put(x);
@@ -1412,12 +1431,8 @@ static int key_notify_sa(struct xfrm_state *x, struct 
km_event *c)
 {
struct sk_buff *skb;
struct sadb_msg *hdr;
-   int hsc = 3;
-
-   if (c-event == XFRM_MSG_DELSA)
-   hsc = 0;
 
-   skb = pfkey_xfrm_state2msg(x, 0, hsc);
+   skb = pfkey_xfrm_state2msg(x);
 
if (IS_ERR(skb))
return PTR_ERR(skb);
@@ -1529,7 +1544,7 @@ static int pfkey_get(struct sock *sk, struct sk_buff 
*skb, struct sadb_msg *hdr,
if (x == NULL)
return -ESRCH;
 
-   out_skb = pfkey_xfrm_state2msg(x, 1, 3);
+   out_skb = pfkey_xfrm_state2msg(x);
proto = x-id.proto;
xfrm_state_put(x);
if (IS_ERR(out_skb))
@@ -1709,7 +1724,7 @@ static int dump_sa(struct xfrm_state *x, int count, void 
*ptr)
struct sk_buff *out_skb;
struct sadb_msg *out_hdr;
 
-   out_skb = pfkey_xfrm_state2msg(x, 1, 3);
+   out_skb = pfkey_xfrm_state2msg(x);
if (IS_ERR(out_skb))
return PTR_ERR(out_skb);
 
@@ -2910,7 +2925,7 @@ static int key_notify_sa_expire(struct xfrm_state *x, 
struct km_event *c)
else
hsc = 1;
 
-   out_skb = pfkey_xfrm_state2msg(x, 0, hsc);
+   out_skb = pfkey_xfrm_state2msg_expire(x, hsc);
if (IS_ERR(out_skb))
return PTR_ERR(out_skb);
 
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 2cbbe5e..5238f6a 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -507,8 +507,16 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
struct xfrm_usersa_info *p,
struct sk_buff *skb)
 {
+   spin_lock_bh(x-lock);
copy_to_user_state(x, p);
 
+   if (x-coaddr)
+   NLA_PUT(skb, XFRMA_COADDR, sizeof(*x-coaddr), x-coaddr);
+
+   if (x-lastused)
+   NLA_PUT_U64(skb, XFRMA_LASTUSED, x-lastused);
+   spin_unlock_bh(x-lock);
+
if (x-aalg)
NLA_PUT(skb, XFRMA_ALG_AUTH, alg_len(x-aalg), x-aalg);
if (x-ealg)
@@ -522,12 +530,6 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
if (x-security  copy_sec_ctx(x-security, skb)  0)
goto nla_put_failure;
 
-   if (x-coaddr)
-   NLA_PUT(skb, XFRMA_COADDR, sizeof(*x-coaddr), x-coaddr);
-
-   if (x-lastused)
-   NLA_PUT_U64(skb, XFRMA_LASTUSED, x-lastused);
-
return 0;
 
 nla_put_failure:
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More 

[PATCH 5/7] [XFRM] user: Move attribute copying code into copy_to_user_state_extra

2007-10-09 Thread Herbert Xu
[XFRM] user: Move attribute copying code into copy_to_user_state_extra

Here's a good example of code duplication leading to code rot.  The
notification patch did its own netlink message creation for xfrm states.
It duplicated code that was already in dump_one_state.  Guess what, the
next time (and the time after) when someone updated dump_one_state the
notification path got zilch.

This patch moves that code from dump_one_state to copy_to_user_state_extra
and uses it in xfrm_notify_sa too.  Unfortunately whoever updates this
still needs to update xfrm_sa_len since the notification path wants to
know the exact size for allocation.

At least I've added a comment saying so and if someone still forgest, we'll
have a WARN_ON telling us so.

I also changed the security size calculation to use xfrm_user_sec_ctx since
that's what we actually put into the skb.  However it makes no practical
difference since it has the same size as xfrm_sec_ctx.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]
---

 net/xfrm/xfrm_user.c |   76 +++
 1 files changed, 47 insertions(+), 29 deletions(-)

diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 52c7fce..2cbbe5e 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -483,9 +483,9 @@ struct xfrm_dump_info {
 
 static int copy_sec_ctx(struct xfrm_sec_ctx *s, struct sk_buff *skb)
 {
-   int ctx_size = sizeof(struct xfrm_sec_ctx) + s-ctx_len;
struct xfrm_user_sec_ctx *uctx;
struct nlattr *attr;
+   int ctx_size = sizeof(*uctx) + s-ctx_len;
 
attr = nla_reserve(skb, XFRMA_SEC_CTX, ctx_size);
if (attr == NULL)
@@ -502,23 +502,11 @@ static int copy_sec_ctx(struct xfrm_sec_ctx *s, struct 
sk_buff *skb)
return 0;
 }
 
-static int dump_one_state(struct xfrm_state *x, int count, void *ptr)
+/* Don't change this without updating xfrm_sa_len! */
+static int copy_to_user_state_extra(struct xfrm_state *x,
+   struct xfrm_usersa_info *p,
+   struct sk_buff *skb)
 {
-   struct xfrm_dump_info *sp = ptr;
-   struct sk_buff *in_skb = sp-in_skb;
-   struct sk_buff *skb = sp-out_skb;
-   struct xfrm_usersa_info *p;
-   struct nlmsghdr *nlh;
-
-   if (sp-this_idx  sp-start_idx)
-   goto out;
-
-   nlh = nlmsg_put(skb, NETLINK_CB(in_skb).pid, sp-nlmsg_seq,
-   XFRM_MSG_NEWSA, sizeof(*p), sp-nlmsg_flags);
-   if (nlh == NULL)
-   return -EMSGSIZE;
-
-   p = nlmsg_data(nlh);
copy_to_user_state(x, p);
 
if (x-aalg)
@@ -540,6 +528,35 @@ static int dump_one_state(struct xfrm_state *x, int count, 
void *ptr)
if (x-lastused)
NLA_PUT_U64(skb, XFRMA_LASTUSED, x-lastused);
 
+   return 0;
+
+nla_put_failure:
+   return -EMSGSIZE;
+}
+
+static int dump_one_state(struct xfrm_state *x, int count, void *ptr)
+{
+   struct xfrm_dump_info *sp = ptr;
+   struct sk_buff *in_skb = sp-in_skb;
+   struct sk_buff *skb = sp-out_skb;
+   struct xfrm_usersa_info *p;
+   struct nlmsghdr *nlh;
+   int err;
+
+   if (sp-this_idx  sp-start_idx)
+   goto out;
+
+   nlh = nlmsg_put(skb, NETLINK_CB(in_skb).pid, sp-nlmsg_seq,
+   XFRM_MSG_NEWSA, sizeof(*p), sp-nlmsg_flags);
+   if (nlh == NULL)
+   return -EMSGSIZE;
+
+   p = nlmsg_data(nlh);
+
+   err = copy_to_user_state_extra(x, p, skb);
+   if (err)
+   goto nla_put_failure;
+
nlmsg_end(skb, nlh);
 out:
sp-this_idx++;
@@ -547,7 +564,7 @@ out:
 
 nla_put_failure:
nlmsg_cancel(skb, nlh);
-   return -EMSGSIZE;
+   return err;
 }
 
 static int xfrm_dump_sa(struct sk_buff *skb, struct netlink_callback *cb)
@@ -1973,6 +1990,14 @@ static inline size_t xfrm_sa_len(struct xfrm_state *x)
l += nla_total_size(sizeof(*x-calg));
if (x-encap)
l += nla_total_size(sizeof(*x-encap));
+   if (x-security)
+   l += nla_total_size(sizeof(struct xfrm_user_sec_ctx) +
+   x-security-ctx_len);
+   if (x-coaddr)
+   l += nla_total_size(sizeof(*x-coaddr));
+
+   /* Must count this as this may become non-zero behind our back. */
+   l += nla_total_size(sizeof(x-lastused));
 
return l;
 }
@@ -2018,23 +2043,16 @@ static int xfrm_notify_sa(struct xfrm_state *x, struct 
km_event *c)
p = nla_data(attr);
}
 
-   copy_to_user_state(x, p);
-
-   if (x-aalg)
-   NLA_PUT(skb, XFRMA_ALG_AUTH, alg_len(x-aalg), x-aalg);
-   if (x-ealg)
-   NLA_PUT(skb, XFRMA_ALG_CRYPT, alg_len(x-ealg), x-ealg);
-   if (x-calg)
-   NLA_PUT(skb, XFRMA_ALG_COMP, sizeof(*(x-calg)), x-calg);
-
-   if (x-encap)
-   NLA_PUT(skb, XFRMA_ENCAP, sizeof(*x-encap), x-encap);
+   if 

[PATCH 7/7] [IPSEC]: Move state lock into x-type-output

2007-10-09 Thread Herbert Xu
[IPSEC]: Move state lock into x-type-output

This patch releases the lock on the state before calling x-type-output.
It also adds the lock to the spots where they're currently needed.

Most of those places (all except mip6) are expected to disappear with
async crypto.

Signed-off-by: Herbert Xu [EMAIL PROTECTED]
---

 net/ipv4/ah4.c |7 ++-
 net/ipv4/esp4.c|   10 --
 net/ipv6/ah6.c |9 ++---
 net/ipv6/esp6.c|   10 --
 net/ipv6/mip6.c|4 
 net/xfrm/xfrm_output.c |8 
 6 files changed, 36 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 58af298..3513149 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -5,6 +5,7 @@
 #include net/ah.h
 #include linux/crypto.h
 #include linux/pfkeyv2.h
+#include linux/spinlock.h
 #include net/icmp.h
 #include net/protocol.h
 #include asm/scatterlist.h
@@ -97,10 +98,14 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
ah-reserved = 0;
ah-spi = x-id.spi;
ah-seq_no = htonl(XFRM_SKB_CB(skb)-seq);
+
+   spin_lock_bh(x-lock);
err = ah_mac_digest(ahp, skb, ah-auth_data);
+   memcpy(ah-auth_data, ahp-work_icv, ahp-icv_trunc_len);
+   spin_unlock_bh(x-lock);
+
if (err)
goto error;
-   memcpy(ah-auth_data, ahp-work_icv, ahp-icv_trunc_len);
 
top_iph-tos = iph-tos;
top_iph-ttl = iph-ttl;
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index ffd5653..452910d 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -8,6 +8,7 @@
 #include linux/kernel.h
 #include linux/pfkeyv2.h
 #include linux/random.h
+#include linux/spinlock.h
 #include net/icmp.h
 #include net/protocol.h
 #include net/udp.h
@@ -66,6 +67,8 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
top_iph-tot_len = htons(skb-len + alen);
*(skb_tail_pointer(trailer) - 1) = top_iph-protocol;
 
+   spin_lock_bh(x-lock);
+
/* this is non-NULL only with UDP Encapsulation */
if (x-encap) {
struct xfrm_encap_tmpl *encap = x-encap;
@@ -111,7 +114,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
if (unlikely(nfrags  ESP_NUM_FAST_SG)) {
sg = kmalloc(sizeof(struct scatterlist)*nfrags, 
GFP_ATOMIC);
if (!sg)
-   goto error;
+   goto unlock;
}
skb_to_sgvec(skb, sg, esph-enc_data+esp-conf.ivlen-skb-data, 
clen);
err = crypto_blkcipher_encrypt(desc, sg, sg, clen);
@@ -120,7 +123,7 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
} while (0);
 
if (unlikely(err))
-   goto error;
+   goto unlock;
 
if (esp-conf.ivlen) {
memcpy(esph-enc_data, esp-conf.ivec, esp-conf.ivlen);
@@ -133,6 +136,9 @@ static int esp_output(struct xfrm_state *x, struct sk_buff 
*skb)
memcpy(pskb_put(skb, trailer, alen), esp-auth.work_icv, alen);
}
 
+unlock:
+   spin_unlock_bh(x-lock);
+
ip_send_check(top_iph);
 
 error:
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index ff904a7..c51d775 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -29,6 +29,7 @@
 #include net/ah.h
 #include linux/crypto.h
 #include linux/pfkeyv2.h
+#include linux/spinlock.h
 #include linux/string.h
 #include net/icmp.h
 #include net/ipv6.h
@@ -284,12 +285,14 @@ static int ah6_output(struct xfrm_state *x, struct 
sk_buff *skb)
ah-reserved = 0;
ah-spi = x-id.spi;
ah-seq_no = htonl(XFRM_SKB_CB(skb)-seq);
+
+   spin_lock_bh(x-lock);
err = ah_mac_digest(ahp, skb, ah-auth_data);
-   if (err)
-   goto error_free_iph;
memcpy(ah-auth_data, ahp-work_icv, ahp-icv_trunc_len);
+   spin_unlock_bh(x-lock);
 
-   err = 0;
+   if (err)
+   goto error_free_iph;
 
memcpy(top_iph, tmp_base, sizeof(tmp_base));
if (tmp_ext) {
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 9fc1940..7355bb0 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -34,6 +34,7 @@
 #include linux/kernel.h
 #include linux/pfkeyv2.h
 #include linux/random.h
+#include linux/spinlock.h
 #include net/icmp.h
 #include net/ipv6.h
 #include net/protocol.h
@@ -98,6 +99,8 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff 
*skb)
esph-spi = x-id.spi;
esph-seq_no = htonl(XFRM_SKB_CB(skb)-seq);
 
+   spin_lock_bh(x-lock);
+
if (esp-conf.ivlen) {
if (unlikely(!esp-conf.ivinitted)) {
get_random_bytes(esp-conf.ivec, esp-conf.ivlen);
@@ -112,7 +115,7 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff 
*skb)
if (unlikely(nfrags  ESP_NUM_FAST_SG)) {
sg = kmalloc(sizeof(struct scatterlist)*nfrags, 
GFP_ATOMIC);

Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread Daniel Lezcano

Pavel Emelyanov wrote:

Currently indexes for netdevices come sequentially one by
one, and the same stays true even for devices that are 
created for namespaces.


Side effects of this are:
 * lo device has not 1 index in a namespace. This may break
   some userspace that relies on it (and AFAIR something
   really broke in OpenVZ VEs without this);
 * after some time namespaces will have devices with indexes
   like 100 os similar. This might be confusing for a
   human (tools will not mind).

So move the (currently global and static) ifindex variable
on the struct net, making the indexes allocation look more
like on a standalone machine.

Moreover - when we have indexes intersect between namespaces,
we may catch more BUGs in the future related to wrong device 
was found for a given index.


Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]


Applied and tested against netns49. Works fine.

Acked-by: Daniel Lezcano [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND]: RNDIS host: adjust MTU instead of refusing to talk to devices with MTU 1.5k

2007-10-09 Thread Thomas Sailer
On Tue, 2007-10-09 at 08:18 -0700, David Brownell wrote:

  This patch makes the host RNDIS driver talk to RNDIS devices with an MTU
  less than 1.5k, instead of refusing to talk to such a device.
  Please apply.
 
  Signed-Off-by: Thomas Sailer [EMAIL PROTECTED]
 
 Acked-by: David Brownell [EMAIL PROTECTED]
 
 ... but patches in drivers/net go to netdev, not linux-usb-devel.
 And it'd be nice if the patch comment mentioned some particular
 device that's been observed to act this way.

Ok, next try.

The device in question is a low power radio device that speaks the
bpqether protocol over ethernet. bpqether frames are never longer than
320 bytes, so there is no need to support 1.5kbyte MTU in the device
firmware.

Tom

Signed-Off-by: Thomas Sailer [EMAIL PROTECTED]

--- 1/drivers/net/usb/rndis_host.c.orig 2007-09-04 17:51:11.0 +0200
+++ 2/drivers/net/usb/rndis_host.c  2007-09-04 17:54:26.0 +0200
@@ -512,11 +512,18 @@
}
tmp = le32_to_cpu(u.init_c-max_transfer_size);
if (tmp  dev-hard_mtu) {
-   dev_err(intf-dev,
-   dev can't take %u byte packets (max %u)\n,
-   dev-hard_mtu, tmp);
-   retval = -EINVAL;
-   goto fail_and_release;
+   if (tmp = net-hard_header_len) {
+   dev_warn(intf-dev,
+dev can't take %u byte packets (max %u)\n,
+net-hard_header_len+1, tmp);
+   retval = -EINVAL;
+   goto fail_and_release;
+   }
+   dev-hard_mtu = tmp;
+   net-mtu = dev-hard_mtu - net-hard_header_len;
+   dev_warn(intf-dev,
+dev can't take %u byte packets (max %u), adjusting 
MTU to %u\n,
+dev-hard_mtu, tmp, net-mtu);
}
 
/* REVISIT:  peripheral alignment request is ignored ... */


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6][NET-2.6.24] Consolidate private allocations in seq files

2007-10-09 Thread Pavel Emelyanov
Many (very many) seq files in net/ allocate some private data
to use it later (mostly for iteration state). All this code
was obviously get using copy-paste method, so move it into
one place.

Almost all of these places either set this private to 0, or
keep uninitialized. Some places, however, pre-initialize this
area, but there are few of them.

The seq_open_private() call just opens the seq file with
allocated and set to zero area. The __seq_open_private() call
makes the same, but returns the allocated memory to the called
to be initialized later.

I didn't measure how much of the .text section this saves,
but I suspect a lot of :) As far as the code is concerned,
this set saves ~450 lines.

Such thing may be useful for any subsystem, but I found this
mostly in the networking code and fixed only it (for a while).

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6][NET-2.6.24] Introduce the seq_open_private()

2007-10-09 Thread Pavel Emelyanov
This function allocates the zeroed chunk of memory and
call seq_open(). The __seq_open_private() helper returns
the allocated memory to make it possible for the caller
to initialize it.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/fs/seq_file.c b/fs/seq_file.c
index bbb19be..ca71c11 100644
--- a/fs/seq_file.c
+++ b/fs/seq_file.c
@@ -429,6 +429,39 @@ int seq_release_private(struct inode *in
 }
 EXPORT_SYMBOL(seq_release_private);
 
+void *__seq_open_private(struct file *f, const struct seq_operations *ops,
+   int psize)
+{
+   int rc;
+   void *private;
+   struct seq_file *seq;
+
+   private = kzalloc(psize, GFP_KERNEL);
+   if (private == NULL)
+   goto out;
+
+   rc = seq_open(f, ops);
+   if (rc  0)
+   goto out_free;
+
+   seq = f-private_data;
+   seq-private = private;
+   return private;
+
+out_free:
+   kfree(private);
+out:
+   return NULL;
+}
+EXPORT_SYMBOL(__seq_open_private);
+
+int seq_open_private(struct file *filp, const struct seq_operations *ops,
+   int psize)
+{
+   return __seq_open_private(filp, ops, psize) ? 0 : -ENOMEM;
+}
+EXPORT_SYMBOL(seq_open_private);
+
 int seq_putc(struct seq_file *m, char c)
 {
if (m-count  m-size) {
diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h
index 83783ab..8bf1e05 100644
--- a/include/linux/seq_file.h
+++ b/include/linux/seq_file.h
@@ -46,6 +46,8 @@ int seq_path(struct seq_file *, struct v
 
 int single_open(struct file *, int (*)(struct seq_file *, void *), void *);
 int single_release(struct inode *, struct file *);
+void *__seq_open_private(struct file *, const struct seq_operations *, int);
+int seq_open_private(struct file *, const struct seq_operations *, int);
 int seq_release_private(struct inode *, struct file *);
 
 #define SEQ_START_TOKEN ((void *)1)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6][NET-2.6.24] Make core networking code use seq_open_private

2007-10-09 Thread Pavel Emelyanov
This concerns the ipv4 and ipv6 code mostly, but also the netlink
and unix sockets.

The netlink code is an example of how to use the __seq_open_private()
call - it saves the net namespace on this private.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index d824819..36d6798 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -1378,24 +1378,8 @@ static const struct seq_operations arp_s
 
 static int arp_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct neigh_seq_state *s = kzalloc(sizeof(*s), GFP_KERNEL);
-
-   if (!s)
-   goto out;
-
-   rc = seq_open(file, arp_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq  = file-private_data;
-   seq-private = s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, arp_seq_ops,
+   sizeof(struct neigh_seq_state));
 }
 
 static const struct file_operations arp_seq_fops = {
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 9fafbee..527a6e0 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -1039,24 +1039,8 @@ static const struct seq_operations fib_s
 
 static int fib_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct fib_iter_state *s = kzalloc(sizeof(*s), GFP_KERNEL);
-
-   if (!s)
-   goto out;
-
-   rc = seq_open(file, fib_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq  = file-private_data;
-   seq-private = s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, fib_seq_ops,
+   sizeof(struct fib_iter_state));
 }
 
 static const struct file_operations fib_seq_fops = {
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index be34bd5..81a8285 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2379,25 +2379,8 @@ static const struct seq_operations fib_t
 
 static int fib_trie_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct fib_trie_iter *s = kmalloc(sizeof(*s), GFP_KERNEL);
-
-   if (!s)
-   goto out;
-
-   rc = seq_open(file, fib_trie_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq  = file-private_data;
-   seq-private = s;
-   memset(s, 0, sizeof(*s));
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, fib_trie_seq_ops,
+   sizeof(struct fib_trie_iter));
 }
 
 static const struct file_operations fib_trie_fops = {
@@ -2500,25 +2483,8 @@ static const struct seq_operations fib_r
 
 static int fib_route_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct fib_trie_iter *s = kmalloc(sizeof(*s), GFP_KERNEL);
-
-   if (!s)
-   goto out;
-
-   rc = seq_open(file, fib_route_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq  = file-private_data;
-   seq-private = s;
-   memset(s, 0, sizeof(*s));
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, fib_route_seq_ops,
+   sizeof(struct fib_trie_iter));
 }
 
 static const struct file_operations fib_route_fops = {
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 2b6e59c..7dbc282 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -2410,23 +2410,8 @@ static const struct seq_operations igmp_
 
 static int igmp_mc_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct igmp_mc_iter_state *s = kzalloc(sizeof(*s), GFP_KERNEL);
-
-   if (!s)
-   goto out;
-   rc = seq_open(file, igmp_mc_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq = file-private_data;
-   seq-private = s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, igmp_mc_seq_ops,
+   sizeof(struct igmp_mc_iter_state));
 }
 
 static const struct file_operations igmp_mc_seq_fops = {
@@ -2584,23 +2569,8 @@ static const struct seq_operations igmp_
 
 static int igmp_mcf_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct igmp_mcf_iter_state *s = kzalloc(sizeof(*s), GFP_KERNEL);
-
-   if (!s)
-   goto out;
-   rc = seq_open(file, igmp_mcf_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq = file-private_data;
-   seq-private = s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, igmp_mcf_seq_ops,
+   

[PATCH 3/6][NET-2.6.24] Make netfilter code use the seq_open_private

2007-10-09 Thread Pavel Emelyanov
Just switch to the consolidated calls.

ipt_recent() has to initialize the private, so use
the __seq_open_private() helper.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
Cc: Patrick McHardy [EMAIL PROTECTED]

---

diff --git a/net/ipv4/netfilter/ipt_recent.c b/net/ipv4/netfilter/ipt_recent.c
index db2a798..11d39fb 100644
--- a/net/ipv4/netfilter/ipt_recent.c
+++ b/net/ipv4/netfilter/ipt_recent.c
@@ -381,25 +381,14 @@ static const struct seq_operations recen
 static int recent_seq_open(struct inode *inode, struct file *file)
 {
struct proc_dir_entry *pde = PDE(inode);
-   struct seq_file *seq;
struct recent_iter_state *st;
-   int ret;
 
-   st = kzalloc(sizeof(*st), GFP_KERNEL);
+   st = __seq_open_private(file, recent_seq_ops, sizeof(*st));
if (st == NULL)
return -ENOMEM;
 
-   ret = seq_open(file, recent_seq_ops);
-   if (ret) {
-   kfree(st);
-   goto out;
-   }
-
st-table= pde-data;
-   seq  = file-private_data;
-   seq-private = st;
-out:
-   return ret;
+   return 0;
 }
 
 static ssize_t recent_proc_write(struct file *file, const char __user *input,
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c 
b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index a5ae2ea..741f3df 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -174,22 +174,8 @@ static const struct seq_operations ct_se
 
 static int ct_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   struct ct_iter_state *st;
-   int ret;
-
-   st = kzalloc(sizeof(struct ct_iter_state), GFP_KERNEL);
-   if (st == NULL)
-   return -ENOMEM;
-   ret = seq_open(file, ct_seq_ops);
-   if (ret)
-   goto out_free;
-   seq  = file-private_data;
-   seq-private = st;
-   return ret;
-out_free:
-   kfree(st);
-   return ret;
+   return seq_open_private(file, ct_seq_ops,
+   sizeof(struct ct_iter_state));
 }
 
 static const struct file_operations ct_file_ops = {
@@ -291,22 +277,8 @@ static const struct seq_operations exp_s
 
 static int exp_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   struct ct_expect_iter_state *st;
-   int ret;
-
-   st = kzalloc(sizeof(struct ct_expect_iter_state), GFP_KERNEL);
-   if (!st)
-   return -ENOMEM;
-   ret = seq_open(file, exp_seq_ops);
-   if (ret)
-   goto out_free;
-   seq  = file-private_data;
-   seq-private = st;
-   return ret;
-out_free:
-   kfree(st);
-   return ret;
+   return seq_open_private(file, exp_seq_ops,
+   sizeof(struct ct_expect_iter_state));
 }
 
 static const struct file_operations ip_exp_file_ops = {
diff --git a/net/netfilter/nf_conntrack_expect.c 
b/net/netfilter/nf_conntrack_expect.c
index 7a0ae36..175c8d1 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -472,22 +472,8 @@ static const struct seq_operations exp_s
 
 static int exp_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   struct ct_expect_iter_state *st;
-   int ret;
-
-   st = kzalloc(sizeof(struct ct_expect_iter_state), GFP_KERNEL);
-   if (!st)
-   return -ENOMEM;
-   ret = seq_open(file, exp_seq_ops);
-   if (ret)
-   goto out_free;
-   seq  = file-private_data;
-   seq-private = st;
-   return ret;
-out_free:
-   kfree(st);
-   return ret;
+   return seq_open_private(file, exp_seq_ops,
+   sizeof(struct ct_expect_iter_state));
 }
 
 static const struct file_operations exp_file_ops = {
diff --git a/net/netfilter/nf_conntrack_standalone.c 
b/net/netfilter/nf_conntrack_standalone.c
index 2a19c5f..9efdd37 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -195,22 +195,8 @@ static const struct seq_operations ct_se
 
 static int ct_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   struct ct_iter_state *st;
-   int ret;
-
-   st = kzalloc(sizeof(struct ct_iter_state), GFP_KERNEL);
-   if (st == NULL)
-   return -ENOMEM;
-   ret = seq_open(file, ct_seq_ops);
-   if (ret)
-   goto out_free;
-   seq  = file-private_data;
-   seq-private = st;
-   return ret;
-out_free:
-   kfree(st);
-   return ret;
+   return seq_open_private(file, ct_seq_ops,
+   sizeof(struct ct_iter_state));
 }
 
 static const struct file_operations ct_file_ops = {
diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index 2135926..2c7bd2e 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c

Re: [PATCH][E1000E] some cleanups

2007-10-09 Thread Kok, Auke
jamal wrote:
 On Mon, 2007-08-10 at 15:40 -0700, Kok, Auke wrote:
 
 My biggest problem with the patch as you sent it that it's a tonload of 
 changes
 and no implicit benefit immediately as I can see. 
 
 The patch looks scary but is pretty tame when you apply it and stare at
 it.
 
 I would really have to see the
 LLTX change as well to make a wellfounded decision whether it is the right 
 thing
 to go and overhaul this part of the driver or not :)

 I'm not particularly against the changes per se though.

 so please, if needed offlist send me those LLTX changes as well.
 
 Dont worry about it - i didnt get any love for a similar patch i did for
 tg3 either ;- I will hold onto it for now. I could do the un-LLTX
 outside of this - would you like that for both e1000/e1000e ?

if we're going to remove LLTX from e1000 I prefer to do that at a much later 
time.
Let's focus on e1000e instead - while it is still moving ;)

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6][NET-2.6.24] Make decnet code use the seq_open_private()

2007-10-09 Thread Pavel Emelyanov
Just switch to the consolidated code.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
Cc: Patrick Caulfield [EMAIL PROTECTED]

---

diff --git a/net/decnet/dn_neigh.c b/net/decnet/dn_neigh.c
index b66e3be..e851b14 100644
--- a/net/decnet/dn_neigh.c
+++ b/net/decnet/dn_neigh.c
@@ -580,24 +580,8 @@ static const struct seq_operations dn_ne
 
 static int dn_neigh_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct neigh_seq_state *s = kzalloc(sizeof(*s), GFP_KERNEL);
-
-   if (!s)
-   goto out;
-
-   rc = seq_open(file, dn_neigh_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq  = file-private_data;
-   seq-private = s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, dn_neigh_seq_ops,
+   sizeof(struct neigh_seq_state));
 }
 
 static const struct file_operations dn_neigh_seq_fops = {
diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index b7ebf99..97eee5e 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1739,23 +1739,8 @@ static const struct seq_operations dn_rt
 
 static int dn_rt_cache_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct dn_rt_cache_iter_state *s;
-
-   s = kzalloc(sizeof(*s), GFP_KERNEL);
-   if (!s)
-   goto out;
-   rc = seq_open(file, dn_rt_cache_seq_ops);
-   if (rc)
-   goto out_kfree;
-   seq = file-private_data;
-   seq-private= s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, dn_rt_cache_seq_ops,
+   sizeof(struct dn_rt_cache_iter_state));
 }
 
 static const struct file_operations dn_rt_cache_seq_fops = {

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6][NET-2.6.24] Make the IRDA use the seq_open_private()

2007-10-09 Thread Pavel Emelyanov
Just switch to the consolidated code

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
Cc: Samuel Ortiz [EMAIL PROTECTED]

---

diff --git a/net/irda/irlap.c b/net/irda/irlap.c
index 3d76aaf..f3236ac 100644
--- a/net/irda/irlap.c
+++ b/net/irda/irlap.c
@@ -1219,29 +1219,11 @@ static const struct seq_operations irlap
 
 static int irlap_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct irlap_iter_state *s = kzalloc(sizeof(*s), GFP_KERNEL);
+   if (irlap == NULL)
+   return -EINVAL;
 
-   if (!s)
-   goto out;
-
-   if (irlap == NULL) {
-   rc = -EINVAL;
-   goto out_kfree;
-   }
-
-   rc = seq_open(file, irlap_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq  = file-private_data;
-   seq-private = s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, irlap_seq_ops,
+   sizeof(struct irlap_iter_state));
 }
 
 const struct file_operations irlap_seq_fops = {
diff --git a/net/irda/irlmp.c b/net/irda/irlmp.c
index 7efa930..7db92ce 100644
--- a/net/irda/irlmp.c
+++ b/net/irda/irlmp.c
@@ -2003,27 +2003,10 @@ static const struct seq_operations irlmp
 
 static int irlmp_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct irlmp_iter_state *s;
-
IRDA_ASSERT(irlmp != NULL, return -EINVAL;);
 
-   s = kmalloc(sizeof(*s), GFP_KERNEL);
-   if (!s)
-   goto out;
-
-   rc = seq_open(file, irlmp_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq  = file-private_data;
-   seq-private = s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, irlmp_seq_ops,
+   sizeof(struct irlmp_iter_state));
 }
 
 const struct file_operations irlmp_seq_fops = {
diff --git a/net/irda/irttp.c b/net/irda/irttp.c
index 3d7ab03..1311976 100644
--- a/net/irda/irttp.c
+++ b/net/irda/irttp.c
@@ -1884,25 +1884,8 @@ static const struct seq_operations irttp
 
 static int irttp_seq_open(struct inode *inode, struct file *file)
 {
-   struct seq_file *seq;
-   int rc = -ENOMEM;
-   struct irttp_iter_state *s;
-
-   s = kzalloc(sizeof(*s), GFP_KERNEL);
-   if (!s)
-   goto out;
-
-   rc = seq_open(file, irttp_seq_ops);
-   if (rc)
-   goto out_kfree;
-
-   seq  = file-private_data;
-   seq-private = s;
-out:
-   return rc;
-out_kfree:
-   kfree(s);
-   goto out;
+   return seq_open_private(file, irttp_seq_ops,
+   sizeof(struct irttp_iter_state));
 }
 
 const struct file_operations irttp_seq_fops = {

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6][NET-2.6.24] Make the sunrpc use the seq_open_private()

2007-10-09 Thread Pavel Emelyanov
Just switch to the consolidated code.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
Cc: Neil Brown [EMAIL PROTECTED]

---

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index ebe344f..8e05557 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -1218,23 +1218,15 @@ static const struct seq_operations cache
 
 static int content_open(struct inode *inode, struct file *file)
 {
-   int res;
struct handle *han;
struct cache_detail *cd = PDE(inode)-data;
 
-   han = kmalloc(sizeof(*han), GFP_KERNEL);
+   han = __seq_open_private(file, cache_content_op, sizeof(*han));
if (han == NULL)
return -ENOMEM;
 
han-cd = cd;
-
-   res = seq_open(file, cache_content_op);
-   if (res)
-   kfree(han);
-   else
-   ((struct seq_file *)file-private_data)-private = han;
-
-   return res;
+   return 0;
 }
 
 static const struct file_operations content_file_operations = {

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/8][BNX2X] resubmit as attachments: add bnx2x to Kconfig and Makefile

2007-10-09 Thread Eliezer Tamir
On Mon, 2007-10-08 at 21:29 -0700, David Miller wrote:
 From: Eliezer Tamir [EMAIL PROTECTED]
 Date: Tue, 09 Oct 2007 06:13:23 +0200
 
  Due to the size of the patch I can not post it to the list.
 
 Understood
 
  Here is an FTP link.
  ftp://[EMAIL PROTECTED]/bnx2x-0.40.10-net-2.6.24-one.patch.txt
  
  Or if you prefer, I can post it gzipped.
 
 I took a look at what takes up so much space and it's the firmware and
 this register init stuff.
 
 The firmware is unavoidable, but wrt. the register init tables there
 appears to be a ton of superfluous information in that header file.

First, I must admit that looking over the file again I noticed an error
on my part in generating the file, all the data that initializes FPGA or
EMULATION should not be there. fixed. 
(This reduced about half of the lines but only a third in size.)


 It seem extreme overkill and I would guess it exists purely to
 simplify your in-house chip validation.  I would really wish you
 wouldn't do this as these tables take up unswappable kernel memory.
 
 No other driver goes about initializing the chip registers using
 tables like this.  Many of them are sequences of nothing but zeros or
 the same value repeated over and over!  The non-zero values that are
 specific are magic constants with no macros to describe up the
 meaning of the bits being set in these registers.

Almost all of the zero-filled tables will be removed.
The rest of the registers do need to be initialized.

I agree that the number of registers that needs to be initialized is
huge, but that is caused by the way the hardware was designed.

The values for the initialization come from several sources:
Some are derived from HW code (the XML files used to derive the verilog
code),
Others (along with much of the machine generated .h files) are generated
at microcode build time, adding a microcode routine will cause the init
values to change, using a new variable can cause an .h file to change.
In the last group which is very small, are registers that are controlled
by the driver.

The values in this file really are machine generated, they really are
not meant to be modified directly by editing the file.

The registers that are under the driver's control are in the main .c
and .h files.


 
 Huge functions like bnx2x_idle_chk() that just validate register
 values with hundreds of lines of C code are not appropriate for
 submission into a Linux kernel network driver.  Again, this looks
 like code that assists you with in-house chip validation.

The idle check code is not a manufacturing test, it is meant to help
debug the driver and microcode.
If the driver sends an invalid command to one of the CPUs which then
chokes on it, this will tell you which one of them died and the general
whereabouts of the problem. (ingress CPU X is stuck because output queue
Y is full)

I agree that it is not a clean implementation.
I will remove it and it will be re-written. 


 I'm open to putting self tests into the driver, but this adds so much
 bloat it really takes things beyond reasonable.
 
 Please remove this self-test code and compress the register
 initialization information, preferrably into foo_hw_init() C functions
 as is done traditionally in drivers.  It will take up less space,
 remove the magic values, and make the hardware easier to understand
 for other developers.

 This should get the driver under the posting limit of 400K, which to
 be honest no driver's should be larger than.  Unfortunately that
 firmware file is 350K so you won't have much left to work with :-)
 
 What we could do is post the driver without the firmware, so that
 people can review the actual C-code by just replying to this posting
 and it won't go over the posting size limit.

( Michael has showed me the trick of how to post with evolution, so I
hope that the mangled patch problem is behind us and I think that I can
now post everything without a problem, Hallelujah!)
 
 Thanks!
 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread David Stevens
Sorry if this is a dumb question, but what is the model you intend for
SNMP? Do you want each namespace to be its own virtual machine with
its own, separate MIB?

Ifindex's have to uniquely identify the interface (virtual or otherwise) 
to remote
queriers (not just local applications), so unless you pay the price of 
separating
all the SNMP MIBs per namespace too, it seems you'll need some way to
remap these for SNMP queries, right?

+-DLS

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread Eric W. Biederman
Pavel Emelyanov [EMAIL PROTECTED] writes:

 Currently indexes for netdevices come sequentially one by
 one, and the same stays true even for devices that are 
 created for namespaces.

 Side effects of this are:
  * lo device has not 1 index in a namespace. This may break
some userspace that relies on it (and AFAIR something
really broke in OpenVZ VEs without this);

As it happens lo hasn't been registered first for some time
so it hasn't had ifindex of 1 in the normal kernel.

  * after some time namespaces will have devices with indexes
like 100 os similar. This might be confusing for a
human (tools will not mind).

Only if we wind up creating that many devices.

 So move the (currently global and static) ifindex variable
 on the struct net, making the indexes allocation look more
 like on a standalone machine.

 Moreover - when we have indexes intersect between namespaces,
 we may catch more BUGs in the future related to wrong device 
 was found for a given index.

Not yet.

I know there are several data structures internal to the kernel that
are indexed by ifindex, and not struct net_device *.  There is the
iflink field in struct net_device.  We need a way to refer to network
devices in other namespaces in rtnetlink in an unambiguous way.   I
don't see any real problems with a global ifindex assignment until
we start migrating applications.

So please hold off on this until the kernel has been audited and
we have removed all of the uses of ifindex that assume ifindex is
global, that we can find.

Right now a namespace local ifindex seems to be just asking for
trouble.

Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread Eric W. Biederman
David Stevens [EMAIL PROTECTED] writes:

 Sorry if this is a dumb question, but what is the model you intend for
 SNMP? Do you want each namespace to be its own virtual machine with
 its own, separate MIB?

Each network namespace appears to user space as a completely separate
network stack.  So yes a separate instance of the MIB is appropriate.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Stephen Hemminger
On 09 Oct 2007 18:51:51 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 David Miller [EMAIL PROTECTED] writes:
  
  2) Switch the default qdisc away from pfifo_fast to a new DRR fifo
 with load balancing using the code in #1.  I think this is kind
 of in the territory of what Peter said he is working on.
 
 Hopefully that new qdisc will just use the TX rings of the hardware
 directly. They are typically large enough these days. That might avoid
 some locking in this critical path.
 
 I know this is controversial, but realistically I doubt users
 benefit at all from the prioritization that pfifo provides.
 
 I agree. For most interfaces the priority is probably dubious.
 Even for DSL the prioritization will be likely usually done in a router
 these days.
 
 Also for the fast interfaces where we do TSO priority doesn't work
 very well anyways -- with large packets there is not too much 
 to prioritize.
 
  3) Work on discovering a way to make the locking on transmit as
 localized to the current thread of execution as possible.  Things
 like RCU and statistic replication, techniques we use widely
 elsewhere in the stack, begin to come to mind.
 
 If the data is just passed on to the hardware queue, why is any 
 locking needed at all? (except for the driver locking of course)
 
 -Andi

I wonder about the whole idea of queueing in general at such high speeds.
Given the normal bi-modal distribution of packets, and the predominance
of 1500 byte MTU; does it make sense to even have any queueing in software
at all?


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Andi Kleen
 I wonder about the whole idea of queueing in general at such high speeds.
 Given the normal bi-modal distribution of packets, and the predominance
 of 1500 byte MTU; does it make sense to even have any queueing in software
 at all?

Yes that is my point -- it should just pass it through directly
and the driver can then put it into the different per CPU (or per
whatever) queues managed by the hardware.

The only thing the qdisc needs to do is to set some bit that says
it is ok to put this into difference queues; don't need strict ordering 

Otherwise if the drivers did that unconditionally they might cause
problems with other qdiscs.

This would also require that the driver exports some hint 
to the upper layer on how large its internal queues are. A device
with a short queue would still require pfifo_fast. Long queue
devices could just pass through. That again could be a single flag.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] xen-netfront: rearrange netfront_info structure to separate tx and rx

2007-10-09 Thread Jeremy Fitzhardinge
Keep tx and rx elements separate on different cachelines to prevent
bouncing.

Signed-off-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
Cc: Stephen Hemminger [EMAIL PROTECTED]
Cc: Christoph Hellwig [EMAIL PROTECTED]

---
 drivers/net/xen-netfront.c |   37 ++---
 1 file changed, 18 insertions(+), 19 deletions(-)

===
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -72,22 +72,12 @@ struct netfront_info {
struct list_head list;
struct net_device *netdev;
 
+   unsigned int evtchn;
+   struct xenbus_device *xbdev;
+
+   spinlock_t   tx_lock;
struct xen_netif_tx_front_ring tx;
-   struct xen_netif_rx_front_ring rx;
-
-   spinlock_t   tx_lock;
-   spinlock_t   rx_lock;
-
-   unsigned int evtchn;
-
-   /* Receive-ring batched refills. */
-#define RX_MIN_TARGET 8
-#define RX_DFL_MIN_TARGET 64
-#define RX_MAX_TARGET min_t(int, NET_RX_RING_SIZE, 256)
-   unsigned rx_min_target, rx_max_target, rx_target;
-   struct sk_buff_head rx_batch;
-
-   struct timer_list rx_refill_timer;
+   int tx_ring_ref;
 
/*
 * {tx,rx}_skbs store outstanding skbuffs. Free tx_skb entries
@@ -106,13 +96,22 @@ struct netfront_info {
grant_ref_t grant_tx_ref[NET_TX_RING_SIZE];
unsigned tx_skb_freelist;
 
+   spinlock_t   rx_lock cacheline_aligned_in_smp;
+   struct xen_netif_rx_front_ring rx;
+   int rx_ring_ref;
+
+   /* Receive-ring batched refills. */
+#define RX_MIN_TARGET 8
+#define RX_DFL_MIN_TARGET 64
+#define RX_MAX_TARGET min_t(int, NET_RX_RING_SIZE, 256)
+   unsigned rx_min_target, rx_max_target, rx_target;
+   struct sk_buff_head rx_batch;
+
+   struct timer_list rx_refill_timer;
+
struct sk_buff *rx_skbs[NET_RX_RING_SIZE];
grant_ref_t gref_rx_head;
grant_ref_t grant_rx_ref[NET_RX_RING_SIZE];
-
-   struct xenbus_device *xbdev;
-   int tx_ring_ref;
-   int rx_ring_ref;
 
unsigned long rx_pfn_array[NET_RX_RING_SIZE];
struct multicall_entry rx_mcl[NET_RX_RING_SIZE+1];

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Waskiewicz Jr, Peter P
 IMO the net driver really should provide a hint as to what it wants.
 
 8139cp and tg3 would probably prefer multiple TX queue 
 behavior to match silicon behavior -- strict prio.

If I understand what you just said, I disagree.  If your hardware is
running strict prio, you don't want to enforce strict prio in the qdisc
layer; performing two layers of QoS is excessive, and may lead to
results you don't want.  The reason I added the DRR qdisc is for the Si
that has its own queueing strategy that is not RR.  For Si that
implements RR (like e1000), you can either use the DRR qdisc, or if you
want to prioritize your flows, use PRIO.

-PJ Waskiewicz
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik

Waskiewicz Jr, Peter P wrote:

IMO the net driver really should provide a hint as to what it wants.

8139cp and tg3 would probably prefer multiple TX queue 
behavior to match silicon behavior -- strict prio.


If I understand what you just said, I disagree.  If your hardware is
running strict prio, you don't want to enforce strict prio in the qdisc
layer; performing two layers of QoS is excessive, and may lead to
results you don't want.  The reason I added the DRR qdisc is for the Si
that has its own queueing strategy that is not RR.  For Si that
implements RR (like e1000), you can either use the DRR qdisc, or if you
want to prioritize your flows, use PRIO.


A misunderstanding, I think.

To my brain, DaveM's item #2 seemed to assume/require the NIC hardware 
to balance fairly across hw TX rings, which seemed to preclude the 
8139cp/tg3 style of strict-prio hardware.  That's what I was responding to.


As long as there is some modular way to fit 8139cp/tg3 style multi-TX 
into our universe, I'm happy :)


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc8-mm2 BUG: register_netdevice() issue as (ab)used by ISDN

2007-10-09 Thread Andreas Mohr
Hi,

On Mon, Oct 08, 2007 at 05:58:36PM +0200, Karsten Keil wrote:
 You could try following patch with 2.6.23-rc8-mm2, it change I4L to use
 alloc_netdev().

I just did the horribly unthinkable:
I rebooted the production internet gateway *remotely*, and:

[EMAIL PROTECTED]:/home/andi# uname -a
Linux gate 2.6.23-rc8-mm2-gate #1 Mon Oct 8 22:14:00 CEST 2007 i586
GNU/Linux
[EMAIL PROTECTED]:/home/andi# dmesg|grep BUG
[EMAIL PROTECTED]:/home/andi# grep BUG /var/log/messages
Oct  7 13:07:40 gate kernel: kernel BUG at net/core/dev.c:3485!
^^^

This 2.6.23-rc8-mm2-gate was ISDN-patched, obviously.
Oh, and isdnlog *is* indeed running, so this is a first check that
ISDN functionality seems to be up and running...

IOW, thanks!!

Andreas Mohr

-- 
Want to provide valuable development feedback about your kernel version?
Install http://klive.cpushare.com/ !
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Waskiewicz Jr, Peter P
 A misunderstanding, I think.
 
 To my brain, DaveM's item #2 seemed to assume/require the NIC 
 hardware to balance fairly across hw TX rings, which seemed 
 to preclude the
 8139cp/tg3 style of strict-prio hardware.  That's what I was 
 responding to.
 
 As long as there is some modular way to fit 8139cp/tg3 style 
 multi-TX into our universe, I'm happy :)

Ah hah.  Yes, a misunderstanding on my part.  Thanks for the
clarification.  Methinks more caffeine is required for today...

-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Don't memset() netns to zero manually

2007-10-09 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 15:48:19 +0400

 The newly created net namespace is set to 0 with memset()
 in setup_net(). The setup_net() is also called for the
 init_net_ns(), which is zeroed naturally as a global var.
 
 So remove this memset and allocate new nets with the
 kmem_cache_zalloc().
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Applied, thanks Pavel!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread David Miller
From: [EMAIL PROTECTED] (Eric W. Biederman)
Date: Tue, 09 Oct 2007 11:43:58 -0600

 David Stevens [EMAIL PROTECTED] writes:
 
  Sorry if this is a dumb question, but what is the model you intend for
  SNMP? Do you want each namespace to be its own virtual machine with
  its own, separate MIB?
 
 Each network namespace appears to user space as a completely separate
 network stack.  So yes a separate instance of the MIB is appropriate.

We don't think you can validly do that, as David tried to explain.

The interface indexes are visible remotely to remote SNMP querying
applications.  They have to be unique within the physical system.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread David Miller
From: David Stevens [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 09:18:25 -0700

 Ifindex's have to uniquely identify the interface (virtual or
 otherwise) to remote queriers (not just local applications), so
 unless you pay the price of separating all the SNMP MIBs per
 namespace too, it seems you'll need some way to remap these for SNMP
 queries, right?

I don't see how it can work even with per-namespace MIBs,
the interface indexes have to be unique per system.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 16:19:25 +0400

 Currently indexes for netdevices come sequentially one by
 one, and the same stays true even for devices that are 
 created for namespaces.
 
 Side effects of this are:
  * lo device has not 1 index in a namespace. This may break
some userspace that relies on it (and AFAIR something
really broke in OpenVZ VEs without this);
  * after some time namespaces will have devices with indexes
like 100 os similar. This might be confusing for a
human (tools will not mind).
 
 So move the (currently global and static) ifindex variable
 on the struct net, making the indexes allocation look more
 like on a standalone machine.
 
 Moreover - when we have indexes intersect between namespaces,
 we may catch more BUGs in the future related to wrong device 
 was found for a given index.
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

Based upon Eric's and other's comments, I'm holding off on
this for now.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 08:44:25 -0400

 David Miller wrote:
  From: Krishna Kumar2 [EMAIL PROTECTED]
  Date: Tue, 9 Oct 2007 16:51:14 +0530
  
  David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:
 
  Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
  it.
  Great, this will make life easy. Any idea how long that would take?
  It seems simple enough to do.
  
  I'd say we can probably try to get rid of it in 2.6.25, this is
  assuming we get driver authors to cooperate and do the conversions
  or alternatively some other motivated person.
  
  I can just threaten to do them all and that should get the driver
  maintainers going :-)
 
 What, like this?  :)

Thanks, but it's probably going to need some corrections and/or
an audit.

If you unconditionally take those locks in the transmit function,
there is probably an ABBA deadlock elsewhere in the driver now, most
likely in the TX reclaim processing, and you therefore need to handle
that too.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Jeff Garzik

David Miller wrote:

From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 08:44:25 -0400


David Miller wrote:

From: Krishna Kumar2 [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 16:51:14 +0530


David Miller [EMAIL PROTECTED] wrote on 10/09/2007 04:32:55 PM:


Ignore LLTX, it sucks, it was a big mistake, and we will get rid of
it.

Great, this will make life easy. Any idea how long that would take?
It seems simple enough to do.

I'd say we can probably try to get rid of it in 2.6.25, this is
assuming we get driver authors to cooperate and do the conversions
or alternatively some other motivated person.

I can just threaten to do them all and that should get the driver
maintainers going :-)

What, like this?  :)


Thanks, but it's probably going to need some corrections and/or
an audit.


I would be happy if someone wanted to audit that patch.



If you unconditionally take those locks in the transmit function,
there is probably an ABBA deadlock elsewhere in the driver now, most
likely in the TX reclaim processing, and you therefore need to handle
that too.


And I most certainly checked the relevant transmit paths and other 
locking to make sure lock ordering was correct.


Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier
  I'd say we can probably try to get rid of it in 2.6.25, this is
  assuming we get driver authors to cooperate and do the conversions
  or alternatively some other motivated person.
  
  I can just threaten to do them all and that should get the driver
  maintainers going :-)

I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to
my TODO list so I don't forget.

In fact if 2.6.23 drags on long enough I may do it for 2.6.24
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] [IPSEC]: Remove bogus ref count in xfrm_secpath_reject

2007-10-09 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 22:36:31 +0800

 [IPSEC]: Remove bogus ref count in xfrm_secpath_reject
 
 Constructs of the form
 
   xfrm_state_hold(x);
   foo(x);
   xfrm_state_put(x);
 
 tend to be broken because foo is either synchronous where this is totally
 unnecessary or if foo is asynchronous then the reference count is in the
 wrong spot.
 
 In the case of xfrm_secpath_reject, the function is synchronous and therefore
 we should just kill the reference count.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied, thanks!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] [IPSEC]: Store IPv6 nh pointer in mac_header on output

2007-10-09 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 22:36:32 +0800

 [IPSEC]: Store IPv6 nh pointer in mac_header on output
 
 Current the x-mode-output functions store the IPv6 nh pointer in the
 skb network header.  This is inconvenient because the network header then
 has to be fixed up before the packet can leave the IPsec stack.  The mac
 header field is unused on output so we can use that to store this instead.
 
 This patch does that and removes the network header fix-up in xfrm_output.
 
 It also uses ipv6_hdr where appropriate in the x-type-output functions.
 
 There is also a minor clean-up in esp4 to make it use the same code as
 esp6 to help any subsequent effort to merge the two.
 
 Lastly it kills two redundant skb_set_* statements in BEET that were
 simply copied over from transport mode.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied, thanks Herbert.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] [IPSEC]: Remove gratuitous km wake-up events on ACQUIRE

2007-10-09 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 22:36:33 +0800

 [IPSEC]: Remove gratuitous km wake-up events on ACQUIRE
 
 There is no point in waking people up when creating/updating larval states
 because they'll just go back to sleep again as larval states by definition
 cannot be found by xfrm_state_find.
 
 We should only wake them up when the larvals mature or die.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied, thanks!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] [IPSEC]: Move common code into xfrm_alloc_spi

2007-10-09 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 22:36:34 +0800

 [IPSEC]: Move common code into xfrm_alloc_spi
 
 This patch moves some common code that conceptually belongs to the xfrm core
 from af_key/xfrm_user into xfrm_alloc_spi.
 
 In particular, the spin lock on the state is now taken inside xfrm_alloc_spi.
 Previously it also protected the construction of the response PF_KEY/XFRM
 messages to user-space.  This is inconsistent as other identical constructions
 are not protected by the state lock.  This is bad because they in fact should
 be protected but only in certain spots (so as not to hold the lock for too
 long which may cause packet drops).
 
 The SPI byte order conversion has also been moved.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied, thanks!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] [XFRM] user: Move attribute copying code into copy_to_user_state_extra

2007-10-09 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 22:36:35 +0800

 [XFRM] user: Move attribute copying code into copy_to_user_state_extra
 
 Here's a good example of code duplication leading to code rot.  The
 notification patch did its own netlink message creation for xfrm states.
 It duplicated code that was already in dump_one_state.  Guess what, the
 next time (and the time after) when someone updated dump_one_state the
 notification path got zilch.
 
 This patch moves that code from dump_one_state to copy_to_user_state_extra
 and uses it in xfrm_notify_sa too.  Unfortunately whoever updates this
 still needs to update xfrm_sa_len since the notification path wants to
 know the exact size for allocation.
 
 At least I've added a comment saying so and if someone still forgest, we'll
 have a WARN_ON telling us so.
 
 I also changed the security size calculation to use xfrm_user_sec_ctx since
 that's what we actually put into the skb.  However it makes no practical
 difference since it has the same size as xfrm_sec_ctx.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied, thanks!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] [IPSEC]: Lock state when copying non-atomic fields to user-space

2007-10-09 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 22:36:36 +0800

 [IPSEC]: Lock state when copying non-atomic fields to user-space
 
 This patch adds locking so that when we're copying non-atomic fields such as
 life-time or coaddr to user-space we don't get a partial result.
 
 For af_key I've changed every instance of pfkey_xfrm_state2msg apart from
 expiration notification to include the keys and life-times.  This is in-line
 with XFRM behaviour.
 
 The actual cases affected are:
 
 * pfkey_getspi: No change as we don't have any keys to copy.
 * key_notify_sa:
   + ADD/UPD: This wouldn't work otherwise.
   + DEL: It can't hurt.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied.

I would be more careful with the changelog description for
something like this in the future.  It sounds like this
patch will cause us to touch userspace with locks held,
which obviously only works in very limited scenerios
and is usually a BUG.

But you're actually just constructing SKB response netlink
bits, which later will be copied into userspace but long
after we've released these locks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] [IPSEC]: Move state lock into x-type-output

2007-10-09 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 22:36:37 +0800

 [IPSEC]: Move state lock into x-type-output
 
 This patch releases the lock on the state before calling x-type-output.
 It also adds the lock to the spots where they're currently needed.
 
 Most of those places (all except mip6) are expected to disappear with
 async crypto.
 
 Signed-off-by: Herbert Xu [EMAIL PROTECTED]

Applied, thanks!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: 09 Oct 2007 18:51:51 +0200

 Hopefully that new qdisc will just use the TX rings of the hardware
 directly. They are typically large enough these days. That might avoid
 some locking in this critical path.

Indeed, I also realized last night that for the default qdiscs
we do a lot of stupid useless work.  If the queue is a FIFO
and the device can take packets, we should send it directly
and not stick it into the qdisc at all.

 If the data is just passed on to the hardware queue, why is any 
 locking needed at all? (except for the driver locking of course)

Absolutely.

Our packet scheduler subsystem is great, but by default it should just
get out of the way.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Roland Dreier [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 13:22:44 -0700

 I can definitely kill LLTX for IPoIB by 2.6.25 and I just added it to
 my TODO list so I don't forget.
 
 In fact if 2.6.23 drags on long enough I may do it for 2.6.24

Before you add new entries to your list, how is that ibm driver NAPI
conversion coming along? :-)

Right now that's a more pressing task to complete.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Stephen Hemminger
On Tue, 09 Oct 2007 13:43:31 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Andi Kleen [EMAIL PROTECTED]
 Date: 09 Oct 2007 18:51:51 +0200
 
  Hopefully that new qdisc will just use the TX rings of the hardware
  directly. They are typically large enough these days. That might avoid
  some locking in this critical path.
 
 Indeed, I also realized last night that for the default qdiscs
 we do a lot of stupid useless work.  If the queue is a FIFO
 and the device can take packets, we should send it directly
 and not stick it into the qdisc at all.
 
  If the data is just passed on to the hardware queue, why is any 
  locking needed at all? (except for the driver locking of course)
 
 Absolutely.
 
 Our packet scheduler subsystem is great, but by default it should just
 get out of the way.

I was thinking why not have a default transmit queue len of 0 like
the virtual devices.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread Eric W. Biederman
David Miller [EMAIL PROTECTED] writes:

 From: [EMAIL PROTECTED] (Eric W. Biederman)
 Date: Tue, 09 Oct 2007 11:43:58 -0600

 David Stevens [EMAIL PROTECTED] writes:
 
  Sorry if this is a dumb question, but what is the model you intend for
  SNMP? Do you want each namespace to be its own virtual machine with
  its own, separate MIB?
 
 Each network namespace appears to user space as a completely separate
 network stack.  So yes a separate instance of the MIB is appropriate.

 We don't think you can validly do that, as David tried to explain.

 The interface indexes are visible remotely to remote SNMP querying
 applications.  They have to be unique within the physical system.

I think figuring out what we are doing with SNMP is not any harder
or easier then any other user space interface, and like I said I
don't think we are ready yet.

From the perspective of monitoring network namespaces make the entire
system looks more like a cluster then it does a single machine, and
that is how I would look at portraying the system to SNMP if I had to
do that work today.  A switch with a bunch of different machines 
behind it.  Especially in the context of container migration this
becomes an attractive model.

Regardless it is early yet and there is plenty of time to revisit this
after we solved the easier and less controversial problems.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net-2.6.24 rebased...

2007-10-09 Thread Urs Thuermann
Hallo Dave,

 9.1MB, 739 changesets, keep those patches flowing!

Last week I have sent another version of our patch series for PF_CAN.
The changes after the last review feedback were only cosmetics.

Do you have any plans with that code for this or the next release?

Regards,
urs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][NETNS] Make ifindex generation per-namespace

2007-10-09 Thread David Miller
From: [EMAIL PROTECTED] (Eric W. Biederman)
Date: Tue, 09 Oct 2007 15:00:10 -0600

 Regardless it is early yet and there is plenty of time to revisit this
 after we solved the easier and less controversial problems.

Ok.

I would encourage you to learn how the SNMP mibs work, and whether
they associate things with interfaces and/or unique MAC addresses.
The semantics may have conflicts with your envisioned cluster
abstraction.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net-2.6.24 rebased...

2007-10-09 Thread David Miller
From: Urs Thuermann [EMAIL PROTECTED]
Date: 09 Oct 2007 23:13:42 +0200

 Last week I have sent another version of our patch series for PF_CAN.
 The changes after the last review feedback were only cosmetics.
 
 Do you have any plans with that code for this or the next release?

I think PF_CAN will go in 2.6.25
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 13:53:40 -0700

 I was thinking why not have a default transmit queue len of 0 like
 the virtual devices.

I'm not so sure.

Even if the device has huge queues I still think we need a software
queue for when the hardware one backs up.

It is even beneficial to stick with reasonably sized TX queues because
it keeps the total resident state accessed by the CPU within the
bounds of the L2 cache.  If you go past that it actually hurts to make
the TX queue larger instead of helps even if it means you never hit
back pressure.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Jeff Garzik [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 16:20:14 -0400

 David Miller wrote:
  If you unconditionally take those locks in the transmit function,
  there is probably an ABBA deadlock elsewhere in the driver now, most
  likely in the TX reclaim processing, and you therefore need to handle
  that too.
 
 And I most certainly checked the relevant transmit paths and other 
 locking to make sure lock ordering was correct.

Awesome.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Roland Dreier
  Before you add new entries to your list, how is that ibm driver NAPI
  conversion coming along? :-)

I still haven't done much.  OK, I will try to get my board booting
again this week.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread jamal

On Tue, 2007-09-10 at 14:22 -0700, David Miller wrote:

 Even if the device has huge queues I still think we need a software
 queue for when the hardware one backs up.

It should be fine to just pretend the qdisc exists despite it sitting
in the driver and not have s/ware queues at all to avoid all the
challenges that qdiscs bring; 
if the h/ware queues are full because of link pressure etc, you drop. We
drop today when the s/ware queues are full. The driver txmit lock takes
place of the qdisc queue lock etc. I am assuming there is still need for
that locking. The filter/classification scheme still works as is and
select classes which map to rings. tc still works as is etc.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHES] TX batching rev2.5

2007-10-09 Thread jamal

Please provide feedback on the code and/or architecture.
They are now updated to work with the latest rebased net-2.6.24 
from a few hours ago.
I am on travel mode so wont have time to do more testing for the next
few days - i do consider this to be stable at this point based on
what i have been testing (famous last words).

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev-gso_skb

What has changed since i posted last:
-

1) There was some cruft leftover from prep_frame feature that i forgot
to remove last time - it is now gone.
2) In the shower this AM, i realized that it is plausible that a 
batch of packets sent to the driver may all be dropped because
they are badly formatted. Current drivers all return NETDEV_TX_OK
in all such cases. This will lead for us to call dev-hard_end_xmit() 
to be invoked unnecessarily. I already had a NETDEV_TX_DROPPED within
batching drivers, so i made that global and make the batching drivers
return it if they drop packets. The core calls dev-hard_end_xmit()
when at least one packet makes it through. 


Things i am gonna say that nobody will see (wink)
-
Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev-gso_skb that you introduced).

UPCOMING PATCHES
---
As before:
More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver 
- tun driver
- pktgen
- netiron driver
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto that i will post soon today. 

PERFORMANCE TESTING


System under test hardware is still a 2xdual core opteron with a 
couple of tg3s. 
A test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {8B, 32B, 64B, 128B, 256B, 512B, 1024B}. 
Each packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

I do plan also to run forwarding and TCP tests in the future when the
dust settles.

cheers,
jamal


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] [NET_BATCH] Introduce batching interface Rev2.5

2007-10-09 Thread jamal
This patch introduces the netdevice interface for batching.

cheers,
jamal
[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

BACKGROUND
-

A driver dev-hard_start_xmit() has 4 typical parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts, set last tx time, etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].

INTRODUCING API
---

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev-hard_start_xmit() in the following manner:
1)Remove #d from dev-hard_start_xmit() and put it in
dev-hard_end_xmit() method.
2)#b and #c can stay in -hard_start_xmit() (or whichever way you want
to do this)
3) #a is deffered to future work to reduce confusion (since it holds
on its own).

Note: There are drivers which may need not support any of the two
approaches (example the tun driver i patched) so the methods are
optional.

xmit_win variable is set by the driver to tell the core how much space
it has to take on new skbs. It is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Refer to the driver howto for more details.

THEORY OF OPERATION
---

1. Core dequeues from qdiscs upto dev-xmit_win packets. Fragmented
and GSO packets are accounted for as well.
2. Core grabs TX_LOCK
3. Core loop for all skbs:
invokes driver dev-hard_start_xmit()
4. Core invokes driver dev-hard_end_xmit()

ACKNOWLEDGEMENT AND SOME HISTORY


There's a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, David Miller,
and Patrick McHardy, Jeff Garzik and Bill Fink have contributed in one or
more of {bug fixes, enhancements, testing, lively discussion}. The
Broadcom and neterion folks have been outstanding in their help.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 98d39ea7922fa2719a80eecd02cae359f3d7
tree 63822bf3040ea41846399c589c912c2be654f008
parent 7b4cd20628fe5c4e145c383fcd8d954d38f7be61
author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:06:28 -0400

 include/linux/netdevice.h |9 +-
 net/core/dev.c|   67 ++---
 net/sched/sch_generic.c   |4 +--
 3 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..b0e71c9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -86,6 +86,7 @@ struct wireless_dev;
 /* Driver transmit return codes */
 #define NETDEV_TX_OK 0		/* driver took care of packet */
 #define NETDEV_TX_BUSY 1	/* driver tx path was busy*/
+#define NETDEV_TX_DROPPED 2	/* driver tx path dropped packet*/
 #define NETDEV_TX_LOCKED -1	/* driver tx lock was already taken */
 
 /*
@@ -467,6 +468,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL	8192	/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE	16384	/* Has multiple TX/RX queues */
 #define NETIF_F_LRO		32768	/* large receive offload */
+#define NETIF_F_BTX		65536	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -595,6 +597,9 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 		struct net_device *dev);
+	void			(*hard_end_xmit) (struct net_device *dev);
+	int			xmit_win;
+
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -609,6 +614,7 @@ struct net_device
 
 	/* delayed register/unregister */
 	struct list_head	todo_list;
+	struct sk_buff_head blist;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
 
@@ -1043,7 +1049,8 @@ 

[PATCH 2/3][NET_BATCH] Rev2.5 net core use batching

2007-10-09 Thread jamal
This patch adds the usage of batching within the core.

cheers,
jamal
[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
Performance results demonstrating improvement are provided separately.

I have #if-0ed some of the old functions so the patch is more readable.
A future patch will remove all if-0ed content.
Patrick McHardy eyeballed a bug that will cause re-ordering in case
of a requeue.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit c73d8ee8cce61a98f55fbfb2cafe813a7eca472c
tree 8b9155fe15baa4c2e7adb69585c7aa275a6bc896
parent 98d39ea7922fa2719a80eecd02cae359f3d7
author Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Tue, 09 Oct 2007 11:13:30 -0400

 net/sched/sch_generic.c |  104 ++-
 1 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 424c08b..d98c680 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
 	return q-q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
   struct Qdisc *q)
 {
@@ -110,6 +111,85 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 
 	return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+	if (unlikely(dev-xmit_lock_owner == smp_processor_id())) {
+		if (net_ratelimit())
+			printk(KERN_WARNING
+Dead loop on netdevice %s, fix it urgently!\n,
+dev-name);
+		return 1;
+	}
+	__get_cpu_var(netdev_rx_stat).cpu_collision++;
+	return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+	   struct Qdisc *q)
+{
+
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue_tail(skbs)) != NULL)
+		q-ops-requeue(skb, q);
+
+	netif_schedule(dev);
+	return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+	struct Qdisc *q)
+{
+	int ret = handle_dev_cpu_collision(dev);
+
+	if (ret) {
+		if (!skb_queue_empty(skbs))
+			skb_queue_purge(skbs);
+		return qdisc_qlen(q);
+	}
+
+	return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+	int count = 0;
+	for (; skb; skb = skb-next) {
+		count += skb_shinfo(skb)-nr_frags;
+		count += 1;
+	}
+	return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+			   struct Qdisc *q,
+			   struct sk_buff_head *pktlist)
+{
+	struct sk_buff *skb;
+	int count = dev-xmit_win;
+
+	if (count   dev-gso_skb) {
+		skb = dev-gso_skb;
+		dev-gso_skb = NULL;
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	while (count  0) {
+		skb = q-dequeue(q);
+		if (!skb)
+			break;
+
+		count -= xmit_count_skbs(skb);
+		__skb_queue_tail(pktlist, skb);
+	}
+
+	return skb_queue_len(pktlist);
+}
 
 /*
  * NOTE: Called under dev-queue_lock with locally disabled BH.
@@ -133,19 +213,20 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb,
 static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev-qdisc;
-	struct sk_buff *skb;
-	int ret, xcnt = 0;
+	int ret = 0;
 
-	/* Dequeue packet */
-	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
-		return 0;
+	/* Dequeue packets */
+	ret = xmit_get_pkts(dev, q, dev-blist);
 
+	if (!ret)
+		return 0;
 
-	/* And release queue */
+	/* We got em packets */
 	spin_unlock(dev-queue_lock);
 
+	/* bye packets */
 	HARD_TX_LOCK(dev, smp_processor_id());
-	ret = dev_hard_start_xmit(skb, dev, xcnt);
+	ret = dev_batch_xmit(dev);
 	HARD_TX_UNLOCK(dev);
 
 	spin_lock(dev-queue_lock);
@@ -158,8 +239,8 @@ static inline int qdisc_restart(struct net_device *dev)
 		break;
 
 	case NETDEV_TX_LOCKED:
-		/* Driver try lock failed */
-		ret = handle_dev_cpu_collision(skb, dev, q);
+		/* Driver lock failed */
+		ret = xmit_islocked(dev-blist, dev, q);
 		break;
 
 	default:
@@ -168,7 +249,7 @@ static inline int qdisc_restart(struct net_device *dev)
 			printk(KERN_WARNING BUG %s code %d qlen %d\n,
 			   dev-name, ret, q-q.qlen);
 
-		ret = dev_requeue_skb(skb, dev, q);
+		ret = dev_requeue_skbs(dev-blist, dev, q);
 		break;
 	}
 
@@ -564,6 +645,9 @@ void dev_deactivate(struct net_device *dev)
 
 	skb = dev-gso_skb;
 	dev-gso_skb = NULL;
+	if (!skb_queue_empty(dev-blist))
+		skb_queue_purge(dev-blist);
+	dev-xmit_win = 1;
 	spin_unlock_bh(dev-queue_lock);
 
 	kfree_skb(skb);


  1   2   >