Re: [PATCH] [IPv4] Reply net unreachable ICMP message

2007-12-06 Thread Rami Rosen
Hello, Jarek,

I am sorry, but I think I am nor sure I underatand exactly what you mean when
you say:
It overrides err codes from fib_lookup, where such decisions should be made.

What is incorrect here ?

There are two lines added in this patch;

IP_INC_STATS_BH(IPSTATS_MIB_INNOROUTES);
and err = -ENETUNREACH;

The first one is, regardless to say, not relevant to err codes.

The second, err = -ENETUNREACH, is from: ip_route_input_slow().
(net/ipv4/route.c).

Assigning values to err is done more than once in this method;
for example,
e_hostunreach:
err = -EHOSTUNREACH;

e_inval:
err = -EINVAL;

e_nobufs:
err = -ENOBUFS;

So I don't think anything is incorrect here.


Regards,
Rami Rosen




On Dec 6, 2007 9:49 AM, Jarek Poplawski [EMAIL PROTECTED] wrote:

 On 06-12-2007 07:31, Mitsuru Chinen wrote:
  IPv4 stack doesn't reply any ICMP destination unreachable message
  with net unreachable code when IP detagrams are being discarded
  because of no route could be found in the forwarding path.
  Incidentally, IPv6 stack replies such ICMPv6 message in the similar
  situation.
 
  Signed-off-by: Mitsuru Chinen [EMAIL PROTECTED]
  ---
   net/ipv4/route.c |2 ++
   1 files changed, 2 insertions(+), 0 deletions(-)
 
  diff --git a/net/ipv4/route.c b/net/ipv4/route.c
  index 6714bbc..ba85ec9 100644
  --- a/net/ipv4/route.c
  +++ b/net/ipv4/route.c
  @@ -1375,6 +1375,7 @@ static int ip_error(struct sk_buff *skb)
break;
case ENETUNREACH:
code = ICMP_NET_UNREACH;
  + IP_INC_STATS_BH(IPSTATS_MIB_INNOROUTES);
break;
case EACCES:
code = ICMP_PKT_FILTERED;
  @@ -2004,6 +2005,7 @@ no_route:
RT_CACHE_STAT_INC(in_no_route);
spec_dst = inet_select_addr(dev, 0, RT_SCOPE_UNIVERSE);
res.type = RTN_UNREACHABLE;
  + err = -ENETUNREACH;
goto local_input;
 
/*

 This patch seems to be wrong. It overrides err codes from
 fib_lookup, where such decisions should be made.

 Regards,
 Jarek P.

 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [IPv4] Reply net unreachable ICMP message

2007-12-06 Thread Mitsuru Chinen
On Thu, 6 Dec 2007 08:49:47 +0100
Jarek Poplawski [EMAIL PROTECTED] wrote:

 On 06-12-2007 07:31, Mitsuru Chinen wrote:
  IPv4 stack doesn't reply any ICMP destination unreachable message
  with net unreachable code when IP detagrams are being discarded
  because of no route could be found in the forwarding path.
  Incidentally, IPv6 stack replies such ICMPv6 message in the similar
  situation.
  
  Signed-off-by: Mitsuru Chinen [EMAIL PROTECTED]
  ---
   net/ipv4/route.c |2 ++
   1 files changed, 2 insertions(+), 0 deletions(-)
  
  diff --git a/net/ipv4/route.c b/net/ipv4/route.c
  index 6714bbc..ba85ec9 100644
  --- a/net/ipv4/route.c
  +++ b/net/ipv4/route.c
  @@ -1375,6 +1375,7 @@ static int ip_error(struct sk_buff *skb)
  break;
  case ENETUNREACH:
  code = ICMP_NET_UNREACH;
  +   IP_INC_STATS_BH(IPSTATS_MIB_INNOROUTES);
  break;
  case EACCES:
  code = ICMP_PKT_FILTERED;
  @@ -2004,6 +2005,7 @@ no_route:
  RT_CACHE_STAT_INC(in_no_route);
  spec_dst = inet_select_addr(dev, 0, RT_SCOPE_UNIVERSE);
  res.type = RTN_UNREACHABLE;
  +   err = -ENETUNREACH;
  goto local_input;
   
  /*
 
 This patch seems to be wrong. It overrides err codes from
 fib_lookup, where such decisions should be made.

fib_lookup() replies -ESRCH in this situation.
It is necessary to override the variable by the suitable error
number like the code under e_hostunreach label.

Best Regards,

Mitsuru Chinen [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [IPv4] Reply net unreachable ICMP message

2007-12-06 Thread Jarek Poplawski
On 06-12-2007 09:14, Mitsuru Chinen wrote:
 On Thu, 6 Dec 2007 08:49:47 +0100
 Jarek Poplawski [EMAIL PROTECTED] wrote:
 
 On 06-12-2007 07:31, Mitsuru Chinen wrote:
 IPv4 stack doesn't reply any ICMP destination unreachable message
 with net unreachable code when IP detagrams are being discarded
 because of no route could be found in the forwarding path.
 Incidentally, IPv6 stack replies such ICMPv6 message in the similar
 situation.
...
 This patch seems to be wrong. It overrides err codes from
 fib_lookup, where such decisions should be made.
 
 fib_lookup() replies -ESRCH in this situation.
 It is necessary to override the variable by the suitable error
 number like the code under e_hostunreach label.

Probably I miss something, but I can't see how can you be sure it's
only -ESRCH possible here? Isn't opt-action() in fib_rules_lookup()
supposed to return this -ENETUNREACH when needed?

Jarek P.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread Stefan Rompf
Am Donnerstag, 6. Dezember 2007 03:25 schrieb David Miller:

 POSIX says nothing about the semantics of route resolution.

Of course not. Applications must not care about what happens at the transport 
layer.

 Non-blocking doesn't mean cannot sleep no matter what.

... and as O_CREAT on open() isn't specifically documented to apply to 
filenames starting with 'a', it is perfectly normal that echo x ash always 
fails since 2.6.22. To revert to the old behaviour, please do echo 1 
/proc/sys/fs/allow_a_file_creation.

Ok, irony aside. Just have a look at
http://www.opengroup.org/onlinepubs/009695399/functions/connect.html (I hope 
009695399 is not a personalition cookie ;-)

If the connection cannot be established immediately and O_NONBLOCK is set for 
the file descriptor for the socket, connect() shall fail and set errno to 
[EINPROGRESS], but the connection request shall not be aborted, and the 
connection shall be established asynchronously.

I think the words shall fail and immediately are quite clear.

  If this is changed for some IP sockets, event-driven applications
  will randomly and subtly break.

 If this was such a clear cut case we'd have changed things
 a long time ago, but it isn't so don't pretend this is the
 case.

Well, the only reason this doesn't break on a daily basis is because the code 
isn't in the kernel that long and not many people run applications on an 
IPSEC gateway. This will change if kernel based IPSEC is used for roadwarrior 
connections or dnssec based anonymous IPSEC someday. Trust me, you will 
revert this misbehaviour in -stable then.

For some real life applications that break when nonblocking connect() blocks, 
please look f.e. at squid or mozilla firefox.

Stefan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread David Miller
From: Stefan Rompf [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 09:49:01 +0100

 If the connection cannot be established immediately and O_NONBLOCK is set 
 for 
 the file descriptor for the socket, connect() shall fail and set errno to 
 [EINPROGRESS], but the connection request shall not be aborted, and the 
 connection shall be established asynchronously.
 
 I think the words shall fail and immediately are quite clear.

They are, but the context in which they apply is vague.

I can equally generate examples where the non-blocking behavior you
are a proponent of would break non-blocking UDP apps during a
sendmsg() call when we hit IPSEC resolution.  Yet similar language on
blocking semantics exists for sendmsg() in the standards.

The world is shades of gray, implying anything else is foolhardy and
that's how I'm handling this.

 Well, the only reason this doesn't break on a daily basis is because the code 
 isn't in the kernel that long and not many people run applications on an 
 IPSEC gateway. This will change if kernel based IPSEC is used for roadwarrior 
 connections or dnssec based anonymous IPSEC someday. Trust me, you will 
 revert this misbehaviour in -stable then.

I use IPSEC every single day in this fashion, and I haven't.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reproducible data corruption with sendfile+vsftp - splice regression?

2007-12-06 Thread Holger Hoffstaette

On Wed, 05 Dec 2007 23:54:29 +0100, Francois Romieu wrote:

 Holger Hoffstaette [EMAIL PROTECTED] : [...]
 Should I file this in bugzilla?
 
 Yes.

Thanks for responding - will do. I verified with 2.6.24-rc4 (same bug) and
have some new information about this.
Despite my previous posting the corruption is NOT triggered by NAPI. It
may be related, but even without NAPI but tso on again I got corruption,
now also on the gbit client (Thinkpad T60). When ftp'ing to ramdisk with
full speed (at a reasonable ~77 MB/sec) it often works, but intermediate
writes that cause the ftp to temporarily slow down reliably cause
corrupted files, so I guess tso gets confused when some kind of throttling
sets in during transfer. That is probably why I first noticed it on the
slow 100mbit client.
Maybe turning off sendfile or NAPI just lead to random success - so far it
really looks like tso on the r8169 is the common cause.

thank you
Holger


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reproducible data corruption with sendfile+vsftp - splice regression?

2007-12-06 Thread Holger Hoffstaette
(removing .kernel as it seems to concern netdev only)

On Thu, 06 Dec 2007 02:13:00 +0100, Francois Romieu wrote:

 Francois Romieu [EMAIL PROTECTED] :
 Holger Hoffstaette [EMAIL PROTECTED] : [...]
  Should I file this in bugzilla?
 
 Yes.
 
 5326 5585327 5585328 5585329 5585330 5585331 5585332 5585333 5585334
 5585335 558 5336 5585337 5585338 5585339 5585340 5585341 5585342 5585343
 5589440 5589441 558
  ^^^ ^^^
 9442 5589443 5589444 5589445 5589446 5589447 5589448 5589449 5589450
 5589451 558 9452 5589453 5589454 5589455 5589456 5589457 5589458 5589459
 5589460 5589461 558
 
 It misses 8*4096 bytes.
 
 8443 9068442 9068441 9068440 9068439 9068438 9068437 9068436 9068435
 9068434 906 8433 9068432 9068431 9068430 9068429 9068428 9068427 9064330
 9064329 9064328 906
  ^^^ ^^^
 4327 9064326 9064325 9064324 9064323 9064322 9064321 9064320 9064319
 9064318 906
 
 Same thing later.
 
 But the amount of data transmitted is fine.
 
 Could you locate the offsets were the sequence is broken ?

According to my hex editor the offsets are:

0x02aa43e4
0x02feb473
0x03142994
0x03765f33
0x03e42ff3
0x03e5079c
0x03e60d9c
0x0451db54
0x0452e7ec

I'll also put all this into bugzilla.

thanks!
Holger


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bonding sysfs output

2007-12-06 Thread Ferenc Wagner
Jean Delvare [EMAIL PROTECTED] writes:

 On Mon, 26 Nov 2007 09:29:40 +0100, Wagner Ferenc wrote:

 On the policy side: some files are not applicable to some types of
 bonds, and return a single linefeed in that case.  Except for one
 single case, which returns 'NA\n'.  The patch changes these cases into
 emtpy files.

 IMHO a better approach would be to not create the files at all when
 they make no sense for a given type of bond.

That would require much more in-depth changes in the sysfs code, I'm
afraid.  But see also the 5th patch in the series, which reponds to
Jay's suggestion.  And as such, goes in the opposite direction.
-- 
Thanks,
Feri.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] remove prototype of ip_rt_advice

2007-12-06 Thread Denis V. Lunev
ip_rt_advice has been gone, so no need to keep prototype and debug message.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
--
diff --git a/include/net/route.h b/include/net/route.h
index f7ce625..59b0b19 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -109,7 +109,6 @@ struct in_device;
 extern int ip_rt_init(void);
 extern voidip_rt_redirect(__be32 old_gw, __be32 dst, __be32 new_gw,
   __be32 src, struct net_device *dev);
-extern voidip_rt_advice(struct rtable **rp, int advice);
 extern voidrt_cache_flush(int how);
 extern int __ip_route_output_key(struct rtable **, const struct 
flowi *flp);
 extern int ip_route_output_key(struct rtable **, struct flowi 
*flp);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 134cab5..cefae61 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1198,7 +1198,7 @@ static struct dst_entry *ipv4_negative_advice(struct 
dst_entry *dst)
unsigned hash = rt_hash(rt-fl.fl4_dst, rt-fl.fl4_src,
rt-fl.oif);
 #if RT_CACHE_DEBUG = 1
-   printk(KERN_DEBUG ip_rt_advice: redirect to 
+   printk(KERN_DEBUG ipv4_negative_advice: redirect to 
  %u.%u.%u.%u/%02x dropped\n,
NIPQUAD(rt-rt_dst), rt-fl.fl4_tos);
 #endif
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remove prototype of ip_rt_advice

2007-12-06 Thread David Miller
From: Denis V. Lunev [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 13:17:43 +0300

 ip_rt_advice has been gone, so no need to keep prototype and debug message.
 
 Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

Applied to net-2.6, thanks.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Wed, 5 Dec 2007 16:33:38 -0500

 On Wed, 05 Dec 2007 08:53:07 -0800
 Joe Perches [EMAIL PROTECTED] wrote:
 
   it occurred to me that we might want to do something
   like a state change event generator.
  
  This could be a basis for an interesting TCP
  performance tester.
 
 That is what tcpprobe does but it isn't detailed enough to address SACK
 issues.

Indeed, this could be done via the jprobe there.

Silly me I didn't do this in the implementation I whipped
up, which I'll likely correct.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread David Miller
From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET)

 On Wed, 5 Dec 2007, David Miller wrote:
 
  I assume you're using something like carefully crafted printk's,
  kprobes, or even ad-hoc statistic counters.  That's what I used to do
  :-)
 
 No, that's not at all what I do :-). I usually look time-seq graphs 
 expect for the cases when I just find things out by reading code (or
 by just thinking of it).

Can you briefly detail what graph tools and command lines
you are using?

The last time I did graphing to analyze things, the tools
were hit-or-miss.

 Much of the info is available in tcpdump already, it's just hard to read 
 without graphing it first because there are some many overlapping things 
 to track in two-dimensional space.
 
 ...But yes, I have to admit that couple of problems come to my mind
 where having some variable from tcp_sock would have made the problem
 more obvious.

The most important are the cwnd and ssthresh, which you could guess
using graphs but it is important to know on a packet to packet
basis why we might have sent a packet or not because this has
rippling effects down the rest of the RTT.

 Not sure what is the benefit of having distributions with it because 
 those people hardly report problems anyway to here, they're just too 
 happy with TCP performance unless we print something to their logs,
 which implies that we must setup a *_ON() condition :-(.

That may be true, but if we could integrate the information with
tcpdumps, we could gather internal state using tools the user
already has available.

Imagine if tcpdump printed out:

02:26:14.865805 IP $SRC  $DEST: . 11226:12686(1460) ack 0 win 108
ss_thresh: 129 cwnd: 133 packets_out: 132

or something like that.

 Some problems are simply such that things cannot be accurately verified 
 without high processing overhead until it's far too late (eg skb bits vs 
 *_out counters). Maybe we should start to build an expensive state 
 validator as well which would automatically check invariants of the write 
 queue and tcp_sock in a straight forward, unoptimized manner? That would 
 definately do a lot of work for us, just ask people to turn it on and it 
 spits out everything that went wrong :-) (unless they really depend on 
 very high-speed things and are therefore unhappy if we scan thousands of 
 packets unnecessarily per ACK :-)). ...Early enough! ...That would work 
 also for distros but there's always human judgement needed to decide 
 whether the bug reporter will be happy when his TCP processing does no 
 longer scale ;-).

I think it's useful as a TCP_DEBUG config option or similar, sure.

But sometimes the algorithms are working as designed, it's just that
they provide poor pipe utilization and CWND analysis embedded inside
of a tcpdump would be one way to see that as well as determine the
flaw in the algorithm.

 ...Hopefully you found any of my comments useful.

Very much so, thanks.

I put together a sample implementation anyways just to show the idea,
against net-2.6.25 below.

It is untested since I didn't write the userland app yet to see that
proper things get logged.  Basically you could run a daemon that
writes per-connection traces into files based upon the incoming
netlink events.  Later, using the binary pcap file and these traces,
you can piece together traces like the above using the timestamps
etc. to match up pcap packets to ones from the TCP logger.

The userland tools could do analysis and print pre-cooked state diff
logs, like this ACK raised CWND by one or whatever else you wanted
to know.

It's nice that an expert like you can look at graphs and understand,
but we'd like to create more experts and besides reading code one
way to become an expert is to be able to extrace live real data
from the kernel's working state and try to understand how things
got that way.  This information is permanently lost currently.

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 56342c3..c0e61d0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -170,6 +170,47 @@ struct tcp_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN]; /* key (binary) */
 };
 
+/* TCP netlink event logger.  */
+struct tcp_log_key {
+   union {
+   __be32  a4;
+   __be32  a6[4];
+   } saddr, daddr;
+   __be16  sport;
+   __be16  dport;
+   unsigned short family;
+   unsigned short __pad;
+};
+
+struct tcp_log_stamp {
+   __u32   tv_sec;
+   __u32   tv_usec;
+};
+
+struct tcp_log_payload {
+   struct tcp_log_key  key;
+   struct tcp_log_stampstamp;
+   struct tcp_info info;
+};
+
+enum {
+   TCP_LOG_A_UNSPEC = 0,
+   __TCP_LOG_A_MAX,
+};
+#define TCP_LOG_A_MAX  (__TCP_LOG_A_MAX - 1)
+
+#define TCP_LOG_GENL_NAME  tcp_log
+#define TCP_LOG_GENL_VERSION   1
+
+enum {
+   TCP_LOG_CMD_UNSPEC = 0,
+   TCP_LOG_CMD_HELLO,
+   TCP_LOG_CMD_GOODBYE,
+   

Re: TCP event tracking via netlink...

2007-12-06 Thread Evgeniy Polyakov
On Wed, Dec 05, 2007 at 09:03:43PM -0800, David Miller ([EMAIL PROTECTED]) 
wrote:
 I think this work is very different.
 
 When I say state I mean something more significant than
 CLOSE, ESTABLISHED, etc. which is what Samir's patches are
 tracking.
 
 I'm talking about all of the sequence numbers, SACK information,
 congestion control knobs, etc. whose values are nearly impossible to
 track on a packet to packet basis in order to diagnose problems.

I pointed that work as a possible basis for collecting more info if you
needs including sequence numbers, window sizes and so on.
It just requires a useful structure layout placed, so that one would not
require to recreate the same bits again, so that it could be called from
any place inside the stack.

-- 
Evgeniy Polyakov
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread Stefan Rompf
Am Donnerstag, 6. Dezember 2007 09:53 schrieb David Miller:

  I think the words shall fail and immediately are quite clear.

 They are, but the context in which they apply is vague.

socket is connection-mode = SOCK_STREAM

 I can equally generate examples where the non-blocking behavior you
 are a proponent of would break non-blocking UDP apps during a
 sendmsg() call when we hit IPSEC resolution.  Yet similar language on
 blocking semantics exists for sendmsg() in the standards.

I am not a good enough kernel hacker to exactly understand the code flow in 
udp_sendmsg(). However, it seems that it first checks destination validity 
via ip_route_output_flow() and queues the message then. The sendmsg() 
documentation only talks about buffer space. I can see your dilemma.

The reason why I'm pushing this issue another time is that I know quite a 
bit about system level application development. A very typical design pattern 
for non-naive single or multi threaded programs is that they set all 
communication sockets to be nonblocking and use a select()/epoll() based loop 
to dispatch IO. This often includes initiating a TCP connect() and 
asynchronously waiting for it to finish or fail from the main loop.

The dangerous situation here is that in 99% of all cases things will just work 
because the phase 2 SA exists. In 0.8%, the SA will be established in 1 sec. 
However, in the rest of time the server application that you have considered 
to be stable will end up sleeping with all threads in a connect() call that 
is supposed to return immediatly.

 The world is shades of gray, implying anything else is foolhardy and
 that's how I'm handling this.

Even though I consider programmers that ignore the result code on a 
nonblocking UDP sendmsg() fools, I agree. May be the best compromise is what 
Herbert Xu suggested in [EMAIL PROTECTED] in this 
thread: At least, for connect() O_NONBLOCK ist ALWAYS respected. Because this 
is where the chance for breakage is highest.

Stefan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Patch] net/xfrm/xfrm_policy.c: Some small improvements

2007-12-06 Thread WANG Cong

This patch contains the following changes.

- Use 'bool' instead of 'int' for booleans.
- Use 'size_t' instead of 'int' for 'sizeof' return value.
- Some style fixes.

Cc: Herbert Xu [EMAIL PROTECTED]
Cc: David Miller [EMAIL PROTECTED]
Signed-off-by: WANG Cong [EMAIL PROTECTED]

---
 net/xfrm/xfrm_policy.c |   23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5d6a81d..311b08f 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -476,17 +476,17 @@ static u32 xfrm_gen_index(u8 type, int dir)
struct hlist_head *list;
struct xfrm_policy *p;
u32 idx;
-   int found;
+   bool found;
 
idx = (idx_generator | dir);
idx_generator += 8;
if (idx == 0)
idx = 8;
list = xfrm_policy_byidx + idx_hash(idx);
-   found = 0;
+   found = false;
hlist_for_each_entry(p, entry, list, byidx) {
if (p-index == idx) {
-   found = 1;
+   found = true;
break;
}
}
@@ -499,8 +499,8 @@ static inline int selector_cmp(struct xfrm_selector *s1, 
struct xfrm_selector *s
 {
u32 *p1 = (u32 *) s1;
u32 *p2 = (u32 *) s2;
-   int len = sizeof(struct xfrm_selector) / sizeof(u32);
-   int i;
+   size_t len = sizeof(struct xfrm_selector) / sizeof(u32);
+   size_t i;
 
for (i = 0; i  len; i++) {
if (p1[i] != p2[i])
@@ -953,7 +953,7 @@ static int xfrm_policy_lookup(struct flowi *fl, u16 family, 
u8 dir,
 #ifdef CONFIG_XFRM_SUB_POLICY
 end:
 #endif
-   if ((*objp = (void *) pol) != NULL)
+   if ((*objp = pol) != NULL)
*obj_refp = pol-refcnt;
return err;
 }
@@ -1137,7 +1137,7 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, struct 
flowi *fl,
xfrm_address_t *saddr = xfrm_flowi_saddr(fl, family);
xfrm_address_t tmp;
 
-   for (nx=0, i = 0; i  policy-xfrm_nr; i++) {
+   for (nx = 0, i = 0; i  policy-xfrm_nr; i++) {
struct xfrm_state *x;
xfrm_address_t *remote = daddr;
xfrm_address_t *local  = saddr;
@@ -1395,7 +1395,7 @@ free_dst:
 }
 
 static int inline
-xfrm_dst_alloc_copy(void **target, void *src, int size)
+xfrm_dst_alloc_copy(void **target, void *src, size_t size)
 {
if (!*target) {
*target = kmalloc(size, GFP_ATOMIC);
@@ -1554,7 +1554,7 @@ restart:
 #endif
nx = xfrm_tmpl_resolve(pols, npols, fl, xfrm, family);
 
-   if (unlikely(nx0)) {
+   if (unlikely(nx  0)) {
err = nx;
if (err == -EAGAIN  sysctl_xfrm_larval_drop) {
/* EREMOTE tells the caller to generate
@@ -1688,7 +1688,8 @@ xfrm_state_ok(struct xfrm_tmpl *tmpl, struct xfrm_state 
*x,
  unsigned short family)
 {
if (xfrm_state_kern(x))
-   return tmpl-optional  !xfrm_state_addr_cmp(tmpl, x, 
tmpl-encap_family);
+   return tmpl-optional 
+   !xfrm_state_addr_cmp(tmpl, x, tmpl-encap_family);
return  x-id.proto == tmpl-id.proto 
(x-id.spi == tmpl-id.spi || !tmpl-id.spi) 
(x-props.reqid == tmpl-reqid || !tmpl-reqid) 
@@ -1777,7 +1778,7 @@ int __xfrm_policy_check(struct sock *sk, int dir, struct 
sk_buff *skb,
if (skb-sp) {
int i;
 
-   for (i=skb-sp-len-1; i=0; i--) {
+   for (i = skb-sp-len-1; i = 0; i--) {
struct xfrm_state *x = skb-sp-xvec[i];
if (!xfrm_selector_match(x-sel, fl, family))
return 0;
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.25 10/11][INET] Eliminate difference in actions of sysctl and proc handler for conf.all.forwarding

2007-12-06 Thread Herbert Xu
On Wed, Dec 05, 2007 at 09:39:33PM -0800, David Miller wrote:

 But we go back again to the question of how to get this current
 behavior setting instantiated early enough.  So much stuff happens
 via initrd's etc. before the real userland has a change to run things,
 read setting from the real filesystem config giles, in order to change
 this.

Perhaps a boot time command line option?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread David Miller
From: Stefan Rompf [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 11:56:48 +0100

 Am Donnerstag, 6. Dezember 2007 09:53 schrieb David Miller:
 
   I think the words shall fail and immediately are quite clear.
 
  They are, but the context in which they apply is vague.
 
 socket is connection-mode = SOCK_STREAM

I meant whether immediately mean in reference to socket
state or includes auxiliary things like route lookups.

When you do a non-blocking write on a socket, things like
memory allocations can block, potentially for a long time.
It is an example where there are definite boundaries to where
the non-blocking'ness applies.

And therefore it is not so cut and dry and you present this
issue.

 The reason why I'm pushing this issue another time is that I know quite a 
 bit about system level application development. A very typical design pattern 
 for non-naive single or multi threaded programs is that they set all 
 communication sockets to be nonblocking and use a select()/epoll() based loop 
 to dispatch IO. This often includes initiating a TCP connect() and 
 asynchronously waiting for it to finish or fail from the main loop.

 The dangerous situation here is that in 99% of all cases things will just 
 work 
 because the phase 2 SA exists. In 0.8%, the SA will be established in 1 sec. 
 However, in the rest of time the server application that you have considered 
 to be stable will end up sleeping with all threads in a connect() call that 
 is supposed to return immediatly.

And that connect() call can hang for a long time due to any memory
allocation done in the connect() path.

You are not avoiding blocking by setting O_NONBLOCK on the socket, it
is quite foolhardy to think that it does so unilaterally.

And that's why this is a grey area.  Why is waiting for memory
allocation on a O_NONBLOCK socket OK but waiting for IPSEC route
resolution is not?
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] net/xfrm/xfrm_policy.c: Some small improvements

2007-12-06 Thread David Miller
From: WANG Cong [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 19:01:23 +0800

 
 This patch contains the following changes.
 
   - Use 'bool' instead of 'int' for booleans.
   - Use 'size_t' instead of 'int' for 'sizeof' return value.
   - Some style fixes.
 
 Cc: Herbert Xu [EMAIL PROTECTED]
 Cc: David Miller [EMAIL PROTECTED]
 Signed-off-by: WANG Cong [EMAIL PROTECTED]

Normally I would let a patch like this sit in my mailbox
for a week and then delete it.

But this time I'll just let you know up front that I
don't see much value in this patch.  It is not a clear
improvement to replace int's with bool's in my mind and
the other changes are just whitespace changes.

And thus I can delete the patch from my mailbox
immediately :-)

Sorry.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.25 10/11][INET] Eliminate difference in actions of sysctl and proc handler for conf.all.forwarding

2007-12-06 Thread David Miller
From: Herbert Xu [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 22:06:01 +1100

 On Wed, Dec 05, 2007 at 09:39:33PM -0800, David Miller wrote:
 
  But we go back again to the question of how to get this current
  behavior setting instantiated early enough.  So much stuff happens
  via initrd's etc. before the real userland has a change to run things,
  read setting from the real filesystem config giles, in order to change
  this.
 
 Perhaps a boot time command line option?

It's not pleasant but it would indeed work.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread Stefan Rompf
Am Donnerstag, 6. Dezember 2007 12:13 schrieb David Miller:

 And that's why this is a grey area.  Why is waiting for memory
 allocation on a O_NONBLOCK socket OK but waiting for IPSEC route
 resolution is not?

Because you just will put enough RAM modules into you server when setting up a 
scalable system. Local resource, managable by the admin. What you cannot 
control in many cases is the network connection to the remote node. Simon 
Arlott has been talking about an 8 hour network outage.

Stefan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread David Miller
From: Stefan Rompf [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 12:35:05 +0100

 Because you just will put enough RAM modules into you server when
 setting up a scalable system.

This suggestion is avoiding the important semantic issue, and
won't lead to a real discussion of the core problem.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.25] multiple namespaces in the all dst_ifdown routines

2007-12-06 Thread Denis V. Lunev
move dst entries to a namespace loopback to catch refcounting leaks.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/core/dst.c  |4 ++--
 net/ipv4/route.c|5 +++--
 net/ipv4/xfrm4_policy.c |3 ++-
 net/ipv6/route.c|7 +--
 net/ipv6/xfrm6_policy.c |3 ++-
 net/xfrm/xfrm_policy.c  |2 +-
 6 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index f538061..5c6cfc4 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -279,11 +279,11 @@ static inline void dst_ifdown(struct dst_entry *dst, 
struct net_device *dev,
if (!unregister) {
dst-input = dst-output = dst_discard;
} else {
-   dst-dev = init_net.loopback_dev;
+   dst-dev = dst-dev-nd_net-loopback_dev;
dev_hold(dst-dev);
dev_put(dev);
if (dst-neighbour  dst-neighbour-dev == dev) {
-   dst-neighbour-dev = init_net.loopback_dev;
+   dst-neighbour-dev = dst-dev;
dev_put(dev);
dev_hold(dst-neighbour-dev);
}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index dae1290..e4aa97e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1425,8 +1425,9 @@ static void ipv4_dst_ifdown(struct dst_entry *dst, struct 
net_device *dev,
 {
struct rtable *rt = (struct rtable *) dst;
struct in_device *idev = rt-idev;
-   if (dev != init_net.loopback_dev  idev  idev-dev == dev) {
-   struct in_device *loopback_idev = 
in_dev_get(init_net.loopback_dev);
+   if (dev != dev-nd_net-loopback_dev  idev  idev-dev == dev) {
+   struct in_device *loopback_idev =
+   in_dev_get(dev-nd_net-loopback_dev);
if (loopback_idev) {
rt-idev = loopback_idev;
in_dev_put(idev);
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 19fdf8a..e086260 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -216,7 +216,8 @@ static void xfrm4_dst_ifdown(struct dst_entry *dst, struct 
net_device *dev,
 
xdst = (struct xfrm_dst *)dst;
if (xdst-u.rt.idev-dev == dev) {
-   struct in_device *loopback_idev = 
in_dev_get(init_net.loopback_dev);
+   struct in_device *loopback_idev =
+   in_dev_get(dev-nd_net-loopback_dev);
BUG_ON(!loopback_idev);
 
do {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e36cac9..e757a3c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -216,9 +216,12 @@ static void ip6_dst_ifdown(struct dst_entry *dst, struct 
net_device *dev,
 {
struct rt6_info *rt = (struct rt6_info *)dst;
struct inet6_dev *idev = rt-rt6i_idev;
+   struct net_device *loopback_dev =
+   dev-nd_net-loopback_dev;
 
-   if (dev != init_net.loopback_dev  idev != NULL  idev-dev == dev) {
-   struct inet6_dev *loopback_idev = 
in6_dev_get(init_net.loopback_dev);
+   if (dev != loopback_dev  idev != NULL  idev-dev == dev) {
+   struct inet6_dev *loopback_idev =
+   in6_dev_get(loopback_dev);
if (loopback_idev != NULL) {
rt-rt6i_idev = loopback_idev;
in6_dev_put(idev);
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index cc0d151..7b360ea 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -233,7 +233,8 @@ static void xfrm6_dst_ifdown(struct dst_entry *dst, struct 
net_device *dev,
 
xdst = (struct xfrm_dst *)dst;
if (xdst-u.rt6.rt6i_idev-dev == dev) {
-   struct inet6_dev *loopback_idev = 
in6_dev_get(init_net.loopback_dev);
+   struct inet6_dev *loopback_idev =
+   in6_dev_get(dev-nd_net-loopback_dev);
BUG_ON(!loopback_idev);
 
do {
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index a9ac748..900f6b6 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1932,7 +1932,7 @@ static int stale_bundle(struct dst_entry *dst)
 void xfrm_dst_ifdown(struct dst_entry *dst, struct net_device *dev)
 {
while ((dst = dst-child)  dst-xfrm  dst-dev == dev) {
-   dst-dev = init_net.loopback_dev;
+   dst-dev = dev-nd_net-loopback_dev;
dev_hold(dst-dev);
dev_put(dev);
}
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BRIDGE] warning message when add an interface to bridge

2007-12-06 Thread Chung-Chi Lo
Thanks. After applying this patch, the warning message is gone.

[PATCH] net: Fix running without sysfs

On Dec 6, 2007 2:00 PM, Eric W. Biederman [EMAIL PROTECTED] wrote:

 Stephen Hemminger [EMAIL PROTECTED] writes:

  On Wed, 5 Dec 2007 10:44:17 +0800
  Chung-Chi Lo [EMAIL PROTECTED] wrote:
 
  My kernel is Linxu 2.6.22.1. SYSFS is off.
  When adding an interface to bridge, console will show WARNING message.
  If turn SYSFS to on, then the WARNING message is gone.
  Any suggestion how to debug this problem? Thanks.
 
  # ifconfig eth0 0.0.0.0
  eth0: starting interface.
  # brctl addbr br0
  # brctl addif br0 eth0
  WARNING: at lib/kref.c:33 kref_get()
  Call Trace:
  [80027844] dump_stack+0x8/0x38
  [8011f348] kref_get+0xdc/0xe4
  [8011ee20] kobject_get+0x20/0x34
  [8011e910] kobject_shadow_add+0x5c/0x170
  [8011ea34] kobject_add+0x10/0x20
  [8020aac0] br_add_if+0xb4/0x1b4
  [8020b354] add_del_if+0x5c/0x118
  [8020bcc4] br_dev_ioctl+0x6c/0x88
  [80182edc] dev_ifsioc+0x334/0x3c0
  [80183184] dev_ioctl+0x21c/0x2ec
  [8016f76c] sock_ioctl+0x130/0x2e4
  [800b3b2c] do_ioctl+0x6c/0x84
  [800b3d40] vfs_ioctl+0x80/0x248
  [800b3f58] sys_ioctl+0x50/0x98
  [8002a8a8] stack_done+0x20/0x3c
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
  This is an an artifact of the kobject_shadow code which was reverted in 
  later
  kernels.
  It is gone in 2.6.23

 I don't think it was the kobject_shadow, but rather we didn't initialize the 
 kref
 or something like that in net/core/dev.c

 I believe commit 8b41d1887db718be9a2cd9e18c58ce25a4c7fd93 was the fix.

 Disabling sysfs can be a fun exercise in finding corner case bugs right now.

 Eric





-- 
Lino, Chung-Chi Lo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread Stefan Rompf
Am Donnerstag, 6. Dezember 2007 12:39 schrieb David Miller:

  Because you just will put enough RAM modules into you server when
  setting up a scalable system.

 This suggestion is avoiding the important semantic issue, and
 won't lead to a real discussion of the core problem.

When writing applications for unix operating systems, it is known since ages 
that stuff can be swapped out and that even things like memory accesses can 
block. So it does not really surprise when a system call has to wait for 
memory - just imagine the kernel code for connect() could be and has been 
swapped out.

Even with moderate swap activity, this memory should be available in much less 
than one second. If on the other hand the system is already threshing, it is 
no difference if it does so within connect() or while reaching the connect() 
system call in the application flow.

Btw, this is where admin responsibility to size their systems kicks in.

So where I would draw the line: connect() is clearly a network related 
function. Therefore, if a nonblocking connect() has to sleep for a local, 
controllable resource like memory to become available, this is ok. Maybe it 
shouldn't wait for a 128MB buffer if someone configured such an abonimation, 
haven't thought deeply about that. But when being told not to wait the 
connection to complete, it should never ever wait for another network related 
activity like IPSEC SA setup to complete, especially not for hours.

IMHO this is what developers expect, and is also consistent with the fact that 
POSIX does not define O_NONBLOCK behaviour for local files.

Stefan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.25 10/11][INET] Eliminate difference in actions of sysctl and proc handler for conf.all.forwarding

2007-12-06 Thread Pavel Emelyanov
Herbert Xu wrote:
 David Miller [EMAIL PROTECTED] wrote:
 The user is pretty much screwed in one way or the other.
 For example:

 1) If 'default' propagates to all devices, any specific
   setting for a device is lost.

 2) If 'default' does not propagate, there is no way to
   have 'default' influence devices which have already
   been loaded.
 
 Well the way it works on IPv4 currently (for most options) is
 that we'll propagate default settings to a device until either:
 
 1) the user modifies the setting for that device;
 2) or that an IPv4 address has been added to the device.

BTW, this is not 100% true. Look, in rtm_to_ifaddr()
I see the following code flow:

ipv4_devconf_setall(in_dev);

ifa = inet_alloc_ifa();
if (ifa == NULL) {
/*
 * A potential indev allocation can be left alive, it stays
 * assigned to its device and is destroy with it.
 */
err = -ENOBUFS;
goto errout;
}

if we fail to allocate the ifa (hard to happen, but), we will
make this device not to accept the default propagation.

If this is a relevant note, I can prepare the patch.
 
 2) was done to preserve backwards compatibility as the controls
 were previously only available after address addition and we did
 not propagate default settings in that case..
 
 We could easily extend this so that the default propagation
 worked until the user modified the setting, with an ioctl to
 revert to the current behaviour for compatibility.
 
 Cheers,

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC2][PATCH 7/7] [TFRC]: New rx history code

2007-12-06 Thread Arnaldo Carvalho de Melo
Gerrit,

I think I got this right this time, please see if there is
anything left so that we can move on. I plan to go thru the following
patches restricting myself to namespacing and consistency issues,
leaving ideas I have for later, when we get more of your backlog merged.

The first six patches in this series are unmodified, so if you
are OK with them please send me your Signed-off-by.

Thanks a lot,

- Arnaldo

From 2a3b4067dd514ce0e307d165783bc561cc7f17c4 Mon Sep 17 00:00:00 2001
From: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 10:56:58 -0200
Subject: [PATCH 7/7] [TFRC]: New rx history code

Credit here goes to Gerrit Renker, that provided the initial implementation for
this new codebase.

I modified it just to try to make it closer to the existing API, renaming some
functions, add namespacing and fix one bug where the tfrc_rx_hist_alloc was not
freeing the allocated ring entries on the error path.

Original changeset comment from Gerrit:
  ---
This provides a new, self-contained and generic RX history service for TFRC
based protocols.

Details:
 * new data structure, initialisation and cleanup routines;
 * allocation of dccp_rx_hist entries local to packet_history.c,
   as a service exported by the dccp_tfrc_lib module.
 * interface to automatically track highest-received seqno;
 * receiver-based RTT estimation (needed for instance by RFC 3448, 6.3.1);
 * a generic function to test for `data packets' as per  RFC 4340, sec. 7.7.

Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---
 net/dccp/ccids/ccid3.c  |  292 +--
 net/dccp/ccids/ccid3.h  |   14 +-
 net/dccp/ccids/lib/loss_interval.c  |   13 ++-
 net/dccp/ccids/lib/packet_history.c |  290 +--
 net/dccp/ccids/lib/packet_history.h |   83 +--
 5 files changed, 334 insertions(+), 358 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index 5ff5aab..28a5e4d 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -641,6 +641,15 @@ static int ccid3_hc_tx_getsockopt(struct sock *sk, const 
int optname, int len,
 /*
  * Receiver Half-Connection Routines
  */
+
+/* CCID3 feedback types */
+enum ccid3_fback_type {
+   CCID3_FBACK_NONE = 0,
+   CCID3_FBACK_INITIAL,
+   CCID3_FBACK_PERIODIC,
+   CCID3_FBACK_PARAM_CHANGE
+};
+
 #ifdef CONFIG_IP_DCCP_CCID3_DEBUG
 static const char *ccid3_rx_state_name(enum ccid3_hc_rx_states state)
 {
@@ -667,59 +676,60 @@ static void ccid3_hc_rx_set_state(struct sock *sk,
hcrx-ccid3hcrx_state = state;
 }
 
-static inline void ccid3_hc_rx_update_s(struct ccid3_hc_rx_sock *hcrx, int len)
-{
-   if (likely(len  0))/* don't update on empty packets (e.g. ACKs) */
-   hcrx-ccid3hcrx_s = tfrc_ewma(hcrx-ccid3hcrx_s, len, 9);
-}
-
-static void ccid3_hc_rx_send_feedback(struct sock *sk)
+static void ccid3_hc_rx_send_feedback(struct sock *sk,
+ const struct sk_buff *skb,
+ enum ccid3_fback_type fbtype)
 {
struct ccid3_hc_rx_sock *hcrx = ccid3_hc_rx_sk(sk);
struct dccp_sock *dp = dccp_sk(sk);
-   struct tfrc_rx_hist_entry *packet;
ktime_t now;
-   suseconds_t delta;
+   s64 delta = 0;
 
ccid3_pr_debug(%s(%p) - entry \n, dccp_role(sk), sk);
 
+   if (unlikely(hcrx-ccid3hcrx_state == TFRC_RSTATE_TERM))
+   return;
+
now = ktime_get_real();
 
-   switch (hcrx-ccid3hcrx_state) {
-   case TFRC_RSTATE_NO_DATA:
+   switch (fbtype) {
+   case CCID3_FBACK_INITIAL:
hcrx-ccid3hcrx_x_recv = 0;
+   hcrx-ccid3hcrx_pinv   = ~0U;   /* see RFC 4342, 8.5 */
break;
-   case TFRC_RSTATE_DATA:
-   delta = ktime_us_delta(now,
-  hcrx-ccid3hcrx_tstamp_last_feedback);
-   DCCP_BUG_ON(delta  0);
-   hcrx-ccid3hcrx_x_recv =
-   scaled_div32(hcrx-ccid3hcrx_bytes_recv, delta);
+   case CCID3_FBACK_PARAM_CHANGE:
+   /*
+* When parameters change (new loss or p  p_prev), we do not
+* have a reliable estimate for R_m of [RFC 3448, 6.2] and so
+* need to  reuse the previous value of X_recv. However, when
+* X_recv was 0 (due to early loss), this would kill X down to
+* s/t_mbi (i.e. one packet in 64 seconds).
+* To avoid such drastic reduction, we approximate X_recv as
+* the number of bytes since last feedback.
+* This is a safe fallback, since X is bounded above by X_calc.
+*/
+   if (hcrx-ccid3hcrx_x_recv  0)
+   break;
+   /* fall through */
+   case CCID3_FBACK_PERIODIC:
+   delta = ktime_us_delta(now, 

Re: TCP event tracking via netlink...

2007-12-06 Thread Arnaldo Carvalho de Melo
Em Thu, Dec 06, 2007 at 02:20:58AM -0800, David Miller escreveu:
 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 5 Dec 2007 16:33:38 -0500
 
  On Wed, 05 Dec 2007 08:53:07 -0800
  Joe Perches [EMAIL PROTECTED] wrote:
  
it occurred to me that we might want to do something
like a state change event generator.
   
   This could be a basis for an interesting TCP
   performance tester.
  
  That is what tcpprobe does but it isn't detailed enough to address SACK
  issues.
 
 Indeed, this could be done via the jprobe there.
 
 Silly me I didn't do this in the implementation I whipped
 up, which I'll likely correct.

I have some experiments from the past on this area:

This is what is produced by ctracer + the ostra callgrapher when
tracking many sk_buff objects, tracing sk_buff routines and as well all
other structs that have a pointer to a sk_buff, i.e. where the sk_buff
can be get from the struct that has a pointer to it, tcp_sock is an
alias to struct inet_sock that is an alias to struct sock, etc, so
when tracing tcp_sock you also trace inet_connection_sock, inet_sock,
sock methods:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sk_buff/many_objects/

With just one object (that is reused, so appears many times):

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sk_buff/0x8101013130e8/

Following struct sock methods:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/many_objects/

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/

struct socket:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/socket/many_objects/

It works by using the DWARF information to generate a systemtap module
that in turn will create a relayfs channel where we store the traces and
a automatically reorganized struct with just the base types (int, char,
long, etc) and typedefs that end up being base types.

Example of the struct minisock recreated from the debugging information
and reorganized using the algorithms in pahole to save space, generated
by this tool, go to the bottom, where you'll find struct
ctracer__mini_sock and the collector, that from a full sized object
creates the mini struct.

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/ctracer_collector.struct.sock.c

And the systemtap module (the tcpprobe on steroids) automatically
generated:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/ctracer_methods.struct.sock.stp

This requires more work to:

. reduce the overhead
. filter out undesired functions creating a project with the functions 
desired using
  some gui editor
. specify lists of fields to put on the internal state to be collected, again 
using a
  gui or plain ctracer-edit using vi, instead of getting just base types
. Be able to say: collect just the fields on the second and fourth cacheline
. collectors for complex objects such as spinlocks, socket lock, mutexes

But since people are wanting to work on tools to watch state
transitions, fields changing, etc, I thought I should dust off the ostra
experiments and the more recent dwarves ctracer work I'm doing on my
copious spare time 8)

In the callgrapher there are some more interesting stuff:

Interface to see where fields changed:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/changes.html

In this page clicking on a field name, such as:

http://oops.ghostprotocols.net:81/acme/dwarves/callgraphs/sock/0xf61bf500/sk_forward_alloc.png

You'll get graphs over time.

Code is in the dwarves repo at:

http://master.kernel.org/git/?p=linux/kernel/git/acme/pahole.git;a=summary

Thanks,

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][VLAN] Lost rtnl_unlock() in vlan_ioctl()

2007-12-06 Thread Pavel Emelyanov
The SET_VLAN_NAME_TYPE_CMD command w/o CAP_NET_ADMIN capability
doesn't release the rtnl lock.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 6567213..5b18315 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -776,7 +776,7 @@ static int vlan_ioctl_handler(struct net *net, void __user 
*arg)
case SET_VLAN_NAME_TYPE_CMD:
err = -EPERM;
if (!capable(CAP_NET_ADMIN))
-   return -EPERM;
+   break;
if ((args.u.name_type = 0) 
(args.u.name_type  VLAN_NAME_TYPE_HIGHEST)) {
vlan_name_type = args.u.name_type;
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/7] [TFRC]: Rename tfrc_tx_hist to tfrc_tx_hist_slab, for consistency

2007-12-06 Thread Gerrit Renker
| Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][VLAN] Lost rtnl_unlock() in vlan_ioctl()

2007-12-06 Thread Patrick McHardy

Pavel Emelyanov wrote:

The SET_VLAN_NAME_TYPE_CMD command w/o CAP_NET_ADMIN capability
doesn't release the rtnl lock.



Thanks Pavel. I somehow recall that we already fixed this
one, but can't find the patch :) Dave, please apply.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] [TFRC]: Make the rx history slab be global

2007-12-06 Thread Gerrit Renker
| Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] [TFRC]: Rename dccp_rx_ to tfrc_rx_

2007-12-06 Thread Gerrit Renker
| Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][VLAN] Lost rtnl_unlock() in vlan_ioctl()

2007-12-06 Thread David Miller
From: Patrick McHardy [EMAIL PROTECTED]
Date: Thu, 06 Dec 2007 14:59:24 +0100

 Pavel Emelyanov wrote:
  The SET_VLAN_NAME_TYPE_CMD command w/o CAP_NET_ADMIN capability
  doesn't release the rtnl lock.
 
 
 Thanks Pavel. I somehow recall that we already fixed this
 one, but can't find the patch :) Dave, please apply.

I think we even added this bug to -stable, or something like
that, didn't we?  Yikes...
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC2][PATCH 7/7] [TFRC]: New rx history code

2007-12-06 Thread Gerrit Renker
|   The first six patches in this series are unmodified, so if you
| are OK with them please send me your Signed-off-by.
Patches [1/7], [2/7], and [6/7] already have a signed-off and there are
no changes. Just acknowledged [3..5/7], will look at [7/7] now.

Cheers
Gerrit
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/6] ipv6 - make xfrm6_init to return an error code

2007-12-06 Thread Daniel Lezcano
The xfrm initialization function does not return any error code, so
if there is an error, the caller can not be advise of that.
This patch checks the return code of the different called functions
in order to return a successful or failed initialization.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Acked-by: Benjamin Thery [EMAIL PROTECTED]
---
 include/net/xfrm.h  |4 ++--
 net/ipv6/xfrm6_policy.c |   22 +-
 net/ipv6/xfrm6_state.c  |4 ++--
 3 files changed, 21 insertions(+), 9 deletions(-)

Index: net-2.6.25/include/net/xfrm.h
===
--- net-2.6.25.orig/include/net/xfrm.h
+++ net-2.6.25/include/net/xfrm.h
@@ -1066,11 +1066,11 @@ struct xfrm6_tunnel {
 
 extern void xfrm_init(void);
 extern void xfrm4_init(void);
-extern void xfrm6_init(void);
+extern int xfrm6_init(void);
 extern void xfrm6_fini(void);
 extern void xfrm_state_init(void);
 extern void xfrm4_state_init(void);
-extern void xfrm6_state_init(void);
+extern int xfrm6_state_init(void);
 extern void xfrm6_state_fini(void);
 
 extern int xfrm_state_walk(u8 proto, int (*func)(struct xfrm_state *, int, 
void*), void *);
Index: net-2.6.25/net/ipv6/xfrm6_policy.c
===
--- net-2.6.25.orig/net/ipv6/xfrm6_policy.c
+++ net-2.6.25/net/ipv6/xfrm6_policy.c
@@ -269,9 +269,9 @@ static struct xfrm_policy_afinfo xfrm6_p
.fill_dst = xfrm6_fill_dst,
 };
 
-static void __init xfrm6_policy_init(void)
+static int __init xfrm6_policy_init(void)
 {
-   xfrm_policy_register_afinfo(xfrm6_policy_afinfo);
+   return xfrm_policy_register_afinfo(xfrm6_policy_afinfo);
 }
 
 static void xfrm6_policy_fini(void)
@@ -279,10 +279,22 @@ static void xfrm6_policy_fini(void)
xfrm_policy_unregister_afinfo(xfrm6_policy_afinfo);
 }
 
-void __init xfrm6_init(void)
+int __init xfrm6_init(void)
 {
-   xfrm6_policy_init();
-   xfrm6_state_init();
+   int ret;
+
+   ret = xfrm6_policy_init();
+   if (ret)
+   goto out;
+
+   ret = xfrm6_state_init();
+   if (ret)
+   goto out_policy;
+out:
+   return ret;
+out_policy:
+   xfrm6_policy_fini();
+   goto out;
 }
 
 void xfrm6_fini(void)
Index: net-2.6.25/net/ipv6/xfrm6_state.c
===
--- net-2.6.25.orig/net/ipv6/xfrm6_state.c
+++ net-2.6.25/net/ipv6/xfrm6_state.c
@@ -198,9 +198,9 @@ static struct xfrm_state_afinfo xfrm6_st
.transport_finish   = xfrm6_transport_finish,
 };
 
-void __init xfrm6_state_init(void)
+int __init xfrm6_state_init(void)
 {
-   xfrm_state_register_afinfo(xfrm6_state_afinfo);
+   return xfrm_state_register_afinfo(xfrm6_state_afinfo);
 }
 
 void xfrm6_state_fini(void)

-- 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 0/6] ipv6 - ipv6 routing initialization

2007-12-06 Thread Daniel Lezcano
This patchset provides modifications around the routes initialization
for ipv6. Actually the init functions does not return an error code
so the protocol can not be notified that there were an error while 
initializing the routing subsystems.

The patchset make the init functions to return an error code, so the ipv6
can safely handle the error and fail gracefully.

The error code can also let to catch the kmem_cache_creation failure without
doing a radical panic. That's allow just to fail to load the ipv6 module 
without 
crashing down the machine.

-- 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 3/6] ipv6 - make fib6_rules_init to return an error code

2007-12-06 Thread Daniel Lezcano
When the fib_rules initialization finished, no return code is provided
so there is no way to know, for the caller, if the initialization has
been successful or has failed. This patch fix that.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Acked-by: Benjamin Thery [EMAIL PROTECTED]
---
 include/net/fib_rules.h |1 +
 include/net/ip6_fib.h   |2 +-
 net/core/fib_rules.c|5 +++--
 net/ipv6/fib6_rules.c   |   19 ---
 4 files changed, 21 insertions(+), 6 deletions(-)

Index: net-2.6.25/include/net/ip6_fib.h
===
--- net-2.6.25.orig/include/net/ip6_fib.h
+++ net-2.6.25/include/net/ip6_fib.h
@@ -226,7 +226,7 @@ extern void fib6_gc_cleanup(void);
 
 extern int fib6_init(void);
 
-extern voidfib6_rules_init(void);
+extern int fib6_rules_init(void);
 extern voidfib6_rules_cleanup(void);
 
 #endif
Index: net-2.6.25/net/ipv6/fib6_rules.c
===
--- net-2.6.25.orig/net/ipv6/fib6_rules.c
+++ net-2.6.25/net/ipv6/fib6_rules.c
@@ -265,10 +265,23 @@ static int __init fib6_default_rules_ini
return 0;
 }
 
-void __init fib6_rules_init(void)
+int __init fib6_rules_init(void)
 {
-   BUG_ON(fib6_default_rules_init());
-   fib_rules_register(fib6_rules_ops);
+   int ret;
+
+   ret = fib6_default_rules_init();
+   if (ret)
+   goto out;
+
+   ret = fib_rules_register(fib6_rules_ops);
+   if (ret)
+   goto out_default_rules_init;
+out:
+   return ret;
+
+out_default_rules_init:
+   fib_rules_cleanup_ops(fib6_rules_ops);
+   goto out;
 }
 
 void fib6_rules_cleanup(void)
Index: net-2.6.25/include/net/fib_rules.h
===
--- net-2.6.25.orig/include/net/fib_rules.h
+++ net-2.6.25/include/net/fib_rules.h
@@ -103,6 +103,7 @@ static inline u32 frh_get_table(struct f
 
 extern int fib_rules_register(struct fib_rules_ops *);
 extern int fib_rules_unregister(struct fib_rules_ops *);
+extern void fib_rules_cleanup_ops(struct fib_rules_ops *);
 
 extern int fib_rules_lookup(struct fib_rules_ops *,
 struct flowi *, int flags,
Index: net-2.6.25/net/core/fib_rules.c
===
--- net-2.6.25.orig/net/core/fib_rules.c
+++ net-2.6.25/net/core/fib_rules.c
@@ -102,7 +102,7 @@ errout:
 
 EXPORT_SYMBOL_GPL(fib_rules_register);
 
-static void cleanup_ops(struct fib_rules_ops *ops)
+void fib_rules_cleanup_ops(struct fib_rules_ops *ops)
 {
struct fib_rule *rule, *tmp;
 
@@ -111,6 +111,7 @@ static void cleanup_ops(struct fib_rules
fib_rule_put(rule);
}
 }
+EXPORT_SYMBOL_GPL(fib_rules_cleanup_ops);
 
 int fib_rules_unregister(struct fib_rules_ops *ops)
 {
@@ -121,7 +122,7 @@ int fib_rules_unregister(struct fib_rule
list_for_each_entry(o, rules_ops, list) {
if (o == ops) {
list_del_rcu(o-list);
-   cleanup_ops(ops);
+   fib_rules_cleanup_ops(ops);
goto out;
}
}

-- 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC2][PATCH 7/7] [TFRC]: New rx history code

2007-12-06 Thread Arnaldo Carvalho de Melo
Em Thu, Dec 06, 2007 at 02:02:25PM +, Gerrit Renker escreveu:
 | The first six patches in this series are unmodified, so if you
 | are OK with them please send me your Signed-off-by.
 Patches [1/7], [2/7], and [6/7] already have a signed-off and there are
 no changes. Just acknowledged [3..5/7], will look at [7/7] now.

OK, please let me know if there are still any problems.

The removal of timestamp insertion in ccid3_hc_rx_insert_options will be
put in another cset.

- Arnaldo
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread David Miller
From: Stefan Rompf [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 13:30:20 +0100

 IMHO this is what developers expect, and is also consistent with the
 fact that POSIX does not define O_NONBLOCK behaviour for local
 files.

You keep ignoring the fact that, as Herbert and I discussed, not
blocking for IPSEC resolution will make some connect() cases fail that
would otherwise not fail.

There are two sides to this issue, and we need to consider them
both.

Long term a resolution-packet-queue provides a solution that handles
both angles correctly, but we don't have that code yet.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 5/6] ipv6 - make af_inet6 to check ip6_route_init return value

2007-12-06 Thread Daniel Lezcano
The af_inet6 initialization function does not check the return code
of the route initilization, so if something goes wrong, the protocol
initialization will continue anyway.
This patch takes into account the modification made in the different
route's initialization subroutines to check the return value and to 
make the protocol initialization to fail.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Acked-by: Benjamin Thery [EMAIL PROTECTED]
---
 net/ipv6/af_inet6.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: net-2.6.25/net/ipv6/af_inet6.c
===
--- net-2.6.25.orig/net/ipv6/af_inet6.c
+++ net-2.6.25/net/ipv6/af_inet6.c
@@ -849,7 +849,9 @@ static int __init inet6_init(void)
if (if6_proc_init())
goto proc_if6_fail;
 #endif
-   ip6_route_init();
+   err = ip6_route_init();
+   if (err)
+   goto ip6_route_fail;
ip6_flowlabel_init();
err = addrconf_init();
if (err)
@@ -874,6 +876,7 @@ out:
 addrconf_fail:
ip6_flowlabel_cleanup();
ip6_route_cleanup();
+ip6_route_fail:
 #ifdef CONFIG_PROC_FS
if6_proc_exit();
 proc_if6_fail:
@@ -904,6 +907,7 @@ icmp_fail:
cleanup_ipv6_mibs();
 out_unregister_sock:
sock_unregister(PF_INET6);
+   rtnl_unregister_all(PF_INET6);
 out_unregister_raw_proto:
proto_unregister(rawv6_prot);
 out_unregister_udplite_proto:

-- 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 4/6] ipv6 - make ip6_route_init to return an error code

2007-12-06 Thread Daniel Lezcano
The route initialization function does not return any value to notify if
the initialization is successful or not. This patch checks all calls made
for the initilization in order to return a value for the caller.

Unfortunatly, proc_net_fops_create will return a NULL pointer if CONFIG_PROC_FS
is off, so we can not check the return code without an ifdef CONFIG_PROC_FS 
block in the ip6_route_init function.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Acked-by: Benjamin Thery [EMAIL PROTECTED]
---
 include/net/ip6_route.h |2 -
 net/ipv6/route.c|   66 +++-
 2 files changed, 55 insertions(+), 13 deletions(-)

Index: net-2.6.25/include/net/ip6_route.h
===
--- net-2.6.25.orig/include/net/ip6_route.h
+++ net-2.6.25/include/net/ip6_route.h
@@ -50,7 +50,7 @@ extern void   ip6_route_input(struct sk_
 extern struct dst_entry *  ip6_route_output(struct sock *sk,
 struct flowi *fl);
 
-extern voidip6_route_init(void);
+extern int ip6_route_init(void);
 extern voidip6_route_cleanup(void);
 
 extern int ipv6_route_ioctl(unsigned int cmd, void __user 
*arg);
Index: net-2.6.25/net/ipv6/route.c
===
--- net-2.6.25.orig/net/ipv6/route.c
+++ net-2.6.25/net/ipv6/route.c
@@ -2460,26 +2460,70 @@ ctl_table ipv6_route_table[] = {
 
 #endif
 
-void __init ip6_route_init(void)
+int __init ip6_route_init(void)
 {
+   int ret;
+
ip6_dst_ops.kmem_cachep =
kmem_cache_create(ip6_dst_cache, sizeof(struct rt6_info), 0,
  SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
ip6_dst_blackhole_ops.kmem_cachep = ip6_dst_ops.kmem_cachep;
 
-   fib6_init();
-   proc_net_fops_create(init_net, ipv6_route, 0, ipv6_route_proc_fops);
-   proc_net_fops_create(init_net, rt6_stats, S_IRUGO, 
rt6_stats_seq_fops);
+   ret = fib6_init();
+   if (ret)
+   goto out_kmem_cache;
+
+#ifdef CONFIG_PROC_FS
+   ret = -ENOMEM;
+   if (!proc_net_fops_create(init_net, ipv6_route,
+ 0, ipv6_route_proc_fops))
+   goto out_fib6_init;
+
+   if (!proc_net_fops_create(init_net, rt6_stats,
+ S_IRUGO, rt6_stats_seq_fops))
+   goto out_proc_ipv6_route;
+#endif
+
 #ifdef CONFIG_XFRM
-   xfrm6_init();
+   ret = xfrm6_init();
+   if (ret)
+   goto out_proc_rt6_stats;
 #endif
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
-   fib6_rules_init();
-#endif
+   ret = fib6_rules_init();
+   if (ret)
+   goto xfrm6_init;
+#endif
+   ret = -ENOBUFS;
+   if (__rtnl_register(PF_INET6, RTM_NEWROUTE, inet6_rtm_newroute, NULL) ||
+   __rtnl_register(PF_INET6, RTM_DELROUTE, inet6_rtm_delroute, NULL) ||
+   __rtnl_register(PF_INET6, RTM_GETROUTE, inet6_rtm_getroute, NULL))
+   goto fib6_rules_init;
+
+   ret = 0;
+out:
+   return ret;
 
-   __rtnl_register(PF_INET6, RTM_NEWROUTE, inet6_rtm_newroute, NULL);
-   __rtnl_register(PF_INET6, RTM_DELROUTE, inet6_rtm_delroute, NULL);
-   __rtnl_register(PF_INET6, RTM_GETROUTE, inet6_rtm_getroute, NULL);
+fib6_rules_init:
+#ifdef CONFIG_IPV6_MULTIPLE_TABLES
+   fib6_rules_cleanup();
+xfrm6_init:
+#endif
+#ifdef CONFIG_XFRM
+   xfrm6_fini();
+out_proc_rt6_stats:
+#endif
+#ifdef CONFIG_PROC_FS
+   proc_net_remove(init_net, rt6_stats);
+out_proc_ipv6_route:
+   proc_net_remove(init_net, ipv6_route);
+out_fib6_init:
+#endif
+   rt6_ifdown(NULL);
+   fib6_gc_cleanup();
+out_kmem_cache:
+   kmem_cache_destroy(ip6_dst_ops.kmem_cachep);
+   goto out;
 }
 
 void ip6_route_cleanup(void)
@@ -2487,10 +2531,8 @@ void ip6_route_cleanup(void)
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
fib6_rules_cleanup();
 #endif
-#ifdef CONFIG_PROC_FS
proc_net_remove(init_net, ipv6_route);
proc_net_remove(init_net, rt6_stats);
-#endif
 #ifdef CONFIG_XFRM
xfrm6_fini();
 #endif

-- 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 6/6] ipv6 - route6/fib6 : dont panic a kmem_cache_create

2007-12-06 Thread Daniel Lezcano
If the kmem_cache_creation fails, the kernel will panic. It is acceptable
if the system is booting, but if the ipv6 protocol is compiled as a module
and it is loaded after the system has booted, do we want to panic instead
of just failing to initialize the protocol ?

The init function is now returning an error and this one is checked for
protocol initialization. So the ipv6 protocol will safely fails.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Acked-by: Benjamin Thery [EMAIL PROTECTED]
---
 net/ipv6/ip6_fib.c |5 -
 net/ipv6/route.c   |5 -
 2 files changed, 8 insertions(+), 2 deletions(-)

Index: net-2.6.25/net/ipv6/ip6_fib.c
===
--- net-2.6.25.orig/net/ipv6/ip6_fib.c
+++ net-2.6.25/net/ipv6/ip6_fib.c
@@ -1478,8 +1478,11 @@ int __init fib6_init(void)
int ret;
fib6_node_kmem = kmem_cache_create(fib6_nodes,
   sizeof(struct fib6_node),
-  0, SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+  0, SLAB_HWCACHE_ALIGN,
   NULL);
+   if (!fib6_node_kmem)
+   return -ENOMEM;
+
fib6_tables_init();
 
ret = __rtnl_register(PF_INET6, RTM_GETROUTE, NULL, inet6_dump_fib);
Index: net-2.6.25/net/ipv6/route.c
===
--- net-2.6.25.orig/net/ipv6/route.c
+++ net-2.6.25/net/ipv6/route.c
@@ -2466,7 +2466,10 @@ int __init ip6_route_init(void)
 
ip6_dst_ops.kmem_cachep =
kmem_cache_create(ip6_dst_cache, sizeof(struct rt6_info), 0,
- SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+ SLAB_HWCACHE_ALIGN, NULL);
+   if (!ip6_dst_ops.kmem_cachep)
+   return -ENOMEM;
+
ip6_dst_blackhole_ops.kmem_cachep = ip6_dst_ops.kmem_cachep;
 
ret = fib6_init();

-- 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/6] ipv6 - make fib6_init to return an error code

2007-12-06 Thread Daniel Lezcano
If there is an error in the initialization function, nothing is followed up
to the caller. So I add a return value to be set for the init function.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
Acked-by: Benjamin Thery [EMAIL PROTECTED]
---
 include/net/ip6_fib.h |2 +-
 net/ipv6/ip6_fib.c|   14 +++---
 2 files changed, 12 insertions(+), 4 deletions(-)

Index: net-2.6.25/include/net/ip6_fib.h
===
--- net-2.6.25.orig/include/net/ip6_fib.h
+++ net-2.6.25/include/net/ip6_fib.h
@@ -224,7 +224,7 @@ extern void fib6_run_gc(unsigned long 
 
 extern voidfib6_gc_cleanup(void);
 
-extern voidfib6_init(void);
+extern int fib6_init(void);
 
 extern voidfib6_rules_init(void);
 extern voidfib6_rules_cleanup(void);
Index: net-2.6.25/net/ipv6/ip6_fib.c
===
--- net-2.6.25.orig/net/ipv6/ip6_fib.c
+++ net-2.6.25/net/ipv6/ip6_fib.c
@@ -1473,16 +1473,24 @@ void fib6_run_gc(unsigned long dummy)
spin_unlock_bh(fib6_gc_lock);
 }
 
-void __init fib6_init(void)
+int __init fib6_init(void)
 {
+   int ret;
fib6_node_kmem = kmem_cache_create(fib6_nodes,
   sizeof(struct fib6_node),
   0, SLAB_HWCACHE_ALIGN|SLAB_PANIC,
   NULL);
-
fib6_tables_init();
 
-   __rtnl_register(PF_INET6, RTM_GETROUTE, NULL, inet6_dump_fib);
+   ret = __rtnl_register(PF_INET6, RTM_GETROUTE, NULL, inet6_dump_fib);
+   if (ret)
+   goto out_kmem_cache_create;
+out:
+   return ret;
+
+out_kmem_cache_create:
+   kmem_cache_destroy(fib6_node_kmem);
+   goto out;
 }
 
 void fib6_gc_cleanup(void)

-- 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][VLAN] Lost rtnl_unlock() in vlan_ioctl()

2007-12-06 Thread Patrick McHardy

David Miller wrote:

From: Patrick McHardy [EMAIL PROTECTED]
Date: Thu, 06 Dec 2007 14:59:24 +0100


Pavel Emelyanov wrote:

The SET_VLAN_NAME_TYPE_CMD command w/o CAP_NET_ADMIN capability
doesn't release the rtnl lock.


Thanks Pavel. I somehow recall that we already fixed this
one, but can't find the patch :) Dave, please apply.


I think we even added this bug to -stable, or something like
that, didn't we?  Yikes...



No, I mixed those two patches up as well. The bug was introduced
with the vlan_netlink stuff, the -stable patch fixed an invalid
return value, but still properly dropped the lock.

This patch should of course go in -stable anyway.


--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread Stefan Rompf
Am Donnerstag, 6. Dezember 2007 14:55 schrieb David Miller:

 You keep ignoring the fact that, as Herbert and I discussed, not
 blocking for IPSEC resolution will make some connect() cases fail that
 would otherwise not fail.

 There are two sides to this issue, and we need to consider them
 both.

as far as I've understood Herbert's patch, at least TCP connect can be fixed 
so that non blocking connect() will neither fail nor block, but just use the 
first or second retransmission of the SYN packet to complete the handshake 
after IPSEC is up. As this will fix the common breakage case, just do so and 
keep UDP sendmsg() etc for later.

You are looking at this issue too much from the kernel side. Admitted, this is 
a corner case, but therefore nobody cares if connection completion takes two 
SYNs and three seconds instead of one SYN and may be two seconds. But 
application developers and users will validly complain if their applications 
block unexpectedly for hours just because some random provider has a network 
outage and IPSEC cannot come up.

Stefan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] net/xfrm/xfrm_policy.c: Some small improvements

2007-12-06 Thread Richard Knutsson

David Miller wrote:

From: WANG Cong [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 19:01:23 +0800

  

This patch contains the following changes.

- Use 'bool' instead of 'int' for booleans.
- Use 'size_t' instead of 'int' for 'sizeof' return value.
- Some style fixes.

Cc: Herbert Xu [EMAIL PROTECTED]
Cc: David Miller [EMAIL PROTECTED]
Signed-off-by: WANG Cong [EMAIL PROTECTED]



Normally I would let a patch like this sit in my mailbox
for a week and then delete it.
  

That is evil! ;)

But this time I'll just let you know up front that I
don't see much value in this patch.  It is not a clear
improvement to replace int's with bool's in my mind and
the other changes are just whitespace changes.
  
Is it not an improvement to distinct booleans from actual values? Do you 
use integers for ASCII characters too? It can also avoid some potential 
bugs like the 'if (i == TRUE)'...
What is wrong with 'size_t' (since it is unsigned, compared to (some) 
'int')?


/Richard Knutsson

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.25] Remove ip_fib_local_table and ip_fib_main_table defines

2007-12-06 Thread Denis V. Lunev
From: Eric W. Biederman [EMAIL PROTECTED]

There are only 2 users and it doesn't hurt to call fib_get_table
instead, and it makes it easier to make the fib network namespace
aware.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 include/net/ip_fib.h |3 ---
 net/ipv4/fib_hash.c  |5 +++--
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index ed514bf..690fb4d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -191,9 +191,6 @@ extern void __init fib4_rules_init(void);
 extern u32 fib_rules_tclass(struct fib_result *res);
 #endif
 
-#define ip_fib_local_table fib_get_table(RT_TABLE_LOCAL)
-#define ip_fib_main_table fib_get_table(RT_TABLE_MAIN)
-
 extern int fib_lookup(struct flowi *flp, struct fib_result *res);
 
 extern struct fib_table *fib_new_table(u32 id);
diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 9d0cee2..30ff657 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -810,7 +810,8 @@ struct fib_iter_state {
 static struct fib_alias *fib_get_first(struct seq_file *seq)
 {
struct fib_iter_state *iter = seq-private;
-   struct fn_hash *table = (struct fn_hash *) ip_fib_main_table-tb_data;
+   struct fib_table *main_table = fib_get_table(RT_TABLE_MAIN);
+   struct fn_hash *table = (struct fn_hash *)main_table-tb_data;
 
iter-bucket= 0;
iter-hash_head = NULL;
@@ -949,7 +950,7 @@ static void *fib_seq_start(struct seq_file *seq, loff_t 
*pos)
void *v = NULL;
 
read_lock(fib_hash_lock);
-   if (ip_fib_main_table)
+   if (fib_get_table(RT_TABLE_MAIN))
v = *pos ? fib_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
return v;
 }
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2.6.25] net: move trie_local and trie_main into the proc iterator

2007-12-06 Thread Denis V. Lunev
From: Eric W. Biederman [EMAIL PROTECTED]

We only use these variables when displaying the trie in proc so
place them into the iterator to make this explicit.  We should
probably do something smarter to handle the CONFIG_IP_MULTIPLE_TABLES
case but at least this makes it clear that the silliness is limited
to the display in /proc.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
---
 net/ipv4/fib_trie.c |   47 ++-
 1 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 8d8c291..6385cca 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -164,7 +164,6 @@ static struct tnode *halve(struct trie *t, struct tnode 
*tn);
 static void tnode_free(struct tnode *tn);
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
-static struct trie *trie_local = NULL, *trie_main = NULL;
 
 static inline struct tnode *node_parent(struct node *node)
 {
@@ -2000,11 +1999,6 @@ struct fib_table * __init fib_hash_init(u32 id)
trie_init(t);
 
if (id == RT_TABLE_LOCAL)
-   trie_local = t;
-   else if (id == RT_TABLE_MAIN)
-   trie_main = t;
-
-   if (id == RT_TABLE_LOCAL)
printk(KERN_INFO IPv4 FIB: Using LC-trie version %s\n, 
VERSION);
 
return tb;
@@ -2013,6 +2007,7 @@ struct fib_table * __init fib_hash_init(u32 id)
 #ifdef CONFIG_PROC_FS
 /* Depth first Trie walk iterator */
 struct fib_trie_iter {
+   struct trie *trie_local, *trie_main;
struct tnode *tnode;
struct trie *trie;
unsigned index;
@@ -2179,7 +2174,20 @@ static void trie_show_stats(struct seq_file *seq, struct 
trie_stat *stat)
 
 static int fib_triestat_seq_show(struct seq_file *seq, void *v)
 {
+   struct trie *trie_local, *trie_main;
struct trie_stat *stat;
+   struct fib_table *tb;
+
+   trie_local = NULL;
+   tb = fib_get_table(RT_TABLE_LOCAL);
+   if (tb)
+   trie_local = (struct trie *) tb-tb_data;
+
+   trie_main = NULL;
+   tb = fib_get_table(RT_TABLE_MAIN);
+   if (tb)
+   trie_main = (struct trie *) tb-tb_data;
+
 
stat = kmalloc(sizeof(*stat), GFP_KERNEL);
if (!stat)
@@ -2223,13 +2231,13 @@ static struct node *fib_trie_get_idx(struct 
fib_trie_iter *iter,
loff_t idx = 0;
struct node *n;
 
-   for (n = fib_trie_get_first(iter, trie_local);
+   for (n = fib_trie_get_first(iter, iter-trie_local);
 n; ++idx, n = fib_trie_get_next(iter)) {
if (pos == idx)
return n;
}
 
-   for (n = fib_trie_get_first(iter, trie_main);
+   for (n = fib_trie_get_first(iter, iter-trie_main);
 n; ++idx, n = fib_trie_get_next(iter)) {
if (pos == idx)
return n;
@@ -2239,10 +2247,23 @@ static struct node *fib_trie_get_idx(struct 
fib_trie_iter *iter,
 
 static void *fib_trie_seq_start(struct seq_file *seq, loff_t *pos)
 {
+   struct fib_trie_iter *iter = seq-private;
+   struct fib_table *tb;
+
+   if (!iter-trie_local) {
+   tb = fib_get_table(RT_TABLE_LOCAL);
+   if (tb)
+   iter-trie_local = (struct trie *) tb-tb_data;
+   }
+   if (!iter-trie_main) {
+   tb = fib_get_table(RT_TABLE_MAIN);
+   if (tb)
+   iter-trie_main = (struct trie *) tb-tb_data;
+   }
rcu_read_lock();
if (*pos == 0)
return SEQ_START_TOKEN;
-   return fib_trie_get_idx(seq-private, *pos - 1);
+   return fib_trie_get_idx(iter, *pos - 1);
 }
 
 static void *fib_trie_seq_next(struct seq_file *seq, void *v, loff_t *pos)
@@ -2260,8 +2281,8 @@ static void *fib_trie_seq_next(struct seq_file *seq, void 
*v, loff_t *pos)
return v;
 
/* continue scan in next trie */
-   if (iter-trie == trie_local)
-   return fib_trie_get_first(iter, trie_main);
+   if (iter-trie == iter-trie_local)
+   return fib_trie_get_first(iter, iter-trie_main);
 
return NULL;
 }
@@ -2327,7 +2348,7 @@ static int fib_trie_seq_show(struct seq_file *seq, void 
*v)
return 0;
 
if (!node_parent(n)) {
-   if (iter-trie == trie_local)
+   if (iter-trie == iter-trie_local)
seq_puts(seq, local:\n);
else
seq_puts(seq, main:\n);
@@ -2426,7 +2447,7 @@ static int fib_route_seq_show(struct seq_file *seq, void 
*v)
return 0;
}
 
-   if (iter-trie == trie_local)
+   if (iter-trie == iter-trie_local)
return 0;
if (IS_TNODE(l))
return 0;
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

[PATCH] virtio_net: Fix stalled inbound traffic on early packets

2007-12-06 Thread Christian Borntraeger
The current virtio_net driver has a startup race, which prevents any
incoming traffic:

If try_fill_recv submits buffers to the host system data might be
filled in and an interrupt is sent, before napi_enable finishes.
In that case the interrupt will kick skb_recv_done which will then
call netif_rx_schedule. netif_rx_schedule checks, if NAPI_STATE_SCHED
is set - which is not as we did not run napi_enable. No poll routine
is scheduled. Furthermore, skb_recv_done returns false, we disables
interrupts for this device.

One solution is the enable napi before inbound buffer are available.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
---
 drivers/net/virtio_net.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: kvm/drivers/net/virtio_net.c
===
--- kvm.orig/drivers/net/virtio_net.c
+++ kvm/drivers/net/virtio_net.c
@@ -285,13 +285,15 @@ static int virtnet_open(struct net_devic
 {
struct virtnet_info *vi = netdev_priv(dev);
 
+   napi_enable(vi-napi);
try_fill_recv(vi);
 
/* If we didn't even get one input buffer, we're useless. */
-   if (vi-num == 0)
+   if (vi-num == 0) {
+   napi_disable(vi-napi);
return -ENOMEM;
+   }
 
-   napi_enable(vi-napi);
return 0;
 }
 
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/20] net/core/dev.c: use LIST_HEAD instead of LIST_HEAD_INIT

2007-12-06 Thread Denis Cheng
single list_head variable initialized with LIST_HEAD_INIT could almost
always can be replaced with LIST_HEAD declaration, this shrinks the code
and looks better.

Signed-off-by: Denis Cheng [EMAIL PROTECTED]
---
 net/core/dev.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 86d6261..7626db4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3501,7 +3501,7 @@ static int dev_new_index(struct net *net)
 
 /* Delayed registration/unregisteration */
 static DEFINE_SPINLOCK(net_todo_list_lock);
-static struct list_head net_todo_list = LIST_HEAD_INIT(net_todo_list);
+static LIST_HEAD(net_todo_list);
 
 static void net_set_todo(struct net_device *dev)
 {
-- 
1.5.3.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/20] net/ipv4/cipso_ipv4.c: use LIST_HEAD instead of LIST_HEAD_INIT

2007-12-06 Thread Denis Cheng
single list_head variable initialized with LIST_HEAD_INIT could almost
always can be replaced with LIST_HEAD declaration, this shrinks the code
and looks better.

Signed-off-by: Denis Cheng [EMAIL PROTECTED]
---
 net/ipv4/cipso_ipv4.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index f18e88b..d4dc4eb 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -63,7 +63,7 @@ struct cipso_v4_domhsh_entry {
  * probably be turned into a hash table or something similar so we
  * can do quick lookups. */
 static DEFINE_SPINLOCK(cipso_v4_doi_list_lock);
-static struct list_head cipso_v4_doi_list = LIST_HEAD_INIT(cipso_v4_doi_list);
+static LIST_HEAD(cipso_v4_doi_list);
 
 /* Label mapping cache */
 int cipso_v4_cache_enabled = 1;
-- 
1.5.3.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] sky2: RX lockup fix

2007-12-06 Thread Peter Tyser
 I have ways to generate errors, so I'll check

Thanks Stephen.  We didn't spend a lot of time characterizing the issue,
but our test setup had two blades, each with an 88E8062.  Our test
software pumped UDP and TCP traffic of varying packet sizes between the
blades in both directions (including  jumbo frames - we increased the
MTU of the interfaces to 9000).  The issue could generally be brought
out in about 15 minutes and almost always within an hour.

If you'd like any additional details on the test setup or would like me
to try something on my end, let me know.



--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ucc_geth 10 Mbit/s locks up CPU even though NAPI is enabled

2007-12-06 Thread Joakim Tjernlund
Injecting a 10 MBit/s stream with 64 bytes pkgs locks up my
MPC832x CPU even though I got NAPI enabled. Kernel 2.6.23

Any ideas?

 Jocke
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread Stephen Hemminger
On Thu, 06 Dec 2007 02:33:46 -0800 (PST)
David Miller [EMAIL PROTECTED] wrote:

 From: Ilpo_Järvinen [EMAIL PROTECTED]
 Date: Thu, 6 Dec 2007 01:18:28 +0200 (EET)
 
  On Wed, 5 Dec 2007, David Miller wrote:
  
   I assume you're using something like carefully crafted printk's,
   kprobes, or even ad-hoc statistic counters.  That's what I used to do
   :-)
  
  No, that's not at all what I do :-). I usually look time-seq graphs 
  expect for the cases when I just find things out by reading code (or
  by just thinking of it).
 
 Can you briefly detail what graph tools and command lines
 you are using?
 
 The last time I did graphing to analyze things, the tools
 were hit-or-miss.
 
  Much of the info is available in tcpdump already, it's just hard to read 
  without graphing it first because there are some many overlapping things 
  to track in two-dimensional space.
  
  ...But yes, I have to admit that couple of problems come to my mind
  where having some variable from tcp_sock would have made the problem
  more obvious.
 
 The most important are the cwnd and ssthresh, which you could guess
 using graphs but it is important to know on a packet to packet
 basis why we might have sent a packet or not because this has
 rippling effects down the rest of the RTT.
 
  Not sure what is the benefit of having distributions with it because 
  those people hardly report problems anyway to here, they're just too 
  happy with TCP performance unless we print something to their logs,
  which implies that we must setup a *_ON() condition :-(.
 
 That may be true, but if we could integrate the information with
 tcpdumps, we could gather internal state using tools the user
 already has available.
 
 Imagine if tcpdump printed out:
 
 02:26:14.865805 IP $SRC  $DEST: . 11226:12686(1460) ack 0 win 108
   ss_thresh: 129 cwnd: 133 packets_out: 132
 
 or something like that.
 
  Some problems are simply such that things cannot be accurately verified 
  without high processing overhead until it's far too late (eg skb bits vs 
  *_out counters). Maybe we should start to build an expensive state 
  validator as well which would automatically check invariants of the write 
  queue and tcp_sock in a straight forward, unoptimized manner? That would 
  definately do a lot of work for us, just ask people to turn it on and it 
  spits out everything that went wrong :-) (unless they really depend on 
  very high-speed things and are therefore unhappy if we scan thousands of 
  packets unnecessarily per ACK :-)). ...Early enough! ...That would work 
  also for distros but there's always human judgement needed to decide 
  whether the bug reporter will be happy when his TCP processing does no 
  longer scale ;-).
 
 I think it's useful as a TCP_DEBUG config option or similar, sure.
 
 But sometimes the algorithms are working as designed, it's just that
 they provide poor pipe utilization and CWND analysis embedded inside
 of a tcpdump would be one way to see that as well as determine the
 flaw in the algorithm.
 
  ...Hopefully you found any of my comments useful.
 
 Very much so, thanks.
 
 I put together a sample implementation anyways just to show the idea,
 against net-2.6.25 below.
 
 It is untested since I didn't write the userland app yet to see that
 proper things get logged.  Basically you could run a daemon that
 writes per-connection traces into files based upon the incoming
 netlink events.  Later, using the binary pcap file and these traces,
 you can piece together traces like the above using the timestamps
 etc. to match up pcap packets to ones from the TCP logger.
 
 The userland tools could do analysis and print pre-cooked state diff
 logs, like this ACK raised CWND by one or whatever else you wanted
 to know.
 
 It's nice that an expert like you can look at graphs and understand,
 but we'd like to create more experts and besides reading code one
 way to become an expert is to be able to extrace live real data
 from the kernel's working state and try to understand how things
 got that way.  This information is permanently lost currently.


Tools and scripts for testing that generate graphs are at:
git://git.kernel.org/pub/scm/tcptest/tcptest
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[SCTP] Bug fixes to the migrate/accept code path.

2007-12-06 Thread Vlad Yasevich
Hi Dave

The following two patches fix some bugs in the SCTP accept code path.
The first one fixes a slab corruption bug that we found during stress
testing.  The second one is just a clean-up and the right way to do things.

You can also pull both from:
  master.kernel.org:/pub/scm/linux/kernel/git/lksctp-dev.git pending

Thanks
-vlad
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] SCTP: Fix the bind_addr info during migration.

2007-12-06 Thread Vlad Yasevich
During accept/migrate the code attempts to copy the addresses from
the parent endpoint to the new endpoint.   However, if the parent
was bound to a wildcard address, then we end up pointlessly copying
all of the current addresses on the system.

Signed-off-by: Vlad Yasevich [EMAIL PROTECTED]
---
 include/net/sctp/structs.h |3 +++
 net/sctp/bind_addr.c   |   26 ++
 net/sctp/socket.c  |   12 ++--
 3 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index eb3113c..002a00a 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1184,6 +1184,9 @@ int sctp_bind_addr_copy(struct sctp_bind_addr *dest,
const struct sctp_bind_addr *src,
sctp_scope_t scope, gfp_t gfp,
int flags);
+int sctp_bind_addr_dup(struct sctp_bind_addr *dest,
+   const struct sctp_bind_addr *src,
+   gfp_t gfp);
 int sctp_add_bind_addr(struct sctp_bind_addr *, union sctp_addr *,
   __u8 use_as_src, gfp_t gfp);
 int sctp_del_bind_addr(struct sctp_bind_addr *, union sctp_addr *);
diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index cae95af..6a7d010 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -105,6 +105,32 @@ out:
return error;
 }
 
+/* Exactly duplicate the address lists.  This is necessary when doing
+ * peer-offs and accepts.  We don't want to put all the current system
+ * addresses into the endpoint.  That's useless.  But we do want duplicat
+ * the list of bound addresses that the older endpoint used.
+ */
+int sctp_bind_addr_dup(struct sctp_bind_addr *dest,
+   const struct sctp_bind_addr *src,
+   gfp_t gfp)
+{
+   struct sctp_sockaddr_entry *addr;
+   struct list_head *pos;
+   int error = 0;
+
+   /* All addresses share the same port.  */
+   dest-port = src-port;
+
+   list_for_each(pos, src-address_list) {
+   addr = list_entry(pos, struct sctp_sockaddr_entry, list);
+   error = sctp_add_bind_addr(dest, addr-a, 1, gfp);
+   if (error  0)
+   break;
+   }
+
+   return error;
+}
+
 /* Initialize the SCTP_bind_addr structure for either an endpoint or
  * an association.
  */
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 9f5d793..ea9649c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6326,7 +6326,6 @@ static void sctp_sock_migrate(struct sock *oldsk, struct 
sock *newsk,
struct sk_buff *skb, *tmp;
struct sctp_ulpevent *event;
struct sctp_bind_hashbucket *head;
-   int flags = 0;
 
/* Migrate socket buffer sizes and all the socket level options to the
 * new socket.
@@ -6356,15 +6355,8 @@ static void sctp_sock_migrate(struct sock *oldsk, struct 
sock *newsk,
/* Copy the bind_addr list from the original endpoint to the new
 * endpoint so that we can handle restarts properly
 */
-   if (PF_INET6 == assoc-base.sk-sk_family)
-   flags = SCTP_ADDR6_ALLOWED;
-   if (assoc-peer.ipv4_address)
-   flags |= SCTP_ADDR4_PEERSUPP;
-   if (assoc-peer.ipv6_address)
-   flags |= SCTP_ADDR6_PEERSUPP;
-   sctp_bind_addr_copy(newsp-ep-base.bind_addr,
-oldsp-ep-base.bind_addr,
-SCTP_SCOPE_GLOBAL, GFP_KERNEL, flags);
+   sctp_bind_addr_dup(newsp-ep-base.bind_addr,
+   oldsp-ep-base.bind_addr, GFP_KERNEL);
 
/* Move any messages in the old socket's receive queue that are for the
 * peeled off association to the new socket's receive queue.
-- 
1.5.3.5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] SCTP: Add bind hash locking to the migrate code

2007-12-06 Thread Vlad Yasevich
SCTP accept code tries to add a newliy created socket
to a bind bucket without holding a lock.   On a really
busy system, that can causes slab corruptions.
Add a lock around this code.

Signed-off-by: Vlad Yasevich [EMAIL PROTECTED]
---
 net/sctp/socket.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index ff8bc95..9f5d793 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6325,6 +6325,7 @@ static void sctp_sock_migrate(struct sock *oldsk, struct 
sock *newsk,
struct sctp_endpoint *newep = newsp-ep;
struct sk_buff *skb, *tmp;
struct sctp_ulpevent *event;
+   struct sctp_bind_hashbucket *head;
int flags = 0;
 
/* Migrate socket buffer sizes and all the socket level options to the
@@ -6342,10 +6343,15 @@ static void sctp_sock_migrate(struct sock *oldsk, 
struct sock *newsk,
newsp-hmac = NULL;
 
/* Hook this new socket in to the bind_hash list. */
+   head = sctp_port_hashtable[sctp_phashfn(inet_sk(oldsk)-num)];
+   sctp_local_bh_disable();
+   sctp_spin_lock(head-lock);
pp = sctp_sk(oldsk)-bind_hash;
sk_add_bind_node(newsk, pp-owner);
sctp_sk(newsk)-bind_hash = pp;
inet_sk(newsk)-num = inet_sk(oldsk)-num;
+   sctp_spin_unlock(head-lock);
+   sctp_local_bh_enable();
 
/* Copy the bind_addr list from the original endpoint to the new
 * endpoint so that we can handle restarts properly
-- 
1.5.3.5

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reproducible data corruption with sendfile+vsftp - splice regression?

2007-12-06 Thread Francois Romieu
Holger Hoffstaette [EMAIL PROTECTED] :
[...]
 Maybe turning off sendfile or NAPI just lead to random success - so far it
 really looks like tso on the r8169 is the common cause.

TSO on the r8169 is the magic switch but the regression makes imvho more
sense from a VM pov:

- the corrupted file has the same size as the expected file
- the corrupted file exhibits holes which come as a multiple of 4096 bytes
  (8*4k, 2 places, there may be more)
- the r8169 driver does not know what a page is
- the 8169 hardware has a small 8192 bytes Tx buffer

It would be nice if someone could do a sendfile + vsftp test with TSO on a
different hardware. While I could not reproduce the corruption when simply
downloading a file that I had copied on the server with scp, it triggered
almost immediately after I copied it locally and tried to download the copy.

-- 
Ueimor
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] net/xfrm/xfrm_policy.c: Some small improvements

2007-12-06 Thread Herbert Xu
On Thu, Dec 06, 2007 at 03:37:46PM +0100, Richard Knutsson wrote:

 Is it not an improvement to distinct booleans from actual values? Do you 
 use integers for ASCII characters too? It can also avoid some potential 
 bugs like the 'if (i == TRUE)'...
 What is wrong with 'size_t' (since it is unsigned, compared to (some) 
 'int')?

I agree with Dave.  There are so many useful things that we can do
(and need to do) in IPsec that bool/size_t conversions just add
churn without adding much value.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.25 10/11][INET] Eliminate difference in actions of sysctl and proc handler for conf.all.forwarding

2007-12-06 Thread Herbert Xu
On Thu, Dec 06, 2007 at 03:31:14PM +0300, Pavel Emelyanov wrote:

 BTW, this is not 100% true. Look, in rtm_to_ifaddr()
 I see the following code flow:
 
 ipv4_devconf_setall(in_dev);
 
 ifa = inet_alloc_ifa();
 if (ifa == NULL) {
 /*
  * A potential indev allocation can be left alive, it stays
  * assigned to its device and is destroy with it.
  */
 err = -ENOBUFS;
 goto errout;
 }
 
 if we fail to allocate the ifa (hard to happen, but), we will
 make this device not to accept the default propagation.

Yes that's unintentional.

 If this is a relevant note, I can prepare the patch.

It certainly seems easy enough to fix by just swapping the order.
Please do.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] s2io: fix inconsistent hardware VLAN tagging during driver init

2007-12-06 Thread Andy Gospodarek

The s2io driver keeps a local variable around (vlan_strip_flag) to keep
track of the current state of the hardware and whether or not it will
strip VLAN tags on incoming packets.  It seems as though the hardware
default is to strip them, but that variable is not set correctly during
initialization if the default setup is used.  This check ensures
vlan_strip_flag and the hardware setting are in sync.

These variables were introduced by this patch:

commit 926930b202d56c3dfb6aea0a0c6bfba2b87a8c03
Author: Sivakumar Subramani [EMAIL PROTECTED]
Date:   Sat Feb 24 01:59:39 2007 -0500

so this problem hasn't been around forever.

Recent patches from Ramkrishna Vepa [EMAIL PROTECTED] removed this
variable and would have worked around the problem, but they were not
accepted.

Signed-off-by: Andy Gospodarek [EMAIL PROTECTED]

---

 s2io.c |5 +
 1 files changed, 5 insertions(+)

diff --git a/drivers/net/s2io.c b/drivers/net/s2io.c
index 8b9f0ea..08c08de 100644
--- a/drivers/net/s2io.c
+++ b/drivers/net/s2io.c
@@ -2151,6 +2151,11 @@ static int start_nic(struct s2io_nic *nic)
val64 = ~RX_PA_CFG_STRIP_VLAN_TAG;
writeq(val64, bar0-rx_pa_cfg);
vlan_strip_flag = 0;
+   } else {
+   val64 = readq(bar0-rx_pa_cfg);
+   val64 |= RX_PA_CFG_STRIP_VLAN_TAG;
+   writeq(val64, bar0-rx_pa_cfg);
+   vlan_strip_flag = 1;
}
 
/*
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] iproute2: support dotted-quad netmask notation.

2007-12-06 Thread Stephen Hemminger
On Tue, 4 Dec 2007 14:58:18 +0100
Andreas Henriksson [EMAIL PROTECTED] wrote:

 Suggested patch for allowing netmask to be specified in dotted quad format.
 See http://bugs.debian.org/357172
 
 (Known problem: this will not prevent some invalid syntaxes,
 ie. 255.0.255.0 will be treated as 255.255.255.0)
 
 Comments? Suggestions? Improvements?

Fix the bug you mentioned?

/* a valid netmask must be 2^n - 1 (n = 1..31) */
static int is_valid_netmask(const inet_prefix *addr)
{
uint32_t host;

if (addr-family != AF_INET)
return 0;

host = ~ntohl(addr-data[0]);
if (host == 0 || ~host == 0)
return 0;

return (host  (host + 1)) == 0;
}
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] [DCCP]: Introduce generic function to test for `data packets'

2007-12-06 Thread Arnaldo Carvalho de Melo
From: Gerrit Renker [EMAIL PROTECTED]

as per  RFC 4340, sec. 7.7.

Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
Signed-off-by: Ian McDonald [EMAIL PROTECTED]
Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---
 net/dccp/dccp.h |   12 
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/net/dccp/dccp.h b/net/dccp/dccp.h
index ee97950..f4a5ea1 100644
--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -334,6 +334,7 @@ struct dccp_skb_cb {
 
 #define DCCP_SKB_CB(__skb) ((struct dccp_skb_cb *)((__skb)-cb[0]))
 
+/* RFC 4340, sec. 7.7 */
 static inline int dccp_non_data_packet(const struct sk_buff *skb)
 {
const __u8 type = DCCP_SKB_CB(skb)-dccpd_type;
@@ -346,6 +347,17 @@ static inline int dccp_non_data_packet(const struct 
sk_buff *skb)
   type == DCCP_PKT_SYNCACK;
 }
 
+/* RFC 4340, sec. 7.7 */
+static inline int dccp_data_packet(const struct sk_buff *skb)
+{
+   const __u8 type = DCCP_SKB_CB(skb)-dccpd_type;
+
+   return type == DCCP_PKT_DATA ||
+  type == DCCP_PKT_DATAACK  ||
+  type == DCCP_PKT_REQUEST  ||
+  type == DCCP_PKT_RESPONSE;
+}
+
 static inline int dccp_packet_without_ack(const struct sk_buff *skb)
 {
const __u8 type = DCCP_SKB_CB(skb)-dccpd_type;
-- 
1.5.3.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] [CCID3]: The receiver of a half-connection does not set window counter values

2007-12-06 Thread Arnaldo Carvalho de Melo
From: Gerrit Renker [EMAIL PROTECTED]

Only the sender sets window counters [RFC 4342, sections 5 and 8.1].

Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
Signed-off-by: Ian McDonald [EMAIL PROTECTED]
Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---
 net/dccp/ccids/ccid3.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index c95dca8..5ff5aab 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -733,7 +733,6 @@ static int ccid3_hc_rx_insert_options(struct sock *sk, 
struct sk_buff *skb)
return 0;
 
hcrx = ccid3_hc_rx_sk(sk);
-   DCCP_SKB_CB(skb)-dccpd_ccval = hcrx-ccid3hcrx_ccval_last_counter;
 
if (dccp_packet_without_ack(skb))
return 0;
-- 
1.5.3.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] [TFRC]: Make the rx history slab be global

2007-12-06 Thread Arnaldo Carvalho de Melo
This is in preparation for merging the new rx history code written by Gerrit 
Renker.

Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---
 net/dccp/ccids/ccid3.c  |   35 ++---
 net/dccp/ccids/lib/packet_history.c |   95 ++-
 net/dccp/ccids/lib/packet_history.h |   43 ++--
 3 files changed, 60 insertions(+), 113 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index 5dea690..07920bb 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -49,8 +49,6 @@ static int ccid3_debug;
 #define ccid3_pr_debug(format, a...)
 #endif
 
-static struct dccp_rx_hist *ccid3_rx_hist;
-
 /*
  * Transmitter Half-Connection Routines
  */
@@ -807,9 +805,9 @@ static int ccid3_hc_rx_detect_loss(struct sock *sk,
}
 
 detect_out:
-   dccp_rx_hist_add_packet(ccid3_rx_hist, hcrx-ccid3hcrx_hist,
-  hcrx-ccid3hcrx_li_hist, packet,
-  hcrx-ccid3hcrx_seqno_nonloss);
+   dccp_rx_hist_add_packet(hcrx-ccid3hcrx_hist,
+   hcrx-ccid3hcrx_li_hist, packet,
+   hcrx-ccid3hcrx_seqno_nonloss);
return loss;
 }
 
@@ -852,8 +850,7 @@ static void ccid3_hc_rx_packet_recv(struct sock *sk, struct 
sk_buff *skb)
return;
}
 
-   packet = dccp_rx_hist_entry_new(ccid3_rx_hist, opt_recv-dccpor_ndp,
-   skb, GFP_ATOMIC);
+   packet = dccp_rx_hist_entry_new(opt_recv-dccpor_ndp, skb, GFP_ATOMIC);
if (unlikely(packet == NULL)) {
DCCP_WARN(%s(%p), Not enough mem to add rx packet 
  to history, consider it lost!\n, dccp_role(sk), sk);
@@ -936,7 +933,7 @@ static void ccid3_hc_rx_exit(struct sock *sk)
ccid3_hc_rx_set_state(sk, TFRC_RSTATE_TERM);
 
/* Empty packet history */
-   dccp_rx_hist_purge(ccid3_rx_hist, hcrx-ccid3hcrx_hist);
+   dccp_rx_hist_purge(hcrx-ccid3hcrx_hist);
 
/* Empty loss interval history */
dccp_li_hist_purge(hcrx-ccid3hcrx_li_hist);
@@ -1013,33 +1010,13 @@ MODULE_PARM_DESC(ccid3_debug, Enable debug messages);
 
 static __init int ccid3_module_init(void)
 {
-   int rc = -ENOBUFS;
-
-   ccid3_rx_hist = dccp_rx_hist_new(ccid3);
-   if (ccid3_rx_hist == NULL)
-   goto out;
-
-   rc = ccid_register(ccid3);
-   if (rc != 0)
-   goto out_free_rx;
-out:
-   return rc;
-
-out_free_rx:
-   dccp_rx_hist_delete(ccid3_rx_hist);
-   ccid3_rx_hist = NULL;
-   goto out;
+   return ccid_register(ccid3);
 }
 module_init(ccid3_module_init);
 
 static __exit void ccid3_module_exit(void)
 {
ccid_unregister(ccid3);
-
-   if (ccid3_rx_hist != NULL) {
-   dccp_rx_hist_delete(ccid3_rx_hist);
-   ccid3_rx_hist = NULL;
-   }
 }
 module_exit(ccid3_module_exit);
 
diff --git a/net/dccp/ccids/lib/packet_history.c 
b/net/dccp/ccids/lib/packet_history.c
index b628714..e1ab853 100644
--- a/net/dccp/ccids/lib/packet_history.c
+++ b/net/dccp/ccids/lib/packet_history.c
@@ -114,48 +114,33 @@ EXPORT_SYMBOL_GPL(tfrc_tx_hist_rtt);
 /*
  * Receiver History Routines
  */
-struct dccp_rx_hist *dccp_rx_hist_new(const char *name)
+static struct kmem_cache *tfrc_rx_hist_slab;
+
+struct dccp_rx_hist_entry *dccp_rx_hist_entry_new(const u32 ndp,
+ const struct sk_buff *skb,
+ const gfp_t prio)
 {
-   struct dccp_rx_hist *hist = kmalloc(sizeof(*hist), GFP_ATOMIC);
-   static const char dccp_rx_hist_mask[] = rx_hist_%s;
-   char *slab_name;
-
-   if (hist == NULL)
-   goto out;
-
-   slab_name = kmalloc(strlen(name) + sizeof(dccp_rx_hist_mask) - 1,
-   GFP_ATOMIC);
-   if (slab_name == NULL)
-   goto out_free_hist;
-
-   sprintf(slab_name, dccp_rx_hist_mask, name);
-   hist-dccprxh_slab = kmem_cache_create(slab_name,
-sizeof(struct dccp_rx_hist_entry),
-0, SLAB_HWCACHE_ALIGN, NULL);
-   if (hist-dccprxh_slab == NULL)
-   goto out_free_slab_name;
-out:
-   return hist;
-out_free_slab_name:
-   kfree(slab_name);
-out_free_hist:
-   kfree(hist);
-   hist = NULL;
-   goto out;
-}
+   struct dccp_rx_hist_entry *entry = kmem_cache_alloc(tfrc_rx_hist_slab,
+   prio);
 
-EXPORT_SYMBOL_GPL(dccp_rx_hist_new);
+   if (entry != NULL) {
+   const struct dccp_hdr *dh = dccp_hdr(skb);
 
-void dccp_rx_hist_delete(struct dccp_rx_hist *hist)
-{
-   const char* name = kmem_cache_name(hist-dccprxh_slab);
+   entry-dccphrx_seqno = DCCP_SKB_CB(skb)-dccpd_seq;
+   entry-dccphrx_ccval = 

[PATCH 5/7] [TFRC]: Rename dccp_rx_ to tfrc_rx_

2007-12-06 Thread Arnaldo Carvalho de Melo
This is in preparation for merging the new rx history code written by Gerrit 
Renker.

Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---
 net/dccp/ccids/ccid3.c  |   32 ++--
 net/dccp/ccids/lib/loss_interval.c  |   14 +++---
 net/dccp/ccids/lib/packet_history.c |   90 +-
 net/dccp/ccids/lib/packet_history.h |   48 +-
 4 files changed, 92 insertions(+), 92 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index 07920bb..c95dca8 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -677,7 +677,7 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk)
 {
struct ccid3_hc_rx_sock *hcrx = ccid3_hc_rx_sk(sk);
struct dccp_sock *dp = dccp_sk(sk);
-   struct dccp_rx_hist_entry *packet;
+   struct tfrc_rx_hist_entry *packet;
ktime_t now;
suseconds_t delta;
 
@@ -701,7 +701,7 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk)
return;
}
 
-   packet = dccp_rx_hist_find_data_packet(hcrx-ccid3hcrx_hist);
+   packet = tfrc_rx_hist_find_data_packet(hcrx-ccid3hcrx_hist);
if (unlikely(packet == NULL)) {
DCCP_WARN(%s(%p), no data packet in history!\n,
  dccp_role(sk), sk);
@@ -709,7 +709,7 @@ static void ccid3_hc_rx_send_feedback(struct sock *sk)
}
 
hcrx-ccid3hcrx_tstamp_last_feedback = now;
-   hcrx-ccid3hcrx_ccval_last_counter   = packet-dccphrx_ccval;
+   hcrx-ccid3hcrx_ccval_last_counter   = packet-tfrchrx_ccval;
hcrx-ccid3hcrx_bytes_recv   = 0;
 
if (hcrx-ccid3hcrx_p == 0)
@@ -752,12 +752,12 @@ static int ccid3_hc_rx_insert_options(struct sock *sk, 
struct sk_buff *skb)
 }
 
 static int ccid3_hc_rx_detect_loss(struct sock *sk,
-   struct dccp_rx_hist_entry *packet)
+   struct tfrc_rx_hist_entry *packet)
 {
struct ccid3_hc_rx_sock *hcrx = ccid3_hc_rx_sk(sk);
-   struct dccp_rx_hist_entry *rx_hist =
-   dccp_rx_hist_head(hcrx-ccid3hcrx_hist);
-   u64 seqno = packet-dccphrx_seqno;
+   struct tfrc_rx_hist_entry *rx_hist =
+   tfrc_rx_hist_head(hcrx-ccid3hcrx_hist);
+   u64 seqno = packet-tfrchrx_seqno;
u64 tmp_seqno;
int loss = 0;
u8 ccval;
@@ -766,9 +766,9 @@ static int ccid3_hc_rx_detect_loss(struct sock *sk,
tmp_seqno = hcrx-ccid3hcrx_seqno_nonloss;
 
if (!rx_hist ||
-  follows48(packet-dccphrx_seqno, hcrx-ccid3hcrx_seqno_nonloss)) {
+  follows48(packet-tfrchrx_seqno, hcrx-ccid3hcrx_seqno_nonloss)) {
hcrx-ccid3hcrx_seqno_nonloss = seqno;
-   hcrx-ccid3hcrx_ccval_nonloss = packet-dccphrx_ccval;
+   hcrx-ccid3hcrx_ccval_nonloss = packet-tfrchrx_ccval;
goto detect_out;
}
 
@@ -789,7 +789,7 @@ static int ccid3_hc_rx_detect_loss(struct sock *sk,
dccp_inc_seqno(tmp_seqno);
hcrx-ccid3hcrx_seqno_nonloss = tmp_seqno;
dccp_inc_seqno(tmp_seqno);
-   while (dccp_rx_hist_find_entry(hcrx-ccid3hcrx_hist,
+   while (tfrc_rx_hist_find_entry(hcrx-ccid3hcrx_hist,
   tmp_seqno, ccval)) {
hcrx-ccid3hcrx_seqno_nonloss = tmp_seqno;
hcrx-ccid3hcrx_ccval_nonloss = ccval;
@@ -799,13 +799,13 @@ static int ccid3_hc_rx_detect_loss(struct sock *sk,
 
/* FIXME - this code could be simplified with above while */
/* but works at moment */
-   if (follows48(packet-dccphrx_seqno, hcrx-ccid3hcrx_seqno_nonloss)) {
+   if (follows48(packet-tfrchrx_seqno, hcrx-ccid3hcrx_seqno_nonloss)) {
hcrx-ccid3hcrx_seqno_nonloss = seqno;
-   hcrx-ccid3hcrx_ccval_nonloss = packet-dccphrx_ccval;
+   hcrx-ccid3hcrx_ccval_nonloss = packet-tfrchrx_ccval;
}
 
 detect_out:
-   dccp_rx_hist_add_packet(hcrx-ccid3hcrx_hist,
+   tfrc_rx_hist_add_packet(hcrx-ccid3hcrx_hist,
hcrx-ccid3hcrx_li_hist, packet,
hcrx-ccid3hcrx_seqno_nonloss);
return loss;
@@ -815,7 +815,7 @@ static void ccid3_hc_rx_packet_recv(struct sock *sk, struct 
sk_buff *skb)
 {
struct ccid3_hc_rx_sock *hcrx = ccid3_hc_rx_sk(sk);
const struct dccp_options_received *opt_recv;
-   struct dccp_rx_hist_entry *packet;
+   struct tfrc_rx_hist_entry *packet;
u32 p_prev, r_sample, rtt_prev;
int loss, payload_size;
ktime_t now;
@@ -850,7 +850,7 @@ static void ccid3_hc_rx_packet_recv(struct sock *sk, struct 
sk_buff *skb)
return;
}
 
-   packet = dccp_rx_hist_entry_new(opt_recv-dccpor_ndp, skb, GFP_ATOMIC);
+   packet = tfrc_rx_hist_entry_new(opt_recv-dccpor_ndp, skb, 

[PATCH 7/7] [TFRC]: New rx history code

2007-12-06 Thread Arnaldo Carvalho de Melo
Credit here goes to Gerrit Renker, that provided the initial implementation for
this new codebase.

I modified it just to try to make it closer to the existing API, renaming some
functions, add namespacing and fix one bug where the tfrc_rx_hist_alloc was not
freeing the allocated ring entries on the error path.

Original changeset comment from Gerrit:
  ---
This provides a new, self-contained and generic RX history service for TFRC
based protocols.

Details:
 * new data structure, initialisation and cleanup routines;
 * allocation of dccp_rx_hist entries local to packet_history.c,
   as a service exported by the dccp_tfrc_lib module.
 * interface to automatically track highest-received seqno;
 * receiver-based RTT estimation (needed for instance by RFC 3448, 6.3.1);
 * a generic function to test for `data packets' as per  RFC 4340, sec. 7.7.

Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---
 net/dccp/ccids/ccid3.c  |  288 --
 net/dccp/ccids/ccid3.h  |   14 +-
 net/dccp/ccids/lib/loss_interval.c  |   13 ++-
 net/dccp/ccids/lib/packet_history.c |  290 +--
 net/dccp/ccids/lib/packet_history.h |   83 +--
 5 files changed, 330 insertions(+), 358 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index 5ff5aab..faacffa 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -641,6 +641,15 @@ static int ccid3_hc_tx_getsockopt(struct sock *sk, const 
int optname, int len,
 /*
  * Receiver Half-Connection Routines
  */
+
+/* CCID3 feedback types */
+enum ccid3_fback_type {
+   CCID3_FBACK_NONE = 0,
+   CCID3_FBACK_INITIAL,
+   CCID3_FBACK_PERIODIC,
+   CCID3_FBACK_PARAM_CHANGE
+};
+
 #ifdef CONFIG_IP_DCCP_CCID3_DEBUG
 static const char *ccid3_rx_state_name(enum ccid3_hc_rx_states state)
 {
@@ -667,59 +676,60 @@ static void ccid3_hc_rx_set_state(struct sock *sk,
hcrx-ccid3hcrx_state = state;
 }
 
-static inline void ccid3_hc_rx_update_s(struct ccid3_hc_rx_sock *hcrx, int len)
-{
-   if (likely(len  0))/* don't update on empty packets (e.g. ACKs) */
-   hcrx-ccid3hcrx_s = tfrc_ewma(hcrx-ccid3hcrx_s, len, 9);
-}
-
-static void ccid3_hc_rx_send_feedback(struct sock *sk)
+static void ccid3_hc_rx_send_feedback(struct sock *sk,
+ const struct sk_buff *skb,
+ enum ccid3_fback_type fbtype)
 {
struct ccid3_hc_rx_sock *hcrx = ccid3_hc_rx_sk(sk);
struct dccp_sock *dp = dccp_sk(sk);
-   struct tfrc_rx_hist_entry *packet;
ktime_t now;
-   suseconds_t delta;
+   s64 delta = 0;
 
ccid3_pr_debug(%s(%p) - entry \n, dccp_role(sk), sk);
 
+   if (unlikely(hcrx-ccid3hcrx_state == TFRC_RSTATE_TERM))
+   return;
+
now = ktime_get_real();
 
-   switch (hcrx-ccid3hcrx_state) {
-   case TFRC_RSTATE_NO_DATA:
+   switch (fbtype) {
+   case CCID3_FBACK_INITIAL:
hcrx-ccid3hcrx_x_recv = 0;
+   hcrx-ccid3hcrx_pinv   = ~0U;   /* see RFC 4342, 8.5 */
break;
-   case TFRC_RSTATE_DATA:
-   delta = ktime_us_delta(now,
-  hcrx-ccid3hcrx_tstamp_last_feedback);
-   DCCP_BUG_ON(delta  0);
-   hcrx-ccid3hcrx_x_recv =
-   scaled_div32(hcrx-ccid3hcrx_bytes_recv, delta);
+   case CCID3_FBACK_PARAM_CHANGE:
+   /*
+* When parameters change (new loss or p  p_prev), we do not
+* have a reliable estimate for R_m of [RFC 3448, 6.2] and so
+* need to  reuse the previous value of X_recv. However, when
+* X_recv was 0 (due to early loss), this would kill X down to
+* s/t_mbi (i.e. one packet in 64 seconds).
+* To avoid such drastic reduction, we approximate X_recv as
+* the number of bytes since last feedback.
+* This is a safe fallback, since X is bounded above by X_calc.
+*/
+   if (hcrx-ccid3hcrx_x_recv  0)
+   break;
+   /* fall through */
+   case CCID3_FBACK_PERIODIC:
+   delta = ktime_us_delta(now, 
hcrx-ccid3hcrx_tstamp_last_feedback);
+   if (delta = 0)
+   DCCP_BUG(delta (%ld) = 0, (long)delta);
+   else
+   hcrx-ccid3hcrx_x_recv =
+   scaled_div32(hcrx-ccid3hcrx_bytes_recv, delta);
break;
-   case TFRC_RSTATE_TERM:
-   DCCP_BUG(%s(%p) - Illegal state TERM, dccp_role(sk), sk);
+   default:
return;
}
 
-   packet = tfrc_rx_hist_find_data_packet(hcrx-ccid3hcrx_hist);
-   if (unlikely(packet == NULL)) {
-   DCCP_WARN(%s(%p), no data 

[PATCHES 0/7]: DCCP patches for 2.6.25

2007-12-06 Thread Arnaldo Carvalho de Melo
Hi David,

Please consider pulling from:

master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.25

Best Regards,

- Arnaldo

 b/net/dccp/ccids/Kconfig  |   13 
 b/net/dccp/ccids/ccid3.c  |   35 --
 b/net/dccp/ccids/ccid3.h  |   14 
 b/net/dccp/ccids/lib/Makefile |2 
 b/net/dccp/ccids/lib/loss_interval.c  |   14 
 b/net/dccp/ccids/lib/packet_history.c |   27 -
 b/net/dccp/ccids/lib/packet_history.h |3 
 b/net/dccp/ccids/lib/tfrc.c   |   48 +++
 b/net/dccp/ccids/lib/tfrc.h   |   18 -
 b/net/dccp/dccp.h |   13 
 net/dccp/ccids/ccid3.c|  322 --
 net/dccp/ccids/lib/loss_interval.c|   13 
 net/dccp/ccids/lib/packet_history.c   |  496 +++---
 net/dccp/ccids/lib/packet_history.h   |  177 
 14 files changed, 579 insertions(+), 616 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] [TFRC]: Rename tfrc_tx_hist to tfrc_tx_hist_slab, for consistency

2007-12-06 Thread Arnaldo Carvalho de Melo
Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---
 net/dccp/ccids/lib/packet_history.c |   20 ++--
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/dccp/ccids/lib/packet_history.c 
b/net/dccp/ccids/lib/packet_history.c
index 1d4d6ee..b628714 100644
--- a/net/dccp/ccids/lib/packet_history.c
+++ b/net/dccp/ccids/lib/packet_history.c
@@ -53,7 +53,7 @@ struct tfrc_tx_hist_entry {
 /*
  * Transmitter History Routines
  */
-static struct kmem_cache *tfrc_tx_hist;
+static struct kmem_cache *tfrc_tx_hist_slab;
 
 static struct tfrc_tx_hist_entry *
tfrc_tx_hist_find_entry(struct tfrc_tx_hist_entry *head, u64 seqno)
@@ -66,7 +66,7 @@ static struct tfrc_tx_hist_entry *
 
 int tfrc_tx_hist_add(struct tfrc_tx_hist_entry **headp, u64 seqno)
 {
-   struct tfrc_tx_hist_entry *entry = kmem_cache_alloc(tfrc_tx_hist, 
gfp_any());
+   struct tfrc_tx_hist_entry *entry = kmem_cache_alloc(tfrc_tx_hist_slab, 
gfp_any());
 
if (entry == NULL)
return -ENOBUFS;
@@ -85,7 +85,7 @@ void tfrc_tx_hist_purge(struct tfrc_tx_hist_entry **headp)
while (head != NULL) {
struct tfrc_tx_hist_entry *next = head-next;
 
-   kmem_cache_free(tfrc_tx_hist, head);
+   kmem_cache_free(tfrc_tx_hist_slab, head);
head = next;
}
 
@@ -278,17 +278,17 @@ EXPORT_SYMBOL_GPL(dccp_rx_hist_purge);
 
 __init int packet_history_init(void)
 {
-   tfrc_tx_hist = kmem_cache_create(tfrc_tx_hist,
-sizeof(struct tfrc_tx_hist_entry), 0,
-SLAB_HWCACHE_ALIGN, NULL);
+   tfrc_tx_hist_slab = kmem_cache_create(tfrc_tx_hist,
+ sizeof(struct 
tfrc_tx_hist_entry), 0,
+ SLAB_HWCACHE_ALIGN, NULL);
 
-   return tfrc_tx_hist == NULL ? -ENOBUFS : 0;
+   return tfrc_tx_hist_slab == NULL ? -ENOBUFS : 0;
 }
 
 void packet_history_exit(void)
 {
-   if (tfrc_tx_hist != NULL) {
-   kmem_cache_destroy(tfrc_tx_hist);
-   tfrc_tx_hist = NULL;
+   if (tfrc_tx_hist_slab != NULL) {
+   kmem_cache_destroy(tfrc_tx_hist_slab);
+   tfrc_tx_hist_slab = NULL;
}
 }
-- 
1.5.3.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] [TFRC]: Provide central source file and debug facility

2007-12-06 Thread Arnaldo Carvalho de Melo
From: Gerrit Renker [EMAIL PROTECTED]

This patch changes the tfrc_lib module in the following manner:

 (1) a dedicated tfrc source file to call the packet history 
 loss interval init/exit functions.
 (2) a dedicated tfrc_pr_debug macro with toggle switch `tfrc_debug'.

Commiter note: renamed tfrc_module.c to tfrc.c, and made CONFIG_IP_DCCP_CCID3
select IP_DCCP_TFRC_LIB.

Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
Signed-off-by: Ian McDonald [EMAIL PROTECTED]
Signed-off-by: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
---
 net/dccp/ccids/Kconfig  |   13 ++---
 net/dccp/ccids/lib/Makefile |2 +-
 net/dccp/ccids/lib/packet_history.c |   27 ++-
 net/dccp/ccids/lib/packet_history.h |3 +-
 net/dccp/ccids/lib/tfrc.c   |   48 +++
 net/dccp/ccids/lib/tfrc.h   |   17 +---
 6 files changed, 75 insertions(+), 35 deletions(-)
 create mode 100644 net/dccp/ccids/lib/tfrc.c

diff --git a/net/dccp/ccids/Kconfig b/net/dccp/ccids/Kconfig
index 3d7d867..1227594 100644
--- a/net/dccp/ccids/Kconfig
+++ b/net/dccp/ccids/Kconfig
@@ -38,6 +38,7 @@ config IP_DCCP_CCID2_DEBUG
 config IP_DCCP_CCID3
tristate CCID3 (TCP-Friendly) (EXPERIMENTAL)
def_tristate IP_DCCP
+   select IP_DCCP_TFRC_LIB
---help---
  CCID 3 denotes TCP-Friendly Rate Control (TFRC), an equation-based
  rate-controlled congestion control mechanism.  TFRC is designed to
@@ -63,10 +64,6 @@ config IP_DCCP_CCID3
 
  If in doubt, say M.
 
-config IP_DCCP_TFRC_LIB
-   depends on IP_DCCP_CCID3
-   def_tristate IP_DCCP_CCID3
-
 config IP_DCCP_CCID3_DEBUG
  bool CCID3 debugging messages
  depends on IP_DCCP_CCID3
@@ -110,5 +107,13 @@ config IP_DCCP_CCID3_RTO
is serious network congestion: experimenting with larger values 
should
therefore not be performed on WANs.
 
+config IP_DCCP_TFRC_LIB
+   tristate
+   default n
+
+config IP_DCCP_TFRC_DEBUG
+   bool
+   depends on IP_DCCP_TFRC_LIB
+   default y if IP_DCCP_CCID3_DEBUG
 
 endmenu
diff --git a/net/dccp/ccids/lib/Makefile b/net/dccp/ccids/lib/Makefile
index 5f940a6..68c93e3 100644
--- a/net/dccp/ccids/lib/Makefile
+++ b/net/dccp/ccids/lib/Makefile
@@ -1,3 +1,3 @@
 obj-$(CONFIG_IP_DCCP_TFRC_LIB) += dccp_tfrc_lib.o
 
-dccp_tfrc_lib-y := loss_interval.o packet_history.o tfrc_equation.o
+dccp_tfrc_lib-y := tfrc.o tfrc_equation.o packet_history.o loss_interval.o
diff --git a/net/dccp/ccids/lib/packet_history.c 
b/net/dccp/ccids/lib/packet_history.c
index 4805de9..1d4d6ee 100644
--- a/net/dccp/ccids/lib/packet_history.c
+++ b/net/dccp/ccids/lib/packet_history.c
@@ -35,7 +35,6 @@
  *  Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  */
 
-#include linux/module.h
 #include linux/string.h
 #include packet_history.h
 
@@ -277,39 +276,19 @@ void dccp_rx_hist_purge(struct dccp_rx_hist *hist, struct 
list_head *list)
 
 EXPORT_SYMBOL_GPL(dccp_rx_hist_purge);
 
-extern int __init dccp_li_init(void);
-extern void dccp_li_exit(void);
-
-static __init int packet_history_init(void)
+__init int packet_history_init(void)
 {
-   if (dccp_li_init() != 0)
-   goto out;
-
tfrc_tx_hist = kmem_cache_create(tfrc_tx_hist,
 sizeof(struct tfrc_tx_hist_entry), 0,
 SLAB_HWCACHE_ALIGN, NULL);
-   if (tfrc_tx_hist == NULL)
-   goto out_li_exit;
 
-   return 0;
-out_li_exit:
-   dccp_li_exit();
-out:
-   return -ENOBUFS;
+   return tfrc_tx_hist == NULL ? -ENOBUFS : 0;
 }
-module_init(packet_history_init);
 
-static __exit void packet_history_exit(void)
+void packet_history_exit(void)
 {
if (tfrc_tx_hist != NULL) {
kmem_cache_destroy(tfrc_tx_hist);
tfrc_tx_hist = NULL;
}
-   dccp_li_exit();
 }
-module_exit(packet_history_exit);
-
-MODULE_AUTHOR(Ian McDonald [EMAIL PROTECTED], 
- Arnaldo Carvalho de Melo [EMAIL PROTECTED]);
-MODULE_DESCRIPTION(DCCP TFRC library);
-MODULE_LICENSE(GPL);
diff --git a/net/dccp/ccids/lib/packet_history.h 
b/net/dccp/ccids/lib/packet_history.h
index 0670f46..9a2642e 100644
--- a/net/dccp/ccids/lib/packet_history.h
+++ b/net/dccp/ccids/lib/packet_history.h
@@ -39,8 +39,7 @@
 #include linux/ktime.h
 #include linux/list.h
 #include linux/slab.h
-
-#include ../../dccp.h
+#include tfrc.h
 
 /* Number of later packets received before one is considered lost */
 #define TFRC_RECV_NUM_LATE_LOSS 3
diff --git a/net/dccp/ccids/lib/tfrc.c b/net/dccp/ccids/lib/tfrc.c
new file mode 100644
index 000..3a7a183
--- /dev/null
+++ b/net/dccp/ccids/lib/tfrc.c
@@ -0,0 +1,48 @@
+/*
+ * TFRC: main module holding the pieces of the TFRC library together
+ *
+ * Copyright (c) 2007 The University of Aberdeen, Scotland, UK
+ * Copyright (c) 2007 Arnaldo Carvalho de Melo [EMAIL PROTECTED]
+ */
+#include 

Re: [PATCH] Reduce stack used by lib/hexdump.c

2007-12-06 Thread Joe Perches
On Wed, 2007-12-05 at 16:01 -0800, Andrew Morton wrote:
 No, I think print_hex_dump() is too low-level to be doing allocations. 
 For example, one could easily choose to call print_hex_dump() at oops time,
 and then what happens if we oops in kmalloc() (as we often do...)?
 
 You could trim linebuf[] to 80 chars or so.  Extra points for making it
 very clear when someone tries to exceed that - strcpy(linebuf, stop being
 stupid).

No extra points, but here's a revised patch to hexdump against
Linus' current:

hex_dump_to_buffer:
Removes casts to type for non-1 group sizes
Used by: fs/ext(3|4)super.c, fs/jfs
If someone really dislikes this change, please say so.
I think casting to type in a hex dump odd, especially
for mixed type structures.
If you want an array of type dumper, it probably
shouldn't be called hex_dump_to_buffer.
Groups by arbitrary size

print_hex_dump:
Removes rowsize argument
Reduces linebuf stack use to ~120 bytes
prefix:25 + address:20 + data:48 + ascii:20)
Aligns multiline ascii output
Changes return to size_t, number of bytes actually output

include/linux/kernel.h
Removes hex_asc define
Updates hex_dump prototypes

The rest are trivial conversions to new argument list.

size before:
   textdata bss dec hex filename
   1142   0   01142 476 lib/hexdump.o

size after:
   textdata bss dec hex filename
823   0   0 823 337 lib/hexdump.o

Signed-off-by: Joe Perches [EMAIL PROTECTED]
---
 include/linux/kernel.h  |   13 +-
 lib/hexdump.c   |  164 ---
 drivers/mtd/ubi/debug.c |2 +-
 drivers/mtd/ubi/io.c|2 +-
 drivers/net/wireless/iwlwifi/iwl3945-base.c |4 +-
 drivers/net/wireless/iwlwifi/iwl4965-base.c |4 +-
 drivers/scsi/ide-scsi.c |8 +-
 drivers/usb/gadget/file_storage.c   |4 +-
 fs/ext3/super.c |6 +-
 fs/ext4/super.c |6 +-
 fs/jffs2/wbuf.c |4 +-
 fs/jfs/jfs_imap.c   |2 +-
 fs/jfs/jfs_logmgr.c |6 +-
 fs/jfs/jfs_metapage.c   |2 +-
 fs/jfs/jfs_txnmgr.c |8 +-
 fs/jfs/xattr.c  |4 +-
 16 files changed, 110 insertions(+), 129 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 94bc996..ab45524 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -248,15 +248,14 @@ enum {
DUMP_PREFIX_ADDRESS,
DUMP_PREFIX_OFFSET
 };
-extern void hex_dump_to_buffer(const void *buf, size_t len,
-   int rowsize, int groupsize,
-   char *linebuf, size_t linebuflen, bool ascii);
+extern size_t hex_dump_to_buffer(const void *buf, size_t len,
+size_t rowsize, size_t groupsize,
+char *linebuf, size_t linebuflen, bool ascii);
 extern void print_hex_dump(const char *level, const char *prefix_str,
-   int prefix_type, int rowsize, int groupsize,
-   const void *buf, size_t len, bool ascii);
+  int prefix_type, size_t groupsize,
+  const void *buf, size_t len, bool ascii);
 extern void print_hex_dump_bytes(const char *prefix_str, int prefix_type,
-   const void *buf, size_t len);
-#define hex_asc(x) 0123456789abcdef[x]
+const void *buf, size_t len);
 
 #define pr_emerg(fmt, arg...) \
printk(KERN_EMERG fmt, ##arg)
diff --git a/lib/hexdump.c b/lib/hexdump.c
index 3435465..df82012 100644
--- a/lib/hexdump.c
+++ b/lib/hexdump.c
@@ -12,18 +12,21 @@
 #include linux/kernel.h
 #include linux/module.h
 
+#define ROWSIZE ((size_t)16)
+#define MAX_PREFIX_LEN ((size_t)20)
+
 /**
  * hex_dump_to_buffer - convert a blob of data to hex ASCII in memory
  * @buf: data blob to dump
  * @len: number of bytes in the @buf
- * @rowsize: number of bytes to print per line; must be 16 or 32
+ * @rowsize: maximum number of bytes to output (aligns ascii)
  * @groupsize: number of bytes to print at a time (1, 2, 4, 8; default = 1)
  * @linebuf: where to put the converted data
  * @linebuflen: total size of @linebuf, including space for terminating NUL
  * @ascii: include ASCII after the hex output
  *
  * hex_dump_to_buffer() works on one line of output at a time, i.e.,
- * 16 or 32 bytes of input data converted to hex + ASCII output.
+ * input data converted to hex + ASCII output.
  *
  * Given a buffer of u8 data, hex_dump_to_buffer() converts the input data
  * to a 

[PATCH 2/3] [POWERPC] fsl_soc: add support for gianfar for fixed-link property

2007-12-06 Thread Vitaly Bordug

fixed-link says: register new Fixed/emulated PHY, i.e. PHY that
not connected to the real MDIO bus.

Signed-off-by: Vitaly Bordug [EMAIL PROTECTED]
Signed-off-by: Anton Vorontsov [EMAIL PROTECTED]

---

 Documentation/powerpc/booting-without-of.txt |4 +
 arch/powerpc/sysdev/fsl_soc.c|   79 --
 2 files changed, 66 insertions(+), 17 deletions(-)


diff --git a/Documentation/powerpc/booting-without-of.txt 
b/Documentation/powerpc/booting-without-of.txt
index e9a3cb1..9dfd308 100644
--- a/Documentation/powerpc/booting-without-of.txt
+++ b/Documentation/powerpc/booting-without-of.txt
@@ -1254,6 +1254,10 @@ platforms are moved over to use the 
flattened-device-tree model.
   services interrupts for this device.
 - phy-handle : The phandle for the PHY connected to this ethernet
   controller.
+- fixed-link : a b c d e where a is emulated phy id - choose any,
+  but unique to the all specified fixed-links, b is duplex - 0 half,
+  1 full, c is link speed - d#10/d#100/d#1000, d is pause - 0 no
+  pause, 1 pause, e is asym_pause - 0 no asym_pause, 1 asym_pause.
 
   Recommended properties:
 
diff --git a/arch/powerpc/sysdev/fsl_soc.c b/arch/powerpc/sysdev/fsl_soc.c
index 3ace747..a008e32 100644
--- a/arch/powerpc/sysdev/fsl_soc.c
+++ b/arch/powerpc/sysdev/fsl_soc.c
@@ -24,6 +24,7 @@
 #include linux/platform_device.h
 #include linux/of_platform.h
 #include linux/phy.h
+#include linux/phy_fixed.h
 #include linux/spi/spi.h
 #include linux/fsl_devices.h
 #include linux/fs_enet_pd.h
@@ -130,6 +131,37 @@ u32 get_baudrate(void)
 EXPORT_SYMBOL(get_baudrate);
 #endif /* CONFIG_CPM2 */
 
+#ifdef CONFIG_FIXED_PHY
+static int __init of_add_fixed_phys(void)
+{
+   int ret;
+   struct device_node *np;
+   u32 *fixed_link;
+   struct fixed_phy_status status = {};
+
+   for_each_node_by_name(np, ethernet) {
+   fixed_link  = (u32 *)of_get_property(np, fixed-link, NULL);
+   if (!fixed_link)
+   continue;
+
+   status.link = 1;
+   status.duplex = fixed_link[1];
+   status.speed = fixed_link[2];
+   status.pause = fixed_link[3];
+   status.asym_pause = fixed_link[4];
+
+   ret = fixed_phy_add(PHY_POLL, fixed_link[0], status);
+   if (ret) {
+   of_node_put(np);
+   return ret;
+   }
+   }
+
+   return 0;
+}
+arch_initcall(of_add_fixed_phys);
+#endif /* CONFIG_FIXED_PHY */
+
 static int __init gfar_mdio_of_init(void)
 {
struct device_node *np;
@@ -193,7 +225,6 @@ static const char *gfar_tx_intr = tx;
 static const char *gfar_rx_intr = rx;
 static const char *gfar_err_intr = error;
 
-
 static int __init gfar_of_init(void)
 {
struct device_node *np;
@@ -277,29 +308,43 @@ static int __init gfar_of_init(void)
gfar_data.interface = PHY_INTERFACE_MODE_MII;
 
ph = of_get_property(np, phy-handle, NULL);
-   phy = of_find_node_by_phandle(*ph);
+   if (ph == NULL) {
+   u32 *fixed_link;
 
-   if (phy == NULL) {
-   ret = -ENODEV;
-   goto unreg;
-   }
+   fixed_link = (u32 *)of_get_property(np, fixed-link,
+  NULL);
+   if (!fixed_link) {
+   ret = -ENODEV;
+   goto unreg;
+   }
 
-   mdio = of_get_parent(phy);
+   gfar_data.bus_id = 0;
+   gfar_data.phy_id = fixed_link[0];
+   } else {
+   phy = of_find_node_by_phandle(*ph);
+
+   if (phy == NULL) {
+   ret = -ENODEV;
+   goto unreg;
+   }
+
+   mdio = of_get_parent(phy);
+
+   id = of_get_property(phy, reg, NULL);
+   ret = of_address_to_resource(mdio, 0, res);
+   if (ret) {
+   of_node_put(phy);
+   of_node_put(mdio);
+   goto unreg;
+   }
+
+   gfar_data.phy_id = *id;
+   gfar_data.bus_id = res.start;
 
-   id = of_get_property(phy, reg, NULL);
-   ret = of_address_to_resource(mdio, 0, res);
-   if (ret) {
of_node_put(phy);
of_node_put(mdio);
-   goto unreg;
}
 
-   gfar_data.phy_id = *id;
-   gfar_data.bus_id = res.start;
-
-   of_node_put(phy);
-   of_node_put(mdio);
-
ret =

[PATCH 3/3] [POWERPC] MPC8349E-mITX: Vitesse 7385 PHY is not connected to the MDIO bus

2007-12-06 Thread Vitaly Bordug

...thus use fixed-link to register proper Fixed PHY

Signed-off-by: Anton Vorontsov [EMAIL PROTECTED]
Signed-off-by: Vitaly Bordug [EMAIL PROTECTED]

---

 arch/powerpc/boot/dts/mpc8349emitx.dts |   11 ++-
 1 files changed, 2 insertions(+), 9 deletions(-)


diff --git a/arch/powerpc/boot/dts/mpc8349emitx.dts 
b/arch/powerpc/boot/dts/mpc8349emitx.dts
index 5072f6d..877ee6d 100644
--- a/arch/powerpc/boot/dts/mpc8349emitx.dts
+++ b/arch/powerpc/boot/dts/mpc8349emitx.dts
@@ -115,14 +115,6 @@
reg = 1c;
device_type = ethernet-phy;
};
-
-   /* Vitesse 7385 */
-   phy1f: [EMAIL PROTECTED] {
-   interrupt-parent =  ipic ;
-   interrupts = 12 8;
-   reg = 1f;
-   device_type = ethernet-phy;
-   };
};
 
[EMAIL PROTECTED] {
@@ -159,7 +151,8 @@
local-mac-address = [ 00 00 00 00 00 00 ];
interrupts = 23 8 24 8 25 8;
interrupt-parent =  ipic ;
-   phy-handle =  phy1f ;
+   /* Vitesse 7385 isn't on the MDIO bus */
+   fixed-link = 1 1 d#1000 0 0;
linux,network-index = 1;
};
 

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] [NET] phy/fixed.c: rework to not duplicate PHY layer functionality

2007-12-06 Thread Vitaly Bordug

With that patch fixed.c now fully emulates MDIO bus, thus no need
to duplicate PHY layer functionality. That, in turn, drastically
simplifies the code, and drops down line count.

As an additional bonus, now there is no need to register MDIO bus
for each PHY, all emulated PHYs placed on the platform fixed MDIO bus.
There is also no more need to pre-allocate PHYs via .config option,
this is all now handled dynamically.


Signed-off-by: Anton Vorontsov [EMAIL PROTECTED]
Signed-off-by: Vitaly Bordug [EMAIL PROTECTED]
Acked-by: Jeff Garzik [EMAIL PROTECTED]

---

 drivers/net/phy/Kconfig   |   32 +--
 drivers/net/phy/fixed.c   |  445 +
 include/linux/phy_fixed.h |   51 ++---
 3 files changed, 195 insertions(+), 333 deletions(-)


diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index 54b2ba9..7fe03ce 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -61,34 +61,12 @@ config ICPLUS_PHY
  Currently supports the IP175C PHY.
 
 config FIXED_PHY
-   tristate Drivers for PHY emulation on fixed speed/link
+   bool Driver for MDIO Bus/PHY emulation with fixed speed/link PHYs
---help---
- Adds the driver to PHY layer to cover the boards that do not have any 
PHY bound,
- but with the ability to manipulate the speed/link in software. The 
relevant MII
- speed/duplex parameters could be effectively handled in a 
user-specified function.
- Currently tested with mpc866ads.
-
-config FIXED_MII_10_FDX
-   bool Emulation for 10M Fdx fixed PHY behavior
-   depends on FIXED_PHY
-
-config FIXED_MII_100_FDX
-   bool Emulation for 100M Fdx fixed PHY behavior
-   depends on FIXED_PHY
-
-config FIXED_MII_1000_FDX
-   bool Emulation for 1000M Fdx fixed PHY behavior
-   depends on FIXED_PHY
-
-config FIXED_MII_AMNT
-int Number of emulated PHYs to allocate 
-depends on FIXED_PHY
-default 1
----help---
-Sometimes it is required to have several independent emulated
-PHYs on the bus (in case of multi-eth but phy-less HW for instance).
-This control will have specified number allocated for each fixed
-PHY type enabled.
+ Adds the platform fixed MDIO Bus to cover the boards that use
+ PHYs that are not connected to the real MDIO bus.
+
+ Currently tested with mpc866ads and mpc8349e-mitx.
 
 config MDIO_BITBANG
tristate Support for bitbanged MDIO buses
diff --git a/drivers/net/phy/fixed.c b/drivers/net/phy/fixed.c
index 5619182..73b6d39 100644
--- a/drivers/net/phy/fixed.c
+++ b/drivers/net/phy/fixed.c
@@ -1,362 +1,253 @@
 /*
- * drivers/net/phy/fixed.c
+ * Fixed MDIO bus (MDIO bus emulation with fixed PHYs)
  *
- * Driver for fixed PHYs, when transceiver is able to operate in one fixed 
mode.
+ * Author: Vitaly Bordug [EMAIL PROTECTED]
+ * Anton Vorontsov [EMAIL PROTECTED]
  *
- * Author: Vitaly Bordug
- *
- * Copyright (c) 2006 MontaVista Software, Inc.
+ * Copyright (c) 2006-2007 MontaVista Software, Inc.
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
  * Free Software Foundation;  either version 2 of the  License, or (at your
  * option) any later version.
- *
  */
+
 #include linux/kernel.h
-#include linux/string.h
-#include linux/errno.h
-#include linux/unistd.h
-#include linux/slab.h
-#include linux/interrupt.h
-#include linux/init.h
-#include linux/delay.h
-#include linux/netdevice.h
-#include linux/etherdevice.h
-#include linux/skbuff.h
-#include linux/spinlock.h
-#include linux/mm.h
 #include linux/module.h
+#include linux/platform_device.h
+#include linux/list.h
 #include linux/mii.h
-#include linux/ethtool.h
 #include linux/phy.h
 #include linux/phy_fixed.h
 
-#include asm/io.h
-#include asm/irq.h
-#include asm/uaccess.h
+#define MII_REGS_NUM 29
 
-/* we need to track the allocated pointers in order to free them on exit */
-static struct fixed_info *fixed_phy_ptrs[CONFIG_FIXED_MII_AMNT*MAX_PHY_AMNT];
-
-/*-
- *  If something weird is required to be done with link/speed,
- * network driver is able to assign a function to implement this.
- * May be useful for PHY's that need to be software-driven.
- 
*-*/
-int fixed_mdio_set_link_update(struct phy_device *phydev,
-  int (*link_update) (struct net_device *,
-  struct fixed_phy_status *))
-{
-   struct fixed_info *fixed;
-
-   if (link_update == NULL)
-   return -EINVAL;
-
-   if (phydev) {
-   if (phydev-bus) {
-   fixed = phydev-bus-priv;
-   fixed-link_update = link_update;
-   return 0;
-

Re: [PATCH 0/2] cxgb3 - driver update

2007-12-06 Thread Divy Le Ray

Divy Le Ray wrote:

Jeff,

I'm submitting a patch series for inclusion in 2.6.25.
The patches are built against netdev#upstream.

Here is a brief description:
- Update GPIO pinning and MAC support for T3C adapters
- Enable parity error detection.

Jeff,

I posted a third patch to fix the EEH code and add a missing
softirq blocking call.

Cheers,
Divy
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/2] cxgb3 - Fix EEH, missing softirq blocking

2007-12-06 Thread Divy Le Ray
From: Divy Le Ray [EMAIL PROTECTED]

set_pci_drvdata() stores a pointer to the adapter,
not the net device.
Add missing softirq blocking in t3_mgmt_tx.

Signed-off-by: Divy Le Ray [EMAIL PROTECTED]
---

 drivers/net/cxgb3/cxgb3_main.c |   14 --
 drivers/net/cxgb3/sge.c|7 ++-
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index d1aa777..0e3dcbf 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -2408,9 +2408,7 @@ void t3_fatal_err(struct adapter *adapter)
 static pci_ers_result_t t3_io_error_detected(struct pci_dev *pdev,
 pci_channel_state_t state)
 {
-   struct net_device *dev = pci_get_drvdata(pdev);
-   struct port_info *pi = netdev_priv(dev);
-   struct adapter *adapter = pi-adapter;
+   struct adapter *adapter = pci_get_drvdata(pdev);
int i;
 
/* Stop all ports */
@@ -2444,9 +2442,7 @@ static pci_ers_result_t t3_io_error_detected(struct 
pci_dev *pdev,
  */
 static pci_ers_result_t t3_io_slot_reset(struct pci_dev *pdev)
 {
-   struct net_device *dev = pci_get_drvdata(pdev);
-   struct port_info *pi = netdev_priv(dev);
-   struct adapter *adapter = pi-adapter;
+   struct adapter *adapter = pci_get_drvdata(pdev);
 
if (pci_enable_device(pdev)) {
dev_err(pdev-dev,
@@ -2469,9 +2465,7 @@ static pci_ers_result_t t3_io_slot_reset(struct pci_dev 
*pdev)
  */
 static void t3_io_resume(struct pci_dev *pdev)
 {
-   struct net_device *dev = pci_get_drvdata(pdev);
-   struct port_info *pi = netdev_priv(dev);
-   struct adapter *adapter = pi-adapter;
+   struct adapter *adapter = pci_get_drvdata(pdev);
int i;
 
/* Restart the ports */
@@ -2491,7 +2485,7 @@ static void t3_io_resume(struct pci_dev *pdev)
 
if (is_offload(adapter)) {
__set_bit(OFFLOAD_DEVMAP_BIT, adapter-registered_device_map);
-   if (offload_open(dev))
+   if (offload_open(adapter-port[0]))
printk(KERN_WARNING
   Could not bring back offload capabilities\n);
}
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
index cef153d..6367cee 100644
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1364,7 +1364,12 @@ static void restart_ctrlq(unsigned long data)
  */
 int t3_mgmt_tx(struct adapter *adap, struct sk_buff *skb)
 {
-   return ctrl_xmit(adap, adap-sge.qs[0].txq[TXQ_CTRL], skb);
+   int ret; 
+   local_bh_disable();
+   ret = ctrl_xmit(adap, adap-sge.qs[0].txq[TXQ_CTRL], skb);
+   local_bh_enable();
+
+   return ret;
 }
 
 /**
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TCP illinois max rtt aging

2007-12-06 Thread Lachlan Andrew
Greetings Ilpo,

On 04/12/2007, Ilpo Järvinen [EMAIL PROTECTED] wrote:
 On Mon, 3 Dec 2007, Lachlan Andrew wrote:
 
  When SACK is active, the per-packet processing becomes more involved,
  tracking the list of lost/SACKed packets.  This causes a CPU spike
  just after a loss, which increases the RTTs, at least in my
  experience.

 I suspect that as long as old code was able to use hint, it wasn't doing
 that bad. But it was seriously lacking ability to take advantage of sack
 processing hint when e.g., a new hole appeared, or cumulative ACK arrived.

 ...Code available in net-2.6.25 might cure those.

We had been using one of your earlier patches, and still had the
problem.  I think you've cured the problem with SACK itself, but there
still seems to be something taking a lot of CPU while recovering from
the loss.  It is possible that it was to do with  web100  which we
have also been running, but I cut out most of the statistics from that
and still had problems.

Cheers,
Lachlan

-- 
Lachlan Andrew  Dept of Computer Science, Caltech
1200 E California Blvd, Mail Code 256-80, Pasadena CA 91125, USA
Ph: +1 (626) 395-8820Fax: +1 (626) 568-3603
http://netlab.caltech.edu/~lachlan
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch] net/xfrm/xfrm_policy.c: Some small improvements

2007-12-06 Thread David Miller
From: Richard Knutsson [EMAIL PROTECTED]
Date: Thu, 06 Dec 2007 15:37:46 +0100

 David Miller wrote:
  But this time I'll just let you know up front that I
  don't see much value in this patch.  It is not a clear
  improvement to replace int's with bool's in my mind and
  the other changes are just whitespace changes.

 Is it not an improvement to distinct booleans from actual values? Do you 
 use integers for ASCII characters too? It can also avoid some potential 
 bugs like the 'if (i == TRUE)'...
 What is wrong with 'size_t' (since it is unsigned, compared to (some) 
 'int')?

When you say int found; is there any doubt in your mind that
this integer is going to hold a 1 or a 0 depending upon whether
we found something?

That's the problem I have with these kinds of patches, they do
not increase clarity, it's just pure mindless edits.

In new code, fine, use booleans if you want.

I would even accept that it helps to change to boolean for
arguments to functions that are global in scope.

But not for function local variables in cases like this.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sockets affected by IPsec always block (2.6.23)

2007-12-06 Thread David Miller
From: Stefan Rompf [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 15:31:53 +0100

 as far as I've understood Herbert's patch, at least TCP connect can be fixed 
 so that non blocking connect() will neither fail nor block, but just use the 
 first or second retransmission of the SYN packet to complete the handshake 
 after IPSEC is up.

If IPSEC takes a long time to resolve, and we don't block, the
connect() can hard fail (we will just keep dropping the outgoing SYN
packet send attempts, eventually hitting the retry limit) in cases
where if we did block it would not fail (because we wouldn't send
the first SYN until IPSEC resolved).
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [IPv4] Add strict check for replying net unreachable message

2007-12-06 Thread Mitsuru Chinen
The patch `Reply net unreachable ICMP message' had a bug.
A route whose type is blockhole or prohibit type is treated as
unreachable type. The case where err is set to ENETUNREACH should
be that no route is found in the routing table only.

Signed-off-by: Mitsuru Chinen [EMAIL PROTECTED]
---
 net/ipv4/route.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8a79f74..d2bc614 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1882,7 +1882,8 @@ no_route:
RT_CACHE_STAT_INC(in_no_route);
spec_dst = inet_select_addr(dev, 0, RT_SCOPE_UNIVERSE);
res.type = RTN_UNREACHABLE;
-   err = -ENETUNREACH;
+   if (err == -ESRCH)
+   err = -ENETUNREACH;
goto local_input;
 
/*
-- 
1.5.3.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [IPv4] Reply net unreachable ICMP message

2007-12-06 Thread Mitsuru Chinen
On Thu, 6 Dec 2007 09:47:33 +0100
Jarek Poplawski [EMAIL PROTECTED] wrote:

 On 06-12-2007 09:14, Mitsuru Chinen wrote:
  On Thu, 6 Dec 2007 08:49:47 +0100
  Jarek Poplawski [EMAIL PROTECTED] wrote:
  
  On 06-12-2007 07:31, Mitsuru Chinen wrote:
  IPv4 stack doesn't reply any ICMP destination unreachable message
  with net unreachable code when IP detagrams are being discarded
  because of no route could be found in the forwarding path.
  Incidentally, IPv6 stack replies such ICMPv6 message in the similar
  situation.
 ...
  This patch seems to be wrong. It overrides err codes from
  fib_lookup, where such decisions should be made.
  
  fib_lookup() replies -ESRCH in this situation.
  It is necessary to override the variable by the suitable error
  number like the code under e_hostunreach label.
 
 Probably I miss something, but I can't see how can you be sure it's
 only -ESRCH possible here? Isn't opt-action() in fib_rules_lookup()
 supposed to return this -ENETUNREACH when needed?

Oh, excuse me. I did mistake.
fib_rules_lookup() replies -ESRCH when no route is found. The case
it replies -ENETUNREACH is that user adds unreachable route.
However, if the err value is override with no check, a blackhole
or prohibit route is treated as a unreachable route.

As the patch is already applied, I will send another patch to add
a check for it.
Thank you very much for pointing out the issue!

Best Regards,

Mitsuru Chinen [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.24-rc3] Fix /proc/net breakage

2007-12-06 Thread David Woodhouse
On Mon, 2007-11-26 at 15:17 -0700, Eric W. Biederman wrote:
 Well I clearly goofed when I added the initial network namespace support
 for /proc/net.  Currently things work but there are odd details visible
 to user space, even when we have a single network namespace.
 
 Since we do not cache proc_dir_entry dentries at the moment we can
 just modify -lookup to return a different directory inode depending
 on the network namespace of the process looking at /proc/net, replacing
 the current technique of using a magic and fragile follow_link method.
 
 To accomplish that this patch:
 - introduces a shadow_proc method to allow different dentries to
   be returned from proc_lookup.
 - Removes the old /proc/net follow_link magic
 - Fixes a weakness in our not caching of proc generic dentries.
 
 As shadow_proc uses a task struct to decided which dentry to return we
 can go back later and fix the proc generic caching without modifying any code 
 that
 uses the shadow_proc method.
 
 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
 ---
  fs/proc/generic.c   |   12 ++-
  fs/proc/proc_net.c  |   86 
 +++
  include/linux/proc_fs.h |3 ++
  3 files changed, 19 insertions(+), 82 deletions(-)

(commit 2b1e300a9dfc3196ccddf6f1d74b91b7af55e416)

This seems to have broken the use of /proc/bus/usb as a mountpoint. It
always appears empty now, whatever's supposed to be mounted there.

-- 
dwmw2

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][VLAN] Lost rtnl_unlock() in vlan_ioctl()

2007-12-06 Thread David Miller
From: Patrick McHardy [EMAIL PROTECTED]
Date: Thu, 06 Dec 2007 14:59:24 +0100

 Pavel Emelyanov wrote:
  The SET_VLAN_NAME_TYPE_CMD command w/o CAP_NET_ADMIN capability
  doesn't release the rtnl lock.
 
 
 Thanks Pavel. I somehow recall that we already fixed this
 one, but can't find the patch :) Dave, please apply.

Applied and I'll push to -stable once Linus pulls it in.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TCP event tracking via netlink...

2007-12-06 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Thu, 6 Dec 2007 09:23:12 -0800

 Tools and scripts for testing that generate graphs are at:
   git://git.kernel.org/pub/scm/tcptest/tcptest

I know about this, I'm just curious what exactly Ilpo is
using :-)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHES 0/7]: DCCP patches for 2.6.25

2007-12-06 Thread David Miller
From: Arnaldo Carvalho de Melo [EMAIL PROTECTED]
Date: Thu,  6 Dec 2007 19:02:47 -0200

   Please consider pulling from:
 
 master.kernel.org:/pub/scm/linux/kernel/git/acme/net-2.6.25

Pulled and pushed out to net-2.6.25, thanks!
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [IPv4] Add strict check for replying net unreachable message

2007-12-06 Thread David Miller
From: Mitsuru Chinen [EMAIL PROTECTED]
Date: Fri, 7 Dec 2007 13:24:18 +0900

 The patch `Reply net unreachable ICMP message' had a bug.
 A route whose type is blockhole or prohibit type is treated as
 unreachable type. The case where err is set to ENETUNREACH should
 be that no route is found in the routing table only.
 
 Signed-off-by: Mitsuru Chinen [EMAIL PROTECTED]

Applied, thanks.

I'll probably combine this with your original change before I
push these changes upstream.
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCTP] Bug fixes to the migrate/accept code path.

2007-12-06 Thread David Miller
From: Vlad Yasevich [EMAIL PROTECTED]
Date: Thu,  6 Dec 2007 12:48:22 -0500

 The following two patches fix some bugs in the SCTP accept code path.
 The first one fixes a slab corruption bug that we found during stress
 testing.  The second one is just a clean-up and the right way to do things.

Both patches applied to net-2.6, thanks Vlad!
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] bonding: Allow setting and querying xmit policy regardless of mode

2007-12-06 Thread Jay Vosburgh
From: Wagner Ferenc [EMAIL PROTECTED]

From: Wagner Ferenc [EMAIL PROTECTED]

For consistency with the behaviour of the arp_ip_target option,
let /sys/class/net/bond0/bonding/xmit_hash_policy accept and report
current policy even if the bonding mode in effect does not use it.

Signed-off-by: Ferenc Wagner [EMAIL PROTECTED]
Acked-by: Jay Vosburgh [EMAIL PROTECTED]
---
 drivers/net/bonding/bond_sysfs.c |   21 +++--
 1 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 9de2c52..11b76b3 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -456,17 +456,11 @@ static ssize_t bonding_show_xmit_hash(struct device *d,
  struct device_attribute *attr,
  char *buf)
 {
-   int count = 0;
struct bonding *bond = to_bond(d);
 
-   if ((bond-params.mode == BOND_MODE_XOR) ||
-   (bond-params.mode == BOND_MODE_8023AD)) {
-   count = sprintf(buf, %s %d\n,
-   xmit_hashtype_tbl[bond-params.xmit_policy].modename,
-   bond-params.xmit_policy);
-   }
-
-   return count;
+   return sprintf(buf, %s %d\n,
+  xmit_hashtype_tbl[bond-params.xmit_policy].modename,
+  bond-params.xmit_policy);
 }
 
 static ssize_t bonding_store_xmit_hash(struct device *d,
@@ -484,15 +478,6 @@ static ssize_t bonding_store_xmit_hash(struct device *d,
goto out;
}
 
-   if ((bond-params.mode != BOND_MODE_XOR) 
-   (bond-params.mode != BOND_MODE_8023AD)) {
-   printk(KERN_ERR DRV_NAME
-  %s: Transmit hash policy is irrelevant in this mode.\n,
-  bond-dev-name);
-   ret = -EPERM;
-   goto out;
-   }
-
new_value = bond_parse_parm((char *)buf, xmit_hashtype_tbl);
if (new_value  0)  {
printk(KERN_ERR DRV_NAME
-- 
1.5.3.4.206.g58ba4-dirty

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] bonding: Return nothing for not applicable values

2007-12-06 Thread Jay Vosburgh
From: Wagner Ferenc [EMAIL PROTECTED]

From: Wagner Ferenc [EMAIL PROTECTED]

The previous code returned '\n' (that is, a single empty line)
from most files, with one exception (xmit_hash_policy), where
it returned 'NA\n'.  This patch consolidates each file to return
nothing at all if not applicable, not even a '\n'.

I find this behaviour more usual, more useful, more efficient
and shorter to code from both sides.

Signed-off-by: Ferenc Wagner [EMAIL PROTECTED]
Acked-by: Jay Vosburgh [EMAIL PROTECTED]
---
 drivers/net/bonding/bond_sysfs.c |   25 -
 1 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index a3f1b4a..6bb91e2 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -455,14 +455,11 @@ static ssize_t bonding_show_xmit_hash(struct device *d,
  struct device_attribute *attr,
  char *buf)
 {
-   int count;
+   int count = 0;
struct bonding *bond = to_bond(d);
 
-   if ((bond-params.mode != BOND_MODE_XOR) 
-   (bond-params.mode != BOND_MODE_8023AD)) {
-   // Not Applicable
-   count = sprintf(buf, NA\n);
-   } else {
+   if ((bond-params.mode == BOND_MODE_XOR) ||
+   (bond-params.mode == BOND_MODE_8023AD)) {
count = sprintf(buf, %s %d\n,
xmit_hashtype_tbl[bond-params.xmit_policy].modename,
bond-params.xmit_policy);
@@ -1079,8 +1076,6 @@ static ssize_t bonding_show_primary(struct device *d,
 
if (bond-primary_slave)
count = sprintf(buf, %s\n, bond-primary_slave-dev-name);
-   else
-   count = sprintf(buf, \n);
 
return count;
 }
@@ -1186,7 +1181,7 @@ static ssize_t bonding_show_active_slave(struct device *d,
 {
struct slave *curr;
struct bonding *bond = to_bond(d);
-   int count;
+   int count = 0;
 
read_lock(bond-curr_slave_lock);
curr = bond-curr_active_slave;
@@ -1194,8 +1189,6 @@ static ssize_t bonding_show_active_slave(struct device *d,
 
if (USES_PRIMARY(bond-params.mode)  curr)
count = sprintf(buf, %s\n, curr-dev-name);
-   else
-   count = sprintf(buf, \n);
return count;
 }
 
@@ -1309,8 +1302,6 @@ static ssize_t bonding_show_ad_aggregator(struct device 
*d,
struct ad_info ad_info;
count = sprintf(buf, %d\n, 
(bond_3ad_get_active_agg_info(bond, ad_info)) ?  0 : ad_info.aggregator_id);
}
-   else
-   count = sprintf(buf, \n);
 
return count;
 }
@@ -1331,8 +1322,6 @@ static ssize_t bonding_show_ad_num_ports(struct device *d,
struct ad_info ad_info;
count = sprintf(buf, %d\n, 
(bond_3ad_get_active_agg_info(bond, ad_info)) ?  0: ad_info.ports);
}
-   else
-   count = sprintf(buf, \n);
 
return count;
 }
@@ -1353,8 +1342,6 @@ static ssize_t bonding_show_ad_actor_key(struct device *d,
struct ad_info ad_info;
count = sprintf(buf, %d\n, 
(bond_3ad_get_active_agg_info(bond, ad_info)) ?  0 : ad_info.actor_key);
}
-   else
-   count = sprintf(buf, \n);
 
return count;
 }
@@ -1375,8 +1362,6 @@ static ssize_t bonding_show_ad_partner_key(struct device 
*d,
struct ad_info ad_info;
count = sprintf(buf, %d\n, 
(bond_3ad_get_active_agg_info(bond, ad_info)) ?  0 : ad_info.partner_key);
}
-   else
-   count = sprintf(buf, \n);
 
return count;
 }
@@ -1401,8 +1386,6 @@ static ssize_t bonding_show_ad_partner_mac(struct device 
*d,
print_mac(mac, ad_info.partner_system));
}
}
-   else
-   count = sprintf(buf, \n);
 
return count;
 }
-- 
1.5.3.4.206.g58ba4-dirty

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/8] bonding: Remove trailing NULs from sysfs interface.

2007-12-06 Thread Jay Vosburgh
From: Wagner Ferenc [EMAIL PROTECTED]

From: Wagner Ferenc [EMAIL PROTECTED]

Also remove trailing spaces from multivalued files.

This fixes output like for example:

$ od -c /sys/class/net/bond0/bonding/slaves
000   e   t   h   -   l   e   f   t   e   t   h   -   r   i   g
020   h   t  \n  \0
025

It mostly entails deleting '+1'-s after sprintf() calls: the return value
of sprintf is the number of characters printed, without the closing NUL,
ie. exactly what the sysfs interface requires.  The three multivalue
cases are different, because they also have to swallow back a trailing
space.

Signed-off-by: Ferenc Wagner [EMAIL PROTECTED]
Acked-by: Jay Vosburgh [EMAIL PROTECTED]
---
 drivers/net/bonding/bond_sysfs.c |   66 +
 1 files changed, 30 insertions(+), 36 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index b29330d..a3f1b4a 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -86,14 +86,13 @@ static ssize_t bonding_show_bonds(struct class *cls, char 
*buffer)
/* not enough space for another interface name */
if ((PAGE_SIZE - res)  10)
res = PAGE_SIZE - 10;
-   res += sprintf(buffer + res, ++more++);
+   res += sprintf(buffer + res, ++more++ );
break;
}
res += sprintf(buffer + res, %s ,
   bond-dev-name);
}
-   res += sprintf(buffer + res, \n);
-   res++;
+   if (res) buffer[res-1] = '\n'; /* eat the leftover space */
up_read((bonding_rwsem));
return res;
 }
@@ -235,14 +234,13 @@ static ssize_t bonding_show_slaves(struct device *d,
/* not enough space for another interface name */
if ((PAGE_SIZE - res)  10)
res = PAGE_SIZE - 10;
-   res += sprintf(buf + res, ++more++);
+   res += sprintf(buf + res, ++more++ );
break;
}
res += sprintf(buf + res, %s , slave-dev-name);
}
read_unlock(bond-lock);
-   res += sprintf(buf + res, \n);
-   res++;
+   if (res) buf[res-1] = '\n'; /* eat the leftover space */
return res;
 }
 
@@ -406,7 +404,7 @@ static ssize_t bonding_show_mode(struct device *d,
 
return sprintf(buf, %s %d\n,
bond_mode_tbl[bond-params.mode].modename,
-   bond-params.mode) + 1;
+   bond-params.mode);
 }
 
 static ssize_t bonding_store_mode(struct device *d,
@@ -463,11 +461,11 @@ static ssize_t bonding_show_xmit_hash(struct device *d,
if ((bond-params.mode != BOND_MODE_XOR) 
(bond-params.mode != BOND_MODE_8023AD)) {
// Not Applicable
-   count = sprintf(buf, NA\n) + 1;
+   count = sprintf(buf, NA\n);
} else {
count = sprintf(buf, %s %d\n,
xmit_hashtype_tbl[bond-params.xmit_policy].modename,
-   bond-params.xmit_policy) + 1;
+   bond-params.xmit_policy);
}
 
return count;
@@ -527,7 +525,7 @@ static ssize_t bonding_show_arp_validate(struct device *d,
 
return sprintf(buf, %s %d\n,
   arp_validate_tbl[bond-params.arp_validate].modename,
-  bond-params.arp_validate) + 1;
+  bond-params.arp_validate);
 }
 
 static ssize_t bonding_store_arp_validate(struct device *d,
@@ -627,7 +625,7 @@ static ssize_t bonding_show_arp_interval(struct device *d,
 {
struct bonding *bond = to_bond(d);
 
-   return sprintf(buf, %d\n, bond-params.arp_interval) + 1;
+   return sprintf(buf, %d\n, bond-params.arp_interval);
 }
 
 static ssize_t bonding_store_arp_interval(struct device *d,
@@ -711,10 +709,7 @@ static ssize_t bonding_show_arp_targets(struct device *d,
res += sprintf(buf + res, %u.%u.%u.%u ,
   NIPQUAD(bond-params.arp_targets[i]));
}
-   if (res)
-   res--;  /* eat the leftover space */
-   res += sprintf(buf + res, \n);
-   res++;
+   if (res) buf[res-1] = '\n'; /* eat the leftover space */
return res;
 }
 
@@ -815,7 +810,7 @@ static ssize_t bonding_show_downdelay(struct device *d,
 {
struct bonding *bond = to_bond(d);
 
-   return sprintf(buf, %d\n, bond-params.downdelay * 
bond-params.miimon) + 1;
+   return sprintf(buf, %d\n, bond-params.downdelay * 
bond-params.miimon);
 }
 
 static ssize_t bonding_store_downdelay(struct device *d,
@@ -872,7 +867,7 @@ static ssize_t bonding_show_updelay(struct device *d,
 {
struct bonding *bond = to_bond(d);
 
-   return sprintf(buf, %d\n, bond-params.updelay * 

[PATCH 0/8] bonding: Several fixes, new hash mode

2007-12-06 Thread Jay Vosburgh
Patch series to fix some bugs, fix coding style, and add a new
hash mode for balance-xor/802.3ad modes.

Jeff: please apply to upstream.

Patch 8 should arguably go in to 2.6.24, as it's a bug in the
locking fixes added there and can cause an oops; is it too late for that?

[PATCH 1/8] bonding: Remove trailing NULs from sysfs interface.
[PATCH 2/8] bonding: Return nothing for not applicable values
[PATCH 3/8] bonding: Purely cosmetic: rename a local variable
[PATCH 4/8] bonding: Coding style: break line after the if condition
[PATCH 5/8] bonding: Allow setting and querying xmit policy regardless of mode
[PATCH 6/8] bonding: Fix time comparison
[PATCH 7/8] bonding: Add new layer2+3 hash for xor/802.3ad modes
[PATCH 8/8] bonding: Fix race at module unload

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] bonding: Fix time comparison

2007-12-06 Thread Jay Vosburgh
From: David Sterba [EMAIL PROTECTED]

From: David Sterba [EMAIL PROTECTED]

Use macros for comparing jiffies. Jiffies' wrap caused missed events and hangs.
Module reinsert was needed to make bonding work again.

Signed-off-by: David Sterba [EMAIL PROTECTED]
Acked-by: Jay Vosburgh [EMAIL PROTECTED]
---
 drivers/net/bonding/bond_main.c |   25 +
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 423298c..e4a4714 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -74,6 +74,7 @@
 #include linux/ethtool.h
 #include linux/if_vlan.h
 #include linux/if_bonding.h
+#include linux/jiffies.h
 #include net/route.h
 #include net/net_namespace.h
 #include bonding.h
@@ -2722,8 +2723,8 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
 */
bond_for_each_slave(bond, slave, i) {
if (slave-link != BOND_LINK_UP) {
-   if (((jiffies - slave-dev-trans_start) = 
delta_in_ticks) 
-   ((jiffies - slave-dev-last_rx) = 
delta_in_ticks)) {
+   if (time_before_eq(jiffies, slave-dev-trans_start + 
delta_in_ticks) 
+   time_before_eq(jiffies, slave-dev-last_rx + 
delta_in_ticks)) {
 
slave-link  = BOND_LINK_UP;
slave-state = BOND_STATE_ACTIVE;
@@ -2754,8 +2755,8 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
 * when the source ip is 0, so don't take the link down
 * if we don't know our ip yet
 */
-   if (((jiffies - slave-dev-trans_start) = 
(2*delta_in_ticks)) ||
-   (((jiffies - slave-dev-last_rx) = 
(2*delta_in_ticks)) 
+   if (time_after_eq(jiffies, slave-dev-trans_start + 
2*delta_in_ticks) ||
+   (time_after_eq(jiffies, slave-dev-last_rx + 
2*delta_in_ticks) 
 bond_has_ip(bond))) {
 
slave-link  = BOND_LINK_DOWN;
@@ -2848,8 +2849,8 @@ void bond_activebackup_arp_mon(struct work_struct *work)
 */
bond_for_each_slave(bond, slave, i) {
if (slave-link != BOND_LINK_UP) {
-   if ((jiffies - slave_last_rx(bond, slave)) =
-delta_in_ticks) {
+   if (time_before_eq(jiffies,
+   slave_last_rx(bond, slave) + delta_in_ticks)) {
 
slave-link = BOND_LINK_UP;
 
@@ -2858,7 +2859,7 @@ void bond_activebackup_arp_mon(struct work_struct *work)
write_lock_bh(bond-curr_slave_lock);
 
if ((!bond-curr_active_slave) 
-   ((jiffies - slave-dev-trans_start) = 
delta_in_ticks)) {
+   time_before_eq(jiffies, 
slave-dev-trans_start + delta_in_ticks)) {
bond_change_active_slave(bond, slave);
bond-current_arp_slave = NULL;
} else if (bond-curr_active_slave != slave) {
@@ -2897,7 +2898,7 @@ void bond_activebackup_arp_mon(struct work_struct *work)
 
if ((slave != bond-curr_active_slave) 
(!bond-current_arp_slave) 
-   (((jiffies - slave_last_rx(bond, slave)) = 
3*delta_in_ticks) 
+   (time_after_eq(jiffies, slave_last_rx(bond, slave) 
+ 3*delta_in_ticks) 
 bond_has_ip(bond))) {
/* a backup slave has gone down; three times
 * the delta allows the current slave to be
@@ -2943,10 +2944,10 @@ void bond_activebackup_arp_mon(struct work_struct *work)
 * before being taken out. if a primary is being used, check
 * if it is up and needs to take over as the curr_active_slave
 */
-   if jiffies - slave-dev-trans_start) = 
(2*delta_in_ticks)) ||
-   (((jiffies - slave_last_rx(bond, slave)) = (2*delta_in_ticks)) 
-bond_has_ip(bond))) 
-   ((jiffies - slave-jiffies) = 2*delta_in_ticks)) {
+   if ((time_after_eq(jiffies, slave-dev-trans_start + 
2*delta_in_ticks) ||
+   (time_after_eq(jiffies, slave_last_rx(bond, slave) + 
2*delta_in_ticks) 
+bond_has_ip(bond))) 
+   time_after_eq(jiffies, slave-jiffies + 
2*delta_in_ticks)) {
 
slave-link  = BOND_LINK_DOWN;
 
-- 
1.5.3.4.206.g58ba4-dirty

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

[PATCH 8/8] bonding: Fix race at module unload

2007-12-06 Thread Jay Vosburgh
Fixes a race condition in module unload.  Without this change,
workqueue events may fire while bonding data structures are partially
freed but before bond_close() is invoked by unregister_netdevice().

Update version to 3.2.3.

Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
---
 drivers/net/bonding/bond_main.c |   43 ---
 drivers/net/bonding/bonding.h   |2 +-
 2 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 08879d5..b0b2603 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4492,6 +4492,27 @@ static void bond_deinit(struct net_device *bond_dev)
 #endif
 }
 
+static void bond_work_cancel_all(struct bonding *bond)
+{
+   write_lock_bh(bond-lock);
+   bond-kill_timers = 1;
+   write_unlock_bh(bond-lock);
+
+   if (bond-params.miimon  delayed_work_pending(bond-mii_work))
+   cancel_delayed_work(bond-mii_work);
+
+   if (bond-params.arp_interval  delayed_work_pending(bond-arp_work))
+   cancel_delayed_work(bond-arp_work);
+
+   if (bond-params.mode == BOND_MODE_ALB 
+   delayed_work_pending(bond-alb_work))
+   cancel_delayed_work(bond-alb_work);
+
+   if (bond-params.mode == BOND_MODE_8023AD 
+   delayed_work_pending(bond-ad_work))
+   cancel_delayed_work(bond-ad_work);
+}
+
 /* Unregister and free all bond devices.
  * Caller must hold rtnl_lock.
  */
@@ -4502,6 +4523,7 @@ static void bond_free_all(void)
list_for_each_entry_safe(bond, nxt, bond_dev_list, bond_list) {
struct net_device *bond_dev = bond-dev;
 
+   bond_work_cancel_all(bond);
bond_mc_list_destroy(bond);
/* Release the bonded slaves */
bond_release_all(bond_dev);
@@ -4902,27 +4924,6 @@ out_rtnl:
return res;
 }
 
-static void bond_work_cancel_all(struct bonding *bond)
-{
-   write_lock_bh(bond-lock);
-   bond-kill_timers = 1;
-   write_unlock_bh(bond-lock);
-
-   if (bond-params.miimon  delayed_work_pending(bond-mii_work))
-   cancel_delayed_work(bond-mii_work);
-
-   if (bond-params.arp_interval  delayed_work_pending(bond-arp_work))
-   cancel_delayed_work(bond-arp_work);
-
-   if (bond-params.mode == BOND_MODE_ALB 
-   delayed_work_pending(bond-alb_work))
-   cancel_delayed_work(bond-alb_work);
-
-   if (bond-params.mode == BOND_MODE_8023AD 
-   delayed_work_pending(bond-ad_work))
-   cancel_delayed_work(bond-ad_work);
-}
-
 static int __init bonding_init(void)
 {
int i;
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index ccafc74..e1e4734 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -22,7 +22,7 @@
 #include bond_3ad.h
 #include bond_alb.h
 
-#define DRV_VERSION3.2.2
+#define DRV_VERSION3.2.3
 #define DRV_RELDATEDecember 6, 2007
 #define DRV_NAME   bonding
 #define DRV_DESCRIPTIONEthernet Channel Bonding Driver
-- 
1.5.3.4.206.g58ba4-dirty

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/8] bonding: Add new layer2+3 hash for xor/802.3ad modes

2007-12-06 Thread Jay Vosburgh
Add new hash for balance-xor and 802.3ad modes.  Originally
 submitted by Glenn Griffin [EMAIL PROTECTED]; modified by
 Jay Vosburgh to move setting of hash policy out of line, tweak the
 documentation update and add version update to 3.2.2.

Glenn's original comment follows:

Included is a patch for a new xmit_hash_policy for the bonding driver
that selects slaves based on MAC and IP information.  This is a middle
ground between what currently exists in the layer2 only policy and the
layer3+4 policy.  This policy strives to be fully 802.3ad compliant by
transmitting every packet of any particular flow over the same link.
As documented the layer3+4 policy is not fully compliant for extreme
cases such as ip fragmentation, so this policy is a nice compromise
for environments that require full compliance but desire more than the
layer2 only policy.

Signed-off-by: Glenn Griffin [EMAIL PROTECTED]
Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
---
 Documentation/networking/bonding.txt |   29 +++-
 drivers/net/bonding/bond_main.c  |   48 ++---
 drivers/net/bonding/bonding.h|4 +-
 include/linux/if_bonding.h   |3 +-
 4 files changed, 69 insertions(+), 15 deletions(-)

diff --git a/Documentation/networking/bonding.txt 
b/Documentation/networking/bonding.txt
index eda0f06..a0cda06 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -559,6 +559,30 @@ xmit_hash_policy
 
This algorithm is 802.3ad compliant.
 
+   layer2+3
+
+   This policy uses a combination of layer2 and layer3
+   protocol information to generate the hash.
+
+   Uses XOR of hardware MAC addresses and IP addresses to
+   generate the hash.  The formula is
+
+   (((source IP XOR dest IP) AND 0x) XOR
+   ( source MAC XOR destination MAC ))
+   modulo slave count
+
+   This algorithm will place all traffic to a particular
+   network peer on the same slave.  For non-IP traffic,
+   the formula is the same as for the layer2 transmit
+   hash policy.
+
+   This policy is intended to provide a more balanced
+   distribution of traffic than layer2 alone, especially
+   in environments where a layer3 gateway device is
+   required to reach most destinations.
+
+   This algorithm is 802.3ad complient.
+
layer3+4
 
This policy uses upper layer protocol information,
@@ -594,8 +618,9 @@ xmit_hash_policy
or may not tolerate this noncompliance.
 
The default value is layer2.  This option was added in bonding
-version 2.6.3.  In earlier versions of bonding, this parameter does
-not exist, and the layer2 policy is the only policy.
+   version 2.6.3.  In earlier versions of bonding, this parameter
+   does not exist, and the layer2 policy is the only policy.  The
+   layer2+3 value was added for bonding version 3.2.2.
 
 
 3. Configuring Bonding Devices
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e4a4714..08879d5 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -175,6 +175,7 @@ struct bond_parm_tbl bond_mode_tbl[] = {
 struct bond_parm_tbl xmit_hashtype_tbl[] = {
 {  layer2,   BOND_XMIT_POLICY_LAYER2},
 {  layer3+4, BOND_XMIT_POLICY_LAYER34},
+{  layer2+3, BOND_XMIT_POLICY_LAYER23},
 {  NULL,   -1},
 };
 
@@ -3605,6 +3606,24 @@ void bond_unregister_arp(struct bonding *bond)
 /* Hashing Policies -*/
 
 /*
+ * Hash for the output device based upon layer 2 and layer 3 data. If
+ * the packet is not IP mimic bond_xmit_hash_policy_l2()
+ */
+static int bond_xmit_hash_policy_l23(struct sk_buff *skb,
+struct net_device *bond_dev, int count)
+{
+   struct ethhdr *data = (struct ethhdr *)skb-data;
+   struct iphdr *iph = ip_hdr(skb);
+
+   if (skb-protocol == __constant_htons(ETH_P_IP)) {
+   return ((ntohl(iph-saddr ^ iph-daddr)  0x) ^
+   (data-h_dest[5] ^ bond_dev-dev_addr[5])) % count;
+   }
+
+   return (data-h_dest[5] ^ bond_dev-dev_addr[5]) % count;
+}
+
+/*
  * Hash for the output device based upon layer 3 and layer 4 data. If
  * the packet is a frag or not TCP or UDP, just use layer 3 data.  If it is
  * altogether not IP, mimic bond_xmit_hash_policy_l2()
@@ -4306,6 +4325,22 @@ out:
 
 /*- Device initialization ---*/
 
+static void bond_set_xmit_hash_policy(struct bonding *bond)
+{
+   switch (bond-params.xmit_policy) {
+   case BOND_XMIT_POLICY_LAYER23:
+   bond-xmit_hash_policy = 

[PATCH 3/8] bonding: Purely cosmetic: rename a local variable

2007-12-06 Thread Jay Vosburgh
From: Wagner Ferenc [EMAIL PROTECTED]

From: Wagner Ferenc [EMAIL PROTECTED]

Code for rendering multivalue sysfs files occurs three times
in this module.  Rename 'buffer' to 'buf' in the first, for
the sake of consistency.

Signed-off-by: Ferenc Wagner [EMAIL PROTECTED]
Acked-by: Jay Vosburgh [EMAIL PROTECTED]
---
 drivers/net/bonding/bond_sysfs.c |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 6bb91e2..5c31f5c 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -74,7 +74,7 @@ struct rw_semaphore bonding_rwsem;
  * show function for the bond_masters attribute.
  * The class parameter is ignored.
  */
-static ssize_t bonding_show_bonds(struct class *cls, char *buffer)
+static ssize_t bonding_show_bonds(struct class *cls, char *buf)
 {
int res = 0;
struct bonding *bond;
@@ -86,13 +86,12 @@ static ssize_t bonding_show_bonds(struct class *cls, char 
*buffer)
/* not enough space for another interface name */
if ((PAGE_SIZE - res)  10)
res = PAGE_SIZE - 10;
-   res += sprintf(buffer + res, ++more++ );
+   res += sprintf(buf + res, ++more++ );
break;
}
-   res += sprintf(buffer + res, %s ,
-  bond-dev-name);
+   res += sprintf(buf + res, %s , bond-dev-name);
}
-   if (res) buffer[res-1] = '\n'; /* eat the leftover space */
+   if (res) buf[res-1] = '\n'; /* eat the leftover space */
up_read((bonding_rwsem));
return res;
 }
-- 
1.5.3.4.206.g58ba4-dirty

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/8] bonding: Coding style: break line after the if condition

2007-12-06 Thread Jay Vosburgh
From: Wagner Ferenc [EMAIL PROTECTED]

From: Wagner Ferenc [EMAIL PROTECTED]

Adhere to coding style: break line after the if condition

Signed-off-by: Ferenc Wagner [EMAIL PROTECTED]
Acked-by: Jay Vosburgh [EMAIL PROTECTED]
---
 drivers/net/bonding/bond_sysfs.c |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 5c31f5c..9de2c52 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -91,7 +91,8 @@ static ssize_t bonding_show_bonds(struct class *cls, char 
*buf)
}
res += sprintf(buf + res, %s , bond-dev-name);
}
-   if (res) buf[res-1] = '\n'; /* eat the leftover space */
+   if (res)
+   buf[res-1] = '\n'; /* eat the leftover space */
up_read((bonding_rwsem));
return res;
 }
@@ -239,7 +240,8 @@ static ssize_t bonding_show_slaves(struct device *d,
res += sprintf(buf + res, %s , slave-dev-name);
}
read_unlock(bond-lock);
-   if (res) buf[res-1] = '\n'; /* eat the leftover space */
+   if (res)
+   buf[res-1] = '\n'; /* eat the leftover space */
return res;
 }
 
@@ -705,7 +707,8 @@ static ssize_t bonding_show_arp_targets(struct device *d,
res += sprintf(buf + res, %u.%u.%u.%u ,
   NIPQUAD(bond-params.arp_targets[i]));
}
-   if (res) buf[res-1] = '\n'; /* eat the leftover space */
+   if (res)
+   buf[res-1] = '\n'; /* eat the leftover space */
return res;
 }
 
-- 
1.5.3.4.206.g58ba4-dirty

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html