Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread Eric Dumazet

Denys a écrit :
Hi 


I got

pi linux-git # git bisect bad
Bisecting: 0 revisions left to test after this
[f85958151900f9d30fa5ff941b0ce71eaa45a7de] [NET]: random functions can use 
nsec resolution instead of usec


I will make sure and will try to reverse this patch on 2.6.22

But it seems that's it.


Well... thats interesting...

No problem here on bigger servers, so I CC David Miller and netdev on this one.

AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying 
hardware functions on PC and no performance problem should happen here.


(relevant part of this patch :

@ -1521,7 +1515,6 @@ __u32 secure_ip_id(__be32 daddr)
 __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr,
 __be16 sport, __be16 dport)
 {
-   struct timeval tv;
__u32 seq;
__u32 hash[4];
struct keydata *keyptr = get_keyptr();
@@ -1543,12 +1536,11 @@ __u32 secure_tcp_sequence_number(__be32 saddr, __be32 
daddr,

 *  As close as possible to RFC 793, which
 *  suggests using a 250 kHz clock.
 *  Further reading shows this assumes 2 Mb/s networks.
-*  For 10 Mb/s Ethernet, a 1 MHz clock is appropriate.
+*  For 10 Gb/s Ethernet, a 1 GHz clock is appropriate.
 *  That's funny, Linux has one built in!  Use it!
 *  (Networks are faster now - should this be increased?)
 */
-   do_gettimeofday(tv);
-   seq += tv.tv_usec + tv.tv_sec * 100;
+   seq += ktime_get_real().tv64;


Thank you for doing this research.




On Sun, 30 Sep 2007 14:25:37 +1000, Nick Piggin wrote
Hi Denys, thanks for reporting (btw. please reply-to-all when 
replying on lkml).


You say that SLAB is better than SLUB on an otherwise identical 
kernel, but I didn't see if you quantified the actual numbers? It 
sounds like there is still a regression with SLAB?


On Monday 01 October 2007 03:48, Eric Dumazet wrote:

Denys a  :

I've moved recently one of my proxies(squid and some compressing
application) from 2.6.21 to 2.6.22, and notice huge performance drop. I
think this is important, cause it can cause serious regression on some
other workloads like busy web-servers and etc.

After some analysis of different options i can bring more exact numbers:

2.6.21 able to process 500-550 requests/second and 15-20 Mbit/s of
traffic, and working great without any slowdown or instability.

2.6.22 able to process only 250-300 requests and 8-10 Mbit/s of traffic,
ssh and console is freezing (there is delay even for typing
characters).

Both proxies is on identical hardware(Sun Fire X4100),
configuration(small system, LFS-like, on USB flash), different only
kernel.

I tried to disable/enable various options and optimisations - it doesn't
change anything, till i reach SLUB/SLAB option.

I've loaded proxy configuration to gentoo PC with 2.6.22 (then upgraded
it to 2.6.23-rc8), and having same effect.
Additionally, when load reaching maximum i can notice whole system
slowdown, for example ssh and scp takes much more time to run, even i do
nice -n -5 for them.

But even choosing 2.6.23-rc8+SLAB i noticed same freezing of ssh (and
sure it slowdown other kind of network performance), but much less
comparing with SLUB. On top i am seeing ksoftirqd taking almost 100%
(sometimes ksoftirqd/0, sometimes ksoftirqd/1).

I tried also different tricks with scheduler (/proc/sys/kernel/sched*),
but it's also didn't help.

When it freezes it looks like:
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
7 root  15  -5 000 R   64  0.0   2:47.48 ksoftirqd/1
 5819 root  20   0  134m 130m  596 R   57  3.3   4:36.78 globax
 5911 squid 20   0 1138m 1.1g 2124 R   26 28.9   2:24.87 squid
   10 root  15  -5 000 S1  0.0   0:01.86 events/1
 6130 root  20   0  3960 2416 1592 S0  0.1   0:08.02 oprofiled


Oprofile results:


Thats oprofile with 2.6.23-rc8 - SLUB

7391821.5521  check_bytes
3836111.1848  acpi_pm_read
14077 4.1044  init_object
13632 3.9747  ip_send_reply
8486  2.4742  __slab_alloc
7199  2.0990  nf_iterate
6718  1.9588  page_address
6716  1.9582  tcp_v4_rcv
6425  1.8733  __slab_free
5604  1.6339  on_freelist


Thats oprofile with 2.6.23-rc8 - SLAB

CPU: AMD64 processors, speed 2592.64 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a
unit mask of 0x00 (No unit mask) count 10
samples  %symbol name
138991   14.0627  acpi_pm_read
52401 5.3018  tcp_v4_rcv
48466 4.9037  nf_iterate
38043 3.8491  __slab_alloc
34155 3.4557  ip_send_reply
20963 2.1210  ip_rcv
19475 1.9704  csum_partial
19084 1.9309  kfree
17434 1.7639  ip_output
17278 1.7481  netif_receive_skb
15248 1.5428  nf_hook_slow

My .config is at http://www.nuclearcat.com/.config (there is SPARSEMEM
enabled, it doesn't make any noticeable difference)

Please CC me on 

Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Mon, 01 Oct 2007 07:59:12 +0200

 No problem here on bigger servers, so I CC David Miller and netdev
 on this one.  AFAIK do_gettimeofday() and ktime_get_real() should
 use the same underlying hardware functions on PC and no performance
 problem should happen here.

One thing that jumps out at me is that on 32-bit (and to a certain
extent on 64-bit) there is a lot of stack accesses and missed
optimizations because all of the work occurs, and gets expanded,
inside of ktime_get_real().

The timespec_to_ktime() inside of there constructs the ktime_t return
value on the stack, then returns that as an aggregate to the caller.

That cannot be without some cost.

ktime_get_real() is definitely a candidate for inlining especially in
these kinds of cases where we'll happily get computations in local
registers instead of all of this on-stack nonsense.  And in several
cases (if the caller only needs the tv_sec value, for example)
computations can be elided entirely.

It would be constructive to experiment and see if this is in fact part
of the problem.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-10-01 Thread Denis V. Lunev
Patrick McHardy wrote:
 Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:


 Maybe I can save you some time: we used to do down_trylock()
 for the rtnl mutex, so senders would simply return if someone
 else was already processing the queue *or* the rtnl was locked
 for some other reason. In the first case the process already
 processing the queue would also process the new messages, but
 if it the rtnl was locked for some other reason (for example
 during module registration) the message would sit in the
 queue until the next rtnetlink sendmsg call, which is why
 rtnl_unlock does queue processing. Commit 6756ae4b changed
 the down_trylock to mutex_lock, so senders will now simply wait
 until the mutex is released and then call netlink_run_queue
 themselves. This means its not needed anymore.

 Sounds reasonable.

 I started looking through the code paths and I currently cannot
 see anything that would leave a message on a kernel rtnl socket.

 However I did a quick test adding a WARN_ON if there were any messages
 found in the queue during rtnl_unlock and I found this code path
 getting invoked from linkwatch_event.  So there is clearly something I
 don't understand, and it sounds at odds just a bit from your
 description.
 
 
 That sounds like a bug. Did you place the WARN_ON before or after
 the mutex_unlock()?

The presence of the message in the queue during rtnl_unlock is quite
possible as normal user-kernel message processing path for rtnl is the
following:

netlink_sendmsg
   netlink_unicast
  netlink_sendskb
  skb_queue_tail
  netlink_data_ready
  rtnetlink_rcv
  mutex_lock(rtnl_mutex);
  netlink_run_queue(sk, qlen, rtnetlink_rcv_msg);
  mutex_unlock(rtnl_mutex);

so, the presence of the packet in the rtnl queue on rtnl_unlock is
normal race with a rtnetlink_rcv for me.

Regards,
Den
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread Denys
Well, i can play a bit more on live servers. I have now hot-swap server with
full gentoo,  where i can rebuild any kernel you want, with any applied patch.
But it looks more like not overhead, load becoming high too spiky, and it is
not just permantenly higher. Also it is not normal that all system becoming
unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage
softirq jumps to 100%).

On Mon, 01 Oct 2007 00:12:59 -0700 (PDT), David Miller wrote
 From: Eric Dumazet [EMAIL PROTECTED]
 Date: Mon, 01 Oct 2007 07:59:12 +0200
 
  No problem here on bigger servers, so I CC David Miller and netdev
  on this one.  AFAIK do_gettimeofday() and ktime_get_real() should
  use the same underlying hardware functions on PC and no performance
  problem should happen here.
 
 One thing that jumps out at me is that on 32-bit (and to a certain
 extent on 64-bit) there is a lot of stack accesses and missed
 optimizations because all of the work occurs, and gets expanded,
 inside of ktime_get_real().
 
 The timespec_to_ktime() inside of there constructs the ktime_t return
 value on the stack, then returns that as an aggregate to the caller.
 
 That cannot be without some cost.
 
 ktime_get_real() is definitely a candidate for inlining especially in
 these kinds of cases where we'll happily get computations in local
 registers instead of all of this on-stack nonsense.  And in several
 cases (if the caller only needs the tv_sec value, for example)
 computations can be elided entirely.
 
 It would be constructive to experiment and see if this is in fact 
 part of the problem.


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread Eric Dumazet

Eric Dumazet a écrit :

Denys a écrit :
Well, i can play a bit more on live servers. I have now hot-swap 
server with
full gentoo,  where i can rebuild any kernel you want, with any 
applied patch.
But it looks more like not overhead, load becoming high too spiky, 
and it is
not just permantenly higher. Also it is not normal that all system 
becoming
unresposive (for example ping 127.0.0.1 becoming 300ms for period, 
when usage

softirq jumps to 100%).

  
Could you try a pristine 2.6.22.9 and some patch in 
secure_tcp_sequence_number() like :


--- drivers/char/random.c.orig 2007-10-01 10:18:42.0 +0200
+++ drivers/char/random.c 2007-10-01 10:19:58.0 +0200
@@ -1554,7 +1554,7 @@
* That's funny, Linux has one built in! Use it!
* (Networks are faster now - should this be increased?)
*/
- seq += ktime_get_real().tv64;
+ seq += ktime_get_real().tv64 / 1000;
#if 0
printk(init_seq(%lx, %lx, %d, %d) = %d\n,
saddr, daddr, sport, dport, seq);
On 32 bits machine, replace the divide by a shift  to avoid a linker 
error (undefined reference to `__divdi3'):  


seq += ktime_get_real().tv64  10;





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-10-01 Thread Eric W. Biederman
Denis V. Lunev [EMAIL PROTECTED] writes:

 The presence of the message in the queue during rtnl_unlock is quite
 possible as normal user-kernel message processing path for rtnl is the
 following:

 netlink_sendmsg
netlink_unicast
   netlink_sendskb
   skb_queue_tail
   netlink_data_ready
   rtnetlink_rcv
   mutex_lock(rtnl_mutex);
   netlink_run_queue(sk, qlen, rtnetlink_rcv_msg);
   mutex_unlock(rtnl_mutex);

 so, the presence of the packet in the rtnl queue on rtnl_unlock is
 normal race with a rtnetlink_rcv for me.

Yes.  That is what I saw in practice as well.
Thanks for confirming this.

It happened to reproducible because I had a dhcp client asking
for a list of links in parallel with the actual link coming up
during boot.

Looking at netlink_unicast and netlink_broadcast I am generally
convinced that we can remove the call of sk_data_ready in
rtnl_unlock.   I think those are the only two possible paths
through there and I don't see how we could miss a processing a
packet on the way through there.

What would be nice is if we could figure out how to eliminate
this race.  As that would allow netlink packets to be processed
synchronously and we could actually use current for security
checks, and for getting the context of the calling process.

Right now we are 99% of the way there but because of the above
race the code must all be written as if netlink packets were coming
in completely asynchronously.  Which is unfortunate and a pain.

Eric
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc8-mm2 - tcp_fastretrans_alert() WARNING

2007-10-01 Thread Cedric Le Goater
Ilpo Järvinen wrote:
 On Sat, 29 Sep 2007, Cedric Le Goater wrote:
 
 Ilpo Järvinen wrote:
 On Fri, 28 Sep 2007, Ilpo Järvinen wrote:
 On Fri, 28 Sep 2007, Cedric Le Goater wrote:

 I just found that warning in my logs. It seems that it's been 
 happening since rc7-mm1 at least. 

 WARNING: at /home/legoater/linux/2.6.23-rc8-mm2/net/ipv4/tcp_input.c:2314 
 tcp_fastretrans_alert()

 Call Trace:
  IRQ  [8040fdc3] tcp_ack+0xcd6/0x1894
 ...snip...
 ...Thanks for the report, I'll have look what could still break 
 fackets_out...
 I think this one is now clear to me, tcp_fragment/collapse adjusts 
 fackets_out (incorrectly) also for reno flow when there were some dupACKs 
 that made sacked_out != 0. Could you please try if patch below proves all 
 them to be of non-SACK origin... In case that's true, it's rather 
 harmless, I'll send a fix on Monday or so (this would anyway be needed)... 
 If you find out that them occur with SACK enabled flow, that would be
 more interesting and requires more digging...
 I'm trying now to reproduce this WARNING. 

 It seems that the n/w behaves differently during the week ends. Probably
 taking a break. 
 
 Thanks.
 
 Of course there are other means too to determine if TCP flows do negotiate 
 SACK enabled or not. Depending on your test case (which is fully unknown 
 to me) they may or may not be usable... At least the value of tcp_sack 
 sysctl on both systems or tcpdump catching SYN packets should give that 
 detail. ...If you know to which hosts TCP could be connected (and active) 
 to, while the WARNING triggers, it's really easy to test what is being 
 negotiated as it's unlikely to change at short notice and any TCP flow to 
 that host will get us the same information though the WARNING would not be 
 triggered with it at this time. Obviously if at least one of the remotes 
 is not known or the set ends up being mixture of reno and SACK flows, then 
 we'll just have to wait and see which fish we get...
 
got it !

r3-06.test.meiosys.com login: WARNING: at 
/home/legoater/linux/2.6.23-rc8-mm2/net/ipv4/tcp_input.c:2314 
tcp_fastretrans_alert()

Call Trace:
 IRQ  [8040fdc3] tcp_ack+0xcd6/0x18af
 [80412b6f] tcp_rcv_established+0x61f/0x6df
 [80254146] __lock_acquire+0x8a1/0xf1b
 [80419d19] tcp_v4_do_rcv+0x3e/0x394
 [8041a68b] tcp_v4_rcv+0x61c/0x9a9
 [803ff1e3] ip_local_deliver+0x1da/0x2a4
 [803ffb4e] ip_rcv+0x583/0x5c9
 [8046d35b] packet_rcv_spkt+0x19a/0x1a8
 [803e081c] netif_receive_skb+0x2cf/0x2f5
 [88042505] :tg3:tg3_poll+0x65d/0x8a4
 [803e09e8] net_rx_action+0xb8/0x191
 [8023a927] __do_softirq+0x5f/0xe0
 [8020c98c] call_softirq+0x1c/0x28
 [8020e9c3] do_softirq+0x3b/0xb8
 [8023aa1e] irq_exit+0x4e/0x50
 [8020e7df] do_IRQ+0xbd/0xd7
 [80209cb9] mwait_idle+0x0/0x4d
 [8020bce6] ret_from_intr+0x0/0xf
 EOI  [80209cfc] mwait_idle+0x43/0x4d
 [802099fb] enter_idle+0x22/0x24
 [80209c4f] cpu_idle+0x9d/0xc0
 [80476aa1] rest_init+0x55/0x57
 [80630815] start_kernel+0x2d6/0x2e2
 [80630134] _sinittext+0x134/0x13b

TCP 0


I wasn't doing any particular test on n/w so it took me a while to figure 
out how I was triggering the WARNING. Apparently, this is happening when I 
run ketchup, but not always. This test machine is behind many firewall  
routers so it might be a reason.

tcpdump gave me this output for a wget on kernel.org :

10:51:14.835981 IP r3-06.test.meiosys.com.40322  pub2.kernel.org.http: S 
737836267:737836267(0) win 5840 mss 1460,sackOK,timestamp 1309245 0,nop,wscale 
7
10:51:14.975153 IP pub2.kernel.org.http  r3-06.test.meiosys.com.40321: F 
524:524(0) ack 166 win 5840
10:51:14.975177 IP r3-06.test.meiosys.com.40321  pub2.kernel.org.http: . ack 
525 win 7504

I'm trying to get the WARNING and the tcpdump output for it but for the
moment, it seems it's beyond my reach :/

Hope it helps !

C. 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] rtnl_unlock cleanups

2007-10-01 Thread Denis V. Lunev
There is no need to process outstanding netlink user-kernel packets
during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv
anymore.

Normal code path is the following:
netlink_sendmsg
   netlink_unicast
   netlink_sendskb
   skb_queue_tail
   netlink_data_ready
   rtnetlink_rcv
   mutex_lock(rtnl_mutex);
   netlink_run_queue(sk, qlen, rtnetlink_rcv_msg);
   mutex_unlock(rtnl_mutex);

So, it is possible, that packets can be present in the rtnl-sk_receive_queue
during rtnl_unlock, but there is no need to process them at that moment as
rtnetlink_rcv for that packet is pending.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]
Acked-by: Alexey Kuznetsov [EMAIL PROTECTED]

--- ./net/core/rtnetlink.c.rtnl22007-08-26 19:30:38.0 +0400
+++ ./net/core/rtnetlink.c  2007-10-01 13:09:03.0 +0400
@@ -75,8 +75,6 @@ void __rtnl_unlock(void)
 void rtnl_unlock(void)
 {
mutex_unlock(rtnl_mutex);
-   if (rtnl  rtnl-sk_receive_queue.qlen)
-   rtnl-sk_data_ready(rtnl, 0);
netdev_run_todo();
 }
 
@@ -1319,11 +1317,9 @@ static void rtnetlink_rcv(struct sock *s
unsigned int qlen = 0;
 
do {
-   mutex_lock(rtnl_mutex);
+   rtnl_lock();
qlen = netlink_run_queue(sk, qlen, rtnetlink_rcv_msg);
-   mutex_unlock(rtnl_mutex);
-
-   netdev_run_todo();
+   rtnl_unlock();
} while (qlen);
 }
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread Patrick McHardy
jamal wrote:
 +static inline int
 +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
 +struct Qdisc *q)
 +{
 +
 + struct sk_buff *skb;
 +
 + while ((skb = __skb_dequeue(skbs)) != NULL)
 + q-ops-requeue(skb, q);


-requeue queues at the head, so this looks like it would reverse
the order of the skbs.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] dm9601: Fix receive MTU

2007-10-01 Thread Peter Korsgaard
Please apply to 2.6.23.
---
dm9601 didn't take the ethernet header into account when calculating
RX MTU, causing packets bigger than 1486 to fail.

Signed-off-by: Peter Korsgaard [EMAIL PROTECTED]
---
 drivers/net/usb/dm9601.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.23-rc8/drivers/net/usb/dm9601.c
===
--- linux-2.6.23-rc8.orig/drivers/net/usb/dm9601.c
+++ linux-2.6.23-rc8/drivers/net/usb/dm9601.c
@@ -405,7 +405,7 @@
dev-net-ethtool_ops = dm9601_ethtool_ops;
dev-net-hard_header_len += DM_TX_OVERHEAD;
dev-hard_mtu = dev-net-mtu + dev-net-hard_header_len;
-   dev-rx_urb_size = dev-net-mtu + DM_RX_OVERHEAD;
+   dev-rx_urb_size = dev-net-mtu + ETH_HLEN + DM_RX_OVERHEAD;
 
dev-mii.dev = dev-net;
dev-mii.mdio_read = dm9601_mdio_read;

-- 
Bye, Peter Korsgaard
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][IPv6] Export userland ND options through netlink (RDNSS support)

2007-10-01 Thread YOSHIFUJI Hideaki / 吉藤英明
Hello.

In article [EMAIL PROTECTED] (at Sat, 29 Sep 2007 19:47:20 +0200), Pierre 
Ynard [EMAIL PROTECTED] says:

 As discussed before, this patch provides userland with a way to access
 relevant options in Router Advertisements, after they are processed and
 validated by the kernel. Extra options are processed in a generic way;
 this patch only exports RDNSS options described in RFC5006, but support
 to control which options are exported could be easily added.

I basically like this approach at first sight.

 which implies that a userland daemon processing RDNSS options needs a
 way to associate the option to the router that sent it, and fetch its
 lifetime. This kind of information could be included in a header in the
 rtnetlink message (in this version of the patch there is none).

 diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
 index dff3192..f69d415 100644
 --- a/include/linux/rtnetlink.h
 +++ b/include/linux/rtnetlink.h
 @@ -97,6 +97,9 @@ enum {
   RTM_SETNEIGHTBL,
  #define RTM_SETNEIGHTBL  RTM_SETNEIGHTBL
  
 + RTM_NEWNDUSEROPT = 68,
 +#define RTM_NEWNDUSEROPT RTM_NEWNDUSEROPT
 +
   __RTM_MAX,

Does this imply that we could extend (or reuse) this for all of
NS/NA/RS/RA/Redirect messages?  I think you need to include the
code, type and basic semantics of the message.

If this is only for RA, we should say RTM_NEWRAUSEROPT or something.

Regards,

--yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out

2007-10-01 Thread Karl Meyer
Hi,

after reading about issues with the nics on kontron boards I did a
bios upgrade,
but this did not change anything.
However, yesterday the nic (onboard) I used died. No link at all,
after switching to
the next onboard  nic I got a NETDEV transmit timeout with that one on
kernel 2.6.22-r2.
It seems the whole thing is a hardware issue. I will try to figure out
with kontron.

Sorry :(

Karl

2007/9/12, Francois Romieu [EMAIL PROTECTED]:
 Karl Meyer [EMAIL PROTECTED] :
 [...]
  am am looking for this issue for some time now, but there where no
  errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2
  officially), I also ran git-bisect (for more information see the older
  messages in this thread).

 2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before
 0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work.
 Thus it is not surprizing that it works.

 Any update regarding the patchkit that I sent on 2007/08/16 ?

 It would help to narrow the culprit.

 --
 Ueimor

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Removing DAD in IPv6

2007-10-01 Thread YOSHIFUJI Hideaki / 吉藤英明
In article [EMAIL PROTECTED] (at Mon, 01 Oct 2007 11:53:27 +0800), Xia Yang 
[EMAIL PROTECTED] says:

 I would like to ask for help on how to remove or disable the DAD process
 properly, as long as the node can send, receive and forward packets
 immediately after a new IPv6 address is generated. Any pointer is
 appreciated. Thanks a lot in advance!

IFA_F_NODAD address flag might help this.

--yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPV6] Fix ICMPv6 redirect handling with target multicast address

2007-10-01 Thread YOSHIFUJI Hideaki / 吉藤英明
Hello.

In article [EMAIL PROTECTED] (at Sat, 29 Sep 2007 10:04:48 +0900 (JST)), 
YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED] says:

 In article [EMAIL PROTECTED] (at Fri, 28 Sep 2007 17:50:38 -0700), David 
 Stevens [EMAIL PROTECTED] says:
 
  Brian,
  A multicast address should never be the target of a neighbor
  discovery request; the sender should use the mapping function for all
  multicasts. So, I'm not sure that your example can ever happen, and it
  certainly is ok to send ICMPv6 errors to multicast addresses in general.
  But I don't see that it hurts anything. either (since it should never 
  happen :-)),
  so I don't particularly object, either.
  I think it'd also be better if you add the check to be:
  
  if (ipv6_addr_type(target)  
  (IPV6_ADDR_LINKLOCAL|IPV6_ADDR_UNICAST))
  
  or something along those lines, rather than reproducing ipv6_addr_type() 
  code
  separately in a new ipv6_addr_linklocal() function.

I'm fine with the idea of the fix itself.

Please use ipv6_addr_type() so far and convert other users as well
to ipv6_addr_linklocal() in another patch.

Regards,

--yoshfuji

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread Denys
Not able to compile kernel with patch

drivers/built-in.o: In function `secure_tcp_sequence_number':
(.text+0x3ad02): undefined reference to `__divdi3'
make: *** [.tmp_vmlinux1] Error 1

On Mon, 01 Oct 2007 10:20:07 +0200, Eric Dumazet wrote
 Denys a :
  Well, i can play a bit more on live servers. I have now hot-swap server 
  with
  full gentoo,  where i can rebuild any kernel you want, with any applied 
  patch.
  But it looks more like not overhead, load becoming high too spiky, and it 
  is
  not just permantenly higher. Also it is not normal that all system becoming
  unresposive (for example ping 127.0.0.1 becoming 300ms for period, when 
  usage
  softirq jumps to 100%).
 

 Could you try a pristine 2.6.22.9 and some patch in 
 secure_tcp_sequence_number() like :
 
 --- drivers/char/random.c.orig 2007-10-01 10:18:42.0 +0200
 +++ drivers/char/random.c 2007-10-01 10:19:58.0 +0200
 @@ -1554,7 +1554,7 @@
 * That's funny, Linux has one built in! Use it!
 * (Networks are faster now - should this be increased?)
 */
 - seq += ktime_get_real().tv64;
 + seq += ktime_get_real().tv64 / 1000;
 #if 0
 printk(init_seq(%lx, %lx, %d, %d) = %d\n,
 saddr, daddr, sport, dport, seq);
 
 Thank you
 
  On Mon, 01 Oct 2007 00:12:59 -0700 (PDT), David Miller wrote

  From: Eric Dumazet [EMAIL PROTECTED]
  Date: Mon, 01 Oct 2007 07:59:12 +0200
 
  
  No problem here on bigger servers, so I CC David Miller and netdev
  on this one.  AFAIK do_gettimeofday() and ktime_get_real() should
  use the same underlying hardware functions on PC and no performance
  problem should happen here.

  One thing that jumps out at me is that on 32-bit (and to a certain
  extent on 64-bit) there is a lot of stack accesses and missed
  optimizations because all of the work occurs, and gets expanded,
  inside of ktime_get_real().
 
  The timespec_to_ktime() inside of there constructs the ktime_t return
  value on the stack, then returns that as an aggregate to the caller.
 
  That cannot be without some cost.
 
  ktime_get_real() is definitely a candidate for inlining especially in
  these kinds of cases where we'll happily get computations in local
  registers instead of all of this on-stack nonsense.  And in several
  cases (if the caller only needs the tv_sec value, for example)
  computations can be elided entirely.
 
  It would be constructive to experiment and see if this is in fact 
  part of the problem.
  
 
 
  --
  Denys Fedoryshchenko
  Technical Manager
  Virtual ISP S.A.L.
 
 
 


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Make TCP prequeue configurable

2007-10-01 Thread Andi Kleen
David Miller [EMAIL PROTECTED] writes:
 
 Furthermore, prequeue puts the stack input processing work into user
 context, which means that the users will be charged more fairly for
 the work that is done for them.

For more details on this people might want to read the old Lazy Receiver
Processing papers: http://www.cs.rice.edu/CS/Systems/LRP/

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] [TCP]: fix comments that got messed up during code move

2007-10-01 Thread Ilpo Järvinen
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_input.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2286361..135f046 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1467,8 +1467,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
*ack_skb, u32 prior_snd_
return flag;
 }
 
-/* F-RTO can only be used if TCP has never retransmitted anything other than
- * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here)
+/* If we receive more dupacks than we expected counting segments
+ * in assumption of absent reordering, interpret this as reordering.
+ * The only another reason could be bug in receiver TCP.
  */
 static void tcp_check_reno_reordering(struct sock *sk, const int addend)
 {
@@ -1516,6 +1517,9 @@ static inline void tcp_reset_reno_sack(struct tcp_sock 
*tp)
tp-sacked_out = 0;
 }
 
+/* F-RTO can only be used if TCP has never retransmitted anything other than
+ * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here)
+ */
 int tcp_use_frto(struct sock *sk)
 {
const struct tcp_sock *tp = tcp_sk(sk);
-- 
1.5.0.6

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-2.6.24 0/4]: TCP fixes

2007-10-01 Thread Ilpo Järvinen
Hi Dave,

This fixes the newreno fackets_out case, which turned out to be
not related to the Cedric's case being under investigation. Two
trivial comment patches, and frto with high-speed seqno
wrap-around protection. Compile tested. Please apply to
net-2.6.24.

-- 
 i.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] [TCP]: No fackets_out/highest_sack tuning when SACK isn't enabled

2007-10-01 Thread Ilpo Järvinen
This was found due to bug report from Cedric Le Goater though
it turned this turned out to be unrelated bug.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_output.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 94c8011..6199abe 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -660,7 +660,7 @@ static void tcp_set_skb_tso_segs(struct sock *sk, struct 
sk_buff *skb, unsigned
 static void tcp_adjust_fackets_out(struct tcp_sock *tp, struct sk_buff *skb,
   int decr)
 {
-   if (!tp-sacked_out)
+   if (!tp-sacked_out || tcp_is_reno(tp))
return;
 
if (!before(tp-highest_sack, TCP_SKB_CB(skb)-seq))
@@ -712,7 +712,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 
len, unsigned int mss
TCP_SKB_CB(buff)-end_seq = TCP_SKB_CB(skb)-end_seq;
TCP_SKB_CB(skb)-end_seq = TCP_SKB_CB(buff)-seq;
 
-   if (tp-sacked_out  (TCP_SKB_CB(skb)-seq == tp-highest_sack))
+   if (tcp_is_sack(tp)  tp-sacked_out 
+   (TCP_SKB_CB(skb)-seq == tp-highest_sack))
tp-highest_sack = TCP_SKB_CB(buff)-seq;
 
/* PSH and FIN should only be set in the second packet. */
@@ -1718,7 +1719,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, 
struct sk_buff *skb, int m
BUG_ON(tcp_skb_pcount(skb) != 1 ||
   tcp_skb_pcount(next_skb) != 1);
 
-   if (WARN_ON(tp-sacked_out 
+   if (WARN_ON(tcp_is_sack(tp)  tp-sacked_out 
(TCP_SKB_CB(next_skb)-seq == tp-highest_sack)))
return;
 
-- 
1.5.0.6

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] [TCP]: Update comment of SACK block validator

2007-10-01 Thread Ilpo Järvinen
Just came across what RFC2018 states about generation of valid
SACK blocks in case of reneging. Alter comment a bit to point
out clearly.

IMHO, there isn't any reason to change code because the
validation is there for a purpose (counters will inform user
about decision TCP made if this case ever surfaces).

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_input.c |   11 +--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 135f046..cec2611 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1027,8 +1027,15 @@ static void tcp_update_reordering(struct sock *sk, const 
int metric,
  * SACK block range validation checks that the received SACK block fits to
  * the expected sequence limits, i.e., it is between SND.UNA and SND.NXT.
  * Note that SND.UNA is not included to the range though being valid because
- * it means that the receiver is rather inconsistent with itself (reports
- * SACK reneging when it should advance SND.UNA).
+ * it means that the receiver is rather inconsistent with itself reporting
+ * SACK reneging when it should advance SND.UNA. Such SACK block this is
+ * perfectly valid, however, in light of RFC2018 which explicitly states
+ * that SACK block MUST reflect the newest segment.  Even if the newest
+ * segment is going to be discarded ..., not that it looks very clever
+ * in case of head skb. Due to potentional receiver driven attacks, we
+ * choose to avoid immediate execution of a walk in write queue due to
+ * reneging and defer head skb's loss recovery to standard loss recovery
+ * procedure that will eventually trigger (nothing forbids us doing this).
  *
  * Implements also blockage to start_seq wrap-around. Problem lies in the
  * fact that though start_seq (s) is before end_seq (i.e., not reversed),
-- 
1.5.0.6

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] [TCP]: Wrap-safed reordering detection FRTO check

2007-10-01 Thread Ilpo Järvinen
In case somebody has a suggestion about a better place for this
check, which must guarantee execution early enough (i.e,
before the wrap can occur), I'm very open to them.

Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED]
---
 net/ipv4/tcp_input.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cec2611..e22ffe7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3024,6 +3024,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
/* See if we can take anything off of the retransmit queue. */
flag |= tcp_clean_rtx_queue(sk, seq_rtt);
 
+   /* Guarantee sacktag reordering detection against wrap-arounds */
+   if (before(tp-frto_highmark, tp-snd_una))
+   tp-frto_highmark = 0;
if (tp-frto_counter)
frto_cwnd = tcp_process_frto(sk, flag);
 
-- 
1.5.0.6

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1][TCP]: break missing at end of switch statement

2007-10-01 Thread Gerrit Renker
[TCP]: break missing at end of switch statement

Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
---
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk)
return;
default:
sk-sk_err = ECONNRESET;
+   break;
}
 
if (!sock_flag(sk, SOCK_DEAD))
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1][TCP]: break missing at end of switch statement

2007-10-01 Thread Al Viro
On Mon, Oct 01, 2007 at 01:32:43PM +0100, Gerrit Renker wrote:
 [TCP]: break missing at end of switch statement
 
 Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
 ---
 --- a/net/ipv4/tcp_input.c
 +++ b/net/ipv4/tcp_input.c
 @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk)
   return;
   default:
   sk-sk_err = ECONNRESET;
 + break;
   }

Huh?  Why on the Earth would that be a problem?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] [PATCH v3] iw_cxgb3: Supportiwarp-onlyinterfacestoavoid 4-tuple conflicts.

2007-10-01 Thread Kanevsky, Arkady
Sean,
Not so simple.
How does client application knows where to connect?
Does this proposal forces applications to choose
the right network?
Currently, MPA or ULP and not applications handle it.
Why would we want to change that?

Sean,
I may be beating the dead horse,
but I recall that one of the main selling points
of RDMA that it magical bust to performance with
no changes applications. Just plug it in an viola,
performances goes up and CPU utilization for network
stack goes does. Win-Win.

Thanks,

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Sean Hefty [mailto:[EMAIL PROTECTED] 
 Sent: Friday, September 28, 2007 5:35 PM
 To: Kanevsky, Arkady
 Cc: netdev@vger.kernel.org; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: 
 Supportiwarp-onlyinterfacestoavoid 4-tuple conflicts.
 
 Kanevsky, Arkady wrote:
  Exactly,
  it forces the burden on administrator.
  And one will be forced to try one mount for iWARP and it 
 does not work 
  issue another one TCP or UDP if it fails.
  Yack!
  
  And server will need to listen on different IP address and simple
  * will not work since it will need to listen in two 
 different domains.
 
 The server already has to call listen twice.  Once for the 
 rdma_cm and once for sockets.  Similarly on the client side, 
 connect must be made over rdma_cm or sockets.  I really don't 
 see any impact on the application for this approach.
 
 We just end up separating the port space based on networking 
 addresses, rather than keeping the problem at the transport 
 level.  If you have an alternate approach that will be 
 accepted upstream, feel free to post it.
 
 - Sean
 ___
 general mailing list
 [EMAIL PROTECTED]
 http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
 
 To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] make netlink processing routines semi-synchronious (inspired by rtnl)

2007-10-01 Thread Denis V. Lunev
The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks
like outdated copy/paste from rtnetlink.c. Push them into sync with the
original.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

--- ./net/netfilter/nfnetlink.c.nlk32007-10-01 09:47:53.0 +0400
+++ ./net/netfilter/nfnetlink.c 2007-10-01 16:09:44.0 +0400
@@ -44,26 +44,14 @@ static struct sock *nfnl = NULL;
 static const struct nfnetlink_subsystem *subsys_table[NFNL_SUBSYS_COUNT];
 static DEFINE_MUTEX(nfnl_mutex);
 
-static void nfnl_lock(void)
+static inline void nfnl_lock(void)
 {
mutex_lock(nfnl_mutex);
 }
 
-static int nfnl_trylock(void)
-{
-   return !mutex_trylock(nfnl_mutex);
-}
-
-static void __nfnl_unlock(void)
-{
-   mutex_unlock(nfnl_mutex);
-}
-
-static void nfnl_unlock(void)
+static inline void nfnl_unlock(void)
 {
mutex_unlock(nfnl_mutex);
-   if (nfnl-sk_receive_queue.qlen)
-   nfnl-sk_data_ready(nfnl, 0);
 }
 
 int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n)
@@ -149,7 +137,7 @@ static int nfnetlink_rcv_msg(struct sk_b
 #ifdef CONFIG_KMOD
/* don't call nfnl_unlock, since it would reenter
 * with further packet processing */
-   __nfnl_unlock();
+   nfnl_unlock();
request_module(nfnetlink-subsys-%d, NFNL_SUBSYS_ID(type));
nfnl_lock();
ss = nfnetlink_get_subsys(type);
@@ -188,10 +176,9 @@ static void nfnetlink_rcv(struct sock *s
unsigned int qlen = 0;
 
do {
-   if (nfnl_trylock())
-   return;
+   nfnl_lock();
qlen = netlink_run_queue(sk, qlen, nfnetlink_rcv_msg);
-   __nfnl_unlock();
+   nfnl_unlock();
} while (qlen);
 }
 
--- ./net/netlink/genetlink.c.nlk3  2007-08-26 19:30:38.0 +0400
+++ ./net/netlink/genetlink.c   2007-10-01 16:05:29.0 +0400
@@ -22,22 +22,14 @@ struct sock *genl_sock = NULL;
 
 static DEFINE_MUTEX(genl_mutex); /* serialization of message processing */
 
-static void genl_lock(void)
+static inline void genl_lock(void)
 {
mutex_lock(genl_mutex);
 }
 
-static int genl_trylock(void)
-{
-   return !mutex_trylock(genl_mutex);
-}
-
-static void genl_unlock(void)
+static inline void genl_unlock(void)
 {
mutex_unlock(genl_mutex);
-
-   if (genl_sock  genl_sock-sk_receive_queue.qlen)
-   genl_sock-sk_data_ready(genl_sock, 0);
 }
 
 #define GENL_FAM_TAB_SIZE  16
@@ -483,8 +475,7 @@ static void genl_rcv(struct sock *sk, in
unsigned int qlen = 0;
 
do {
-   if (genl_trylock())
-   return;
+   genl_lock();
qlen = netlink_run_queue(sk, qlen, genl_rcv_msg);
genl_unlock();
} while (qlen  genl_sock  genl_sock-sk_receive_queue.qlen);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1][TCP]: break missing at end of switch statement

2007-10-01 Thread Gerrit Renker
Quoting Al Viro:
|  On Mon, Oct 01, 2007 at 01:32:43PM +0100, Gerrit Renker wrote:
|   [TCP]: break missing at end of switch statement
|   
|   Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
|   ---
|   --- a/net/ipv4/tcp_input.c
|   +++ b/net/ipv4/tcp_input.c
|   @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk)
|  return;
|  default:
|  sk-sk_err = ECONNRESET;
|   +  break;
|  }
|  
|  Huh?  Why on the Earth would that be a problem?
|  
|  
Sorry what is your question?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make netlink processing routines semi-synchronious (inspired by rtnl)

2007-10-01 Thread Patrick McHardy
Denis V. Lunev wrote:
 The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks
 like outdated copy/paste from rtnetlink.c. Push them into sync with the
 original.
 

  int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n)
 @@ -149,7 +137,7 @@ static int nfnetlink_rcv_msg(struct sk_b
  #ifdef CONFIG_KMOD
   /* don't call nfnl_unlock, since it would reenter
* with further packet processing */
 - __nfnl_unlock();
 + nfnl_unlock();


That comment should be updated/deleted. Rest looks good to me.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1][TCP]: break missing at end of switch statement

2007-10-01 Thread Al Viro
On Mon, Oct 01, 2007 at 02:02:10PM +0100, Gerrit Renker wrote:
 Quoting Al Viro:
 |  On Mon, Oct 01, 2007 at 01:32:43PM +0100, Gerrit Renker wrote:
 |   [TCP]: break missing at end of switch statement
 |   
 |   Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
 |   ---
 |   --- a/net/ipv4/tcp_input.c
 |   +++ b/net/ipv4/tcp_input.c
 |   @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk)
 |return;
 |default:
 |sk-sk_err = ECONNRESET;
 |   +break;
 |}
 |  
 |  Huh?  Why on the Earth would that be a problem?
 |  
 |  
 Sorry what is your question?

Why the hell is $Subject a problem that warrants any patches whatsoever?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread jamal
On Mon, 2007-01-10 at 12:42 +0200, Patrick McHardy wrote:
 jamal wrote:

  +   while ((skb = __skb_dequeue(skbs)) != NULL)
  +   q-ops-requeue(skb, q);
 
 
 -requeue queues at the head, so this looks like it would reverse
 the order of the skbs.

Excellent catch!  thanks; i will fix.

As a side note: Any batching driver should _never_ have to requeue; if
it does it is buggy. And the non-batching ones if they ever requeue will
be a single packet, so not much reordering.

Thanks again Patrick.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] Preparatory refactoring part 1.

2007-10-01 Thread Patrick McHardy
Corey Hickey wrote:
 Make a new function sfq_q_enqueue() that operates directly on the
 queue data. This will be useful for implementing sfq_change() in
 a later patch. A pleasant side-effect is reducing most of the
 duplicate code in sfq_enqueue() and sfq_requeue().
 
 Similarly, make a new function sfq_q_dequeue().
 
 Signed-off-by: Corey Hickey [EMAIL PROTECTED]
 ---
  net/sched/sch_sfq.c |   72 
 +++
  1 files changed, 38 insertions(+), 34 deletions(-)
 
 diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
 index 3a23e30..57485ef 100644
 --- a/net/sched/sch_sfq.c
 +++ b/net/sched/sch_sfq.c
 


The sfq_q_enqueue part looks fine.

  
 - sch-qstats.drops++;


A line in the changelog explaining that this was increased twice
would have been nice.

   sfq_drop(sch);
   return NET_XMIT_CN;
  }
  
 -
 -
 -
 -static struct sk_buff *
 -sfq_dequeue(struct Qdisc* sch)
 +static struct
 +sk_buff *sfq_q_dequeue(struct sfq_sched_data *q)


What is this function needed for?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread Denys
Resend for maillists (was discareded cause of encoding issues as SPAM).

Everything looks fine, for sure. Confirmed on second server.

On Mon, 01 Oct 2007 10:20:07 +0200, Eric Dumazet wrote

  Well, i can play a bit more on live servers. I have now hot-swap server 
  with
  full gentoo,  where i can rebuild any kernel you want, with any applied 
  patch.
  But it looks more like not overhead, load becoming high too spiky, and it 
  is
  not just permantenly higher. Also it is not normal that all system becoming
  unresposive (for example ping 127.0.0.1 becoming 300ms for period, when 
  usage
  softirq jumps to 100%).
 

 Could you try a pristine 2.6.22.9 and some patch in 
 secure_tcp_sequence_number() like :
 
 --- drivers/char/random.c.orig 2007-10-01 10:18:42.0 +0200
 +++ drivers/char/random.c 2007-10-01 10:19:58.0 +0200
 @@ -1554,7 +1554,7 @@
 * That's funny, Linux has one built in! Use it!
 * (Networks are faster now - should this be increased?)
 */
 - seq += ktime_get_real().tv64;
 + seq += ktime_get_real().tv64 / 1000;
 #if 0
 printk(init_seq(%lx, %lx, %d, %d) = %d\n,
 saddr, daddr, sport, dport, seq);
 
 Thank you
 
  On Mon, 01 Oct 2007 00:12:59 -0700 (PDT), David Miller wrote

  From: Eric Dumazet [EMAIL PROTECTED]
  Date: Mon, 01 Oct 2007 07:59:12 +0200
 
  
  No problem here on bigger servers, so I CC David Miller and netdev
  on this one.  AFAIK do_gettimeofday() and ktime_get_real() should
  use the same underlying hardware functions on PC and no performance
  problem should happen here.

  One thing that jumps out at me is that on 32-bit (and to a certain
  extent on 64-bit) there is a lot of stack accesses and missed
  optimizations because all of the work occurs, and gets expanded,
  inside of ktime_get_real().
 
  The timespec_to_ktime() inside of there constructs the ktime_t return
  value on the stack, then returns that as an aggregate to the caller.
 
  That cannot be without some cost.
 
  ktime_get_real() is definitely a candidate for inlining especially in
  these kinds of cases where we'll happily get computations in local
  registers instead of all of this on-stack nonsense.  And in several
  cases (if the caller only needs the tv_sec value, for example)
  computations can be elided entirely.
 
  It would be constructive to experiment and see if this is in fact 
  part of the problem.
  
 
 
  --
  Denys Fedoryshchenko
  Technical Manager
  Virtual ISP S.A.L.
 
 
 


--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] Move two functions.

2007-10-01 Thread Patrick McHardy
Corey Hickey wrote:
 Move sfq_q_destroy() to above sfq_q_init() so that it can be used
 by an error case in a later patch.
 
 Move sfq_destroy() as well, for clarity.


This patch looks pointless, just put them where you need them
in the patch introducing them.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-01 Thread jamal
On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote:

 Have you done performance comparisons for the case of using 9000-byte
 jumbo frames?

I havent, but will try if any of the gige cards i have support it.

As a side note: I have not seen any useful gains or losses as the packet
size approaches even 1500B MTU. For example, post about 256B neither the
batching nor the non-batching give much difference in either throughput
or cpu use. Below 256B, theres a noticeable gain for batching.
Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and
so the occupancy of both the qdisc queue(s) and ethernet ring is
constantly high. For example at 512B, the app is 80% idle on all 4 CPUs
and we are hitting in the range of wire speed. We are at 90% idle at
1024B. This is the case with or without batching.  So my suspicion is
that with that trend a 9000B packet will just follow the same pattern.


cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Preparatory refactoring part 2.

2007-10-01 Thread Patrick McHardy
Corey Hickey wrote:
 The sfq_destroy() -- sfq_q_destroy() change looks pointless here,
 but it's cleaner to split now and add code to sfq_q_destroy() in a
 later patch.
 
 +static void sfq_destroy(struct Qdisc *sch)
 +{
 + struct sfq_sched_data *q = qdisc_priv(sch);
 + sfq_q_destroy(q);
 +}


It does look pointless, after applying all patches sfq_destroy still
remains a simply wrapper around sfq_q_destroy.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] Add divisor.

2007-10-01 Thread Patrick McHardy
Corey Hickey wrote:
 Make hash divisor user-configurable.
 

 @@ -120,7 +121,7 @@ static __inline__ unsigned sfq_fold_hash(struct 
 sfq_sched_data *q, u32 h, u32 h1
   /* Have we any rotation primitives? If not, WHY? */
   h ^= (h1pert) ^ (h1(0x1F - pert));
   h ^= h10;
 - return h  0x3FF;
 + return h  (q-hash_divisor-1);


This assumes that hash_divisor is a power of two, but this is
not enforced anywhere.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1][TCP]: break missing at end of switch statement

2007-10-01 Thread Gerrit Renker
Quoting YOSHIFUJI Hideaki:
| 
|   [TCP]: break missing at end of switch statement
|   
|   Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
|   ---
|   --- a/net/ipv4/tcp_input.c
|   +++ b/net/ipv4/tcp_input.c
|   @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk)
|  return;
|  default:
|  sk-sk_err = ECONNRESET;
|   +  break;
|  }
|
|  if (!sock_flag(sk, SOCK_DEAD))
|  
|  NAK; it is not required at all.
|  
|  --yoshfuji
|  
If it were true what you are saying then the statement 

   `sk-sk_err = ECONNRESET;' 

can go as well since it will always be overridden.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] Make qdisc changeable.

2007-10-01 Thread Patrick McHardy
Corey Hickey wrote:
 Re-implement sfq_change() and enable Qdisc_opts.change so tc qdisc
 change will work.
 

 +static int sfq_change(struct Qdisc *sch, struct rtattr *opt)
 +{
 + ...
 +
 + /* finish up */
 + if (q-perturb_period) {
 + q-perturb_timer.expires = jiffies + q-perturb_period;
 + add_timer(q-perturb_timer);
 + } else {
 + q-perturbation = 0;


Seems counter-productive to explicitly set it to zero since
it was still used during tranfering the packets with the
old value. So I'd suggest to remove this or alternatively
set it to the final value *before* transfering the packets.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/10] Change perturb_period to unsigned.

2007-10-01 Thread Patrick McHardy
Corey Hickey wrote:
 perturb_period is currently a signed integer, but I can't see any good
 reason why this is so--a negative perturbation period will add a timer
 that expires in the past, causing constant perturbation, which makes
 hashing useless.
 
   if (q-perturb_period) {
   q-perturb_timer.expires = jiffies + q-perturb_period;
   add_timer(q-perturb_timer);
   }
 
 Strictly speaking, this will break binary compatibility with older
 versions of tc, but that ought not to be a problem because (a) there's
 no valid use for a negative perturb_period, and (b) negative values
 will be seen as high values ( INT_MAX), which don't work anyway.
 
 If perturb_period is too large, (perturb_period * HZ) will overflow the
 size of an unsigned int and wrap around. So, check for thet and reject
 values that are too high.


Sounds reasonable.

 --- a/net/sched/sch_sfq.c
 +++ b/net/sched/sch_sfq.c
 @@ -74,6 +74,9 @@
  typedef unsigned int sfq_index;
  #define SFQ_MAX_DEPTH (UINT_MAX / 2 - 1)
  
 +/* We don't want perturb_period * HZ to overflow an unsigned int. */
 +#define SFQ_MAX_PERTURB (UINT_MAX / HZ)


jiffies are unsigned long.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 3/3] UDP memory usage accounting (take 2): measurement

2007-10-01 Thread Satoshi OSHIMA
Evgeniy Polyakov wrote:
 On Fri, Sep 28, 2007 at 10:41:31PM +0900, Satoshi OSHIMA
([EMAIL PROTECTED]) wrote:
 This patch introduces memory usage measurement for UDP.

 These 3 points were updated.

 - UDP specific codes in IP layer were removed.

 - atomic_sub() in a loop was removed

 - accounting during socket destruction

 Another approach is to account only at the highest UDP layer and having
 datagram skb destructor just like it is done in TCP, but this approach
 is also resonable.


This patch set try to introduce a memory accounting by the page
because TCP does. And ip_append_data() merges payloads to a sk_buff
if previous sk_buff has enough space. The problem is that
udp_append_data() doesn't recognize whether this merge happens or not.

If the accounting must be in UDP layer, we need to change
the interface of ip_append_data() to know this merge happens.

Once the interface is changed, we have to maintain other
protocol stacks to keep up with the change.

But I didn't want to do it to keep this patch set small
in the first step.


 I already told that patches 1 and 3 have broken indent, please fix that.

Oops! I will fix that.


 A hint: when you are about to submit something network related for
inclusion,
 and strongly believes it is ready, it can be a not that bad idea to add
 David Miller [EMAIL PROTECTED] to copy list, he can complain about
 backlog and so on, but will read you mail twice :) but do not tell anyone.

Thank you for your advice. I will do that!

Satoshi Oshima
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1][TCP]: break missing at end of switch statement

2007-10-01 Thread YOSHIFUJI Hideaki
In article [EMAIL PROTECTED] (at Mon, 1 Oct 2007 13:32:43 +0100), Gerrit 
Renker [EMAIL PROTECTED] says:

 [TCP]: break missing at end of switch statement
 
 Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
 ---
 --- a/net/ipv4/tcp_input.c
 +++ b/net/ipv4/tcp_input.c
 @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk)
   return;
   default:
   sk-sk_err = ECONNRESET;
 + break;
   }
  
   if (!sock_flag(sk, SOCK_DEAD))

NAK; it is not required at all.

--yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/10] Use nested compat attributes to pass parameters.

2007-10-01 Thread Patrick McHardy
Corey Hickey wrote:
 This fixes the ambiguity between, for example:
 tc qdisc change ... perturb 0
 tc qdisc change ...
 
 Without this patch, there is no way for SFQ to differentiate between
 a parameter specified to be 0 and a parameter that was omitted.
 

 diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
 index 170fd37..36197f6 100644
 --- a/net/sched/sch_sfq.c
 +++ b/net/sched/sch_sfq.c
 @@ -428,25 +428,31 @@ sfq_q_init(struct sfq_sched_data *q, struct rtattr *opt)
* the previous values (sfq_change). So, overwrite the parameters as
* specified. */
   if (opt) {
 - struct tc_sfq_qopt *ctl = RTA_DATA(opt);
 -
 - if (opt-rta_len  RTA_LENGTH(sizeof(*ctl)))
 - return -EINVAL;
 -
 - if (ctl-quantum)
 - q-quantum = ctl-quantum;
 - if (ctl-perturb_period)
 - q-perturb_period = ctl-perturb_period;
 - if (ctl-divisor)
 - q-hash_divisor = ctl-divisor;
 - if (ctl-flows)
 - q-depth = ctl-flows;
 - if (ctl-limit)
 - q-limit = ctl-limit;
 -
 + struct tc_sfq_qopt *ctl;
 + struct rtattr *tb[TCA_SFQ_MAX];
 +
 + if (rtattr_parse_nested_compat(tb, TCA_SFQ_MAX, opt, ctl,
 +sizeof(*ctl)))
 + goto rtattr_failure;
 +
 +#define GET_PARAM(dst, nest, compat) do { \
 + struct rtattr *rta = tb[(nest) - 1]; \
 + if (rta) \
 + (dst) = RTA_GET_U32(rta); \
 + else if ((compat)) \
 + (dst) = (compat); \
 +} while (0)


An inline function and a comment why this is done would increase
readability.

 +
 + GET_PARAM(q-quantum,TCA_SFQ_QUANTUM, ctl-quantum);
 + GET_PARAM(q-perturb_period, TCA_SFQ_PERTURB,
 + ctl-perturb_period);
 + GET_PARAM(q-hash_divisor,   TCA_SFQ_DIVISOR, ctl-divisor);
 + GET_PARAM(q-depth,  TCA_SFQ_FLOWS,   ctl-flows);
 + GET_PARAM(q-limit,  TCA_SFQ_LIMIT,   ctl-limit);
 + 
   if (q-perturb_period  SFQ_MAX_PERTURB ||
   q-depth  SFQ_MAX_DEPTH)
 - return -EINVAL;
 + goto rtattr_failure;
   }
   q-limit = min_t(u32, q-limit, q-depth - 2);
   q-tail = q-depth;
 @@ -482,6 +488,8 @@ sfq_q_init(struct sfq_sched_data *q, struct rtattr *opt)
   for (i=0; i  q-depth; i++)
   sfq_link(q, i);
   return 0;
 +rtattr_failure:
 + return -EINVAL;
  err_case:
   sfq_q_destroy(q);
   return -ENOBUFS;
 @@ -559,17 +567,26 @@ static int sfq_dump(struct Qdisc *sch, struct sk_buff 
 *skb)
  {
   struct sfq_sched_data *q = qdisc_priv(sch);
   unsigned char *b = skb_tail_pointer(skb);
 + struct rtattr *nest;
   struct tc_sfq_qopt opt;
  
   opt.quantum = q-quantum;
   opt.perturb_period = q-perturb_period;
 -
   opt.limit = q-limit;
   opt.divisor = q-hash_divisor;
   opt.flows = q-depth;
  
 + nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), opt);
 +
 + RTA_PUT_U32(skb, TCA_SFQ_QUANTUM, q-quantum);
 + RTA_PUT_U32(skb, TCA_SFQ_PERTURB, q-perturb_period);
 + RTA_PUT_U32(skb, TCA_SFQ_LIMIT,   q-limit);
 + RTA_PUT_U32(skb, TCA_SFQ_DIVISOR, q-hash_divisor);
 + RTA_PUT_U32(skb, TCA_SFQ_FLOWS,   q-depth);
   RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), opt);


This is wrong, RTA_NEST_COMPAT already dumps the structure.

  
 + RTA_NEST_COMPAT_END(skb, nest);
 +
   return skb-len;
  
  rtattr_failure:

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 0/3] UDP memory usage accounting

2007-10-01 Thread Satoshi OSHIMA
Herbert Xu wrote:
 On Fri, Sep 28, 2007 at 09:51:59PM -0700, David Miller wrote:
 There is a per-socket send buffer limit, and there is a per-user open
 file descriptor limit.  Multiply the two to determine how much system
 memory the user can consume using sockets.

 We do have these limits but they're per-process, not per-user.
 Unless you lock down the number of processes each user can have
 to no more than a handful then this is basically useless.

 For example, let's say each socket can lock down 64K of kernel
 memory (which is quite easy to do BTW, just open a TCP/UDP socket,
 send data to it from another socket but keep the data in the
 socket by not calling recvmsg), and that each process can have
 1024 file descriptors (the default), then each process can pin

 64K x 1024 = 64M

 of memory.  So if the user can have 10 processes, then that's
 640M of kernel memory that can be pinned down.  Usually the
 process limit is at least 10 times higher.

Thank you very mush for your comment.

What you pointed out is my motivation to make this patch.
I think that per-process limits won't help to solve this
problem.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH 0/3] UDP memory usage accounting

2007-10-01 Thread Satoshi OSHIMA
 On Fri, Sep 28, 2007 at 09:47:37PM -0700, David Miller wrote:
 There are two things we (might) need to guard against, one local and
 one remote.

 Right I was focusing on the local threat.

 If you do a per-user limit, apache would basically just stop at that
 redzone point.  In some sense making the attack more effective because
 then it's trivial to shut down an entire web server this way.

 Having a per-user limit doesn't necessarily mean that we have
 to apply the limit differently to how we apply the system-wide
 limits.  We could keep exactly the same code as we have now but
 check against a per-user limit instead of a system-wide one.

 In other words your apache scenario will continue to work as is
 even with a per-user limit.

I'm afraid that per-user limit won't work for system administrator,
because he can't know who is the rogue user in advance (before
such attack is made). And once the attack is made, system will
not responce because of the lack of memory for slab.

So if he only has per-user limit, he need to split the memory
budget for UDP to each user. The limit per user will be very
small if number of users in the system is large.


 Now where it does become useful is when we have a rogue local
 user.  As it is that user can chew up all of the budgeted TCP
 memory by simply not calling recvmsg.  As I've stated in the
 other email, the existing rlimits don't help because they're
 per-process rather than per-user.

 BTW, this is not fatal for TCP because TCP provides a minimum
 amount of memory for each socket even when we are over the
 limit.  However, if we this was implemented for UDP without
 a minimum guarantee then it'd be quite useless.

Hmm, I didn't realize that. Thank you for your good
suggestion. I will think of it.


 I see no valid argument against doing something similar for sockets.
 Such a register_shrinker() handler for TCP could, for example, look
 for TCP flows which haven't made forward progress in more than a
 certain amount of time and attempt to trim SKB memory from them.

 Yes I agree this would be quite useful for sending.  However, it'll
 be tough to shrink skbs that we've already acked for but the app
 for some reason has decided to leave in the socket by not calling
 recvmsg.

 UDP and other datagram sockets are troublesome because the memory
 gets wholly tied up immediately during the send call and it's not
 easy to liberate anything.  The nice part about datagram sockets,
 however, is that they make forward progress quickly and their
 memory is liberated as soon as the device transmits the packet.
 They don't have to wait for ACKs, windows openning up, or anything
 like that to happen.

 Agreed.  Also the recvmsg case I've described above is much
 simpler for UDP as we can just go through all the sockets and
 free skbs at random :)

 To be honest I don't even think UDP is much of a real problem for this
 reason.

 It's not a hard problem but we do need to have some code for it.

I believe so. Currently, a nasty user can easily stop the system
without root privilege. This may not be a serious problem, but
this is the problem to be fixed.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] make netlink processing routines semi-synchronious (inspired by rtnl) v2

2007-10-01 Thread Denis V. Lunev
The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks
like outdated copy/paste from rtnetlink.c. Push them into sync with the
original.

Changes from v1:
- deleted comment in nfnetlink_rcv_msg by request of Patrick McHardy

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

--- ./net/netfilter/nfnetlink.c.nlk32007-10-01 09:47:53.0 +0400
+++ ./net/netfilter/nfnetlink.c 2007-10-01 17:13:09.0 +0400
@@ -44,26 +44,14 @@ static struct sock *nfnl = NULL;
 static const struct nfnetlink_subsystem *subsys_table[NFNL_SUBSYS_COUNT];
 static DEFINE_MUTEX(nfnl_mutex);
 
-static void nfnl_lock(void)
+static inline void nfnl_lock(void)
 {
mutex_lock(nfnl_mutex);
 }
 
-static int nfnl_trylock(void)
-{
-   return !mutex_trylock(nfnl_mutex);
-}
-
-static void __nfnl_unlock(void)
-{
-   mutex_unlock(nfnl_mutex);
-}
-
-static void nfnl_unlock(void)
+static inline void nfnl_unlock(void)
 {
mutex_unlock(nfnl_mutex);
-   if (nfnl-sk_receive_queue.qlen)
-   nfnl-sk_data_ready(nfnl, 0);
 }
 
 int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n)
@@ -147,9 +135,7 @@ static int nfnetlink_rcv_msg(struct sk_b
ss = nfnetlink_get_subsys(type);
if (!ss) {
 #ifdef CONFIG_KMOD
-   /* don't call nfnl_unlock, since it would reenter
-* with further packet processing */
-   __nfnl_unlock();
+   nfnl_unlock();
request_module(nfnetlink-subsys-%d, NFNL_SUBSYS_ID(type));
nfnl_lock();
ss = nfnetlink_get_subsys(type);
@@ -188,10 +174,9 @@ static void nfnetlink_rcv(struct sock *s
unsigned int qlen = 0;
 
do {
-   if (nfnl_trylock())
-   return;
+   nfnl_lock();
qlen = netlink_run_queue(sk, qlen, nfnetlink_rcv_msg);
-   __nfnl_unlock();
+   nfnl_unlock();
} while (qlen);
 }
 
--- ./net/netlink/genetlink.c.nlk3  2007-08-26 19:30:38.0 +0400
+++ ./net/netlink/genetlink.c   2007-10-01 16:05:29.0 +0400
@@ -22,22 +22,14 @@ struct sock *genl_sock = NULL;
 
 static DEFINE_MUTEX(genl_mutex); /* serialization of message processing */
 
-static void genl_lock(void)
+static inline void genl_lock(void)
 {
mutex_lock(genl_mutex);
 }
 
-static int genl_trylock(void)
-{
-   return !mutex_trylock(genl_mutex);
-}
-
-static void genl_unlock(void)
+static inline void genl_unlock(void)
 {
mutex_unlock(genl_mutex);
-
-   if (genl_sock  genl_sock-sk_receive_queue.qlen)
-   genl_sock-sk_data_ready(genl_sock, 0);
 }
 
 #define GENL_FAM_TAB_SIZE  16
@@ -483,8 +475,7 @@ static void genl_rcv(struct sock *sk, in
unsigned int qlen = 0;
 
do {
-   if (genl_trylock())
-   return;
+   genl_lock();
qlen = netlink_run_queue(sk, qlen, genl_rcv_msg);
genl_unlock();
} while (qlen  genl_sock  genl_sock-sk_receive_queue.qlen);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] memory leak in netlink user-kernel processing

2007-10-01 Thread Denis V. Lunev
netlink_kernel_create can be called with NULL as an input callback in several
places, f.e. in kobject_uevent_init. This means that if one sends packet from
user to kernel for such a socket, the packet will be leaked in the socket
queue forever.

This patch adds a simple generic cleanup callback for these sockets.

Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

--- ./net/netlink/af_netlink.c.nlk4 2007-08-26 19:30:38.0 +0400
+++ ./net/netlink/af_netlink.c  2007-10-01 18:00:58.0 +0400
@@ -1301,6 +1301,13 @@ out:
return err ? : copied;
 }
 
+static void netlink_rcv_drop(struct sock *sk, int len)
+{
+   struct sk_buff *skb;
+   while ((skb = skb_dequeue(sk-sk_receive_queue)) != NULL)
+   kfree_skb(skb);
+}
+
 static void netlink_data_ready(struct sock *sk, int len)
 {
struct netlink_sock *nlk = nlk_sk(sk);
@@ -1346,8 +1353,7 @@ netlink_kernel_create(struct net *net, i
 
sk = sock-sk;
sk-sk_data_ready = netlink_data_ready;
-   if (input)
-   nlk_sk(sk)-data_ready = input;
+   nlk_sk(sk)-data_ready = input != NULL ? input : netlink_rcv_drop;
 
if (netlink_insert(sk, net, 0))
goto out_sock_release;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make netlink processing routines semi-synchronious (inspired by rtnl) v2

2007-10-01 Thread Patrick McHardy
Denis V. Lunev wrote:
 The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks
 like outdated copy/paste from rtnetlink.c. Push them into sync with the
 original.
 
 Changes from v1:
 - deleted comment in nfnetlink_rcv_msg by request of Patrick McHardy

Thanks.

 
 Signed-off-by: Denis V. Lunev [EMAIL PROTECTED]

Acked-by: Patrick McHardy [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1][TCP]: break missing at end of switch statement

2007-10-01 Thread Arnaldo Carvalho de Melo
Em Mon, Oct 01, 2007 at 02:39:28PM +0100, Gerrit Renker escreveu:
 Quoting YOSHIFUJI Hideaki:
 | 
 |   [TCP]: break missing at end of switch statement
 |   
 |   Signed-off-by: Gerrit Renker [EMAIL PROTECTED]
 |   ---
 |   --- a/net/ipv4/tcp_input.c
 |   +++ b/net/ipv4/tcp_input.c
 |   @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk)
 |return;
 |default:
 |sk-sk_err = ECONNRESET;
 |   +break;
 |}
 |
 |if (!sock_flag(sk, SOCK_DEAD))
 |  
 |  NAK; it is not required at all.
 |  
 |  --yoshfuji
 |  
 If it were true what you are saying then the statement 
 
`sk-sk_err = ECONNRESET;' 
 
 can go as well since it will always be overridden.

Gerrit,

It is not required. The statement you mention will be executed
when the sk_state is not one of TCP_SYN_SENT, TCP_CLOSE_WAIT or
TCP_CLOSE.

A 'break' is only needed in a label block if it is not the last
one.

- Arnaldo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ehea: DLPAR memory add fix

2007-10-01 Thread Jan-Bernd Themann
Due to stability issues in high load situations the HW queue handling
has to be changed. The HW queues are now stopped and restarted again instead
of destroying and allocating new HW queues. 

Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]

---
 drivers/net/ehea/ehea.h  |4 +-
 drivers/net/ehea/ehea_main.c |  276 +-
 drivers/net/ehea/ehea_phyp.h |1 +
 drivers/net/ehea/ehea_qmr.c  |   20 ++--
 drivers/net/ehea/ehea_qmr.h  |4 +-
 5 files changed, 259 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ehea/ehea.h b/drivers/net/ehea/ehea.h
index c0cbd94..3022089 100644
--- a/drivers/net/ehea/ehea.h
+++ b/drivers/net/ehea/ehea.h
@@ -40,13 +40,13 @@
 #include asm/io.h
 
 #define DRV_NAME   ehea
-#define DRV_VERSIONEHEA_0074
+#define DRV_VERSIONEHEA_0077
 
 /* eHEA capability flags */
 #define DLPAR_PORT_ADD_REM 1
 #define DLPAR_MEM_ADD  2
 #define DLPAR_MEM_REM  4
-#define EHEA_CAPABILITIES  (DLPAR_PORT_ADD_REM)
+#define EHEA_CAPABILITIES  (DLPAR_PORT_ADD_REM | DLPAR_MEM_ADD)
 
 #define EHEA_MSG_DEFAULT (NETIF_MSG_LINK | NETIF_MSG_TIMER \
| NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR)
diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index 62d6c1e..5bc0a15 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -97,6 +97,7 @@ u64 ehea_driver_flags = 0;
 struct workqueue_struct *ehea_driver_wq;
 struct work_struct ehea_rereg_mr_task;
 
+struct semaphore dlpar_mem_lock;
 
 static int __devinit ehea_probe_adapter(struct ibmebus_dev *dev,
const struct of_device_id *id);
@@ -177,16 +178,24 @@ static void ehea_refill_rq1(struct ehea_port_res *pr, int 
index, int nr_of_wqes)
struct sk_buff **skb_arr_rq1 = pr-rq1_skba.arr;
struct net_device *dev = pr-port-netdev;
int max_index_mask = pr-rq1_skba.len - 1;
+   int fill_wqes = pr-rq1_skba.os_skbs + nr_of_wqes;
+   int adder = 0;
int i;
 
-   if (!nr_of_wqes)
+   pr-rq1_skba.os_skbs = 0;
+
+   if (unlikely(test_bit(__EHEA_STOP_XFER, ehea_driver_flags))) {
+   pr-rq1_skba.index = index;
+   pr-rq1_skba.os_skbs = fill_wqes;
return;
+   }
 
-   for (i = 0; i  nr_of_wqes; i++) {
+   for (i = 0; i  fill_wqes; i++) {
if (!skb_arr_rq1[index]) {
skb_arr_rq1[index] = netdev_alloc_skb(dev,
  EHEA_L_PKT_SIZE);
if (!skb_arr_rq1[index]) {
+   pr-rq1_skba.os_skbs = fill_wqes - i;
ehea_error(%s: no mem for skb/%d wqes filled,
   dev-name, i);
break;
@@ -194,9 +203,14 @@ static void ehea_refill_rq1(struct ehea_port_res *pr, int 
index, int nr_of_wqes)
}
index--;
index = max_index_mask;
+   adder++;
}
+
+   if (adder == 0)
+   return;
+
/* Ring doorbell */
-   ehea_update_rq1a(pr-qp, i);
+   ehea_update_rq1a(pr-qp, adder);
 }
 
 static int ehea_init_fill_rq1(struct ehea_port_res *pr, int nr_rq1a)
@@ -230,16 +244,21 @@ static int ehea_refill_rq_def(struct ehea_port_res *pr,
struct sk_buff **skb_arr = q_skba-arr;
struct ehea_rwqe *rwqe;
int i, index, max_index_mask, fill_wqes;
+   int adder = 0;
int ret = 0;
 
fill_wqes = q_skba-os_skbs + num_wqes;
+   q_skba-os_skbs = 0;
 
-   if (!fill_wqes)
+   if (unlikely(test_bit(__EHEA_STOP_XFER, ehea_driver_flags))) {
+   q_skba-os_skbs = fill_wqes;
return ret;
+   }
 
index = q_skba-index;
max_index_mask = q_skba-len - 1;
for (i = 0; i  fill_wqes; i++) {
+   u64 tmp_addr;
struct sk_buff *skb = netdev_alloc_skb(dev, packet_size);
if (!skb) {
ehea_error(%s: no mem for skb/%d wqes filled,
@@ -251,30 +270,37 @@ static int ehea_refill_rq_def(struct ehea_port_res *pr,
skb_reserve(skb, NET_IP_ALIGN);
 
skb_arr[index] = skb;
+   tmp_addr = ehea_map_vaddr(skb-data);
+   if (tmp_addr == -1) {
+   dev_kfree_skb(skb);
+   q_skba-os_skbs = fill_wqes - i;
+   ret = 0;
+   break;
+   }
 
rwqe = ehea_get_next_rwqe(qp, rq_nr);
rwqe-wr_id = EHEA_BMASK_SET(EHEA_WR_ID_TYPE, wqe_type)
| EHEA_BMASK_SET(EHEA_WR_ID_INDEX, index);
rwqe-sg_list[0].l_key = pr-recv_mr.lkey;
-   rwqe-sg_list[0].vaddr = ehea_map_vaddr(skb-data);
+   rwqe-sg_list[0].vaddr = tmp_addr;
rwqe-sg_list[0].len = packet_size;

Re: [PATCH] memory leak in netlink user-kernel processing

2007-10-01 Thread Patrick McHardy
Denis V. Lunev wrote:
 netlink_kernel_create can be called with NULL as an input callback in several
 places, f.e. in kobject_uevent_init. This means that if one sends packet from
 user to kernel for such a socket, the packet will be leaked in the socket
 queue forever.
 
 This patch adds a simple generic cleanup callback for these sockets.


This should already be handled by netlink_getsockbypid:

/* Don't bother queuing skb if kernel socket has no input
function */
nlk = nlk_sk(sock);
if ((nlk-pid == 0  !nlk-data_ready) ||
(sock-sk_state == NETLINK_CONNECTED 
 nlk-dst_pid != nlk_sk(ssk)-pid)) {
sock_put(sock);
return ERR_PTR(-ECONNREFUSED);
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ehea work queues

2007-10-01 Thread Jan-Bernd Themann
Hi

On Sunday 30 September 2007 18:20, Anton Blanchard wrote:
 
 Hi,
 
 I booted 2.6.23-rc8 and noticed that ehea loves its workqueues:
 (notice also that the ehea_driver_wq/XXX exceeds TASK_COMM_LEN). 
 
 Since they are both infrequent events and not performance critical
 (memory hotplug and driver reset), can we just use schedule_work?
 
Yes. I'll provide a patch soon.

Thanks,
Jan-Bernd
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ehea: DLPAR memory add fix

2007-10-01 Thread Jeff Garzik

Jan-Bernd Themann wrote:

Due to stability issues in high load situations the HW queue handling
has to be changed. The HW queues are now stopped and restarted again instead
of destroying and allocating new HW queues. 


Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]


May I presume this is for 2.6.23?

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ehea: DLPAR memory add fix

2007-10-01 Thread Jan-Bernd Themann
Hi,

On Monday 01 October 2007 16:44, Jeff Garzik wrote:
 Jan-Bernd Themann wrote:
  Due to stability issues in high load situations the HW queue handling
  has to be changed. The HW queues are now stopped and restarted again instead
  of destroying and allocating new HW queues. 
  
  Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]
 
 May I presume this is for 2.6.23?
 
   Jeff

no, the patch is build against 2.6.24 upstream (new NAPI interface).

Regards,
Jan-Bernd
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] memory leak in netlink user-kernel processing

2007-10-01 Thread Denis V. Lunev
Patrick McHardy wrote:
 Denis V. Lunev wrote:
 netlink_kernel_create can be called with NULL as an input callback in several
 places, f.e. in kobject_uevent_init. This means that if one sends packet from
 user to kernel for such a socket, the packet will be leaked in the socket
 queue forever.

 This patch adds a simple generic cleanup callback for these sockets.
 
 
 This should already be handled by netlink_getsockbypid:
 
 /* Don't bother queuing skb if kernel socket has no input
 function */
 nlk = nlk_sk(sock);
 if ((nlk-pid == 0  !nlk-data_ready) ||
 (sock-sk_state == NETLINK_CONNECTED 
  nlk-dst_pid != nlk_sk(ssk)-pid)) {
 sock_put(sock);
 return ERR_PTR(-ECONNREFUSED);
 }
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

Looks so...

By the way, Patrick, this looks like nlk-pid == 0 if and only if this
is a kernel socket. Right?

I have told with Alexey Kuznetsov and we have discrovered a way to get
rid of
skb_queue_tail(sk-sk_receive_queue, skb);
sk-sk_data_ready(sk, len);
in netlink_sendskb/etc for kernel sockets and make user-kernel packets
processing truly synchronous.

The idea is simple, we should queue/wakeup in kernel-user direction and
simply call nlk-data_ready for user-kernel direction. This will remove
all the crap we have now. But we need a mark to determine the direction.
Which one will be better? (nlk-data_ready) or (nlk-pid == 0)

Regards,
Den
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ehea: DLPAR memory add fix

2007-10-01 Thread Jeff Garzik

Jan-Bernd Themann wrote:

Hi,

On Monday 01 October 2007 16:44, Jeff Garzik wrote:

Jan-Bernd Themann wrote:

Due to stability issues in high load situations the HW queue handling
has to be changed. The HW queues are now stopped and restarted again instead
of destroying and allocating new HW queues. 


Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED]

May I presume this is for 2.6.23?

Jeff


no, the patch is build against 2.6.24 upstream (new NAPI interface).


OK, thanks.

Since we typically have two streams, the current bug-fix stream and the 
for-next-kernel stream, please indicate to which kernel/git tree your 
patch applies, in the future.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] memory leak in netlink user-kernel processing

2007-10-01 Thread Patrick McHardy

Denis V. Lunev wrote:

By the way, Patrick, this looks like nlk-pid == 0 if and only if this
is a kernel socket. Right?
  


Thats correct.


I have told with Alexey Kuznetsov and we have discrovered a way to get
rid of
skb_queue_tail(sk-sk_receive_queue, skb);
sk-sk_data_ready(sk, len);
in netlink_sendskb/etc for kernel sockets and make user-kernel packets
processing truly synchronous.

The idea is simple, we should queue/wakeup in kernel-user direction and
simply call nlk-data_ready for user-kernel direction. This will remove
all the crap we have now. But we need a mark to determine the direction.
Which one will be better? (nlk-data_ready) or (nlk-pid == 0)



Both would work fine, but I think nlk-pid is better since its
actually the address.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread Eric Dumazet

Denys a écrit :

Well, i can play a bit more on live servers. I have now hot-swap server with
full gentoo,  where i can rebuild any kernel you want, with any applied patch.
But it looks more like not overhead, load becoming high too spiky, and it is
not just permantenly higher. Also it is not normal that all system becoming
unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage
softirq jumps to 100%).

  
Could you try a pristine 2.6.22.9 and some patch in 
secure_tcp_sequence_number() like :


--- drivers/char/random.c.orig 2007-10-01 10:18:42.0 +0200
+++ drivers/char/random.c 2007-10-01 10:19:58.0 +0200
@@ -1554,7 +1554,7 @@
* That's funny, Linux has one built in! Use it!
* (Networks are faster now - should this be increased?)
*/
- seq += ktime_get_real().tv64;
+ seq += ktime_get_real().tv64 / 1000;
#if 0
printk(init_seq(%lx, %lx, %d, %d) = %d\n,
saddr, daddr, sport, dport, seq);

Thank you



On Mon, 01 Oct 2007 00:12:59 -0700 (PDT), David Miller wrote
  

From: Eric Dumazet [EMAIL PROTECTED]
Date: Mon, 01 Oct 2007 07:59:12 +0200



No problem here on bigger servers, so I CC David Miller and netdev
on this one.  AFAIK do_gettimeofday() and ktime_get_real() should
use the same underlying hardware functions on PC and no performance
problem should happen here.
  

One thing that jumps out at me is that on 32-bit (and to a certain
extent on 64-bit) there is a lot of stack accesses and missed
optimizations because all of the work occurs, and gets expanded,
inside of ktime_get_real().

The timespec_to_ktime() inside of there constructs the ktime_t return
value on the stack, then returns that as an aggregate to the caller.

That cannot be without some cost.

ktime_get_real() is definitely a candidate for inlining especially in
these kinds of cases where we'll happily get computations in local
registers instead of all of this on-stack nonsense.  And in several
cases (if the caller only needs the tv_sec value, for example)
computations can be elided entirely.

It would be constructive to experiment and see if this is in fact 
part of the problem.




--
Denys Fedoryshchenko
Technical Manager
Virtual ISP S.A.L.


  





-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] memory leak in netlink user-kernel processing

2007-10-01 Thread Eric W. Biederman
Patrick McHardy [EMAIL PROTECTED] writes:

 Denis V. Lunev wrote:
 By the way, Patrick, this looks like nlk-pid == 0 if and only if this
 is a kernel socket. Right?


 Thats correct.

 I have told with Alexey Kuznetsov and we have discrovered a way to get
 rid of
 skb_queue_tail(sk-sk_receive_queue, skb);
 sk-sk_data_ready(sk, len);
 in netlink_sendskb/etc for kernel sockets and make user-kernel packets
 processing truly synchronous.

 The idea is simple, we should queue/wakeup in kernel-user direction and
 simply call nlk-data_ready for user-kernel direction. This will remove
 all the crap we have now. But we need a mark to determine the direction.
 Which one will be better? (nlk-data_ready) or (nlk-pid == 0)


 Both would work fine, but I think nlk-pid is better since its
 actually the address.

Maybe.  nlk-pid is also 0, before the socket is bound so it does
not serve as a reliable indicator that you have a kernel socket.

My gut feel says the best test is:
(nlk-flags  NETLINK_KERNEL_SOCKET)

There is no confusion in that and it is dead obvious what we
are testing for.  Although we do still need to properly handle
the case when netlink_kernel_create is called with a NULL
input method.  As long as get the proper -ECONNREFUSED the
code path doesn't look like it matters.

Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1][TCP]: break missing at end of switch statement

2007-10-01 Thread Gerrit Renker
Arnaldo, Al Viro, and Yoshifuji -

sorry for having wasted your time with this one. You are right, that was 
complete nonsense. 
I don't know where my mind was - even my test program used to `prove' this was 
screwed up.

So nothing wrong here and thank you very much for your clarifying comments.

|   |   --- a/net/ipv4/tcp_input.c
|   |   +++ b/net/ipv4/tcp_input.c
|   |   @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk)
|   | return;
|   | default:
|   | sk-sk_err = ECONNRESET;
|   |   + break;
|   | }
|   |
|   | if (!sock_flag(sk, SOCK_DEAD))
|   |  
|   |  NAK; it is not required at all.
|   |  
|   |  --yoshfuji
|   |  
|   If it were true what you are saying then the statement 
|   
|  `sk-sk_err = ECONNRESET;' 
|   
|   can go as well since it will always be overridden.
|  
|  Gerrit,
|  
|  It is not required. The statement you mention will be executed
|  when the sk_state is not one of TCP_SYN_SENT, TCP_CLOSE_WAIT or
|  TCP_CLOSE.
|  
|   A 'break' is only needed in a label block if it is not the last
|  one.
|  
|  - Arnaldo
|  
|  
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [IPV6] Fix ICMPv6 redirect handling with target multicast address

2007-10-01 Thread Brian Haley
Hi,

YOSHIFUJI Hideaki / 吉藤英明 wrote:
 I think it'd also be better if you add the check to be:

 if (ipv6_addr_type(target)  
 (IPV6_ADDR_LINKLOCAL|IPV6_ADDR_UNICAST))

 or something along those lines, rather than reproducing ipv6_addr_type() 
 code
 separately in a new ipv6_addr_linklocal() function.
 
 I'm fine with the idea of the fix itself.

Ok, in both the receive and send code?

 Please use ipv6_addr_type() so far and convert other users as well
 to ipv6_addr_linklocal() in another patch.

I'll re-do the patch.

-Brian
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] net-2.6.24: old ax25 driver fix

2007-10-01 Thread Stephen Hemminger
Recent change in hard header broke build of these old drivers.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
---
 drivers/net/hamradio/dmascc.c |2 +-
 drivers/net/hamradio/scc.c|2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hamradio/dmascc.c b/drivers/net/hamradio/dmascc.c
index b529e23..bc02e46 100644
--- a/drivers/net/hamradio/dmascc.c
+++ b/drivers/net/hamradio/dmascc.c
@@ -581,7 +581,7 @@ static int __init setup_adapter(int card_base, int type, 
int n)
dev-do_ioctl = scc_ioctl;
dev-hard_start_xmit = scc_send_packet;
dev-get_stats = scc_get_stats;
-   dev-header_ops = ax25_hard_header_ops
+   dev-header_ops = ax25_header_ops;
dev-set_mac_address = scc_set_mac_address;
}
if (register_netdev(info-dev[0])) {
diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c
index 56cc523..353d13e 100644
--- a/drivers/net/hamradio/scc.c
+++ b/drivers/net/hamradio/scc.c
@@ -1551,7 +1551,7 @@ static void scc_net_setup(struct net_device *dev)
dev-stop= scc_net_close;
 
dev-hard_start_xmit = scc_net_tx;
-   dev-header_ops  = ax25_hard_header_ops;
+   dev-header_ops  = ax25_header_ops;
 
dev-set_mac_address = scc_net_set_mac_address;
dev-get_stats   = scc_net_get_stats;
-- 
1.5.2.5

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] qla3xxx: receive path bugfixes.

2007-10-01 Thread Ron Mercer
Jeff,
This is the second submission... First was in August. Thanks, Ron

The following two patches fix:

An undocumented feature where the 4032 chip sets bit-7
of the opcode for an inbound completion if it's for a VLAN.

The access of stale data on a completion entry.

These patches were built and tested on 2.6.23-rc1.

Signed-off-by: Ron Mercer [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] qla3xxx: bugfix: Add memory barrier before accessing rx completion.

2007-10-01 Thread Ron Mercer

Signed-off-by: Ron Mercer [EMAIL PROTECTED]
---
 drivers/net/qla3xxx.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/qla3xxx.c b/drivers/net/qla3xxx.c
index 69da95b..c3fe1c7 100755
--- a/drivers/net/qla3xxx.c
+++ b/drivers/net/qla3xxx.c
@@ -2248,6 +2248,7 @@ static int ql_tx_rx_clean(struct ql3_adapter *qdev,
qdev-rsp_consumer_index)  (work_done  work_to_do)) {
 
net_rsp = qdev-rsp_current;
+   rmb();
switch (net_rsp-opcode) {
 
case OPCODE_OB_MAC_IOCB_FN0:
-- 
1.5.0.rc4.16.g9e258

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] qla3xxx: bugfix: Fix VLAN rx completion handling.

2007-10-01 Thread Ron Mercer
Fix 4032 chip undocumented feature where bit-8 is set
if the inbound completion is for a VLAN.

Signed-off-by: Ron Mercer [EMAIL PROTECTED]
---
 drivers/net/qla3xxx.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/drivers/net/qla3xxx.c b/drivers/net/qla3xxx.c
index c3fe1c7..ea15131 100755
--- a/drivers/net/qla3xxx.c
+++ b/drivers/net/qla3xxx.c
@@ -2249,6 +2249,12 @@ static int ql_tx_rx_clean(struct ql3_adapter *qdev,
 
net_rsp = qdev-rsp_current;
rmb();
+   /*
+* Fix 4032 chipe undocumented feature where bit-8 is set if 
the
+* inbound completion is for a VLAN.
+*/
+   if (qdev-device_id == QL3032_DEVICE_ID)
+   net_rsp-opcode = 0x7f;
switch (net_rsp-opcode) {
 
case OPCODE_OB_MAC_IOCB_FN0:
-- 
1.5.0.rc4.16.g9e258

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] fs_enet: Whitespace cleanup.

2007-10-01 Thread Scott Wood
Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
This patch series applies to the net-2.6.24 branch.

 drivers/net/fs_enet/fs_enet-main.c |   85 ---
 drivers/net/fs_enet/fs_enet.h  |4 +-
 drivers/net/fs_enet/mac-fcc.c  |1 -
 drivers/net/fs_enet/mii-bitbang.c  |3 -
 drivers/net/fs_enet/mii-fec.c  |1 -
 5 files changed, 41 insertions(+), 53 deletions(-)

diff --git a/drivers/net/fs_enet/fs_enet-main.c 
b/drivers/net/fs_enet/fs_enet-main.c
index ebdcf3f..2a1b150 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -343,7 +343,6 @@ static void fs_enet_tx(struct net_device *dev)
 
do_wake = do_restart = 0;
while (((sc = CBDR_SC(bdp))  BD_ENET_TX_READY) == 0) {
-
dirtyidx = bdp - fep-tx_bd_base;
 
if (fep-tx_free == fep-tx_ring)
@@ -444,7 +443,6 @@ fs_enet_interrupt(int irq, void *dev_id)
 
nr = 0;
while ((int_events = (*fep-ops-get_int_events)(dev)) != 0) {
-
nr++;
 
int_clr_events = int_events;
@@ -700,45 +698,43 @@ static void fs_timeout(struct net_device *dev)
  
*-*/
 static void generic_adjust_link(struct  net_device *dev)
 {
-   struct fs_enet_private *fep = netdev_priv(dev);
-   struct phy_device *phydev = fep-phydev;
-   int new_state = 0;
-
-   if (phydev-link) {
-
-   /* adjust to duplex mode */
-   if (phydev-duplex != fep-oldduplex){
-   new_state = 1;
-   fep-oldduplex = phydev-duplex;
-   }
-
-   if (phydev-speed != fep-oldspeed) {
-   new_state = 1;
-   fep-oldspeed = phydev-speed;
-   }
-
-   if (!fep-oldlink) {
-   new_state = 1;
-   fep-oldlink = 1;
-   netif_schedule(dev);
-   netif_carrier_on(dev);
-   netif_start_queue(dev);
-   }
-
-   if (new_state)
-   fep-ops-restart(dev);
-
-   } else if (fep-oldlink) {
-   new_state = 1;
-   fep-oldlink = 0;
-   fep-oldspeed = 0;
-   fep-oldduplex = -1;
-   netif_carrier_off(dev);
-   netif_stop_queue(dev);
-   }
-
-   if (new_state  netif_msg_link(fep))
-   phy_print_status(phydev);
+   struct fs_enet_private *fep = netdev_priv(dev);
+   struct phy_device *phydev = fep-phydev;
+   int new_state = 0;
+
+   if (phydev-link) {
+   /* adjust to duplex mode */
+   if (phydev-duplex != fep-oldduplex) {
+   new_state = 1;
+   fep-oldduplex = phydev-duplex;
+   }
+
+   if (phydev-speed != fep-oldspeed) {
+   new_state = 1;
+   fep-oldspeed = phydev-speed;
+   }
+
+   if (!fep-oldlink) {
+   new_state = 1;
+   fep-oldlink = 1;
+   netif_schedule(dev);
+   netif_carrier_on(dev);
+   netif_start_queue(dev);
+   }
+
+   if (new_state)
+   fep-ops-restart(dev);
+   } else if (fep-oldlink) {
+   new_state = 1;
+   fep-oldlink = 0;
+   fep-oldspeed = 0;
+   fep-oldduplex = -1;
+   netif_carrier_off(dev);
+   netif_stop_queue(dev);
+   }
+
+   if (new_state  netif_msg_link(fep))
+   phy_print_status(phydev);
 }
 
 
@@ -782,7 +778,6 @@ static int fs_init_phy(struct net_device *dev)
return 0;
 }
 
-
 static int fs_enet_open(struct net_device *dev)
 {
struct fs_enet_private *fep = netdev_priv(dev);
@@ -971,7 +966,7 @@ static struct net_device *fs_init_instance(struct device 
*dev,
 #endif
 
 #ifdef CONFIG_FS_ENET_HAS_SCC
-   if (fs_get_scc_index(fpi-fs_no) =0 )
+   if (fs_get_scc_index(fpi-fs_no) =0)
fep-ops = fs_scc_ops;
 #endif
 
@@ -1066,9 +1061,8 @@ static struct net_device *fs_init_instance(struct device 
*dev,
 
return ndev;
 
-  err:
+err:
if (ndev != NULL) {
-
if (registered)
unregister_netdev(ndev);
 
@@ -1259,7 +1253,6 @@ static int __init fs_init(void)
 err:
cleanup_immap();
return r;
-
 }
 
 static void __exit fs_cleanup(void)
diff --git a/drivers/net/fs_enet/fs_enet.h b/drivers/net/fs_enet/fs_enet.h
index 46d0606..fbe2087 100644
--- a/drivers/net/fs_enet/fs_enet.h
+++ b/drivers/net/fs_enet/fs_enet.h
@@ -15,8 +15,8 @@
 #include asm/commproc.h
 
 struct fec_info {
-fec_t*  fecp;
-   u32 mii_speed;
+   fec_t *fecp;
+   u32 mii_speed;
 };
 #endif
 
diff 

[PATCH 5/9] fs_enet: Align receive buffers.

2007-10-01 Thread Scott Wood
At least some hardware driven by this driver needs receive buffers
to be aligned on a 16-byte boundary.  This usually happens by chance,
but it breaks if slab debugging is enabled.

Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
 drivers/net/fs_enet/fs_enet-main.c |   21 +++--
 drivers/net/fs_enet/fs_enet.h  |3 ++-
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/net/fs_enet/fs_enet-main.c 
b/drivers/net/fs_enet/fs_enet-main.c
index a15345b..7a02986 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -70,6 +70,14 @@ static void fs_set_multicast_list(struct net_device *dev)
(*fep-ops-set_multicast_list)(dev);
 }
 
+static void skb_align(struct sk_buff *skb, int align)
+{
+   int off = ((unsigned long)skb-data)  (align - 1);
+
+   if (off)
+   skb_reserve(skb, align - off);
+}
+
 /* NAPI receive function */
 static int fs_enet_rx_napi(struct napi_struct *napi, int budget)
 {
@@ -159,9 +167,13 @@ static int fs_enet_rx_napi(struct napi_struct *napi, int 
budget)
skb = skbn;
skbn = skbt;
}
-   } else
+   } else {
skbn = dev_alloc_skb(ENET_RX_FRSIZE);
 
+   if (skbn)
+   skb_align(skbn, ENET_RX_ALIGN);
+   }
+
if (skbn != NULL) {
skb_put(skb, pkt_len);  /* Make room */
skb-protocol = eth_type_trans(skb, dev);
@@ -290,9 +302,13 @@ static int fs_enet_rx_non_napi(struct net_device *dev)
skb = skbn;
skbn = skbt;
}
-   } else
+   } else {
skbn = dev_alloc_skb(ENET_RX_FRSIZE);
 
+   if (skbn)
+   skb_align(skbn, ENET_RX_ALIGN);
+   }
+
if (skbn != NULL) {
skb_put(skb, pkt_len);  /* Make room */
skb-protocol = eth_type_trans(skb, dev);
@@ -502,6 +518,7 @@ void fs_init_bds(struct net_device *dev)
   dev-name);
break;
}
+   skb_align(skb, ENET_RX_ALIGN);
fep-rx_skbuff[i] = skb;
CBDW_BUFADDR(bdp,
dma_map_single(fep-dev, skb-data,
diff --git a/drivers/net/fs_enet/fs_enet.h b/drivers/net/fs_enet/fs_enet.h
index fbe2087..85571e4 100644
--- a/drivers/net/fs_enet/fs_enet.h
+++ b/drivers/net/fs_enet/fs_enet.h
@@ -82,7 +82,8 @@ struct phy_info {
 /* Must be a multiple of 32 (to cover both FEC  FCC) */
 #define PKT_MAXBLR_SIZE((PKT_MAXBUF_SIZE + 31)  ~31)
 /* This is needed so that invalidate_xxx wont invalidate too much */
-#define ENET_RX_FRSIZE L1_CACHE_ALIGN(PKT_MAXBUF_SIZE)
+#define ENET_RX_ALIGN  16
+#define ENET_RX_FRSIZE L1_CACHE_ALIGN(PKT_MAXBUF_SIZE + ENET_RX_ALIGN - 1)
 
 struct fs_enet_mii_bus {
struct list_head list;
-- 
1.5.3.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] fs_enet: mac-fcc: Eliminate __fcc-* macros.

2007-10-01 Thread Scott Wood
These macros accomplish nothing other than defeating type checking.

This patch also fixes one instance of the wrong register size being
used that was revealed by enabling type checking.

Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
 drivers/net/fs_enet/mac-fcc.c |   25 -
 1 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/drivers/net/fs_enet/mac-fcc.c b/drivers/net/fs_enet/mac-fcc.c
index 1e1024a..e990f72 100644
--- a/drivers/net/fs_enet/mac-fcc.c
+++ b/drivers/net/fs_enet/mac-fcc.c
@@ -48,28 +48,19 @@
 
 /* FCC access macros */
 
-#define __fcc_out32(addr, x)   out_be32((unsigned *)addr, x)
-#define __fcc_out16(addr, x)   out_be16((unsigned short *)addr, x)
-#define __fcc_out8(addr, x)out_8((unsigned char *)addr, x)
-#define __fcc_in32(addr)   in_be32((unsigned *)addr)
-#define __fcc_in16(addr)   in_be16((unsigned short *)addr)
-#define __fcc_in8(addr)in_8((unsigned char *)addr)
-
-/* parameter space */
-
 /* write, read, set bits, clear bits */
-#define W32(_p, _m, _v)__fcc_out32((_p)-_m, (_v))
-#define R32(_p, _m)__fcc_in32((_p)-_m)
+#define W32(_p, _m, _v)out_be32((_p)-_m, (_v))
+#define R32(_p, _m)in_be32((_p)-_m)
 #define S32(_p, _m, _v)W32(_p, _m, R32(_p, _m) | (_v))
 #define C32(_p, _m, _v)W32(_p, _m, R32(_p, _m)  ~(_v))
 
-#define W16(_p, _m, _v)__fcc_out16((_p)-_m, (_v))
-#define R16(_p, _m)__fcc_in16((_p)-_m)
+#define W16(_p, _m, _v)out_be16((_p)-_m, (_v))
+#define R16(_p, _m)in_be16((_p)-_m)
 #define S16(_p, _m, _v)W16(_p, _m, R16(_p, _m) | (_v))
 #define C16(_p, _m, _v)W16(_p, _m, R16(_p, _m)  ~(_v))
 
-#define W8(_p, _m, _v) __fcc_out8((_p)-_m, (_v))
-#define R8(_p, _m) __fcc_in8((_p)-_m)
+#define W8(_p, _m, _v) out_8((_p)-_m, (_v))
+#define R8(_p, _m) in_8((_p)-_m)
 #define S8(_p, _m, _v) W8(_p, _m, R8(_p, _m) | (_v))
 #define C8(_p, _m, _v) W8(_p, _m, R8(_p, _m)  ~(_v))
 
@@ -290,7 +281,7 @@ static void restart(struct net_device *dev)
 
/* clear everything (slow  steady does it) */
for (i = 0; i  sizeof(*ep); i++)
-   __fcc_out8((char *)ep + i, 0);
+   out_8((char *)ep + i, 0);
 
/* get physical address */
rx_bd_base_phys = fep-ring_mem_addr;
@@ -495,7 +486,7 @@ static void tx_kickstart(struct net_device *dev)
struct fs_enet_private *fep = netdev_priv(dev);
fcc_t *fccp = fep-fcc.fccp;
 
-   S32(fccp, fcc_ftodr, 0x80);
+   S16(fccp, fcc_ftodr, 0x8000);
 }
 
 static u32 get_int_events(struct net_device *dev)
-- 
1.5.3.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] fs_enet: Include linux/string.h from linux/fs_enet_pd.h

2007-10-01 Thread Scott Wood
It is needed for strstr().

Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
 include/linux/fs_enet_pd.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs_enet_pd.h b/include/linux/fs_enet_pd.h
index 543cd3c..815c6f9 100644
--- a/include/linux/fs_enet_pd.h
+++ b/include/linux/fs_enet_pd.h
@@ -16,6 +16,7 @@
 #ifndef FS_ENET_PD_H
 #define FS_ENET_PD_H
 
+#include linux/string.h
 #include asm/types.h
 
 #define FS_ENET_NAME   fs_enet
-- 
1.5.3.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/9] fs_enet: Convert mii-bitbang to use the generic bitbang MDIO code.

2007-10-01 Thread Scott Wood
Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
 drivers/net/fs_enet/mii-bitbang.c |  270 -
 1 files changed, 54 insertions(+), 216 deletions(-)

diff --git a/drivers/net/fs_enet/mii-bitbang.c 
b/drivers/net/fs_enet/mii-bitbang.c
index 7cf132f..b8e4a73 100644
--- a/drivers/net/fs_enet/mii-bitbang.c
+++ b/drivers/net/fs_enet/mii-bitbang.c
@@ -15,15 +15,13 @@
 #include linux/module.h
 #include linux/ioport.h
 #include linux/slab.h
-#include linux/interrupt.h
 #include linux/init.h
-#include linux/delay.h
+#include linux/interrupt.h
 #include linux/netdevice.h
 #include linux/etherdevice.h
 #include linux/mii.h
-#include linux/ethtool.h
-#include linux/bitops.h
 #include linux/platform_device.h
+#include linux/mdio-bitbang.h
 
 #ifdef CONFIG_PPC_CPM_NEW_BINDING
 #include linux/of_platform.h
@@ -32,11 +30,11 @@
 #include fs_enet.h
 
 struct bb_info {
+   struct mdiobb_ctrl ctrl;
__be32 __iomem *dir;
__be32 __iomem *dat;
u32 mdio_msk;
u32 mdc_msk;
-   int delay;
 };
 
 /* FIXME: If any other users of GPIO crop up, then these will have to
@@ -59,212 +57,58 @@ static inline int bb_read(u32 __iomem *p, u32 m)
return (in_be32(p)  m) != 0;
 }
 
-static inline void mdio_active(struct bb_info *bitbang)
+static inline void mdio_dir(struct mdiobb_ctrl *ctrl, int dir)
 {
-   bb_set(bitbang-dir, bitbang-mdio_msk);
-}
+   struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
 
-static inline void mdio_tristate(struct bb_info *bitbang)
-{
-   bb_clr(bitbang-dir, bitbang-mdio_msk);
+   if (dir)
+   bb_set(bitbang-dir, bitbang-mdio_msk);
+   else
+   bb_clr(bitbang-dir, bitbang-mdio_msk);
+
+   /* Read back to flush the write. */
+   in_be32(bitbang-dir);
 }
 
-static inline int mdio_read(struct bb_info *bitbang)
+static inline int mdio_read(struct mdiobb_ctrl *ctrl)
 {
+   struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
return bb_read(bitbang-dat, bitbang-mdio_msk);
 }
 
-static inline void mdio(struct bb_info *bitbang, int what)
+static inline void mdio(struct mdiobb_ctrl *ctrl, int what)
 {
+   struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
+
if (what)
bb_set(bitbang-dat, bitbang-mdio_msk);
else
bb_clr(bitbang-dat, bitbang-mdio_msk);
+
+   /* Read back to flush the write. */
+   in_be32(bitbang-dat);
 }
 
-static inline void mdc(struct bb_info *bitbang, int what)
+static inline void mdc(struct mdiobb_ctrl *ctrl, int what)
 {
+   struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl);
+
if (what)
bb_set(bitbang-dat, bitbang-mdc_msk);
else
bb_clr(bitbang-dat, bitbang-mdc_msk);
-}
-
-static inline void mii_delay(struct bb_info *bitbang)
-{
-   udelay(bitbang-delay);
-}
-
-/* Utility to send the preamble, address, and register (common to read and 
write). */
-static void bitbang_pre(struct bb_info *bitbang , int read, u8 addr, u8 reg)
-{
-   int j;
-
-   /*
-* Send a 32 bit preamble ('1's) with an extra '1' bit for good measure.
-* The IEEE spec says this is a PHY optional requirement.  The AMD
-* 79C874 requires one after power up and one after a MII communications
-* error.  This means that we are doing more preambles than we need,
-* but it is safer and will be much more robust.
-*/
-
-   mdio_active(bitbang);
-   mdio(bitbang, 1);
-   for (j = 0; j  32; j++) {
-   mdc(bitbang, 0);
-   mii_delay(bitbang);
-   mdc(bitbang, 1);
-   mii_delay(bitbang);
-   }
-
-   /* send the start bit (01) and the read opcode (10) or write (10) */
-   mdc(bitbang, 0);
-   mdio(bitbang, 0);
-   mii_delay(bitbang);
-   mdc(bitbang, 1);
-   mii_delay(bitbang);
-   mdc(bitbang, 0);
-   mdio(bitbang, 1);
-   mii_delay(bitbang);
-   mdc(bitbang, 1);
-   mii_delay(bitbang);
-   mdc(bitbang, 0);
-   mdio(bitbang, read);
-   mii_delay(bitbang);
-   mdc(bitbang, 1);
-   mii_delay(bitbang);
-   mdc(bitbang, 0);
-   mdio(bitbang, !read);
-   mii_delay(bitbang);
-   mdc(bitbang, 1);
-   mii_delay(bitbang);
-
-   /* send the PHY address */
-   for (j = 0; j  5; j++) {
-   mdc(bitbang, 0);
-   mdio(bitbang, (addr  0x10) != 0);
-   mii_delay(bitbang);
-   mdc(bitbang, 1);
-   mii_delay(bitbang);
-   addr = 1;
-   }
 
-   /* send the register address */
-   for (j = 0; j  5; j++) {
-   mdc(bitbang, 0);
-   mdio(bitbang, (reg  0x10) != 0);
-   mii_delay(bitbang);
-   mdc(bitbang, 1);
-   mii_delay(bitbang);
-   reg = 1;
-   }
+   /* Read back to flush the write. */

[PATCH 9/9] fs_enet: sparse fixes

2007-10-01 Thread Scott Wood
Mostly a bunch of __iomem annotations.

Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
 drivers/net/fs_enet/fs_enet-main.c |   18 +-
 drivers/net/fs_enet/fs_enet.h  |   30 
 drivers/net/fs_enet/mac-fcc.c  |   71 
 drivers/net/fs_enet/mac-fec.c  |   34 +-
 drivers/net/fs_enet/mac-scc.c  |   37 ++-
 drivers/net/fs_enet/mii-fec.c  |8 ++--
 6 files changed, 103 insertions(+), 95 deletions(-)

diff --git a/drivers/net/fs_enet/fs_enet-main.c 
b/drivers/net/fs_enet/fs_enet-main.c
index a2dee7d..d1eb6dd 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -60,7 +60,7 @@ MODULE_DESCRIPTION(Freescale Ethernet Driver);
 MODULE_LICENSE(GPL);
 MODULE_VERSION(DRV_MODULE_VERSION);
 
-int fs_enet_debug = -1;/* -1 == use FS_ENET_DEF_MSG_ENABLE as 
value */
+static int fs_enet_debug = -1; /* -1 == use FS_ENET_DEF_MSG_ENABLE as value */
 module_param(fs_enet_debug, int, 0);
 MODULE_PARM_DESC(fs_enet_debug,
 Freescale bitmapped debugging message enable value);
@@ -90,7 +90,7 @@ static int fs_enet_rx_napi(struct napi_struct *napi, int 
budget)
struct fs_enet_private *fep = container_of(napi, struct 
fs_enet_private, napi);
struct net_device *dev = to_net_dev(fep-dev);
const struct fs_platform_info *fpi = fep-fpi;
-   cbd_t *bdp;
+   cbd_t __iomem *bdp;
struct sk_buff *skb, *skbn, *skbt;
int received = 0;
u16 pkt_len, sc;
@@ -230,7 +230,7 @@ static int fs_enet_rx_non_napi(struct net_device *dev)
 {
struct fs_enet_private *fep = netdev_priv(dev);
const struct fs_platform_info *fpi = fep-fpi;
-   cbd_t *bdp;
+   cbd_t __iomem *bdp;
struct sk_buff *skb, *skbn, *skbt;
int received = 0;
u16 pkt_len, sc;
@@ -355,7 +355,7 @@ static int fs_enet_rx_non_napi(struct net_device *dev)
 static void fs_enet_tx(struct net_device *dev)
 {
struct fs_enet_private *fep = netdev_priv(dev);
-   cbd_t *bdp;
+   cbd_t __iomem *bdp;
struct sk_buff *skb;
int dirtyidx, do_wake, do_restart;
u16 sc;
@@ -503,7 +503,7 @@ fs_enet_interrupt(int irq, void *dev_id)
 void fs_init_bds(struct net_device *dev)
 {
struct fs_enet_private *fep = netdev_priv(dev);
-   cbd_t *bdp;
+   cbd_t __iomem *bdp;
struct sk_buff *skb;
int i;
 
@@ -557,7 +557,7 @@ void fs_cleanup_bds(struct net_device *dev)
 {
struct fs_enet_private *fep = netdev_priv(dev);
struct sk_buff *skb;
-   cbd_t *bdp;
+   cbd_t __iomem *bdp;
int i;
 
/*
@@ -598,7 +598,7 @@ void fs_cleanup_bds(struct net_device *dev)
 static int fs_enet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct fs_enet_private *fep = netdev_priv(dev);
-   cbd_t *bdp;
+   cbd_t __iomem *bdp;
int curidx;
u16 sc;
unsigned long flags;
@@ -1121,7 +1121,7 @@ static int fs_cleanup_instance(struct net_device *ndev)
unregister_netdev(ndev);
 
dma_free_coherent(fep-dev, (fpi-tx_ring + fpi-rx_ring) * 
sizeof(cbd_t),
- fep-ring_base, fep-ring_mem_addr);
+ (void __force *)fep-ring_base, fep-ring_mem_addr);
 
/* reset it */
(*fep-ops-cleanup_data)(ndev);
@@ -1141,7 +1141,7 @@ static int fs_cleanup_instance(struct net_device *ndev)
 
/**/
 
 /* handy pointer to the immap */
-void *fs_enet_immap = NULL;
+void __iomem *fs_enet_immap = NULL;
 
 static int setup_immap(void)
 {
diff --git a/drivers/net/fs_enet/fs_enet.h b/drivers/net/fs_enet/fs_enet.h
index 5a5c9d1..baf6477 100644
--- a/drivers/net/fs_enet/fs_enet.h
+++ b/drivers/net/fs_enet/fs_enet.h
@@ -15,7 +15,7 @@
 #include asm/commproc.h
 
 struct fec_info {
-   fec_t *fecp;
+   fec_t __iomem *fecp;
u32 mii_speed;
 };
 #endif
@@ -81,14 +81,14 @@ struct fs_enet_private {
const struct fs_ops *ops;
int rx_ring, tx_ring;
dma_addr_t ring_mem_addr;
-   void *ring_base;
+   void __iomem *ring_base;
struct sk_buff **rx_skbuff;
struct sk_buff **tx_skbuff;
-   cbd_t *rx_bd_base;  /* Address of Rx and Tx buffers.*/
-   cbd_t *tx_bd_base;
-   cbd_t *dirty_tx;/* ring entries to be free()ed. */
-   cbd_t *cur_rx;
-   cbd_t *cur_tx;
+   cbd_t __iomem *rx_bd_base;  /* Address of Rx and Tx buffers.*/
+   cbd_t __iomem *tx_bd_base;
+   cbd_t __iomem *dirty_tx;/* ring entries to be free()ed. */
+   cbd_t __iomem *cur_rx;
+   cbd_t __iomem *cur_tx;
int tx_free;
struct net_device_stats stats;
struct timer_list phy_timer_list;
@@ -113,23 +113,23 @@ struct fs_enet_private {
union {
struct {
int 

[PATCH 6/9] fs_enet: Be an of_platform device when CONFIG_PPC_CPM_NEW_BINDING is set.

2007-10-01 Thread Scott Wood
The existing OF glue code was crufty and broken.  Rather than fix it, it
will be removed, and the ethernet driver now talks to the device tree
directly.

The old, non-CONFIG_PPC_CPM_NEW_BINDING code can go away once CPM
platforms are dropped from arch/ppc (which will hopefully be soon), and
existing arch/powerpc boards that I wasn't able to test on for this
patchset get converted (which should be even sooner).

Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
 drivers/net/fs_enet/Kconfig|1 +
 drivers/net/fs_enet/fs_enet-main.c |  258 ---
 drivers/net/fs_enet/fs_enet.h  |   55 +---
 drivers/net/fs_enet/mac-fcc.c  |   89 +
 drivers/net/fs_enet/mac-fec.c  |   19 +++-
 drivers/net/fs_enet/mac-scc.c  |   53 +--
 drivers/net/fs_enet/mii-bitbang.c  |  269 +++-
 drivers/net/fs_enet/mii-fec.c  |  143 +++-
 include/linux/fs_enet_pd.h |5 +
 9 files changed, 714 insertions(+), 178 deletions(-)

diff --git a/drivers/net/fs_enet/Kconfig b/drivers/net/fs_enet/Kconfig
index e27ee21..2765e49 100644
--- a/drivers/net/fs_enet/Kconfig
+++ b/drivers/net/fs_enet/Kconfig
@@ -11,6 +11,7 @@ config FS_ENET_HAS_SCC
 config FS_ENET_HAS_FCC
bool Chip has an FCC usable for ethernet
depends on FS_ENET  CPM2
+   select MDIO_BITBANG
default y
 
 config FS_ENET_HAS_FEC
diff --git a/drivers/net/fs_enet/fs_enet-main.c 
b/drivers/net/fs_enet/fs_enet-main.c
index 7a02986..a2dee7d 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -42,12 +42,18 @@
 #include asm/irq.h
 #include asm/uaccess.h
 
+#ifdef CONFIG_PPC_CPM_NEW_BINDING
+#include asm/of_platform.h
+#endif
+
 #include fs_enet.h
 
 /*/
 
+#ifndef CONFIG_PPC_CPM_NEW_BINDING
 static char version[] __devinitdata =
 DRV_MODULE_NAME .c:v DRV_MODULE_VERSION  ( DRV_MODULE_RELDATE ) \n;
+#endif
 
 MODULE_AUTHOR(Pantelis Antoniou [EMAIL PROTECTED]);
 MODULE_DESCRIPTION(Freescale Ethernet Driver);
@@ -948,6 +954,7 @@ static int fs_ioctl(struct net_device *dev, struct ifreq 
*rq, int cmd)
 extern int fs_mii_connect(struct net_device *dev);
 extern void fs_mii_disconnect(struct net_device *dev);
 
+#ifndef CONFIG_PPC_CPM_NEW_BINDING
 static struct net_device *fs_init_instance(struct device *dev,
struct fs_platform_info *fpi)
 {
@@ -1129,6 +1136,7 @@ static int fs_cleanup_instance(struct net_device *ndev)
 
return 0;
 }
+#endif
 
 
/**/
 
@@ -1137,35 +1145,250 @@ void *fs_enet_immap = NULL;
 
 static int setup_immap(void)
 {
-   phys_addr_t paddr = 0;
-   unsigned long size = 0;
-
 #ifdef CONFIG_CPM1
-   paddr = IMAP_ADDR;
-   size = 0x1; /* map 64K */
-#endif
-
-#ifdef CONFIG_CPM2
-   paddr = CPM_MAP_ADDR;
-   size = 0x4; /* map 256 K */
+   fs_enet_immap = ioremap(IMAP_ADDR, 0x4000);
+   WARN_ON(!fs_enet_immap);
+#elif defined(CONFIG_CPM2)
+   fs_enet_immap = cpm2_immr;
 #endif
-   fs_enet_immap = ioremap(paddr, size);
-   if (fs_enet_immap == NULL)
-   return -EBADF;  /* XXX ahem; maybe just BUG_ON? */
 
return 0;
 }
 
 static void cleanup_immap(void)
 {
-   if (fs_enet_immap != NULL) {
-   iounmap(fs_enet_immap);
-   fs_enet_immap = NULL;
-   }
+#if defined(CONFIG_CPM1)
+   iounmap(fs_enet_immap);
+#endif
 }
 
 
/**/
 
+#ifdef CONFIG_PPC_CPM_NEW_BINDING
+static int __devinit find_phy(struct device_node *np,
+  struct fs_platform_info *fpi)
+{
+   struct device_node *phynode, *mdionode;
+   struct resource res;
+   int ret = 0, len;
+
+   const u32 *data = of_get_property(np, phy-handle, len);
+   if (!data || len != 4)
+   return -EINVAL;
+
+   phynode = of_find_node_by_phandle(*data);
+   if (!phynode)
+   return -EINVAL;
+
+   mdionode = of_get_parent(phynode);
+   if (!phynode)
+   goto out_put_phy;
+
+   ret = of_address_to_resource(mdionode, 0, res);
+   if (ret)
+   goto out_put_mdio;
+
+   data = of_get_property(phynode, reg, len);
+   if (!data || len != 4)
+   goto out_put_mdio;
+
+   snprintf(fpi-bus_id, 16, PHY_ID_FMT, res.start, *data);
+
+out_put_mdio:
+   of_node_put(mdionode);
+out_put_phy:
+   of_node_put(phynode);
+   return ret;
+}
+
+#ifdef CONFIG_FS_ENET_HAS_FEC
+#define IS_FEC(match) ((match)-data == fs_fec_ops)
+#else
+#define IS_FEC(match) 0
+#endif
+
+static int __devinit fs_enet_probe(struct of_device *ofdev,
+   const struct of_device_id *match)
+{
+   struct net_device *ndev;
+   struct fs_enet_private *fep;
+   struct 

[PATCH 2/9] fs_enet: Fix build breakage.

2007-10-01 Thread Scott Wood
Commit 4fa57c9ea9f36f9ca852f3a88ca5d2f1aebbc960
(Make NAPI polling independent of struct net_device objects.)
introduced some build breakage in the napi rx function.

Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
 drivers/net/fs_enet/fs_enet-main.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/fs_enet/fs_enet-main.c 
b/drivers/net/fs_enet/fs_enet-main.c
index 2a1b150..a15345b 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -73,8 +73,8 @@ static void fs_set_multicast_list(struct net_device *dev)
 /* NAPI receive function */
 static int fs_enet_rx_napi(struct napi_struct *napi, int budget)
 {
-   struct fs_enet_private *fep = container_of(napi, struct 
fec_enet_private, napi);
-   struct net_device *dev = fep-dev;
+   struct fs_enet_private *fep = container_of(napi, struct 
fs_enet_private, napi);
+   struct net_device *dev = to_net_dev(fep-dev);
const struct fs_platform_info *fpi = fep-fpi;
cbd_t *bdp;
struct sk_buff *skb, *skbn, *skbt;
-- 
1.5.3.2

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] Generic bitbanged MDIO library

2007-10-01 Thread Scott Wood
Previously, bitbanged MDIO was only supported in individual
hardware-specific drivers.  This code factors out the higher level
protocol implementation, reducing the hardware-specific portion to
functions setting direction, data, and clock.

Signed-off-by: Scott Wood [EMAIL PROTECTED]
---
 drivers/net/phy/Kconfig|9 ++
 drivers/net/phy/Makefile   |1 +
 drivers/net/phy/mdio-bitbang.c |  187 
 include/linux/mdio-bitbang.h   |   42 +
 4 files changed, 239 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/phy/mdio-bitbang.c
 create mode 100644 include/linux/mdio-bitbang.h

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index dd09011..72a98dd 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -76,4 +76,13 @@ config FIXED_MII_100_FDX
bool Emulation for 100M Fdx fixed PHY behavior
depends on FIXED_PHY
 
+config MDIO_BITBANG
+   tristate Support for bitbanged MDIO buses
+   help
+ This module implements the MDIO bus protocol in software,
+ for use by low level drivers that export the ability to
+ drive the relevant pins.
+
+ If in doubt, say N.
+
 endif # PHYLIB
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index 8885650..3d6cc7b 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -13,3 +13,4 @@ obj-$(CONFIG_VITESSE_PHY) += vitesse.o
 obj-$(CONFIG_BROADCOM_PHY) += broadcom.o
 obj-$(CONFIG_ICPLUS_PHY)   += icplus.o
 obj-$(CONFIG_FIXED_PHY)+= fixed.o
+obj-$(CONFIG_MDIO_BITBANG) += mdio-bitbang.o
diff --git a/drivers/net/phy/mdio-bitbang.c b/drivers/net/phy/mdio-bitbang.c
new file mode 100644
index 000..8cd243d
--- /dev/null
+++ b/drivers/net/phy/mdio-bitbang.c
@@ -0,0 +1,187 @@
+/*
+ * Bitbanged MDIO support.
+ *
+ * Author: Scott Wood [EMAIL PROTECTED]
+ * Copyright (c) 2007 Freescale Semiconductor
+ *
+ * Based on CPM2 MDIO code which is:
+ *
+ * Copyright (c) 2003 Intracom S.A.
+ *  by Pantelis Antoniou [EMAIL PROTECTED]
+ *
+ * 2005 (c) MontaVista Software, Inc.
+ * Vitaly Bordug [EMAIL PROTECTED]
+ *
+ * This file is licensed under the terms of the GNU General Public License
+ * version 2. This program is licensed as is without any warranty of any
+ * kind, whether express or implied.
+ */
+
+#include linux/module.h
+#include linux/mdio-bitbang.h
+#include linux/slab.h
+#include linux/types.h
+#include linux/delay.h
+
+#define MDIO_READ 1
+#define MDIO_WRITE 0
+
+#define MDIO_SETUP_TIME 10
+#define MDIO_HOLD_TIME 10
+
+/* Minimum MDC period is 400 ns, plus some margin for error.  MDIO_DELAY
+ * is done twice per period.
+ */
+#define MDIO_DELAY 250
+
+/* The PHY may take up to 300 ns to produce data, plus some margin
+ * for error.
+ */
+#define MDIO_READ_DELAY 350
+
+/* MDIO must already be configured as output. */
+static void mdiobb_send_bit(struct mdiobb_ctrl *ctrl, int val)
+{
+   const struct mdiobb_ops *ops = ctrl-ops;
+
+   ops-set_mdio_data(ctrl, val);
+   ndelay(MDIO_DELAY);
+   ops-set_mdc(ctrl, 1);
+   ndelay(MDIO_DELAY);
+   ops-set_mdc(ctrl, 0);
+}
+
+/* MDIO must already be configured as input. */
+static int mdiobb_get_bit(struct mdiobb_ctrl *ctrl)
+{
+   const struct mdiobb_ops *ops = ctrl-ops;
+
+   ndelay(MDIO_DELAY);
+   ops-set_mdc(ctrl, 1);
+   ndelay(MDIO_READ_DELAY);
+   ops-set_mdc(ctrl, 0);
+
+   return ops-get_mdio_data(ctrl);
+}
+
+/* MDIO must already be configured as output. */
+static void mdiobb_send_num(struct mdiobb_ctrl *ctrl, u16 val, int bits)
+{
+   int i;
+
+   for (i = bits - 1; i = 0; i--)
+   mdiobb_send_bit(ctrl, (val  i)  1);
+}
+
+/* MDIO must already be configured as input. */
+static u16 mdiobb_get_num(struct mdiobb_ctrl *ctrl, int bits)
+{
+   int i;
+   u16 ret = 0;
+
+   for (i = bits - 1; i = 0; i--) {
+   ret = 1;
+   ret |= mdiobb_get_bit(ctrl);
+   }
+
+   return ret;
+}
+
+/* Utility to send the preamble, address, and
+ * register (common to read and write).
+ */
+static void mdiobb_cmd(struct mdiobb_ctrl *ctrl, int read, u8 phy, u8 reg)
+{
+   const struct mdiobb_ops *ops = ctrl-ops;
+   int i;
+
+   ops-set_mdio_dir(ctrl, 1);
+
+   /*
+* Send a 32 bit preamble ('1's) with an extra '1' bit for good
+* measure.  The IEEE spec says this is a PHY optional
+* requirement.  The AMD 79C874 requires one after power up and
+* one after a MII communications error.  This means that we are
+* doing more preambles than we need, but it is safer and will be
+* much more robust.
+*/
+
+   for (i = 0; i  32; i++)
+   mdiobb_send_bit(ctrl, 1);
+
+   /* send the start bit (01) and the read opcode (10) or write (10) */
+   mdiobb_send_bit(ctrl, 0);
+   mdiobb_send_bit(ctrl, 1);
+   mdiobb_send_bit(ctrl, read);
+   

Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread Eric Dumazet

So maybe the following patch is necessary...

I believe IPV6  DCCP are immune to this problem.

Thanks again Denys for spotting this.

Eric

[PATCH] TCP : secure_tcp_sequence_number() should not use a too fast clock

TCP V4 sequence numbers are 32bits, and RFC 793 assumed a 250 KHz clock.
In order to follow network speed increase, we can use a faster clock, but
we should limit this clock so that the delay between two rollovers is
greater than MSL (TCP Maximum Segment Lifetime : 2 minutes)

Choosing a 64 nsec clock should be OK, since the rollovers occur every
274 seconds.

Problem spotted by Denys Fedoryshchenko

Signed-off-by: Eric Dumazet [EMAIL PROTECTED]

--- linux-2.6.22/drivers/char/random.c  2007-10-01 10:18:42.0 +0200
+++ linux-2.6.22-ed/drivers/char/random.c   2007-10-01 21:47:58.0 
+0200
@@ -1550,11 +1550,13 @@ __u32 secure_tcp_sequence_number(__be32 
 *  As close as possible to RFC 793, which
 *  suggests using a 250 kHz clock.
 *  Further reading shows this assumes 2 Mb/s networks.
-*  For 10 Gb/s Ethernet, a 1 GHz clock is appropriate.
-*  That's funny, Linux has one built in!  Use it!
-*  (Networks are faster now - should this be increased?)
+*  For 10 Mb/s Ethernet, a 1 MHz clock is appropriate.
+*  For 10 Gb/s Ethernet, a 1 GHz clock should be ok, but
+*  we also need to limit the resolution so that the u32 seq
+*  overlaps less than one time per MSL (2 minutes).
+*  Choosing a clock of 64 ns period is OK. (period of 274 s)
 */
-   seq += ktime_get_real().tv64;
+   seq += ktime_get_real().tv64  6;
 #if 0
printk(init_seq(%lx, %lx, %d, %d) = %d\n,
   saddr, daddr, sport, dport, seq);


sk98lin, jumbo frames, and memory fragmentation

2007-10-01 Thread Chris Friesen


Hi all,

We're considering some hardware that uses the sk98lin network hardware, 
and we'll be using jumbo frames.  Looking at the driver, when using a 
9KB MTU it seems like it would end up trying to atomically allocate a 
16KB buffer.


Has anyone heard of this been a problem?  It would seem like trying to 
atomically allocate four physically contiguous pages could become tricky 
after the system has been running for a while.


The reason I ask is that we ran into this with the e1000.  Before they 
added the new jumbo frame code it was trying to atomically allocate 32KB 
buffers and we would start getting allocation failures after a month or 
so of uptime.


Any information anyone can provide would be appreciated.


Thanks,

Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin, jumbo frames, and memory fragmentation

2007-10-01 Thread John Heffner

Yes it has this problem.  I've observed it in practice on a busy firewall.

  -John


Chris Friesen wrote:


Hi all,

We're considering some hardware that uses the sk98lin network hardware, 
and we'll be using jumbo frames.  Looking at the driver, when using a 
9KB MTU it seems like it would end up trying to atomically allocate a 
16KB buffer.


Has anyone heard of this been a problem?  It would seem like trying to 
atomically allocate four physically contiguous pages could become tricky 
after the system has been running for a while.


The reason I ask is that we ran into this with the e1000.  Before they 
added the new jumbo frame code it was trying to atomically allocate 32KB 
buffers and we would start getting allocation failures after a month or 
so of uptime.


Any information anyone can provide would be appreciated.


Thanks,

Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net-2.6.24: old ax25 driver fix

2007-10-01 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Mon, 1 Oct 2007 11:24:17 -0700

 Recent change in hard header broke build of these old drivers.
 
 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

Applied, thanks Stephen.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] Preparatory refactoring part 1.

2007-10-01 Thread Corey Hickey

Patrick McHardy wrote:

Corey Hickey wrote:

Make a new function sfq_q_enqueue() that operates directly on the
queue data. This will be useful for implementing sfq_change() in
a later patch. A pleasant side-effect is reducing most of the
duplicate code in sfq_enqueue() and sfq_requeue().

Similarly, make a new function sfq_q_dequeue().

Signed-off-by: Corey Hickey [EMAIL PROTECTED]
---
 net/sched/sch_sfq.c |   72 +++
 1 files changed, 38 insertions(+), 34 deletions(-)

diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 3a23e30..57485ef 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c




The sfq_q_enqueue part looks fine.

 
-	sch-qstats.drops++;



A line in the changelog explaining that this was increased twice
would have been nice.


Certainly; I think I didn't realize, when you originally pointed out the 
duplicate incrementing, that it was a bug in the original version and 
not in my patch. Otherwise, I would have sent it as a separate patch.


If a note in this patch will suffice, though, I'll definitely do so.


sfq_drop(sch);
return NET_XMIT_CN;
 }
 
-

-
-
-static struct sk_buff *
-sfq_dequeue(struct Qdisc* sch)
+static struct
+sk_buff *sfq_q_dequeue(struct sfq_sched_data *q)



What is this function needed for?


It gets used in sfq_change for moving packets from the old queue into 
the new one. In this case, we don't want to modify sch-q.qlen or 
sch-qstats.backlog, since those don't actually change.


 while ((skb = sfq_q_dequeue(q)) != NULL)
 sfq_q_enqueue(skb, tmp, SFQ_TAIL);


I'll improve the description of this patch to make that more clear.

-Corey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/10] Move two functions.

2007-10-01 Thread Corey Hickey

Patrick McHardy wrote:

Corey Hickey wrote:

Move sfq_q_destroy() to above sfq_q_init() so that it can be used
by an error case in a later patch.

Move sfq_destroy() as well, for clarity.



This patch looks pointless, just put them where you need them
in the patch introducing them.


As you wish. I thought having a separate patch would ease reviewing.

-Corey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/10] Add divisor.

2007-10-01 Thread Corey Hickey

Patrick McHardy wrote:

Corey Hickey wrote:

Make hash divisor user-configurable.




@@ -120,7 +121,7 @@ static __inline__ unsigned sfq_fold_hash(struct 
sfq_sched_data *q, u32 h, u32 h1
/* Have we any rotation primitives? If not, WHY? */
h ^= (h1pert) ^ (h1(0x1F - pert));
h ^= h10;
-   return h  0x3FF;
+   return h  (q-hash_divisor-1);



This assumes that hash_divisor is a power of two, but this is
not enforced anywhere.


Ok. I'll move that part from userspace to the kernel. That should be 
better anyway.


-Corey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/10] Make qdisc changeable.

2007-10-01 Thread Corey Hickey

Patrick McHardy wrote:

Corey Hickey wrote:

Re-implement sfq_change() and enable Qdisc_opts.change so tc qdisc
change will work.




+static int sfq_change(struct Qdisc *sch, struct rtattr *opt)
+{
+   ...
+
+   /* finish up */
+   if (q-perturb_period) {
+   q-perturb_timer.expires = jiffies + q-perturb_period;
+   add_timer(q-perturb_timer);
+   } else {
+   q-perturbation = 0;



Seems counter-productive to explicitly set it to zero since
it was still used during tranfering the packets with the
old value. So I'd suggest to remove this or alternatively
set it to the final value *before* transfering the packets.


I suppose so; you're right. I'll adapt that part to fit before 
transferring the packets.


-Corey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/10] Change perturb_period to unsigned.

2007-10-01 Thread Corey Hickey

Patrick McHardy wrote:

Corey Hickey wrote:

perturb_period is currently a signed integer, but I can't see any good
reason why this is so--a negative perturbation period will add a timer
that expires in the past, causing constant perturbation, which makes
hashing useless.

if (q-perturb_period) {
q-perturb_timer.expires = jiffies + q-perturb_period;
add_timer(q-perturb_timer);
}

Strictly speaking, this will break binary compatibility with older
versions of tc, but that ought not to be a problem because (a) there's
no valid use for a negative perturb_period, and (b) negative values
will be seen as high values ( INT_MAX), which don't work anyway.

If perturb_period is too large, (perturb_period * HZ) will overflow the
size of an unsigned int and wrap around. So, check for thet and reject
values that are too high.



Sounds reasonable.


--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -74,6 +74,9 @@
 typedef unsigned int sfq_index;
 #define SFQ_MAX_DEPTH (UINT_MAX / 2 - 1)
 
+/* We don't want perturb_period * HZ to overflow an unsigned int. */

+#define SFQ_MAX_PERTURB (UINT_MAX / HZ)



jiffies are unsigned long.


Hmm. You're right. It looks like my previous patch obviated the need for 
this part. I'll remove it.


-Corey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/10] Use nested compat attributes to pass parameters.

2007-10-01 Thread Corey Hickey

Patrick McHardy wrote:

Corey Hickey wrote:

+
+#define GET_PARAM(dst, nest, compat) do { \
+   struct rtattr *rta = tb[(nest) - 1]; \
+   if (rta) \
+   (dst) = RTA_GET_U32(rta); \
+   else if ((compat)) \
+   (dst) = (compat); \
+} while (0)



An inline function and a comment why this is done would increase
readability.


Well, I had a reason for making a macro, but it probably wasn't a good 
reason. Looking now, I don't see why not to make a function. I'll see 
what I can do.



+   nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), opt);
+
+   RTA_PUT_U32(skb, TCA_SFQ_QUANTUM, q-quantum);
+   RTA_PUT_U32(skb, TCA_SFQ_PERTURB, q-perturb_period);
+   RTA_PUT_U32(skb, TCA_SFQ_LIMIT,   q-limit);
+   RTA_PUT_U32(skb, TCA_SFQ_DIVISOR, q-hash_divisor);
+   RTA_PUT_U32(skb, TCA_SFQ_FLOWS,   q-depth);
RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), opt);



This is wrong, RTA_NEST_COMPAT already dumps the structure.


You mean that last line (RTA_PUT) is superfluous, right? I can't see a 
reason for it to be there, so I must have just forgotten to delete it 
from the original code.


If I'm wrong, I might need a little hand-holding here. My understanding 
of all the RTA stuff is a bit shaky.



Much thanks for the review. I'll make a new set of patches soon.

-Corey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin, jumbo frames, and memory fragmentation

2007-10-01 Thread Jeff Garzik

Chris Friesen wrote:
We're considering some hardware that uses the sk98lin network hardware, 
and we'll be using jumbo frames.  Looking at the driver, when using a 
9KB MTU it seems like it would end up trying to atomically allocate a 
16KB buffer.


The sk98lin driver is going away, please don't use it.

It's unmaintained and full of known bugs.

Jeff



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/10] Preparatory refactoring part 2.

2007-10-01 Thread Corey Hickey

Patrick McHardy wrote:

Corey Hickey wrote:

The sfq_destroy() -- sfq_q_destroy() change looks pointless here,
but it's cleaner to split now and add code to sfq_q_destroy() in a
later patch.

+static void sfq_destroy(struct Qdisc *sch)
+{
+   struct sfq_sched_data *q = qdisc_priv(sch);
+   sfq_q_destroy(q);
+}



It does look pointless, after applying all patches sfq_destroy still
remains a simply wrapper around sfq_q_destroy.


It does remain a wrapper, but both functions are used. It doesn't have 
to be this way, but I wanted to avoid duplicating code and I didn't see 
a better layout.


sfq_q_destroy is used in sfq_q_init if a kcalloc fails. sfq_q_init knows 
nothing about struct Qdisc *sch, so it can't call sfq_destroy.


sfq_destroy is still marked as the destroy function in sfq_qdisc_ops.

-Corey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression

2007-10-01 Thread David Miller
From: Eric Dumazet [EMAIL PROTECTED]
Date: Mon, 01 Oct 2007 22:10:03 +0200

 So maybe the following patch is necessary...
 
 I believe IPV6  DCCP are immune to this problem.
 
 Thanks again Denys for spotting this.
 
 Eric
 
 [PATCH] TCP : secure_tcp_sequence_number() should not use a too fast clock
 
 TCP V4 sequence numbers are 32bits, and RFC 793 assumed a 250 KHz clock.
 In order to follow network speed increase, we can use a faster clock, but
 we should limit this clock so that the delay between two rollovers is
 greater than MSL (TCP Maximum Segment Lifetime : 2 minutes)
 
 Choosing a 64 nsec clock should be OK, since the rollovers occur every
 274 seconds.
 
 Problem spotted by Denys Fedoryshchenko
 
 Signed-off-by: Eric Dumazet [EMAIL PROTECTED]

Thanks a lot Eric for bringing closure to this.

I'll apply this and add a reference in the commit message to the
changeset that introduced this problem, since it might help
others who look at this.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin, jumbo frames, and memory fragmentation

2007-10-01 Thread Stephen Hemminger
On Mon, 01 Oct 2007 14:35:48 -0600
Chris Friesen [EMAIL PROTECTED] wrote:

 
 Hi all,
 
 We're considering some hardware that uses the sk98lin network hardware, 
 and we'll be using jumbo frames.  Looking at the driver, when using a 
 9KB MTU it seems like it would end up trying to atomically allocate a 
 16KB buffer.
 
 Has anyone heard of this been a problem?  It would seem like trying to 
 atomically allocate four physically contiguous pages could become tricky 
 after the system has been running for a while.
 
 The reason I ask is that we ran into this with the e1000.  Before they 
 added the new jumbo frame code it was trying to atomically allocate 32KB 
 buffers and we would start getting allocation failures after a month or 
 so of uptime.
 
 Any information anyone can provide would be appreciated.

Adding fragmentation support to skge driver is on my list of
possible extensions. sky2 driver already supports it (yet one
more feature that the vendor sk98lin driver doesn't do).

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin, jumbo frames, and memory fragmentation

2007-10-01 Thread Chris Friesen

Stephen Hemminger wrote:


Adding fragmentation support to skge driver is on my list of
possible extensions. sky2 driver already supports it (yet one
more feature that the vendor sk98lin driver doesn't do).


Thanks for speaking up.  As I mentioned in my email to Jeff it looks 
like the sky2 driver is what I need (Marvel Yukon 88E8062).  However, 
I'm on 2.6.14 and it doesn't exist there...do you anticipate any issues 
if I were to backport it?


Thanks,

Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin, jumbo frames, and memory fragmentation

2007-10-01 Thread Chris Friesen

Jeff Garzik wrote:


The sk98lin driver is going away, please don't use it.

It's unmaintained and full of known bugs.


Okay...so it looks like the proper driver for the Marvell Yukon 88E8062 
is the sky2 driver, and this one does avoid order0 allocations.  Am I 
on track?


Chris
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sk98lin, jumbo frames, and memory fragmentation

2007-10-01 Thread Stephen Hemminger
On Mon, 01 Oct 2007 15:15:59 -0600
Chris Friesen [EMAIL PROTECTED] wrote:

 Stephen Hemminger wrote:
 
  Adding fragmentation support to skge driver is on my list of
  possible extensions. sky2 driver already supports it (yet one
  more feature that the vendor sk98lin driver doesn't do).
 
 Thanks for speaking up.  As I mentioned in my email to Jeff it looks 
 like the sky2 driver is what I need (Marvel Yukon 88E8062).  However, 
 I'm on 2.6.14 and it doesn't exist there...do you anticipate any issues 
 if I were to backport it?

Nothing but usual annoying kernel API changes..


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-2.6.24 0/4]: TCP fixes

2007-10-01 Thread David Miller
From: Ilpo_Järvinen [EMAIL PROTECTED]
Date: Mon,  1 Oct 2007 15:29:40 +0300

 This fixes the newreno fackets_out case, which turned out to be
 not related to the Cedric's case being under investigation. Two
 trivial comment patches, and frto with high-speed seqno
 wrap-around protection. Compile tested. Please apply to
 net-2.6.24.

I've applied them all to net-2.6.24, thanks Ilpo!
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] mv643xx_eth: Do not modify struct netdev tx_queue_len

2007-10-01 Thread Dale Farnsworth
From: Dale Farnsworth [EMAIL PROTECTED]

This driver erroneously zeros dev-tx_queue_len, since
mp-tx_ring_size has not yet been initialized.  Actually,
the driver shouldn't modify tx_queue_len at all and should
leave the value set by alloc_etherdev(), currently 1000.

Signed-off-by: Dale Farnsworth [EMAIL PROTECTED]
---
Jeff, this bug was just reported today, or I would have batched
it with the one I sent you last week.  It's an obvious bugfix,
so I'm not going to hold it in my queue.

 drivers/net/mv643xx_eth.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c
index 34288fe..3153356 100644
--- a/drivers/net/mv643xx_eth.c
+++ b/drivers/net/mv643xx_eth.c
@@ -1357,7 +1357,6 @@ static int mv643xx_eth_probe(struct platform_device *pdev)
 #endif
 
dev-watchdog_timeo = 2 * HZ;
-   dev-tx_queue_len = mp-tx_ring_size;
dev-base_addr = 0;
dev-change_mtu = mv643xx_eth_change_mtu;
dev-do_ioctl = mv643xx_eth_do_ioctl;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][TG3]Some cleanups

2007-10-01 Thread Michael Chan
On Sun, 2007-09-30 at 14:11 -0400, jamal wrote:
 Here are some non-batching related changes that i have in my batching
 tree. Like the e1000e, they make the xmit code more readable.
 I wouldnt mind if you take them over.
 

Jamal, in tg3_enqueue_buggy(), we may have to call tg3_tso_bug() which
will recursively call tg3_start_xmit_dma_bug() after segmenting the TSO
packet into normal packets.  We need to restore the VLAN tag so that the
GSO code will create the chain of segmented SKBs with the proper VLAN
tag.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How do queue-less virtual devices wake higher level senders?

2007-10-01 Thread Ben Greear

Hello!

I am having some trouble figuring out how virtual interfaces
(such as mac-vlans) can wake up writers (such as udp sockets).

For 'real' hardware, it seems that the netif_stop_queue and
netif_wake_queue methods handle stopping and waking the
higher level senders, but for virtual devices with no
queues, how does this work?

In my case, I'm using a virtual Station interface that sits on
top of a wifi radio interface (hacked up madwifi).  I notice
that UDP connections set up for high speed, unidirectional
sends are stalling after a few minutes.  netstat -an shows
a write-buffer that is quite full, but nothing is transmitted.

If I ping or start any other type of traffic on these interfaces,
the udp recovers.  It seems like the udp send logic is just
getting stuck and needs a kick.

I do not see any problems with TCP connections, and if I keep
a slow-speed tcp connection running, the UDP will not hang.

It's likely the bug is in my driver and/or code, so this is
not a bug report..just a question to hopefully help me debug
it further :)

Thanks,
Ben

--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do queue-less virtual devices wake higher level senders?

2007-10-01 Thread David Miller
From: Ben Greear [EMAIL PROTECTED]
Date: Mon, 01 Oct 2007 16:49:06 -0700

 For 'real' hardware, it seems that the netif_stop_queue and
 netif_wake_queue methods handle stopping and waking the
 higher level senders, but for virtual devices with no
 queues, how does this work?

They don't queue, there is nothing to stop or wakeup.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do queue-less virtual devices wake higher level senders?

2007-10-01 Thread Ben Greear

David Miller wrote:

From: Ben Greear [EMAIL PROTECTED]
Date: Mon, 01 Oct 2007 16:49:06 -0700


For 'real' hardware, it seems that the netif_stop_queue and
netif_wake_queue methods handle stopping and waking the
higher level senders, but for virtual devices with no
queues, how does this work?


They don't queue, there is nothing to stop or wakeup.


Ok, so if I have a UDP socket bound to an interface that has
no queue, and yet I see the send portion of the queue being
full in netstat, what does this mean?

Maybe the device I think has no queue somehow does?

I added some debugging to print out dev-state in sysfs, and
the state of the virtual is always 0x6, which appears right
to me.  It's underlying device goes back and forth between 0x7 and 0x6,
which also seems right to me.

When the thing is in the hung state, phys and virtual interface have 0x6
state, and yet the udp tx queue remains full.  The physical NIC also
prints out some errors about being low on buffers right before the
hang, but it seems to recover since just doing a ping or starting
a second udp connection brings everything back to life.

Other than IFF_UP and dev-state, are there other things that
can make the tx logic stop sending to a device?

Thanks,
Ben



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-01 Thread Larry McVoy
On Sat, Sep 29, 2007 at 11:02:32AM -0700, Linus Torvalds wrote:
 On Sat, 29 Sep 2007, Larry McVoy wrote:
  I haven't kept up on switch technology but in the past they were much
  better than you are thinking.  The Kalpana switch that I had modified
  to support vlans (invented by yours truly), did not store and forward,
  it was cut through and could handle any load that was theoretically
  possible within about 1%.
 
 Hey, you may well be right. Maybe my assumptions about cutting corners are 
 just cynical and pessimistic. 

So I got a netgear switch and it works fine.  But my tests are busted.  
Catching netdev up, I'm trying to optimize traffic to a server that has
a gbit interface; I moved to a 24 port netgear that is all 10/100/1000
and I have a pile of clients to act as load generators.

I can do this on each of the clients 

dd if=/dev/zero bs=1024000 | rsh work dd of=/dev/null

and that cranks up to about 47K packets/second which is about 70MB/sec.

One of my clients also has gigabit so I played around with just that
one and it (itanium running hpux w/ broadcom gigabit) can push the load
as well.  One weird thing is that it is dependent on the direction the
data is flowing.  If the hp is sending then I get 46MB/sec, if linux is
sending then I get 18MB/sec.  Weird.  Linux is debian, running 

Linux work 2.6.18-5-k7 #1 SMP Thu Aug 30 02:52:31 UTC 2007 i686 

and dual e1000 cards:

e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection

I wrote a tiny little program to try and emulate this and I can't get
it to do as well.  I've tracked it down, I think, to the read side.
The server sources, the client sinks, the server looks like:

11689 accept(3, {sa_family=AF_INET, sin_port=htons(49376), 
sin_addr=inet_addr(10.3.1.38)}, [16]) = 4
11689 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0
11689 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
11689 clone(child_stack=0, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7ddf708) 
= 11694
11689 close(4)  = 0
11689 accept(3,  unfinished ...
11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1048576) = 1048576
11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1048576) = 1048576
11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1048576) = 1048576
11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
1048576) = 1048576
...

but the client looks like

connect(3, {sa_family=AF_INET, sin_port=htons(31235), 
sin_addr=inet_addr(10.3.9.1)}, 16) = 0
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
1448
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
1448
read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 
2896

which I suspect may be the problem.

I played around with SO_RCVBUF/SO_SNDBUF and that didn't help.  So any ideas why
a simple dd piped through rsh is kicking my ass?  It must be something simple
but my test program is tiny and does nothing weird that I can see.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-01 Thread Linus Torvalds


On Mon, 1 Oct 2007, Larry McVoy wrote:
 
 but the client looks like
 
 connect(3, {sa_family=AF_INET, sin_port=htons(31235), 
 sin_addr=inet_addr(10.3.9.1)}, 16) = 0
 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) 
 = 2896
 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) 
 = 1448
 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) 
 = 2896
..

This is exactly what I'd expect if the machine is *not* under excessive 
load.

The system calls are fast enough that the latency for the TCP stack is 
roughly on the same scale as the time it takes to receive one new packet, 
so since a socket read will always return when it has any data (not until 
it has filled the whole buffer), you get exactly that one or two packets 
pattern.

If you'd be really CPU-limited or under load from other programs, you'd 
have more packets come in while you're in the read path, and you'd get 
bigger reads.

But do a tcpdump both ways, and see (for example) if the TCP window is 
much bigger going the other way.

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tcp bw in 2.6

2007-10-01 Thread Larry McVoy
On Mon, Oct 01, 2007 at 07:14:37PM -0700, Linus Torvalds wrote:
 
 
 On Mon, 1 Oct 2007, Larry McVoy wrote:
  
  but the client looks like
  
  connect(3, {sa_family=AF_INET, sin_port=htons(31235), 
  sin_addr=inet_addr(10.3.9.1)}, 16) = 0
  read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
  1048576) = 2896
  read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
  1048576) = 1448
  read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 
  1048576) = 2896
 ..
 
 This is exactly what I'd expect if the machine is *not* under excessive 
 load.

That's fine, but why is it that my trivial program can't do as well as 
dd | rsh dd?

A short summary is can someone please post a test program that sources
and sinks data at the wire speed?  because apparently I'm too old and
clueless to write such a thing.
-- 
---
Larry McVoylm at bitmover.com   http://www.bitkeeper.com
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/10] Preparatory refactoring part 1.

2007-10-01 Thread Patrick McHardy
Corey Hickey wrote:
 Patrick McHardy wrote:
 
 -sch-qstats.drops++;

 A line in the changelog explaining that this was increased twice
 would have been nice.
 
 
 Certainly; I think I didn't realize, when you originally pointed out the
 duplicate incrementing, that it was a bug in the original version and
 not in my patch. Otherwise, I would have sent it as a separate patch.


I didn't remember that :)

 If a note in this patch will suffice, though, I'll definitely do so.


Sure, a note in the changelog will be fine.

 +static struct
 +sk_buff *sfq_q_dequeue(struct sfq_sched_data *q)



 What is this function needed for?
 
 
 It gets used in sfq_change for moving packets from the old queue into
 the new one. In this case, we don't want to modify sch-q.qlen or
 sch-qstats.backlog, since those don't actually change.
 
  while ((skb = sfq_q_dequeue(q)) != NULL)
  sfq_q_enqueue(skb, tmp, SFQ_TAIL);


I missed that, thanks for the explanation.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Removing DAD in IPv6

2007-10-01 Thread Xia Yang
Hi,

I just find out this IFA_F_NODAD was not in the kernel used in my test
bed which is 2.6.17. So I tried to modify the code in ipv6/addrconf.c by
myself to remove the DAD:

if (!max_addresses ||
ipv6_count_addresses(in6_dev)  max_addresses)
ifp = ipv6_add_addr(in6_dev, addr, 
pinfo-prefix_len,

addr_typeIPV6_ADDR_SCOPE_MASK, 0);

if (!ifp || IS_ERR(ifp)) {
in6_dev_put(in6_dev);
return;
}
// New code 
if (!IS_ERR(ifp)) {
spin_lock_bh(ifp-lock);
ifp-flags = ~IFA_F_TENTATIVE;
spin_unlock_bh(ifp-lock);

addrconf_join_solict(ifp-idev-dev, ifp-addr);
ipv6_ifa_notify(RTM_NEWADDR, ifp);
//in6_ifa_put(ifp);
printk(New address configured.\n);
}
// --end ---
update_lft = create = 1;
ifp-cstamp = jiffies;

// addrconf_dad_start(ifp, RTF_ADDRCONF|RTF_PREFIX_RT);

However, even the new address is generated and assigned to the
interface, and I can read the address from the /proc interface, my first
few packets are eaten by the kernel. Only until after about 1 second,
then my packet can make its way out. Is kernel doing anything that
blocks the sending and receiving of packets during the time of DAD?
Thanks a lot!

Best Regards,

Xia Yang



On Mon, 2007-10-01 at 20:44 +0900, YOSHIFUJI Hideaki / 吉藤英明 wrote:
 In article [EMAIL PROTECTED] (at Mon, 01 Oct 2007 11:53:27 +0800), Xia Yang 
 [EMAIL PROTECTED] says:
 
  I would like to ask for help on how to remove or disable the DAD process
  properly, as long as the node can send, receive and forward packets
  immediately after a new IPv6 address is generated. Any pointer is
  appreciated. Thanks a lot in advance!
 
 IFA_F_NODAD address flag might help this.
 
 --yoshfuji
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >