Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
Denys a écrit : Hi I got pi linux-git # git bisect bad Bisecting: 0 revisions left to test after this [f85958151900f9d30fa5ff941b0ce71eaa45a7de] [NET]: random functions can use nsec resolution instead of usec I will make sure and will try to reverse this patch on 2.6.22 But it seems that's it. Well... thats interesting... No problem here on bigger servers, so I CC David Miller and netdev on this one. AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying hardware functions on PC and no performance problem should happen here. (relevant part of this patch : @ -1521,7 +1515,6 @@ __u32 secure_ip_id(__be32 daddr) __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport) { - struct timeval tv; __u32 seq; __u32 hash[4]; struct keydata *keyptr = get_keyptr(); @@ -1543,12 +1536,11 @@ __u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr, * As close as possible to RFC 793, which * suggests using a 250 kHz clock. * Further reading shows this assumes 2 Mb/s networks. -* For 10 Mb/s Ethernet, a 1 MHz clock is appropriate. +* For 10 Gb/s Ethernet, a 1 GHz clock is appropriate. * That's funny, Linux has one built in! Use it! * (Networks are faster now - should this be increased?) */ - do_gettimeofday(tv); - seq += tv.tv_usec + tv.tv_sec * 100; + seq += ktime_get_real().tv64; Thank you for doing this research. On Sun, 30 Sep 2007 14:25:37 +1000, Nick Piggin wrote Hi Denys, thanks for reporting (btw. please reply-to-all when replying on lkml). You say that SLAB is better than SLUB on an otherwise identical kernel, but I didn't see if you quantified the actual numbers? It sounds like there is still a regression with SLAB? On Monday 01 October 2007 03:48, Eric Dumazet wrote: Denys a : I've moved recently one of my proxies(squid and some compressing application) from 2.6.21 to 2.6.22, and notice huge performance drop. I think this is important, cause it can cause serious regression on some other workloads like busy web-servers and etc. After some analysis of different options i can bring more exact numbers: 2.6.21 able to process 500-550 requests/second and 15-20 Mbit/s of traffic, and working great without any slowdown or instability. 2.6.22 able to process only 250-300 requests and 8-10 Mbit/s of traffic, ssh and console is freezing (there is delay even for typing characters). Both proxies is on identical hardware(Sun Fire X4100), configuration(small system, LFS-like, on USB flash), different only kernel. I tried to disable/enable various options and optimisations - it doesn't change anything, till i reach SLUB/SLAB option. I've loaded proxy configuration to gentoo PC with 2.6.22 (then upgraded it to 2.6.23-rc8), and having same effect. Additionally, when load reaching maximum i can notice whole system slowdown, for example ssh and scp takes much more time to run, even i do nice -n -5 for them. But even choosing 2.6.23-rc8+SLAB i noticed same freezing of ssh (and sure it slowdown other kind of network performance), but much less comparing with SLUB. On top i am seeing ksoftirqd taking almost 100% (sometimes ksoftirqd/0, sometimes ksoftirqd/1). I tried also different tricks with scheduler (/proc/sys/kernel/sched*), but it's also didn't help. When it freezes it looks like: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 7 root 15 -5 000 R 64 0.0 2:47.48 ksoftirqd/1 5819 root 20 0 134m 130m 596 R 57 3.3 4:36.78 globax 5911 squid 20 0 1138m 1.1g 2124 R 26 28.9 2:24.87 squid 10 root 15 -5 000 S1 0.0 0:01.86 events/1 6130 root 20 0 3960 2416 1592 S0 0.1 0:08.02 oprofiled Oprofile results: Thats oprofile with 2.6.23-rc8 - SLUB 7391821.5521 check_bytes 3836111.1848 acpi_pm_read 14077 4.1044 init_object 13632 3.9747 ip_send_reply 8486 2.4742 __slab_alloc 7199 2.0990 nf_iterate 6718 1.9588 page_address 6716 1.9582 tcp_v4_rcv 6425 1.8733 __slab_free 5604 1.6339 on_freelist Thats oprofile with 2.6.23-rc8 - SLAB CPU: AMD64 processors, speed 2592.64 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 10 samples %symbol name 138991 14.0627 acpi_pm_read 52401 5.3018 tcp_v4_rcv 48466 4.9037 nf_iterate 38043 3.8491 __slab_alloc 34155 3.4557 ip_send_reply 20963 2.1210 ip_rcv 19475 1.9704 csum_partial 19084 1.9309 kfree 17434 1.7639 ip_output 17278 1.7481 netif_receive_skb 15248 1.5428 nf_hook_slow My .config is at http://www.nuclearcat.com/.config (there is SPARSEMEM enabled, it doesn't make any noticeable difference) Please CC me on
Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
From: Eric Dumazet [EMAIL PROTECTED] Date: Mon, 01 Oct 2007 07:59:12 +0200 No problem here on bigger servers, so I CC David Miller and netdev on this one. AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying hardware functions on PC and no performance problem should happen here. One thing that jumps out at me is that on 32-bit (and to a certain extent on 64-bit) there is a lot of stack accesses and missed optimizations because all of the work occurs, and gets expanded, inside of ktime_get_real(). The timespec_to_ktime() inside of there constructs the ktime_t return value on the stack, then returns that as an aggregate to the caller. That cannot be without some cost. ktime_get_real() is definitely a candidate for inlining especially in these kinds of cases where we'll happily get computations in local registers instead of all of this on-stack nonsense. And in several cases (if the caller only needs the tv_sec value, for example) computations can be elided entirely. It would be constructive to experiment and see if this is in fact part of the problem. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Patrick McHardy wrote: Eric W. Biederman wrote: Patrick McHardy [EMAIL PROTECTED] writes: Maybe I can save you some time: we used to do down_trylock() for the rtnl mutex, so senders would simply return if someone else was already processing the queue *or* the rtnl was locked for some other reason. In the first case the process already processing the queue would also process the new messages, but if it the rtnl was locked for some other reason (for example during module registration) the message would sit in the queue until the next rtnetlink sendmsg call, which is why rtnl_unlock does queue processing. Commit 6756ae4b changed the down_trylock to mutex_lock, so senders will now simply wait until the mutex is released and then call netlink_run_queue themselves. This means its not needed anymore. Sounds reasonable. I started looking through the code paths and I currently cannot see anything that would leave a message on a kernel rtnl socket. However I did a quick test adding a WARN_ON if there were any messages found in the queue during rtnl_unlock and I found this code path getting invoked from linkwatch_event. So there is clearly something I don't understand, and it sounds at odds just a bit from your description. That sounds like a bug. Did you place the WARN_ON before or after the mutex_unlock()? The presence of the message in the queue during rtnl_unlock is quite possible as normal user-kernel message processing path for rtnl is the following: netlink_sendmsg netlink_unicast netlink_sendskb skb_queue_tail netlink_data_ready rtnetlink_rcv mutex_lock(rtnl_mutex); netlink_run_queue(sk, qlen, rtnetlink_rcv_msg); mutex_unlock(rtnl_mutex); so, the presence of the packet in the rtnl queue on rtnl_unlock is normal race with a rtnetlink_rcv for me. Regards, Den - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
Well, i can play a bit more on live servers. I have now hot-swap server with full gentoo, where i can rebuild any kernel you want, with any applied patch. But it looks more like not overhead, load becoming high too spiky, and it is not just permantenly higher. Also it is not normal that all system becoming unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage softirq jumps to 100%). On Mon, 01 Oct 2007 00:12:59 -0700 (PDT), David Miller wrote From: Eric Dumazet [EMAIL PROTECTED] Date: Mon, 01 Oct 2007 07:59:12 +0200 No problem here on bigger servers, so I CC David Miller and netdev on this one. AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying hardware functions on PC and no performance problem should happen here. One thing that jumps out at me is that on 32-bit (and to a certain extent on 64-bit) there is a lot of stack accesses and missed optimizations because all of the work occurs, and gets expanded, inside of ktime_get_real(). The timespec_to_ktime() inside of there constructs the ktime_t return value on the stack, then returns that as an aggregate to the caller. That cannot be without some cost. ktime_get_real() is definitely a candidate for inlining especially in these kinds of cases where we'll happily get computations in local registers instead of all of this on-stack nonsense. And in several cases (if the caller only needs the tv_sec value, for example) computations can be elided entirely. It would be constructive to experiment and see if this is in fact part of the problem. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
Eric Dumazet a écrit : Denys a écrit : Well, i can play a bit more on live servers. I have now hot-swap server with full gentoo, where i can rebuild any kernel you want, with any applied patch. But it looks more like not overhead, load becoming high too spiky, and it is not just permantenly higher. Also it is not normal that all system becoming unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage softirq jumps to 100%). Could you try a pristine 2.6.22.9 and some patch in secure_tcp_sequence_number() like : --- drivers/char/random.c.orig 2007-10-01 10:18:42.0 +0200 +++ drivers/char/random.c 2007-10-01 10:19:58.0 +0200 @@ -1554,7 +1554,7 @@ * That's funny, Linux has one built in! Use it! * (Networks are faster now - should this be increased?) */ - seq += ktime_get_real().tv64; + seq += ktime_get_real().tv64 / 1000; #if 0 printk(init_seq(%lx, %lx, %d, %d) = %d\n, saddr, daddr, sport, dport, seq); On 32 bits machine, replace the divide by a shift to avoid a linker error (undefined reference to `__divdi3'): seq += ktime_get_real().tv64 10; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Denis V. Lunev [EMAIL PROTECTED] writes: The presence of the message in the queue during rtnl_unlock is quite possible as normal user-kernel message processing path for rtnl is the following: netlink_sendmsg netlink_unicast netlink_sendskb skb_queue_tail netlink_data_ready rtnetlink_rcv mutex_lock(rtnl_mutex); netlink_run_queue(sk, qlen, rtnetlink_rcv_msg); mutex_unlock(rtnl_mutex); so, the presence of the packet in the rtnl queue on rtnl_unlock is normal race with a rtnetlink_rcv for me. Yes. That is what I saw in practice as well. Thanks for confirming this. It happened to reproducible because I had a dhcp client asking for a list of links in parallel with the actual link coming up during boot. Looking at netlink_unicast and netlink_broadcast I am generally convinced that we can remove the call of sk_data_ready in rtnl_unlock. I think those are the only two possible paths through there and I don't see how we could miss a processing a packet on the way through there. What would be nice is if we could figure out how to eliminate this race. As that would allow netlink packets to be processed synchronously and we could actually use current for security checks, and for getting the context of the calling process. Right now we are 99% of the way there but because of the above race the code must all be written as if netlink packets were coming in completely asynchronously. Which is unfortunate and a pain. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc8-mm2 - tcp_fastretrans_alert() WARNING
Ilpo Järvinen wrote: On Sat, 29 Sep 2007, Cedric Le Goater wrote: Ilpo Järvinen wrote: On Fri, 28 Sep 2007, Ilpo Järvinen wrote: On Fri, 28 Sep 2007, Cedric Le Goater wrote: I just found that warning in my logs. It seems that it's been happening since rc7-mm1 at least. WARNING: at /home/legoater/linux/2.6.23-rc8-mm2/net/ipv4/tcp_input.c:2314 tcp_fastretrans_alert() Call Trace: IRQ [8040fdc3] tcp_ack+0xcd6/0x1894 ...snip... ...Thanks for the report, I'll have look what could still break fackets_out... I think this one is now clear to me, tcp_fragment/collapse adjusts fackets_out (incorrectly) also for reno flow when there were some dupACKs that made sacked_out != 0. Could you please try if patch below proves all them to be of non-SACK origin... In case that's true, it's rather harmless, I'll send a fix on Monday or so (this would anyway be needed)... If you find out that them occur with SACK enabled flow, that would be more interesting and requires more digging... I'm trying now to reproduce this WARNING. It seems that the n/w behaves differently during the week ends. Probably taking a break. Thanks. Of course there are other means too to determine if TCP flows do negotiate SACK enabled or not. Depending on your test case (which is fully unknown to me) they may or may not be usable... At least the value of tcp_sack sysctl on both systems or tcpdump catching SYN packets should give that detail. ...If you know to which hosts TCP could be connected (and active) to, while the WARNING triggers, it's really easy to test what is being negotiated as it's unlikely to change at short notice and any TCP flow to that host will get us the same information though the WARNING would not be triggered with it at this time. Obviously if at least one of the remotes is not known or the set ends up being mixture of reno and SACK flows, then we'll just have to wait and see which fish we get... got it ! r3-06.test.meiosys.com login: WARNING: at /home/legoater/linux/2.6.23-rc8-mm2/net/ipv4/tcp_input.c:2314 tcp_fastretrans_alert() Call Trace: IRQ [8040fdc3] tcp_ack+0xcd6/0x18af [80412b6f] tcp_rcv_established+0x61f/0x6df [80254146] __lock_acquire+0x8a1/0xf1b [80419d19] tcp_v4_do_rcv+0x3e/0x394 [8041a68b] tcp_v4_rcv+0x61c/0x9a9 [803ff1e3] ip_local_deliver+0x1da/0x2a4 [803ffb4e] ip_rcv+0x583/0x5c9 [8046d35b] packet_rcv_spkt+0x19a/0x1a8 [803e081c] netif_receive_skb+0x2cf/0x2f5 [88042505] :tg3:tg3_poll+0x65d/0x8a4 [803e09e8] net_rx_action+0xb8/0x191 [8023a927] __do_softirq+0x5f/0xe0 [8020c98c] call_softirq+0x1c/0x28 [8020e9c3] do_softirq+0x3b/0xb8 [8023aa1e] irq_exit+0x4e/0x50 [8020e7df] do_IRQ+0xbd/0xd7 [80209cb9] mwait_idle+0x0/0x4d [8020bce6] ret_from_intr+0x0/0xf EOI [80209cfc] mwait_idle+0x43/0x4d [802099fb] enter_idle+0x22/0x24 [80209c4f] cpu_idle+0x9d/0xc0 [80476aa1] rest_init+0x55/0x57 [80630815] start_kernel+0x2d6/0x2e2 [80630134] _sinittext+0x134/0x13b TCP 0 I wasn't doing any particular test on n/w so it took me a while to figure out how I was triggering the WARNING. Apparently, this is happening when I run ketchup, but not always. This test machine is behind many firewall routers so it might be a reason. tcpdump gave me this output for a wget on kernel.org : 10:51:14.835981 IP r3-06.test.meiosys.com.40322 pub2.kernel.org.http: S 737836267:737836267(0) win 5840 mss 1460,sackOK,timestamp 1309245 0,nop,wscale 7 10:51:14.975153 IP pub2.kernel.org.http r3-06.test.meiosys.com.40321: F 524:524(0) ack 166 win 5840 10:51:14.975177 IP r3-06.test.meiosys.com.40321 pub2.kernel.org.http: . ack 525 win 7504 I'm trying to get the WARNING and the tcpdump output for it but for the moment, it seems it's beyond my reach :/ Hope it helps ! C. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rtnl_unlock cleanups
There is no need to process outstanding netlink user-kernel packets during rtnl_unlock now. There is no rtnl_trylock in the rtnetlink_rcv anymore. Normal code path is the following: netlink_sendmsg netlink_unicast netlink_sendskb skb_queue_tail netlink_data_ready rtnetlink_rcv mutex_lock(rtnl_mutex); netlink_run_queue(sk, qlen, rtnetlink_rcv_msg); mutex_unlock(rtnl_mutex); So, it is possible, that packets can be present in the rtnl-sk_receive_queue during rtnl_unlock, but there is no need to process them at that moment as rtnetlink_rcv for that packet is pending. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] Acked-by: Alexey Kuznetsov [EMAIL PROTECTED] --- ./net/core/rtnetlink.c.rtnl22007-08-26 19:30:38.0 +0400 +++ ./net/core/rtnetlink.c 2007-10-01 13:09:03.0 +0400 @@ -75,8 +75,6 @@ void __rtnl_unlock(void) void rtnl_unlock(void) { mutex_unlock(rtnl_mutex); - if (rtnl rtnl-sk_receive_queue.qlen) - rtnl-sk_data_ready(rtnl, 0); netdev_run_todo(); } @@ -1319,11 +1317,9 @@ static void rtnetlink_rcv(struct sock *s unsigned int qlen = 0; do { - mutex_lock(rtnl_mutex); + rtnl_lock(); qlen = netlink_run_queue(sk, qlen, rtnetlink_rcv_msg); - mutex_unlock(rtnl_mutex); - - netdev_run_todo(); + rtnl_unlock(); } while (qlen); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
jamal wrote: +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, +struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue(skbs)) != NULL) + q-ops-requeue(skb, q); -requeue queues at the head, so this looks like it would reverse the order of the skbs. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] dm9601: Fix receive MTU
Please apply to 2.6.23. --- dm9601 didn't take the ethernet header into account when calculating RX MTU, causing packets bigger than 1486 to fail. Signed-off-by: Peter Korsgaard [EMAIL PROTECTED] --- drivers/net/usb/dm9601.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.23-rc8/drivers/net/usb/dm9601.c === --- linux-2.6.23-rc8.orig/drivers/net/usb/dm9601.c +++ linux-2.6.23-rc8/drivers/net/usb/dm9601.c @@ -405,7 +405,7 @@ dev-net-ethtool_ops = dm9601_ethtool_ops; dev-net-hard_header_len += DM_TX_OVERHEAD; dev-hard_mtu = dev-net-mtu + dev-net-hard_header_len; - dev-rx_urb_size = dev-net-mtu + DM_RX_OVERHEAD; + dev-rx_urb_size = dev-net-mtu + ETH_HLEN + DM_RX_OVERHEAD; dev-mii.dev = dev-net; dev-mii.mdio_read = dm9601_mdio_read; -- Bye, Peter Korsgaard - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][IPv6] Export userland ND options through netlink (RDNSS support)
Hello. In article [EMAIL PROTECTED] (at Sat, 29 Sep 2007 19:47:20 +0200), Pierre Ynard [EMAIL PROTECTED] says: As discussed before, this patch provides userland with a way to access relevant options in Router Advertisements, after they are processed and validated by the kernel. Extra options are processed in a generic way; this patch only exports RDNSS options described in RFC5006, but support to control which options are exported could be easily added. I basically like this approach at first sight. which implies that a userland daemon processing RDNSS options needs a way to associate the option to the router that sent it, and fetch its lifetime. This kind of information could be included in a header in the rtnetlink message (in this version of the patch there is none). diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index dff3192..f69d415 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -97,6 +97,9 @@ enum { RTM_SETNEIGHTBL, #define RTM_SETNEIGHTBL RTM_SETNEIGHTBL + RTM_NEWNDUSEROPT = 68, +#define RTM_NEWNDUSEROPT RTM_NEWNDUSEROPT + __RTM_MAX, Does this imply that we could extend (or reuse) this for all of NS/NA/RS/RA/Redirect messages? I think you need to include the code, type and basic semantics of the message. If this is only for RA, we should say RTM_NEWRAUSEROPT or something. Regards, --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: 2.6.23-rc NETDEV WATCHDOG: eth0: transmit timed out
Hi, after reading about issues with the nics on kontron boards I did a bios upgrade, but this did not change anything. However, yesterday the nic (onboard) I used died. No link at all, after switching to the next onboard nic I got a NETDEV transmit timeout with that one on kernel 2.6.22-r2. It seems the whole thing is a hardware issue. I will try to figure out with kontron. Sorry :( Karl 2007/9/12, Francois Romieu [EMAIL PROTECTED]: Karl Meyer [EMAIL PROTECTED] : [...] am am looking for this issue for some time now, but there where no errors in 2.6.22-r2 (gentoo speak, I guess this is 2.6.22.2 officially), I also ran git-bisect (for more information see the older messages in this thread). 2.6.22-r2 in gentoo is based on 2.6.22.1. It is way before 0e4851502f846b13b29b7f88f1250c980d57e944 that you reported to work. Thus it is not surprizing that it works. Any update regarding the patchkit that I sent on 2007/08/16 ? It would help to narrow the culprit. -- Ueimor - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Removing DAD in IPv6
In article [EMAIL PROTECTED] (at Mon, 01 Oct 2007 11:53:27 +0800), Xia Yang [EMAIL PROTECTED] says: I would like to ask for help on how to remove or disable the DAD process properly, as long as the node can send, receive and forward packets immediately after a new IPv6 address is generated. Any pointer is appreciated. Thanks a lot in advance! IFA_F_NODAD address flag might help this. --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPV6] Fix ICMPv6 redirect handling with target multicast address
Hello. In article [EMAIL PROTECTED] (at Sat, 29 Sep 2007 10:04:48 +0900 (JST)), YOSHIFUJI Hideaki / 吉藤英明 [EMAIL PROTECTED] says: In article [EMAIL PROTECTED] (at Fri, 28 Sep 2007 17:50:38 -0700), David Stevens [EMAIL PROTECTED] says: Brian, A multicast address should never be the target of a neighbor discovery request; the sender should use the mapping function for all multicasts. So, I'm not sure that your example can ever happen, and it certainly is ok to send ICMPv6 errors to multicast addresses in general. But I don't see that it hurts anything. either (since it should never happen :-)), so I don't particularly object, either. I think it'd also be better if you add the check to be: if (ipv6_addr_type(target) (IPV6_ADDR_LINKLOCAL|IPV6_ADDR_UNICAST)) or something along those lines, rather than reproducing ipv6_addr_type() code separately in a new ipv6_addr_linklocal() function. I'm fine with the idea of the fix itself. Please use ipv6_addr_type() so far and convert other users as well to ipv6_addr_linklocal() in another patch. Regards, --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
Not able to compile kernel with patch drivers/built-in.o: In function `secure_tcp_sequence_number': (.text+0x3ad02): undefined reference to `__divdi3' make: *** [.tmp_vmlinux1] Error 1 On Mon, 01 Oct 2007 10:20:07 +0200, Eric Dumazet wrote Denys a : Well, i can play a bit more on live servers. I have now hot-swap server with full gentoo, where i can rebuild any kernel you want, with any applied patch. But it looks more like not overhead, load becoming high too spiky, and it is not just permantenly higher. Also it is not normal that all system becoming unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage softirq jumps to 100%). Could you try a pristine 2.6.22.9 and some patch in secure_tcp_sequence_number() like : --- drivers/char/random.c.orig 2007-10-01 10:18:42.0 +0200 +++ drivers/char/random.c 2007-10-01 10:19:58.0 +0200 @@ -1554,7 +1554,7 @@ * That's funny, Linux has one built in! Use it! * (Networks are faster now - should this be increased?) */ - seq += ktime_get_real().tv64; + seq += ktime_get_real().tv64 / 1000; #if 0 printk(init_seq(%lx, %lx, %d, %d) = %d\n, saddr, daddr, sport, dport, seq); Thank you On Mon, 01 Oct 2007 00:12:59 -0700 (PDT), David Miller wrote From: Eric Dumazet [EMAIL PROTECTED] Date: Mon, 01 Oct 2007 07:59:12 +0200 No problem here on bigger servers, so I CC David Miller and netdev on this one. AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying hardware functions on PC and no performance problem should happen here. One thing that jumps out at me is that on 32-bit (and to a certain extent on 64-bit) there is a lot of stack accesses and missed optimizations because all of the work occurs, and gets expanded, inside of ktime_get_real(). The timespec_to_ktime() inside of there constructs the ktime_t return value on the stack, then returns that as an aggregate to the caller. That cannot be without some cost. ktime_get_real() is definitely a candidate for inlining especially in these kinds of cases where we'll happily get computations in local registers instead of all of this on-stack nonsense. And in several cases (if the caller only needs the tv_sec value, for example) computations can be elided entirely. It would be constructive to experiment and see if this is in fact part of the problem. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Make TCP prequeue configurable
David Miller [EMAIL PROTECTED] writes: Furthermore, prequeue puts the stack input processing work into user context, which means that the users will be charged more fairly for the work that is done for them. For more details on this people might want to read the old Lazy Receiver Processing papers: http://www.cs.rice.edu/CS/Systems/LRP/ -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] [TCP]: fix comments that got messed up during code move
Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 2286361..135f046 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1467,8 +1467,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_ return flag; } -/* F-RTO can only be used if TCP has never retransmitted anything other than - * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here) +/* If we receive more dupacks than we expected counting segments + * in assumption of absent reordering, interpret this as reordering. + * The only another reason could be bug in receiver TCP. */ static void tcp_check_reno_reordering(struct sock *sk, const int addend) { @@ -1516,6 +1517,9 @@ static inline void tcp_reset_reno_sack(struct tcp_sock *tp) tp-sacked_out = 0; } +/* F-RTO can only be used if TCP has never retransmitted anything other than + * head (SACK enhanced variant from Appendix B of RFC4138 is more robust here) + */ int tcp_use_frto(struct sock *sk) { const struct tcp_sock *tp = tcp_sk(sk); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.24 0/4]: TCP fixes
Hi Dave, This fixes the newreno fackets_out case, which turned out to be not related to the Cedric's case being under investigation. Two trivial comment patches, and frto with high-speed seqno wrap-around protection. Compile tested. Please apply to net-2.6.24. -- i. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] [TCP]: No fackets_out/highest_sack tuning when SACK isn't enabled
This was found due to bug report from Cedric Le Goater though it turned this turned out to be unrelated bug. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_output.c |7 --- 1 files changed, 4 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 94c8011..6199abe 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -660,7 +660,7 @@ static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb, unsigned static void tcp_adjust_fackets_out(struct tcp_sock *tp, struct sk_buff *skb, int decr) { - if (!tp-sacked_out) + if (!tp-sacked_out || tcp_is_reno(tp)) return; if (!before(tp-highest_sack, TCP_SKB_CB(skb)-seq)) @@ -712,7 +712,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss TCP_SKB_CB(buff)-end_seq = TCP_SKB_CB(skb)-end_seq; TCP_SKB_CB(skb)-end_seq = TCP_SKB_CB(buff)-seq; - if (tp-sacked_out (TCP_SKB_CB(skb)-seq == tp-highest_sack)) + if (tcp_is_sack(tp) tp-sacked_out + (TCP_SKB_CB(skb)-seq == tp-highest_sack)) tp-highest_sack = TCP_SKB_CB(buff)-seq; /* PSH and FIN should only be set in the second packet. */ @@ -1718,7 +1719,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1); - if (WARN_ON(tp-sacked_out + if (WARN_ON(tcp_is_sack(tp) tp-sacked_out (TCP_SKB_CB(next_skb)-seq == tp-highest_sack))) return; -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] [TCP]: Update comment of SACK block validator
Just came across what RFC2018 states about generation of valid SACK blocks in case of reneging. Alter comment a bit to point out clearly. IMHO, there isn't any reason to change code because the validation is there for a purpose (counters will inform user about decision TCP made if this case ever surfaces). Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 135f046..cec2611 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1027,8 +1027,15 @@ static void tcp_update_reordering(struct sock *sk, const int metric, * SACK block range validation checks that the received SACK block fits to * the expected sequence limits, i.e., it is between SND.UNA and SND.NXT. * Note that SND.UNA is not included to the range though being valid because - * it means that the receiver is rather inconsistent with itself (reports - * SACK reneging when it should advance SND.UNA). + * it means that the receiver is rather inconsistent with itself reporting + * SACK reneging when it should advance SND.UNA. Such SACK block this is + * perfectly valid, however, in light of RFC2018 which explicitly states + * that SACK block MUST reflect the newest segment. Even if the newest + * segment is going to be discarded ..., not that it looks very clever + * in case of head skb. Due to potentional receiver driven attacks, we + * choose to avoid immediate execution of a walk in write queue due to + * reneging and defer head skb's loss recovery to standard loss recovery + * procedure that will eventually trigger (nothing forbids us doing this). * * Implements also blockage to start_seq wrap-around. Problem lies in the * fact that though start_seq (s) is before end_seq (i.e., not reversed), -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] [TCP]: Wrap-safed reordering detection FRTO check
In case somebody has a suggestion about a better place for this check, which must guarantee execution early enough (i.e, before the wrap can occur), I'm very open to them. Signed-off-by: Ilpo Järvinen [EMAIL PROTECTED] --- net/ipv4/tcp_input.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index cec2611..e22ffe7 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3024,6 +3024,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) /* See if we can take anything off of the retransmit queue. */ flag |= tcp_clean_rtx_queue(sk, seq_rtt); + /* Guarantee sacktag reordering detection against wrap-arounds */ + if (before(tp-frto_highmark, tp-snd_una)) + tp-frto_highmark = 0; if (tp-frto_counter) frto_cwnd = tcp_process_frto(sk, flag); -- 1.5.0.6 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1][TCP]: break missing at end of switch statement
[TCP]: break missing at end of switch statement Signed-off-by: Gerrit Renker [EMAIL PROTECTED] --- --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk) return; default: sk-sk_err = ECONNRESET; + break; } if (!sock_flag(sk, SOCK_DEAD)) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][TCP]: break missing at end of switch statement
On Mon, Oct 01, 2007 at 01:32:43PM +0100, Gerrit Renker wrote: [TCP]: break missing at end of switch statement Signed-off-by: Gerrit Renker [EMAIL PROTECTED] --- --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk) return; default: sk-sk_err = ECONNRESET; + break; } Huh? Why on the Earth would that be a problem? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] [PATCH v3] iw_cxgb3: Supportiwarp-onlyinterfacestoavoid 4-tuple conflicts.
Sean, Not so simple. How does client application knows where to connect? Does this proposal forces applications to choose the right network? Currently, MPA or ULP and not applications handle it. Why would we want to change that? Sean, I may be beating the dead horse, but I recall that one of the main selling points of RDMA that it magical bust to performance with no changes applications. Just plug it in an viola, performances goes up and CPU utilization for network stack goes does. Win-Win. Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Sean Hefty [mailto:[EMAIL PROTECTED] Sent: Friday, September 28, 2007 5:35 PM To: Kanevsky, Arkady Cc: netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Supportiwarp-onlyinterfacestoavoid 4-tuple conflicts. Kanevsky, Arkady wrote: Exactly, it forces the burden on administrator. And one will be forced to try one mount for iWARP and it does not work issue another one TCP or UDP if it fails. Yack! And server will need to listen on different IP address and simple * will not work since it will need to listen in two different domains. The server already has to call listen twice. Once for the rdma_cm and once for sockets. Similarly on the client side, connect must be made over rdma_cm or sockets. I really don't see any impact on the application for this approach. We just end up separating the port space based on networking addresses, rather than keeping the problem at the transport level. If you have an alternate approach that will be accepted upstream, feel free to post it. - Sean ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] make netlink processing routines semi-synchronious (inspired by rtnl)
The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks like outdated copy/paste from rtnetlink.c. Push them into sync with the original. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- ./net/netfilter/nfnetlink.c.nlk32007-10-01 09:47:53.0 +0400 +++ ./net/netfilter/nfnetlink.c 2007-10-01 16:09:44.0 +0400 @@ -44,26 +44,14 @@ static struct sock *nfnl = NULL; static const struct nfnetlink_subsystem *subsys_table[NFNL_SUBSYS_COUNT]; static DEFINE_MUTEX(nfnl_mutex); -static void nfnl_lock(void) +static inline void nfnl_lock(void) { mutex_lock(nfnl_mutex); } -static int nfnl_trylock(void) -{ - return !mutex_trylock(nfnl_mutex); -} - -static void __nfnl_unlock(void) -{ - mutex_unlock(nfnl_mutex); -} - -static void nfnl_unlock(void) +static inline void nfnl_unlock(void) { mutex_unlock(nfnl_mutex); - if (nfnl-sk_receive_queue.qlen) - nfnl-sk_data_ready(nfnl, 0); } int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n) @@ -149,7 +137,7 @@ static int nfnetlink_rcv_msg(struct sk_b #ifdef CONFIG_KMOD /* don't call nfnl_unlock, since it would reenter * with further packet processing */ - __nfnl_unlock(); + nfnl_unlock(); request_module(nfnetlink-subsys-%d, NFNL_SUBSYS_ID(type)); nfnl_lock(); ss = nfnetlink_get_subsys(type); @@ -188,10 +176,9 @@ static void nfnetlink_rcv(struct sock *s unsigned int qlen = 0; do { - if (nfnl_trylock()) - return; + nfnl_lock(); qlen = netlink_run_queue(sk, qlen, nfnetlink_rcv_msg); - __nfnl_unlock(); + nfnl_unlock(); } while (qlen); } --- ./net/netlink/genetlink.c.nlk3 2007-08-26 19:30:38.0 +0400 +++ ./net/netlink/genetlink.c 2007-10-01 16:05:29.0 +0400 @@ -22,22 +22,14 @@ struct sock *genl_sock = NULL; static DEFINE_MUTEX(genl_mutex); /* serialization of message processing */ -static void genl_lock(void) +static inline void genl_lock(void) { mutex_lock(genl_mutex); } -static int genl_trylock(void) -{ - return !mutex_trylock(genl_mutex); -} - -static void genl_unlock(void) +static inline void genl_unlock(void) { mutex_unlock(genl_mutex); - - if (genl_sock genl_sock-sk_receive_queue.qlen) - genl_sock-sk_data_ready(genl_sock, 0); } #define GENL_FAM_TAB_SIZE 16 @@ -483,8 +475,7 @@ static void genl_rcv(struct sock *sk, in unsigned int qlen = 0; do { - if (genl_trylock()) - return; + genl_lock(); qlen = netlink_run_queue(sk, qlen, genl_rcv_msg); genl_unlock(); } while (qlen genl_sock genl_sock-sk_receive_queue.qlen); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][TCP]: break missing at end of switch statement
Quoting Al Viro: | On Mon, Oct 01, 2007 at 01:32:43PM +0100, Gerrit Renker wrote: | [TCP]: break missing at end of switch statement | | Signed-off-by: Gerrit Renker [EMAIL PROTECTED] | --- | --- a/net/ipv4/tcp_input.c | +++ b/net/ipv4/tcp_input.c | @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk) | return; | default: | sk-sk_err = ECONNRESET; | + break; | } | | Huh? Why on the Earth would that be a problem? | | Sorry what is your question? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make netlink processing routines semi-synchronious (inspired by rtnl)
Denis V. Lunev wrote: The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks like outdated copy/paste from rtnetlink.c. Push them into sync with the original. int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n) @@ -149,7 +137,7 @@ static int nfnetlink_rcv_msg(struct sk_b #ifdef CONFIG_KMOD /* don't call nfnl_unlock, since it would reenter * with further packet processing */ - __nfnl_unlock(); + nfnl_unlock(); That comment should be updated/deleted. Rest looks good to me. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][TCP]: break missing at end of switch statement
On Mon, Oct 01, 2007 at 02:02:10PM +0100, Gerrit Renker wrote: Quoting Al Viro: | On Mon, Oct 01, 2007 at 01:32:43PM +0100, Gerrit Renker wrote: | [TCP]: break missing at end of switch statement | | Signed-off-by: Gerrit Renker [EMAIL PROTECTED] | --- | --- a/net/ipv4/tcp_input.c | +++ b/net/ipv4/tcp_input.c | @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk) |return; |default: |sk-sk_err = ECONNRESET; | +break; |} | | Huh? Why on the Earth would that be a problem? | | Sorry what is your question? Why the hell is $Subject a problem that warrants any patches whatsoever? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-01-10 at 12:42 +0200, Patrick McHardy wrote: jamal wrote: + while ((skb = __skb_dequeue(skbs)) != NULL) + q-ops-requeue(skb, q); -requeue queues at the head, so this looks like it would reverse the order of the skbs. Excellent catch! thanks; i will fix. As a side note: Any batching driver should _never_ have to requeue; if it does it is buggy. And the non-batching ones if they ever requeue will be a single packet, so not much reordering. Thanks again Patrick. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/10] Preparatory refactoring part 1.
Corey Hickey wrote: Make a new function sfq_q_enqueue() that operates directly on the queue data. This will be useful for implementing sfq_change() in a later patch. A pleasant side-effect is reducing most of the duplicate code in sfq_enqueue() and sfq_requeue(). Similarly, make a new function sfq_q_dequeue(). Signed-off-by: Corey Hickey [EMAIL PROTECTED] --- net/sched/sch_sfq.c | 72 +++ 1 files changed, 38 insertions(+), 34 deletions(-) diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index 3a23e30..57485ef 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c The sfq_q_enqueue part looks fine. - sch-qstats.drops++; A line in the changelog explaining that this was increased twice would have been nice. sfq_drop(sch); return NET_XMIT_CN; } - - - -static struct sk_buff * -sfq_dequeue(struct Qdisc* sch) +static struct +sk_buff *sfq_q_dequeue(struct sfq_sched_data *q) What is this function needed for? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
Resend for maillists (was discareded cause of encoding issues as SPAM). Everything looks fine, for sure. Confirmed on second server. On Mon, 01 Oct 2007 10:20:07 +0200, Eric Dumazet wrote Well, i can play a bit more on live servers. I have now hot-swap server with full gentoo, where i can rebuild any kernel you want, with any applied patch. But it looks more like not overhead, load becoming high too spiky, and it is not just permantenly higher. Also it is not normal that all system becoming unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage softirq jumps to 100%). Could you try a pristine 2.6.22.9 and some patch in secure_tcp_sequence_number() like : --- drivers/char/random.c.orig 2007-10-01 10:18:42.0 +0200 +++ drivers/char/random.c 2007-10-01 10:19:58.0 +0200 @@ -1554,7 +1554,7 @@ * That's funny, Linux has one built in! Use it! * (Networks are faster now - should this be increased?) */ - seq += ktime_get_real().tv64; + seq += ktime_get_real().tv64 / 1000; #if 0 printk(init_seq(%lx, %lx, %d, %d) = %d\n, saddr, daddr, sport, dport, seq); Thank you On Mon, 01 Oct 2007 00:12:59 -0700 (PDT), David Miller wrote From: Eric Dumazet [EMAIL PROTECTED] Date: Mon, 01 Oct 2007 07:59:12 +0200 No problem here on bigger servers, so I CC David Miller and netdev on this one. AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying hardware functions on PC and no performance problem should happen here. One thing that jumps out at me is that on 32-bit (and to a certain extent on 64-bit) there is a lot of stack accesses and missed optimizations because all of the work occurs, and gets expanded, inside of ktime_get_real(). The timespec_to_ktime() inside of there constructs the ktime_t return value on the stack, then returns that as an aggregate to the caller. That cannot be without some cost. ktime_get_real() is definitely a candidate for inlining especially in these kinds of cases where we'll happily get computations in local registers instead of all of this on-stack nonsense. And in several cases (if the caller only needs the tv_sec value, for example) computations can be elided entirely. It would be constructive to experiment and see if this is in fact part of the problem. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/10] Move two functions.
Corey Hickey wrote: Move sfq_q_destroy() to above sfq_q_init() so that it can be used by an error case in a later patch. Move sfq_destroy() as well, for clarity. This patch looks pointless, just put them where you need them in the patch introducing them. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
On Mon, 2007-01-10 at 00:11 -0400, Bill Fink wrote: Have you done performance comparisons for the case of using 9000-byte jumbo frames? I havent, but will try if any of the gige cards i have support it. As a side note: I have not seen any useful gains or losses as the packet size approaches even 1500B MTU. For example, post about 256B neither the batching nor the non-batching give much difference in either throughput or cpu use. Below 256B, theres a noticeable gain for batching. Note, in the cases of my tests all 4 CPUs are in full-throttle UDP and so the occupancy of both the qdisc queue(s) and ethernet ring is constantly high. For example at 512B, the app is 80% idle on all 4 CPUs and we are hitting in the range of wire speed. We are at 90% idle at 1024B. This is the case with or without batching. So my suspicion is that with that trend a 9000B packet will just follow the same pattern. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] Preparatory refactoring part 2.
Corey Hickey wrote: The sfq_destroy() -- sfq_q_destroy() change looks pointless here, but it's cleaner to split now and add code to sfq_q_destroy() in a later patch. +static void sfq_destroy(struct Qdisc *sch) +{ + struct sfq_sched_data *q = qdisc_priv(sch); + sfq_q_destroy(q); +} It does look pointless, after applying all patches sfq_destroy still remains a simply wrapper around sfq_q_destroy. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] Add divisor.
Corey Hickey wrote: Make hash divisor user-configurable. @@ -120,7 +121,7 @@ static __inline__ unsigned sfq_fold_hash(struct sfq_sched_data *q, u32 h, u32 h1 /* Have we any rotation primitives? If not, WHY? */ h ^= (h1pert) ^ (h1(0x1F - pert)); h ^= h10; - return h 0x3FF; + return h (q-hash_divisor-1); This assumes that hash_divisor is a power of two, but this is not enforced anywhere. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][TCP]: break missing at end of switch statement
Quoting YOSHIFUJI Hideaki: | | [TCP]: break missing at end of switch statement | | Signed-off-by: Gerrit Renker [EMAIL PROTECTED] | --- | --- a/net/ipv4/tcp_input.c | +++ b/net/ipv4/tcp_input.c | @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk) | return; | default: | sk-sk_err = ECONNRESET; | + break; | } | | if (!sock_flag(sk, SOCK_DEAD)) | | NAK; it is not required at all. | | --yoshfuji | If it were true what you are saying then the statement `sk-sk_err = ECONNRESET;' can go as well since it will always be overridden. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/10] Make qdisc changeable.
Corey Hickey wrote: Re-implement sfq_change() and enable Qdisc_opts.change so tc qdisc change will work. +static int sfq_change(struct Qdisc *sch, struct rtattr *opt) +{ + ... + + /* finish up */ + if (q-perturb_period) { + q-perturb_timer.expires = jiffies + q-perturb_period; + add_timer(q-perturb_timer); + } else { + q-perturbation = 0; Seems counter-productive to explicitly set it to zero since it was still used during tranfering the packets with the old value. So I'd suggest to remove this or alternatively set it to the final value *before* transfering the packets. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/10] Change perturb_period to unsigned.
Corey Hickey wrote: perturb_period is currently a signed integer, but I can't see any good reason why this is so--a negative perturbation period will add a timer that expires in the past, causing constant perturbation, which makes hashing useless. if (q-perturb_period) { q-perturb_timer.expires = jiffies + q-perturb_period; add_timer(q-perturb_timer); } Strictly speaking, this will break binary compatibility with older versions of tc, but that ought not to be a problem because (a) there's no valid use for a negative perturb_period, and (b) negative values will be seen as high values ( INT_MAX), which don't work anyway. If perturb_period is too large, (perturb_period * HZ) will overflow the size of an unsigned int and wrap around. So, check for thet and reject values that are too high. Sounds reasonable. --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -74,6 +74,9 @@ typedef unsigned int sfq_index; #define SFQ_MAX_DEPTH (UINT_MAX / 2 - 1) +/* We don't want perturb_period * HZ to overflow an unsigned int. */ +#define SFQ_MAX_PERTURB (UINT_MAX / HZ) jiffies are unsigned long. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 3/3] UDP memory usage accounting (take 2): measurement
Evgeniy Polyakov wrote: On Fri, Sep 28, 2007 at 10:41:31PM +0900, Satoshi OSHIMA ([EMAIL PROTECTED]) wrote: This patch introduces memory usage measurement for UDP. These 3 points were updated. - UDP specific codes in IP layer were removed. - atomic_sub() in a loop was removed - accounting during socket destruction Another approach is to account only at the highest UDP layer and having datagram skb destructor just like it is done in TCP, but this approach is also resonable. This patch set try to introduce a memory accounting by the page because TCP does. And ip_append_data() merges payloads to a sk_buff if previous sk_buff has enough space. The problem is that udp_append_data() doesn't recognize whether this merge happens or not. If the accounting must be in UDP layer, we need to change the interface of ip_append_data() to know this merge happens. Once the interface is changed, we have to maintain other protocol stacks to keep up with the change. But I didn't want to do it to keep this patch set small in the first step. I already told that patches 1 and 3 have broken indent, please fix that. Oops! I will fix that. A hint: when you are about to submit something network related for inclusion, and strongly believes it is ready, it can be a not that bad idea to add David Miller [EMAIL PROTECTED] to copy list, he can complain about backlog and so on, but will read you mail twice :) but do not tell anyone. Thank you for your advice. I will do that! Satoshi Oshima - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][TCP]: break missing at end of switch statement
In article [EMAIL PROTECTED] (at Mon, 1 Oct 2007 13:32:43 +0100), Gerrit Renker [EMAIL PROTECTED] says: [TCP]: break missing at end of switch statement Signed-off-by: Gerrit Renker [EMAIL PROTECTED] --- --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk) return; default: sk-sk_err = ECONNRESET; + break; } if (!sock_flag(sk, SOCK_DEAD)) NAK; it is not required at all. --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] Use nested compat attributes to pass parameters.
Corey Hickey wrote: This fixes the ambiguity between, for example: tc qdisc change ... perturb 0 tc qdisc change ... Without this patch, there is no way for SFQ to differentiate between a parameter specified to be 0 and a parameter that was omitted. diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index 170fd37..36197f6 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -428,25 +428,31 @@ sfq_q_init(struct sfq_sched_data *q, struct rtattr *opt) * the previous values (sfq_change). So, overwrite the parameters as * specified. */ if (opt) { - struct tc_sfq_qopt *ctl = RTA_DATA(opt); - - if (opt-rta_len RTA_LENGTH(sizeof(*ctl))) - return -EINVAL; - - if (ctl-quantum) - q-quantum = ctl-quantum; - if (ctl-perturb_period) - q-perturb_period = ctl-perturb_period; - if (ctl-divisor) - q-hash_divisor = ctl-divisor; - if (ctl-flows) - q-depth = ctl-flows; - if (ctl-limit) - q-limit = ctl-limit; - + struct tc_sfq_qopt *ctl; + struct rtattr *tb[TCA_SFQ_MAX]; + + if (rtattr_parse_nested_compat(tb, TCA_SFQ_MAX, opt, ctl, +sizeof(*ctl))) + goto rtattr_failure; + +#define GET_PARAM(dst, nest, compat) do { \ + struct rtattr *rta = tb[(nest) - 1]; \ + if (rta) \ + (dst) = RTA_GET_U32(rta); \ + else if ((compat)) \ + (dst) = (compat); \ +} while (0) An inline function and a comment why this is done would increase readability. + + GET_PARAM(q-quantum,TCA_SFQ_QUANTUM, ctl-quantum); + GET_PARAM(q-perturb_period, TCA_SFQ_PERTURB, + ctl-perturb_period); + GET_PARAM(q-hash_divisor, TCA_SFQ_DIVISOR, ctl-divisor); + GET_PARAM(q-depth, TCA_SFQ_FLOWS, ctl-flows); + GET_PARAM(q-limit, TCA_SFQ_LIMIT, ctl-limit); + if (q-perturb_period SFQ_MAX_PERTURB || q-depth SFQ_MAX_DEPTH) - return -EINVAL; + goto rtattr_failure; } q-limit = min_t(u32, q-limit, q-depth - 2); q-tail = q-depth; @@ -482,6 +488,8 @@ sfq_q_init(struct sfq_sched_data *q, struct rtattr *opt) for (i=0; i q-depth; i++) sfq_link(q, i); return 0; +rtattr_failure: + return -EINVAL; err_case: sfq_q_destroy(q); return -ENOBUFS; @@ -559,17 +567,26 @@ static int sfq_dump(struct Qdisc *sch, struct sk_buff *skb) { struct sfq_sched_data *q = qdisc_priv(sch); unsigned char *b = skb_tail_pointer(skb); + struct rtattr *nest; struct tc_sfq_qopt opt; opt.quantum = q-quantum; opt.perturb_period = q-perturb_period; - opt.limit = q-limit; opt.divisor = q-hash_divisor; opt.flows = q-depth; + nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), opt); + + RTA_PUT_U32(skb, TCA_SFQ_QUANTUM, q-quantum); + RTA_PUT_U32(skb, TCA_SFQ_PERTURB, q-perturb_period); + RTA_PUT_U32(skb, TCA_SFQ_LIMIT, q-limit); + RTA_PUT_U32(skb, TCA_SFQ_DIVISOR, q-hash_divisor); + RTA_PUT_U32(skb, TCA_SFQ_FLOWS, q-depth); RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), opt); This is wrong, RTA_NEST_COMPAT already dumps the structure. + RTA_NEST_COMPAT_END(skb, nest); + return skb-len; rtattr_failure: - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 0/3] UDP memory usage accounting
Herbert Xu wrote: On Fri, Sep 28, 2007 at 09:51:59PM -0700, David Miller wrote: There is a per-socket send buffer limit, and there is a per-user open file descriptor limit. Multiply the two to determine how much system memory the user can consume using sockets. We do have these limits but they're per-process, not per-user. Unless you lock down the number of processes each user can have to no more than a handful then this is basically useless. For example, let's say each socket can lock down 64K of kernel memory (which is quite easy to do BTW, just open a TCP/UDP socket, send data to it from another socket but keep the data in the socket by not calling recvmsg), and that each process can have 1024 file descriptors (the default), then each process can pin 64K x 1024 = 64M of memory. So if the user can have 10 processes, then that's 640M of kernel memory that can be pinned down. Usually the process limit is at least 10 times higher. Thank you very mush for your comment. What you pointed out is my motivation to make this patch. I think that per-process limits won't help to solve this problem. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 0/3] UDP memory usage accounting
On Fri, Sep 28, 2007 at 09:47:37PM -0700, David Miller wrote: There are two things we (might) need to guard against, one local and one remote. Right I was focusing on the local threat. If you do a per-user limit, apache would basically just stop at that redzone point. In some sense making the attack more effective because then it's trivial to shut down an entire web server this way. Having a per-user limit doesn't necessarily mean that we have to apply the limit differently to how we apply the system-wide limits. We could keep exactly the same code as we have now but check against a per-user limit instead of a system-wide one. In other words your apache scenario will continue to work as is even with a per-user limit. I'm afraid that per-user limit won't work for system administrator, because he can't know who is the rogue user in advance (before such attack is made). And once the attack is made, system will not responce because of the lack of memory for slab. So if he only has per-user limit, he need to split the memory budget for UDP to each user. The limit per user will be very small if number of users in the system is large. Now where it does become useful is when we have a rogue local user. As it is that user can chew up all of the budgeted TCP memory by simply not calling recvmsg. As I've stated in the other email, the existing rlimits don't help because they're per-process rather than per-user. BTW, this is not fatal for TCP because TCP provides a minimum amount of memory for each socket even when we are over the limit. However, if we this was implemented for UDP without a minimum guarantee then it'd be quite useless. Hmm, I didn't realize that. Thank you for your good suggestion. I will think of it. I see no valid argument against doing something similar for sockets. Such a register_shrinker() handler for TCP could, for example, look for TCP flows which haven't made forward progress in more than a certain amount of time and attempt to trim SKB memory from them. Yes I agree this would be quite useful for sending. However, it'll be tough to shrink skbs that we've already acked for but the app for some reason has decided to leave in the socket by not calling recvmsg. UDP and other datagram sockets are troublesome because the memory gets wholly tied up immediately during the send call and it's not easy to liberate anything. The nice part about datagram sockets, however, is that they make forward progress quickly and their memory is liberated as soon as the device transmits the packet. They don't have to wait for ACKs, windows openning up, or anything like that to happen. Agreed. Also the recvmsg case I've described above is much simpler for UDP as we can just go through all the sockets and free skbs at random :) To be honest I don't even think UDP is much of a real problem for this reason. It's not a hard problem but we do need to have some code for it. I believe so. Currently, a nasty user can easily stop the system without root privilege. This may not be a serious problem, but this is the problem to be fixed. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] make netlink processing routines semi-synchronious (inspired by rtnl) v2
The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks like outdated copy/paste from rtnetlink.c. Push them into sync with the original. Changes from v1: - deleted comment in nfnetlink_rcv_msg by request of Patrick McHardy Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- ./net/netfilter/nfnetlink.c.nlk32007-10-01 09:47:53.0 +0400 +++ ./net/netfilter/nfnetlink.c 2007-10-01 17:13:09.0 +0400 @@ -44,26 +44,14 @@ static struct sock *nfnl = NULL; static const struct nfnetlink_subsystem *subsys_table[NFNL_SUBSYS_COUNT]; static DEFINE_MUTEX(nfnl_mutex); -static void nfnl_lock(void) +static inline void nfnl_lock(void) { mutex_lock(nfnl_mutex); } -static int nfnl_trylock(void) -{ - return !mutex_trylock(nfnl_mutex); -} - -static void __nfnl_unlock(void) -{ - mutex_unlock(nfnl_mutex); -} - -static void nfnl_unlock(void) +static inline void nfnl_unlock(void) { mutex_unlock(nfnl_mutex); - if (nfnl-sk_receive_queue.qlen) - nfnl-sk_data_ready(nfnl, 0); } int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n) @@ -147,9 +135,7 @@ static int nfnetlink_rcv_msg(struct sk_b ss = nfnetlink_get_subsys(type); if (!ss) { #ifdef CONFIG_KMOD - /* don't call nfnl_unlock, since it would reenter -* with further packet processing */ - __nfnl_unlock(); + nfnl_unlock(); request_module(nfnetlink-subsys-%d, NFNL_SUBSYS_ID(type)); nfnl_lock(); ss = nfnetlink_get_subsys(type); @@ -188,10 +174,9 @@ static void nfnetlink_rcv(struct sock *s unsigned int qlen = 0; do { - if (nfnl_trylock()) - return; + nfnl_lock(); qlen = netlink_run_queue(sk, qlen, nfnetlink_rcv_msg); - __nfnl_unlock(); + nfnl_unlock(); } while (qlen); } --- ./net/netlink/genetlink.c.nlk3 2007-08-26 19:30:38.0 +0400 +++ ./net/netlink/genetlink.c 2007-10-01 16:05:29.0 +0400 @@ -22,22 +22,14 @@ struct sock *genl_sock = NULL; static DEFINE_MUTEX(genl_mutex); /* serialization of message processing */ -static void genl_lock(void) +static inline void genl_lock(void) { mutex_lock(genl_mutex); } -static int genl_trylock(void) -{ - return !mutex_trylock(genl_mutex); -} - -static void genl_unlock(void) +static inline void genl_unlock(void) { mutex_unlock(genl_mutex); - - if (genl_sock genl_sock-sk_receive_queue.qlen) - genl_sock-sk_data_ready(genl_sock, 0); } #define GENL_FAM_TAB_SIZE 16 @@ -483,8 +475,7 @@ static void genl_rcv(struct sock *sk, in unsigned int qlen = 0; do { - if (genl_trylock()) - return; + genl_lock(); qlen = netlink_run_queue(sk, qlen, genl_rcv_msg); genl_unlock(); } while (qlen genl_sock genl_sock-sk_receive_queue.qlen); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] memory leak in netlink user-kernel processing
netlink_kernel_create can be called with NULL as an input callback in several places, f.e. in kobject_uevent_init. This means that if one sends packet from user to kernel for such a socket, the packet will be leaked in the socket queue forever. This patch adds a simple generic cleanup callback for these sockets. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- ./net/netlink/af_netlink.c.nlk4 2007-08-26 19:30:38.0 +0400 +++ ./net/netlink/af_netlink.c 2007-10-01 18:00:58.0 +0400 @@ -1301,6 +1301,13 @@ out: return err ? : copied; } +static void netlink_rcv_drop(struct sock *sk, int len) +{ + struct sk_buff *skb; + while ((skb = skb_dequeue(sk-sk_receive_queue)) != NULL) + kfree_skb(skb); +} + static void netlink_data_ready(struct sock *sk, int len) { struct netlink_sock *nlk = nlk_sk(sk); @@ -1346,8 +1353,7 @@ netlink_kernel_create(struct net *net, i sk = sock-sk; sk-sk_data_ready = netlink_data_ready; - if (input) - nlk_sk(sk)-data_ready = input; + nlk_sk(sk)-data_ready = input != NULL ? input : netlink_rcv_drop; if (netlink_insert(sk, net, 0)) goto out_sock_release; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make netlink processing routines semi-synchronious (inspired by rtnl) v2
Denis V. Lunev wrote: The code in netfilter/nfnetlink.c and in ./net/netlink/genetlink.c looks like outdated copy/paste from rtnetlink.c. Push them into sync with the original. Changes from v1: - deleted comment in nfnetlink_rcv_msg by request of Patrick McHardy Thanks. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] Acked-by: Patrick McHardy [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][TCP]: break missing at end of switch statement
Em Mon, Oct 01, 2007 at 02:39:28PM +0100, Gerrit Renker escreveu: Quoting YOSHIFUJI Hideaki: | | [TCP]: break missing at end of switch statement | | Signed-off-by: Gerrit Renker [EMAIL PROTECTED] | --- | --- a/net/ipv4/tcp_input.c | +++ b/net/ipv4/tcp_input.c | @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk) |return; |default: |sk-sk_err = ECONNRESET; | +break; |} | |if (!sock_flag(sk, SOCK_DEAD)) | | NAK; it is not required at all. | | --yoshfuji | If it were true what you are saying then the statement `sk-sk_err = ECONNRESET;' can go as well since it will always be overridden. Gerrit, It is not required. The statement you mention will be executed when the sk_state is not one of TCP_SYN_SENT, TCP_CLOSE_WAIT or TCP_CLOSE. A 'break' is only needed in a label block if it is not the last one. - Arnaldo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ehea: DLPAR memory add fix
Due to stability issues in high load situations the HW queue handling has to be changed. The HW queues are now stopped and restarted again instead of destroying and allocating new HW queues. Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED] --- drivers/net/ehea/ehea.h |4 +- drivers/net/ehea/ehea_main.c | 276 +- drivers/net/ehea/ehea_phyp.h |1 + drivers/net/ehea/ehea_qmr.c | 20 ++-- drivers/net/ehea/ehea_qmr.h |4 +- 5 files changed, 259 insertions(+), 46 deletions(-) diff --git a/drivers/net/ehea/ehea.h b/drivers/net/ehea/ehea.h index c0cbd94..3022089 100644 --- a/drivers/net/ehea/ehea.h +++ b/drivers/net/ehea/ehea.h @@ -40,13 +40,13 @@ #include asm/io.h #define DRV_NAME ehea -#define DRV_VERSIONEHEA_0074 +#define DRV_VERSIONEHEA_0077 /* eHEA capability flags */ #define DLPAR_PORT_ADD_REM 1 #define DLPAR_MEM_ADD 2 #define DLPAR_MEM_REM 4 -#define EHEA_CAPABILITIES (DLPAR_PORT_ADD_REM) +#define EHEA_CAPABILITIES (DLPAR_PORT_ADD_REM | DLPAR_MEM_ADD) #define EHEA_MSG_DEFAULT (NETIF_MSG_LINK | NETIF_MSG_TIMER \ | NETIF_MSG_RX_ERR | NETIF_MSG_TX_ERR) diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c index 62d6c1e..5bc0a15 100644 --- a/drivers/net/ehea/ehea_main.c +++ b/drivers/net/ehea/ehea_main.c @@ -97,6 +97,7 @@ u64 ehea_driver_flags = 0; struct workqueue_struct *ehea_driver_wq; struct work_struct ehea_rereg_mr_task; +struct semaphore dlpar_mem_lock; static int __devinit ehea_probe_adapter(struct ibmebus_dev *dev, const struct of_device_id *id); @@ -177,16 +178,24 @@ static void ehea_refill_rq1(struct ehea_port_res *pr, int index, int nr_of_wqes) struct sk_buff **skb_arr_rq1 = pr-rq1_skba.arr; struct net_device *dev = pr-port-netdev; int max_index_mask = pr-rq1_skba.len - 1; + int fill_wqes = pr-rq1_skba.os_skbs + nr_of_wqes; + int adder = 0; int i; - if (!nr_of_wqes) + pr-rq1_skba.os_skbs = 0; + + if (unlikely(test_bit(__EHEA_STOP_XFER, ehea_driver_flags))) { + pr-rq1_skba.index = index; + pr-rq1_skba.os_skbs = fill_wqes; return; + } - for (i = 0; i nr_of_wqes; i++) { + for (i = 0; i fill_wqes; i++) { if (!skb_arr_rq1[index]) { skb_arr_rq1[index] = netdev_alloc_skb(dev, EHEA_L_PKT_SIZE); if (!skb_arr_rq1[index]) { + pr-rq1_skba.os_skbs = fill_wqes - i; ehea_error(%s: no mem for skb/%d wqes filled, dev-name, i); break; @@ -194,9 +203,14 @@ static void ehea_refill_rq1(struct ehea_port_res *pr, int index, int nr_of_wqes) } index--; index = max_index_mask; + adder++; } + + if (adder == 0) + return; + /* Ring doorbell */ - ehea_update_rq1a(pr-qp, i); + ehea_update_rq1a(pr-qp, adder); } static int ehea_init_fill_rq1(struct ehea_port_res *pr, int nr_rq1a) @@ -230,16 +244,21 @@ static int ehea_refill_rq_def(struct ehea_port_res *pr, struct sk_buff **skb_arr = q_skba-arr; struct ehea_rwqe *rwqe; int i, index, max_index_mask, fill_wqes; + int adder = 0; int ret = 0; fill_wqes = q_skba-os_skbs + num_wqes; + q_skba-os_skbs = 0; - if (!fill_wqes) + if (unlikely(test_bit(__EHEA_STOP_XFER, ehea_driver_flags))) { + q_skba-os_skbs = fill_wqes; return ret; + } index = q_skba-index; max_index_mask = q_skba-len - 1; for (i = 0; i fill_wqes; i++) { + u64 tmp_addr; struct sk_buff *skb = netdev_alloc_skb(dev, packet_size); if (!skb) { ehea_error(%s: no mem for skb/%d wqes filled, @@ -251,30 +270,37 @@ static int ehea_refill_rq_def(struct ehea_port_res *pr, skb_reserve(skb, NET_IP_ALIGN); skb_arr[index] = skb; + tmp_addr = ehea_map_vaddr(skb-data); + if (tmp_addr == -1) { + dev_kfree_skb(skb); + q_skba-os_skbs = fill_wqes - i; + ret = 0; + break; + } rwqe = ehea_get_next_rwqe(qp, rq_nr); rwqe-wr_id = EHEA_BMASK_SET(EHEA_WR_ID_TYPE, wqe_type) | EHEA_BMASK_SET(EHEA_WR_ID_INDEX, index); rwqe-sg_list[0].l_key = pr-recv_mr.lkey; - rwqe-sg_list[0].vaddr = ehea_map_vaddr(skb-data); + rwqe-sg_list[0].vaddr = tmp_addr; rwqe-sg_list[0].len = packet_size;
Re: [PATCH] memory leak in netlink user-kernel processing
Denis V. Lunev wrote: netlink_kernel_create can be called with NULL as an input callback in several places, f.e. in kobject_uevent_init. This means that if one sends packet from user to kernel for such a socket, the packet will be leaked in the socket queue forever. This patch adds a simple generic cleanup callback for these sockets. This should already be handled by netlink_getsockbypid: /* Don't bother queuing skb if kernel socket has no input function */ nlk = nlk_sk(sock); if ((nlk-pid == 0 !nlk-data_ready) || (sock-sk_state == NETLINK_CONNECTED nlk-dst_pid != nlk_sk(ssk)-pid)) { sock_put(sock); return ERR_PTR(-ECONNREFUSED); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ehea work queues
Hi On Sunday 30 September 2007 18:20, Anton Blanchard wrote: Hi, I booted 2.6.23-rc8 and noticed that ehea loves its workqueues: (notice also that the ehea_driver_wq/XXX exceeds TASK_COMM_LEN). Since they are both infrequent events and not performance critical (memory hotplug and driver reset), can we just use schedule_work? Yes. I'll provide a patch soon. Thanks, Jan-Bernd - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: DLPAR memory add fix
Jan-Bernd Themann wrote: Due to stability issues in high load situations the HW queue handling has to be changed. The HW queues are now stopped and restarted again instead of destroying and allocating new HW queues. Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED] May I presume this is for 2.6.23? Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: DLPAR memory add fix
Hi, On Monday 01 October 2007 16:44, Jeff Garzik wrote: Jan-Bernd Themann wrote: Due to stability issues in high load situations the HW queue handling has to be changed. The HW queues are now stopped and restarted again instead of destroying and allocating new HW queues. Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED] May I presume this is for 2.6.23? Jeff no, the patch is build against 2.6.24 upstream (new NAPI interface). Regards, Jan-Bernd - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] memory leak in netlink user-kernel processing
Patrick McHardy wrote: Denis V. Lunev wrote: netlink_kernel_create can be called with NULL as an input callback in several places, f.e. in kobject_uevent_init. This means that if one sends packet from user to kernel for such a socket, the packet will be leaked in the socket queue forever. This patch adds a simple generic cleanup callback for these sockets. This should already be handled by netlink_getsockbypid: /* Don't bother queuing skb if kernel socket has no input function */ nlk = nlk_sk(sock); if ((nlk-pid == 0 !nlk-data_ready) || (sock-sk_state == NETLINK_CONNECTED nlk-dst_pid != nlk_sk(ssk)-pid)) { sock_put(sock); return ERR_PTR(-ECONNREFUSED); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Looks so... By the way, Patrick, this looks like nlk-pid == 0 if and only if this is a kernel socket. Right? I have told with Alexey Kuznetsov and we have discrovered a way to get rid of skb_queue_tail(sk-sk_receive_queue, skb); sk-sk_data_ready(sk, len); in netlink_sendskb/etc for kernel sockets and make user-kernel packets processing truly synchronous. The idea is simple, we should queue/wakeup in kernel-user direction and simply call nlk-data_ready for user-kernel direction. This will remove all the crap we have now. But we need a mark to determine the direction. Which one will be better? (nlk-data_ready) or (nlk-pid == 0) Regards, Den - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ehea: DLPAR memory add fix
Jan-Bernd Themann wrote: Hi, On Monday 01 October 2007 16:44, Jeff Garzik wrote: Jan-Bernd Themann wrote: Due to stability issues in high load situations the HW queue handling has to be changed. The HW queues are now stopped and restarted again instead of destroying and allocating new HW queues. Signed-off-by: Jan-Bernd Themann [EMAIL PROTECTED] May I presume this is for 2.6.23? Jeff no, the patch is build against 2.6.24 upstream (new NAPI interface). OK, thanks. Since we typically have two streams, the current bug-fix stream and the for-next-kernel stream, please indicate to which kernel/git tree your patch applies, in the future. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] memory leak in netlink user-kernel processing
Denis V. Lunev wrote: By the way, Patrick, this looks like nlk-pid == 0 if and only if this is a kernel socket. Right? Thats correct. I have told with Alexey Kuznetsov and we have discrovered a way to get rid of skb_queue_tail(sk-sk_receive_queue, skb); sk-sk_data_ready(sk, len); in netlink_sendskb/etc for kernel sockets and make user-kernel packets processing truly synchronous. The idea is simple, we should queue/wakeup in kernel-user direction and simply call nlk-data_ready for user-kernel direction. This will remove all the crap we have now. But we need a mark to determine the direction. Which one will be better? (nlk-data_ready) or (nlk-pid == 0) Both would work fine, but I think nlk-pid is better since its actually the address. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
Denys a écrit : Well, i can play a bit more on live servers. I have now hot-swap server with full gentoo, where i can rebuild any kernel you want, with any applied patch. But it looks more like not overhead, load becoming high too spiky, and it is not just permantenly higher. Also it is not normal that all system becoming unresposive (for example ping 127.0.0.1 becoming 300ms for period, when usage softirq jumps to 100%). Could you try a pristine 2.6.22.9 and some patch in secure_tcp_sequence_number() like : --- drivers/char/random.c.orig 2007-10-01 10:18:42.0 +0200 +++ drivers/char/random.c 2007-10-01 10:19:58.0 +0200 @@ -1554,7 +1554,7 @@ * That's funny, Linux has one built in! Use it! * (Networks are faster now - should this be increased?) */ - seq += ktime_get_real().tv64; + seq += ktime_get_real().tv64 / 1000; #if 0 printk(init_seq(%lx, %lx, %d, %d) = %d\n, saddr, daddr, sport, dport, seq); Thank you On Mon, 01 Oct 2007 00:12:59 -0700 (PDT), David Miller wrote From: Eric Dumazet [EMAIL PROTECTED] Date: Mon, 01 Oct 2007 07:59:12 +0200 No problem here on bigger servers, so I CC David Miller and netdev on this one. AFAIK do_gettimeofday() and ktime_get_real() should use the same underlying hardware functions on PC and no performance problem should happen here. One thing that jumps out at me is that on 32-bit (and to a certain extent on 64-bit) there is a lot of stack accesses and missed optimizations because all of the work occurs, and gets expanded, inside of ktime_get_real(). The timespec_to_ktime() inside of there constructs the ktime_t return value on the stack, then returns that as an aggregate to the caller. That cannot be without some cost. ktime_get_real() is definitely a candidate for inlining especially in these kinds of cases where we'll happily get computations in local registers instead of all of this on-stack nonsense. And in several cases (if the caller only needs the tv_sec value, for example) computations can be elided entirely. It would be constructive to experiment and see if this is in fact part of the problem. -- Denys Fedoryshchenko Technical Manager Virtual ISP S.A.L. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] memory leak in netlink user-kernel processing
Patrick McHardy [EMAIL PROTECTED] writes: Denis V. Lunev wrote: By the way, Patrick, this looks like nlk-pid == 0 if and only if this is a kernel socket. Right? Thats correct. I have told with Alexey Kuznetsov and we have discrovered a way to get rid of skb_queue_tail(sk-sk_receive_queue, skb); sk-sk_data_ready(sk, len); in netlink_sendskb/etc for kernel sockets and make user-kernel packets processing truly synchronous. The idea is simple, we should queue/wakeup in kernel-user direction and simply call nlk-data_ready for user-kernel direction. This will remove all the crap we have now. But we need a mark to determine the direction. Which one will be better? (nlk-data_ready) or (nlk-pid == 0) Both would work fine, but I think nlk-pid is better since its actually the address. Maybe. nlk-pid is also 0, before the socket is bound so it does not serve as a reliable indicator that you have a kernel socket. My gut feel says the best test is: (nlk-flags NETLINK_KERNEL_SOCKET) There is no confusion in that and it is dead obvious what we are testing for. Although we do still need to properly handle the case when netlink_kernel_create is called with a NULL input method. As long as get the proper -ECONNREFUSED the code path doesn't look like it matters. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1][TCP]: break missing at end of switch statement
Arnaldo, Al Viro, and Yoshifuji - sorry for having wasted your time with this one. You are right, that was complete nonsense. I don't know where my mind was - even my test program used to `prove' this was screwed up. So nothing wrong here and thank you very much for your clarifying comments. | | --- a/net/ipv4/tcp_input.c | | +++ b/net/ipv4/tcp_input.c | | @@ -3129,6 +3129,7 @@ static void tcp_reset(struct sock *sk) | | return; | | default: | | sk-sk_err = ECONNRESET; | | + break; | | } | | | | if (!sock_flag(sk, SOCK_DEAD)) | | | | NAK; it is not required at all. | | | | --yoshfuji | | | If it were true what you are saying then the statement | | `sk-sk_err = ECONNRESET;' | | can go as well since it will always be overridden. | | Gerrit, | | It is not required. The statement you mention will be executed | when the sk_state is not one of TCP_SYN_SENT, TCP_CLOSE_WAIT or | TCP_CLOSE. | | A 'break' is only needed in a label block if it is not the last | one. | | - Arnaldo | | - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPV6] Fix ICMPv6 redirect handling with target multicast address
Hi, YOSHIFUJI Hideaki / 吉藤英明 wrote: I think it'd also be better if you add the check to be: if (ipv6_addr_type(target) (IPV6_ADDR_LINKLOCAL|IPV6_ADDR_UNICAST)) or something along those lines, rather than reproducing ipv6_addr_type() code separately in a new ipv6_addr_linklocal() function. I'm fine with the idea of the fix itself. Ok, in both the receive and send code? Please use ipv6_addr_type() so far and convert other users as well to ipv6_addr_linklocal() in another patch. I'll re-do the patch. -Brian - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net-2.6.24: old ax25 driver fix
Recent change in hard header broke build of these old drivers. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- drivers/net/hamradio/dmascc.c |2 +- drivers/net/hamradio/scc.c|2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/hamradio/dmascc.c b/drivers/net/hamradio/dmascc.c index b529e23..bc02e46 100644 --- a/drivers/net/hamradio/dmascc.c +++ b/drivers/net/hamradio/dmascc.c @@ -581,7 +581,7 @@ static int __init setup_adapter(int card_base, int type, int n) dev-do_ioctl = scc_ioctl; dev-hard_start_xmit = scc_send_packet; dev-get_stats = scc_get_stats; - dev-header_ops = ax25_hard_header_ops + dev-header_ops = ax25_header_ops; dev-set_mac_address = scc_set_mac_address; } if (register_netdev(info-dev[0])) { diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c index 56cc523..353d13e 100644 --- a/drivers/net/hamradio/scc.c +++ b/drivers/net/hamradio/scc.c @@ -1551,7 +1551,7 @@ static void scc_net_setup(struct net_device *dev) dev-stop= scc_net_close; dev-hard_start_xmit = scc_net_tx; - dev-header_ops = ax25_hard_header_ops; + dev-header_ops = ax25_header_ops; dev-set_mac_address = scc_net_set_mac_address; dev-get_stats = scc_net_get_stats; -- 1.5.2.5 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] qla3xxx: receive path bugfixes.
Jeff, This is the second submission... First was in August. Thanks, Ron The following two patches fix: An undocumented feature where the 4032 chip sets bit-7 of the opcode for an inbound completion if it's for a VLAN. The access of stale data on a completion entry. These patches were built and tested on 2.6.23-rc1. Signed-off-by: Ron Mercer [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] qla3xxx: bugfix: Add memory barrier before accessing rx completion.
Signed-off-by: Ron Mercer [EMAIL PROTECTED] --- drivers/net/qla3xxx.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/net/qla3xxx.c b/drivers/net/qla3xxx.c index 69da95b..c3fe1c7 100755 --- a/drivers/net/qla3xxx.c +++ b/drivers/net/qla3xxx.c @@ -2248,6 +2248,7 @@ static int ql_tx_rx_clean(struct ql3_adapter *qdev, qdev-rsp_consumer_index) (work_done work_to_do)) { net_rsp = qdev-rsp_current; + rmb(); switch (net_rsp-opcode) { case OPCODE_OB_MAC_IOCB_FN0: -- 1.5.0.rc4.16.g9e258 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] qla3xxx: bugfix: Fix VLAN rx completion handling.
Fix 4032 chip undocumented feature where bit-8 is set if the inbound completion is for a VLAN. Signed-off-by: Ron Mercer [EMAIL PROTECTED] --- drivers/net/qla3xxx.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/drivers/net/qla3xxx.c b/drivers/net/qla3xxx.c index c3fe1c7..ea15131 100755 --- a/drivers/net/qla3xxx.c +++ b/drivers/net/qla3xxx.c @@ -2249,6 +2249,12 @@ static int ql_tx_rx_clean(struct ql3_adapter *qdev, net_rsp = qdev-rsp_current; rmb(); + /* +* Fix 4032 chipe undocumented feature where bit-8 is set if the +* inbound completion is for a VLAN. +*/ + if (qdev-device_id == QL3032_DEVICE_ID) + net_rsp-opcode = 0x7f; switch (net_rsp-opcode) { case OPCODE_OB_MAC_IOCB_FN0: -- 1.5.0.rc4.16.g9e258 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] fs_enet: Whitespace cleanup.
Signed-off-by: Scott Wood [EMAIL PROTECTED] --- This patch series applies to the net-2.6.24 branch. drivers/net/fs_enet/fs_enet-main.c | 85 --- drivers/net/fs_enet/fs_enet.h |4 +- drivers/net/fs_enet/mac-fcc.c |1 - drivers/net/fs_enet/mii-bitbang.c |3 - drivers/net/fs_enet/mii-fec.c |1 - 5 files changed, 41 insertions(+), 53 deletions(-) diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c index ebdcf3f..2a1b150 100644 --- a/drivers/net/fs_enet/fs_enet-main.c +++ b/drivers/net/fs_enet/fs_enet-main.c @@ -343,7 +343,6 @@ static void fs_enet_tx(struct net_device *dev) do_wake = do_restart = 0; while (((sc = CBDR_SC(bdp)) BD_ENET_TX_READY) == 0) { - dirtyidx = bdp - fep-tx_bd_base; if (fep-tx_free == fep-tx_ring) @@ -444,7 +443,6 @@ fs_enet_interrupt(int irq, void *dev_id) nr = 0; while ((int_events = (*fep-ops-get_int_events)(dev)) != 0) { - nr++; int_clr_events = int_events; @@ -700,45 +698,43 @@ static void fs_timeout(struct net_device *dev) *-*/ static void generic_adjust_link(struct net_device *dev) { - struct fs_enet_private *fep = netdev_priv(dev); - struct phy_device *phydev = fep-phydev; - int new_state = 0; - - if (phydev-link) { - - /* adjust to duplex mode */ - if (phydev-duplex != fep-oldduplex){ - new_state = 1; - fep-oldduplex = phydev-duplex; - } - - if (phydev-speed != fep-oldspeed) { - new_state = 1; - fep-oldspeed = phydev-speed; - } - - if (!fep-oldlink) { - new_state = 1; - fep-oldlink = 1; - netif_schedule(dev); - netif_carrier_on(dev); - netif_start_queue(dev); - } - - if (new_state) - fep-ops-restart(dev); - - } else if (fep-oldlink) { - new_state = 1; - fep-oldlink = 0; - fep-oldspeed = 0; - fep-oldduplex = -1; - netif_carrier_off(dev); - netif_stop_queue(dev); - } - - if (new_state netif_msg_link(fep)) - phy_print_status(phydev); + struct fs_enet_private *fep = netdev_priv(dev); + struct phy_device *phydev = fep-phydev; + int new_state = 0; + + if (phydev-link) { + /* adjust to duplex mode */ + if (phydev-duplex != fep-oldduplex) { + new_state = 1; + fep-oldduplex = phydev-duplex; + } + + if (phydev-speed != fep-oldspeed) { + new_state = 1; + fep-oldspeed = phydev-speed; + } + + if (!fep-oldlink) { + new_state = 1; + fep-oldlink = 1; + netif_schedule(dev); + netif_carrier_on(dev); + netif_start_queue(dev); + } + + if (new_state) + fep-ops-restart(dev); + } else if (fep-oldlink) { + new_state = 1; + fep-oldlink = 0; + fep-oldspeed = 0; + fep-oldduplex = -1; + netif_carrier_off(dev); + netif_stop_queue(dev); + } + + if (new_state netif_msg_link(fep)) + phy_print_status(phydev); } @@ -782,7 +778,6 @@ static int fs_init_phy(struct net_device *dev) return 0; } - static int fs_enet_open(struct net_device *dev) { struct fs_enet_private *fep = netdev_priv(dev); @@ -971,7 +966,7 @@ static struct net_device *fs_init_instance(struct device *dev, #endif #ifdef CONFIG_FS_ENET_HAS_SCC - if (fs_get_scc_index(fpi-fs_no) =0 ) + if (fs_get_scc_index(fpi-fs_no) =0) fep-ops = fs_scc_ops; #endif @@ -1066,9 +1061,8 @@ static struct net_device *fs_init_instance(struct device *dev, return ndev; - err: +err: if (ndev != NULL) { - if (registered) unregister_netdev(ndev); @@ -1259,7 +1253,6 @@ static int __init fs_init(void) err: cleanup_immap(); return r; - } static void __exit fs_cleanup(void) diff --git a/drivers/net/fs_enet/fs_enet.h b/drivers/net/fs_enet/fs_enet.h index 46d0606..fbe2087 100644 --- a/drivers/net/fs_enet/fs_enet.h +++ b/drivers/net/fs_enet/fs_enet.h @@ -15,8 +15,8 @@ #include asm/commproc.h struct fec_info { -fec_t* fecp; - u32 mii_speed; + fec_t *fecp; + u32 mii_speed; }; #endif diff
[PATCH 5/9] fs_enet: Align receive buffers.
At least some hardware driven by this driver needs receive buffers to be aligned on a 16-byte boundary. This usually happens by chance, but it breaks if slab debugging is enabled. Signed-off-by: Scott Wood [EMAIL PROTECTED] --- drivers/net/fs_enet/fs_enet-main.c | 21 +++-- drivers/net/fs_enet/fs_enet.h |3 ++- 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c index a15345b..7a02986 100644 --- a/drivers/net/fs_enet/fs_enet-main.c +++ b/drivers/net/fs_enet/fs_enet-main.c @@ -70,6 +70,14 @@ static void fs_set_multicast_list(struct net_device *dev) (*fep-ops-set_multicast_list)(dev); } +static void skb_align(struct sk_buff *skb, int align) +{ + int off = ((unsigned long)skb-data) (align - 1); + + if (off) + skb_reserve(skb, align - off); +} + /* NAPI receive function */ static int fs_enet_rx_napi(struct napi_struct *napi, int budget) { @@ -159,9 +167,13 @@ static int fs_enet_rx_napi(struct napi_struct *napi, int budget) skb = skbn; skbn = skbt; } - } else + } else { skbn = dev_alloc_skb(ENET_RX_FRSIZE); + if (skbn) + skb_align(skbn, ENET_RX_ALIGN); + } + if (skbn != NULL) { skb_put(skb, pkt_len); /* Make room */ skb-protocol = eth_type_trans(skb, dev); @@ -290,9 +302,13 @@ static int fs_enet_rx_non_napi(struct net_device *dev) skb = skbn; skbn = skbt; } - } else + } else { skbn = dev_alloc_skb(ENET_RX_FRSIZE); + if (skbn) + skb_align(skbn, ENET_RX_ALIGN); + } + if (skbn != NULL) { skb_put(skb, pkt_len); /* Make room */ skb-protocol = eth_type_trans(skb, dev); @@ -502,6 +518,7 @@ void fs_init_bds(struct net_device *dev) dev-name); break; } + skb_align(skb, ENET_RX_ALIGN); fep-rx_skbuff[i] = skb; CBDW_BUFADDR(bdp, dma_map_single(fep-dev, skb-data, diff --git a/drivers/net/fs_enet/fs_enet.h b/drivers/net/fs_enet/fs_enet.h index fbe2087..85571e4 100644 --- a/drivers/net/fs_enet/fs_enet.h +++ b/drivers/net/fs_enet/fs_enet.h @@ -82,7 +82,8 @@ struct phy_info { /* Must be a multiple of 32 (to cover both FEC FCC) */ #define PKT_MAXBLR_SIZE((PKT_MAXBUF_SIZE + 31) ~31) /* This is needed so that invalidate_xxx wont invalidate too much */ -#define ENET_RX_FRSIZE L1_CACHE_ALIGN(PKT_MAXBUF_SIZE) +#define ENET_RX_ALIGN 16 +#define ENET_RX_FRSIZE L1_CACHE_ALIGN(PKT_MAXBUF_SIZE + ENET_RX_ALIGN - 1) struct fs_enet_mii_bus { struct list_head list; -- 1.5.3.2 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] fs_enet: mac-fcc: Eliminate __fcc-* macros.
These macros accomplish nothing other than defeating type checking. This patch also fixes one instance of the wrong register size being used that was revealed by enabling type checking. Signed-off-by: Scott Wood [EMAIL PROTECTED] --- drivers/net/fs_enet/mac-fcc.c | 25 - 1 files changed, 8 insertions(+), 17 deletions(-) diff --git a/drivers/net/fs_enet/mac-fcc.c b/drivers/net/fs_enet/mac-fcc.c index 1e1024a..e990f72 100644 --- a/drivers/net/fs_enet/mac-fcc.c +++ b/drivers/net/fs_enet/mac-fcc.c @@ -48,28 +48,19 @@ /* FCC access macros */ -#define __fcc_out32(addr, x) out_be32((unsigned *)addr, x) -#define __fcc_out16(addr, x) out_be16((unsigned short *)addr, x) -#define __fcc_out8(addr, x)out_8((unsigned char *)addr, x) -#define __fcc_in32(addr) in_be32((unsigned *)addr) -#define __fcc_in16(addr) in_be16((unsigned short *)addr) -#define __fcc_in8(addr)in_8((unsigned char *)addr) - -/* parameter space */ - /* write, read, set bits, clear bits */ -#define W32(_p, _m, _v)__fcc_out32((_p)-_m, (_v)) -#define R32(_p, _m)__fcc_in32((_p)-_m) +#define W32(_p, _m, _v)out_be32((_p)-_m, (_v)) +#define R32(_p, _m)in_be32((_p)-_m) #define S32(_p, _m, _v)W32(_p, _m, R32(_p, _m) | (_v)) #define C32(_p, _m, _v)W32(_p, _m, R32(_p, _m) ~(_v)) -#define W16(_p, _m, _v)__fcc_out16((_p)-_m, (_v)) -#define R16(_p, _m)__fcc_in16((_p)-_m) +#define W16(_p, _m, _v)out_be16((_p)-_m, (_v)) +#define R16(_p, _m)in_be16((_p)-_m) #define S16(_p, _m, _v)W16(_p, _m, R16(_p, _m) | (_v)) #define C16(_p, _m, _v)W16(_p, _m, R16(_p, _m) ~(_v)) -#define W8(_p, _m, _v) __fcc_out8((_p)-_m, (_v)) -#define R8(_p, _m) __fcc_in8((_p)-_m) +#define W8(_p, _m, _v) out_8((_p)-_m, (_v)) +#define R8(_p, _m) in_8((_p)-_m) #define S8(_p, _m, _v) W8(_p, _m, R8(_p, _m) | (_v)) #define C8(_p, _m, _v) W8(_p, _m, R8(_p, _m) ~(_v)) @@ -290,7 +281,7 @@ static void restart(struct net_device *dev) /* clear everything (slow steady does it) */ for (i = 0; i sizeof(*ep); i++) - __fcc_out8((char *)ep + i, 0); + out_8((char *)ep + i, 0); /* get physical address */ rx_bd_base_phys = fep-ring_mem_addr; @@ -495,7 +486,7 @@ static void tx_kickstart(struct net_device *dev) struct fs_enet_private *fep = netdev_priv(dev); fcc_t *fccp = fep-fcc.fccp; - S32(fccp, fcc_ftodr, 0x80); + S16(fccp, fcc_ftodr, 0x8000); } static u32 get_int_events(struct net_device *dev) -- 1.5.3.2 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] fs_enet: Include linux/string.h from linux/fs_enet_pd.h
It is needed for strstr(). Signed-off-by: Scott Wood [EMAIL PROTECTED] --- include/linux/fs_enet_pd.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/include/linux/fs_enet_pd.h b/include/linux/fs_enet_pd.h index 543cd3c..815c6f9 100644 --- a/include/linux/fs_enet_pd.h +++ b/include/linux/fs_enet_pd.h @@ -16,6 +16,7 @@ #ifndef FS_ENET_PD_H #define FS_ENET_PD_H +#include linux/string.h #include asm/types.h #define FS_ENET_NAME fs_enet -- 1.5.3.2 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] fs_enet: Convert mii-bitbang to use the generic bitbang MDIO code.
Signed-off-by: Scott Wood [EMAIL PROTECTED] --- drivers/net/fs_enet/mii-bitbang.c | 270 - 1 files changed, 54 insertions(+), 216 deletions(-) diff --git a/drivers/net/fs_enet/mii-bitbang.c b/drivers/net/fs_enet/mii-bitbang.c index 7cf132f..b8e4a73 100644 --- a/drivers/net/fs_enet/mii-bitbang.c +++ b/drivers/net/fs_enet/mii-bitbang.c @@ -15,15 +15,13 @@ #include linux/module.h #include linux/ioport.h #include linux/slab.h -#include linux/interrupt.h #include linux/init.h -#include linux/delay.h +#include linux/interrupt.h #include linux/netdevice.h #include linux/etherdevice.h #include linux/mii.h -#include linux/ethtool.h -#include linux/bitops.h #include linux/platform_device.h +#include linux/mdio-bitbang.h #ifdef CONFIG_PPC_CPM_NEW_BINDING #include linux/of_platform.h @@ -32,11 +30,11 @@ #include fs_enet.h struct bb_info { + struct mdiobb_ctrl ctrl; __be32 __iomem *dir; __be32 __iomem *dat; u32 mdio_msk; u32 mdc_msk; - int delay; }; /* FIXME: If any other users of GPIO crop up, then these will have to @@ -59,212 +57,58 @@ static inline int bb_read(u32 __iomem *p, u32 m) return (in_be32(p) m) != 0; } -static inline void mdio_active(struct bb_info *bitbang) +static inline void mdio_dir(struct mdiobb_ctrl *ctrl, int dir) { - bb_set(bitbang-dir, bitbang-mdio_msk); -} + struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl); -static inline void mdio_tristate(struct bb_info *bitbang) -{ - bb_clr(bitbang-dir, bitbang-mdio_msk); + if (dir) + bb_set(bitbang-dir, bitbang-mdio_msk); + else + bb_clr(bitbang-dir, bitbang-mdio_msk); + + /* Read back to flush the write. */ + in_be32(bitbang-dir); } -static inline int mdio_read(struct bb_info *bitbang) +static inline int mdio_read(struct mdiobb_ctrl *ctrl) { + struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl); return bb_read(bitbang-dat, bitbang-mdio_msk); } -static inline void mdio(struct bb_info *bitbang, int what) +static inline void mdio(struct mdiobb_ctrl *ctrl, int what) { + struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl); + if (what) bb_set(bitbang-dat, bitbang-mdio_msk); else bb_clr(bitbang-dat, bitbang-mdio_msk); + + /* Read back to flush the write. */ + in_be32(bitbang-dat); } -static inline void mdc(struct bb_info *bitbang, int what) +static inline void mdc(struct mdiobb_ctrl *ctrl, int what) { + struct bb_info *bitbang = container_of(ctrl, struct bb_info, ctrl); + if (what) bb_set(bitbang-dat, bitbang-mdc_msk); else bb_clr(bitbang-dat, bitbang-mdc_msk); -} - -static inline void mii_delay(struct bb_info *bitbang) -{ - udelay(bitbang-delay); -} - -/* Utility to send the preamble, address, and register (common to read and write). */ -static void bitbang_pre(struct bb_info *bitbang , int read, u8 addr, u8 reg) -{ - int j; - - /* -* Send a 32 bit preamble ('1's) with an extra '1' bit for good measure. -* The IEEE spec says this is a PHY optional requirement. The AMD -* 79C874 requires one after power up and one after a MII communications -* error. This means that we are doing more preambles than we need, -* but it is safer and will be much more robust. -*/ - - mdio_active(bitbang); - mdio(bitbang, 1); - for (j = 0; j 32; j++) { - mdc(bitbang, 0); - mii_delay(bitbang); - mdc(bitbang, 1); - mii_delay(bitbang); - } - - /* send the start bit (01) and the read opcode (10) or write (10) */ - mdc(bitbang, 0); - mdio(bitbang, 0); - mii_delay(bitbang); - mdc(bitbang, 1); - mii_delay(bitbang); - mdc(bitbang, 0); - mdio(bitbang, 1); - mii_delay(bitbang); - mdc(bitbang, 1); - mii_delay(bitbang); - mdc(bitbang, 0); - mdio(bitbang, read); - mii_delay(bitbang); - mdc(bitbang, 1); - mii_delay(bitbang); - mdc(bitbang, 0); - mdio(bitbang, !read); - mii_delay(bitbang); - mdc(bitbang, 1); - mii_delay(bitbang); - - /* send the PHY address */ - for (j = 0; j 5; j++) { - mdc(bitbang, 0); - mdio(bitbang, (addr 0x10) != 0); - mii_delay(bitbang); - mdc(bitbang, 1); - mii_delay(bitbang); - addr = 1; - } - /* send the register address */ - for (j = 0; j 5; j++) { - mdc(bitbang, 0); - mdio(bitbang, (reg 0x10) != 0); - mii_delay(bitbang); - mdc(bitbang, 1); - mii_delay(bitbang); - reg = 1; - } + /* Read back to flush the write. */
[PATCH 9/9] fs_enet: sparse fixes
Mostly a bunch of __iomem annotations. Signed-off-by: Scott Wood [EMAIL PROTECTED] --- drivers/net/fs_enet/fs_enet-main.c | 18 +- drivers/net/fs_enet/fs_enet.h | 30 drivers/net/fs_enet/mac-fcc.c | 71 drivers/net/fs_enet/mac-fec.c | 34 +- drivers/net/fs_enet/mac-scc.c | 37 ++- drivers/net/fs_enet/mii-fec.c |8 ++-- 6 files changed, 103 insertions(+), 95 deletions(-) diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c index a2dee7d..d1eb6dd 100644 --- a/drivers/net/fs_enet/fs_enet-main.c +++ b/drivers/net/fs_enet/fs_enet-main.c @@ -60,7 +60,7 @@ MODULE_DESCRIPTION(Freescale Ethernet Driver); MODULE_LICENSE(GPL); MODULE_VERSION(DRV_MODULE_VERSION); -int fs_enet_debug = -1;/* -1 == use FS_ENET_DEF_MSG_ENABLE as value */ +static int fs_enet_debug = -1; /* -1 == use FS_ENET_DEF_MSG_ENABLE as value */ module_param(fs_enet_debug, int, 0); MODULE_PARM_DESC(fs_enet_debug, Freescale bitmapped debugging message enable value); @@ -90,7 +90,7 @@ static int fs_enet_rx_napi(struct napi_struct *napi, int budget) struct fs_enet_private *fep = container_of(napi, struct fs_enet_private, napi); struct net_device *dev = to_net_dev(fep-dev); const struct fs_platform_info *fpi = fep-fpi; - cbd_t *bdp; + cbd_t __iomem *bdp; struct sk_buff *skb, *skbn, *skbt; int received = 0; u16 pkt_len, sc; @@ -230,7 +230,7 @@ static int fs_enet_rx_non_napi(struct net_device *dev) { struct fs_enet_private *fep = netdev_priv(dev); const struct fs_platform_info *fpi = fep-fpi; - cbd_t *bdp; + cbd_t __iomem *bdp; struct sk_buff *skb, *skbn, *skbt; int received = 0; u16 pkt_len, sc; @@ -355,7 +355,7 @@ static int fs_enet_rx_non_napi(struct net_device *dev) static void fs_enet_tx(struct net_device *dev) { struct fs_enet_private *fep = netdev_priv(dev); - cbd_t *bdp; + cbd_t __iomem *bdp; struct sk_buff *skb; int dirtyidx, do_wake, do_restart; u16 sc; @@ -503,7 +503,7 @@ fs_enet_interrupt(int irq, void *dev_id) void fs_init_bds(struct net_device *dev) { struct fs_enet_private *fep = netdev_priv(dev); - cbd_t *bdp; + cbd_t __iomem *bdp; struct sk_buff *skb; int i; @@ -557,7 +557,7 @@ void fs_cleanup_bds(struct net_device *dev) { struct fs_enet_private *fep = netdev_priv(dev); struct sk_buff *skb; - cbd_t *bdp; + cbd_t __iomem *bdp; int i; /* @@ -598,7 +598,7 @@ void fs_cleanup_bds(struct net_device *dev) static int fs_enet_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct fs_enet_private *fep = netdev_priv(dev); - cbd_t *bdp; + cbd_t __iomem *bdp; int curidx; u16 sc; unsigned long flags; @@ -1121,7 +1121,7 @@ static int fs_cleanup_instance(struct net_device *ndev) unregister_netdev(ndev); dma_free_coherent(fep-dev, (fpi-tx_ring + fpi-rx_ring) * sizeof(cbd_t), - fep-ring_base, fep-ring_mem_addr); + (void __force *)fep-ring_base, fep-ring_mem_addr); /* reset it */ (*fep-ops-cleanup_data)(ndev); @@ -1141,7 +1141,7 @@ static int fs_cleanup_instance(struct net_device *ndev) /**/ /* handy pointer to the immap */ -void *fs_enet_immap = NULL; +void __iomem *fs_enet_immap = NULL; static int setup_immap(void) { diff --git a/drivers/net/fs_enet/fs_enet.h b/drivers/net/fs_enet/fs_enet.h index 5a5c9d1..baf6477 100644 --- a/drivers/net/fs_enet/fs_enet.h +++ b/drivers/net/fs_enet/fs_enet.h @@ -15,7 +15,7 @@ #include asm/commproc.h struct fec_info { - fec_t *fecp; + fec_t __iomem *fecp; u32 mii_speed; }; #endif @@ -81,14 +81,14 @@ struct fs_enet_private { const struct fs_ops *ops; int rx_ring, tx_ring; dma_addr_t ring_mem_addr; - void *ring_base; + void __iomem *ring_base; struct sk_buff **rx_skbuff; struct sk_buff **tx_skbuff; - cbd_t *rx_bd_base; /* Address of Rx and Tx buffers.*/ - cbd_t *tx_bd_base; - cbd_t *dirty_tx;/* ring entries to be free()ed. */ - cbd_t *cur_rx; - cbd_t *cur_tx; + cbd_t __iomem *rx_bd_base; /* Address of Rx and Tx buffers.*/ + cbd_t __iomem *tx_bd_base; + cbd_t __iomem *dirty_tx;/* ring entries to be free()ed. */ + cbd_t __iomem *cur_rx; + cbd_t __iomem *cur_tx; int tx_free; struct net_device_stats stats; struct timer_list phy_timer_list; @@ -113,23 +113,23 @@ struct fs_enet_private { union { struct { int
[PATCH 6/9] fs_enet: Be an of_platform device when CONFIG_PPC_CPM_NEW_BINDING is set.
The existing OF glue code was crufty and broken. Rather than fix it, it will be removed, and the ethernet driver now talks to the device tree directly. The old, non-CONFIG_PPC_CPM_NEW_BINDING code can go away once CPM platforms are dropped from arch/ppc (which will hopefully be soon), and existing arch/powerpc boards that I wasn't able to test on for this patchset get converted (which should be even sooner). Signed-off-by: Scott Wood [EMAIL PROTECTED] --- drivers/net/fs_enet/Kconfig|1 + drivers/net/fs_enet/fs_enet-main.c | 258 --- drivers/net/fs_enet/fs_enet.h | 55 +--- drivers/net/fs_enet/mac-fcc.c | 89 + drivers/net/fs_enet/mac-fec.c | 19 +++- drivers/net/fs_enet/mac-scc.c | 53 +-- drivers/net/fs_enet/mii-bitbang.c | 269 +++- drivers/net/fs_enet/mii-fec.c | 143 +++- include/linux/fs_enet_pd.h |5 + 9 files changed, 714 insertions(+), 178 deletions(-) diff --git a/drivers/net/fs_enet/Kconfig b/drivers/net/fs_enet/Kconfig index e27ee21..2765e49 100644 --- a/drivers/net/fs_enet/Kconfig +++ b/drivers/net/fs_enet/Kconfig @@ -11,6 +11,7 @@ config FS_ENET_HAS_SCC config FS_ENET_HAS_FCC bool Chip has an FCC usable for ethernet depends on FS_ENET CPM2 + select MDIO_BITBANG default y config FS_ENET_HAS_FEC diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c index 7a02986..a2dee7d 100644 --- a/drivers/net/fs_enet/fs_enet-main.c +++ b/drivers/net/fs_enet/fs_enet-main.c @@ -42,12 +42,18 @@ #include asm/irq.h #include asm/uaccess.h +#ifdef CONFIG_PPC_CPM_NEW_BINDING +#include asm/of_platform.h +#endif + #include fs_enet.h /*/ +#ifndef CONFIG_PPC_CPM_NEW_BINDING static char version[] __devinitdata = DRV_MODULE_NAME .c:v DRV_MODULE_VERSION ( DRV_MODULE_RELDATE ) \n; +#endif MODULE_AUTHOR(Pantelis Antoniou [EMAIL PROTECTED]); MODULE_DESCRIPTION(Freescale Ethernet Driver); @@ -948,6 +954,7 @@ static int fs_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) extern int fs_mii_connect(struct net_device *dev); extern void fs_mii_disconnect(struct net_device *dev); +#ifndef CONFIG_PPC_CPM_NEW_BINDING static struct net_device *fs_init_instance(struct device *dev, struct fs_platform_info *fpi) { @@ -1129,6 +1136,7 @@ static int fs_cleanup_instance(struct net_device *ndev) return 0; } +#endif /**/ @@ -1137,35 +1145,250 @@ void *fs_enet_immap = NULL; static int setup_immap(void) { - phys_addr_t paddr = 0; - unsigned long size = 0; - #ifdef CONFIG_CPM1 - paddr = IMAP_ADDR; - size = 0x1; /* map 64K */ -#endif - -#ifdef CONFIG_CPM2 - paddr = CPM_MAP_ADDR; - size = 0x4; /* map 256 K */ + fs_enet_immap = ioremap(IMAP_ADDR, 0x4000); + WARN_ON(!fs_enet_immap); +#elif defined(CONFIG_CPM2) + fs_enet_immap = cpm2_immr; #endif - fs_enet_immap = ioremap(paddr, size); - if (fs_enet_immap == NULL) - return -EBADF; /* XXX ahem; maybe just BUG_ON? */ return 0; } static void cleanup_immap(void) { - if (fs_enet_immap != NULL) { - iounmap(fs_enet_immap); - fs_enet_immap = NULL; - } +#if defined(CONFIG_CPM1) + iounmap(fs_enet_immap); +#endif } /**/ +#ifdef CONFIG_PPC_CPM_NEW_BINDING +static int __devinit find_phy(struct device_node *np, + struct fs_platform_info *fpi) +{ + struct device_node *phynode, *mdionode; + struct resource res; + int ret = 0, len; + + const u32 *data = of_get_property(np, phy-handle, len); + if (!data || len != 4) + return -EINVAL; + + phynode = of_find_node_by_phandle(*data); + if (!phynode) + return -EINVAL; + + mdionode = of_get_parent(phynode); + if (!phynode) + goto out_put_phy; + + ret = of_address_to_resource(mdionode, 0, res); + if (ret) + goto out_put_mdio; + + data = of_get_property(phynode, reg, len); + if (!data || len != 4) + goto out_put_mdio; + + snprintf(fpi-bus_id, 16, PHY_ID_FMT, res.start, *data); + +out_put_mdio: + of_node_put(mdionode); +out_put_phy: + of_node_put(phynode); + return ret; +} + +#ifdef CONFIG_FS_ENET_HAS_FEC +#define IS_FEC(match) ((match)-data == fs_fec_ops) +#else +#define IS_FEC(match) 0 +#endif + +static int __devinit fs_enet_probe(struct of_device *ofdev, + const struct of_device_id *match) +{ + struct net_device *ndev; + struct fs_enet_private *fep; + struct
[PATCH 2/9] fs_enet: Fix build breakage.
Commit 4fa57c9ea9f36f9ca852f3a88ca5d2f1aebbc960 (Make NAPI polling independent of struct net_device objects.) introduced some build breakage in the napi rx function. Signed-off-by: Scott Wood [EMAIL PROTECTED] --- drivers/net/fs_enet/fs_enet-main.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c index 2a1b150..a15345b 100644 --- a/drivers/net/fs_enet/fs_enet-main.c +++ b/drivers/net/fs_enet/fs_enet-main.c @@ -73,8 +73,8 @@ static void fs_set_multicast_list(struct net_device *dev) /* NAPI receive function */ static int fs_enet_rx_napi(struct napi_struct *napi, int budget) { - struct fs_enet_private *fep = container_of(napi, struct fec_enet_private, napi); - struct net_device *dev = fep-dev; + struct fs_enet_private *fep = container_of(napi, struct fs_enet_private, napi); + struct net_device *dev = to_net_dev(fep-dev); const struct fs_platform_info *fpi = fep-fpi; cbd_t *bdp; struct sk_buff *skb, *skbn, *skbt; -- 1.5.3.2 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] Generic bitbanged MDIO library
Previously, bitbanged MDIO was only supported in individual hardware-specific drivers. This code factors out the higher level protocol implementation, reducing the hardware-specific portion to functions setting direction, data, and clock. Signed-off-by: Scott Wood [EMAIL PROTECTED] --- drivers/net/phy/Kconfig|9 ++ drivers/net/phy/Makefile |1 + drivers/net/phy/mdio-bitbang.c | 187 include/linux/mdio-bitbang.h | 42 + 4 files changed, 239 insertions(+), 0 deletions(-) create mode 100644 drivers/net/phy/mdio-bitbang.c create mode 100644 include/linux/mdio-bitbang.h diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index dd09011..72a98dd 100644 --- a/drivers/net/phy/Kconfig +++ b/drivers/net/phy/Kconfig @@ -76,4 +76,13 @@ config FIXED_MII_100_FDX bool Emulation for 100M Fdx fixed PHY behavior depends on FIXED_PHY +config MDIO_BITBANG + tristate Support for bitbanged MDIO buses + help + This module implements the MDIO bus protocol in software, + for use by low level drivers that export the ability to + drive the relevant pins. + + If in doubt, say N. + endif # PHYLIB diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile index 8885650..3d6cc7b 100644 --- a/drivers/net/phy/Makefile +++ b/drivers/net/phy/Makefile @@ -13,3 +13,4 @@ obj-$(CONFIG_VITESSE_PHY) += vitesse.o obj-$(CONFIG_BROADCOM_PHY) += broadcom.o obj-$(CONFIG_ICPLUS_PHY) += icplus.o obj-$(CONFIG_FIXED_PHY)+= fixed.o +obj-$(CONFIG_MDIO_BITBANG) += mdio-bitbang.o diff --git a/drivers/net/phy/mdio-bitbang.c b/drivers/net/phy/mdio-bitbang.c new file mode 100644 index 000..8cd243d --- /dev/null +++ b/drivers/net/phy/mdio-bitbang.c @@ -0,0 +1,187 @@ +/* + * Bitbanged MDIO support. + * + * Author: Scott Wood [EMAIL PROTECTED] + * Copyright (c) 2007 Freescale Semiconductor + * + * Based on CPM2 MDIO code which is: + * + * Copyright (c) 2003 Intracom S.A. + * by Pantelis Antoniou [EMAIL PROTECTED] + * + * 2005 (c) MontaVista Software, Inc. + * Vitaly Bordug [EMAIL PROTECTED] + * + * This file is licensed under the terms of the GNU General Public License + * version 2. This program is licensed as is without any warranty of any + * kind, whether express or implied. + */ + +#include linux/module.h +#include linux/mdio-bitbang.h +#include linux/slab.h +#include linux/types.h +#include linux/delay.h + +#define MDIO_READ 1 +#define MDIO_WRITE 0 + +#define MDIO_SETUP_TIME 10 +#define MDIO_HOLD_TIME 10 + +/* Minimum MDC period is 400 ns, plus some margin for error. MDIO_DELAY + * is done twice per period. + */ +#define MDIO_DELAY 250 + +/* The PHY may take up to 300 ns to produce data, plus some margin + * for error. + */ +#define MDIO_READ_DELAY 350 + +/* MDIO must already be configured as output. */ +static void mdiobb_send_bit(struct mdiobb_ctrl *ctrl, int val) +{ + const struct mdiobb_ops *ops = ctrl-ops; + + ops-set_mdio_data(ctrl, val); + ndelay(MDIO_DELAY); + ops-set_mdc(ctrl, 1); + ndelay(MDIO_DELAY); + ops-set_mdc(ctrl, 0); +} + +/* MDIO must already be configured as input. */ +static int mdiobb_get_bit(struct mdiobb_ctrl *ctrl) +{ + const struct mdiobb_ops *ops = ctrl-ops; + + ndelay(MDIO_DELAY); + ops-set_mdc(ctrl, 1); + ndelay(MDIO_READ_DELAY); + ops-set_mdc(ctrl, 0); + + return ops-get_mdio_data(ctrl); +} + +/* MDIO must already be configured as output. */ +static void mdiobb_send_num(struct mdiobb_ctrl *ctrl, u16 val, int bits) +{ + int i; + + for (i = bits - 1; i = 0; i--) + mdiobb_send_bit(ctrl, (val i) 1); +} + +/* MDIO must already be configured as input. */ +static u16 mdiobb_get_num(struct mdiobb_ctrl *ctrl, int bits) +{ + int i; + u16 ret = 0; + + for (i = bits - 1; i = 0; i--) { + ret = 1; + ret |= mdiobb_get_bit(ctrl); + } + + return ret; +} + +/* Utility to send the preamble, address, and + * register (common to read and write). + */ +static void mdiobb_cmd(struct mdiobb_ctrl *ctrl, int read, u8 phy, u8 reg) +{ + const struct mdiobb_ops *ops = ctrl-ops; + int i; + + ops-set_mdio_dir(ctrl, 1); + + /* +* Send a 32 bit preamble ('1's) with an extra '1' bit for good +* measure. The IEEE spec says this is a PHY optional +* requirement. The AMD 79C874 requires one after power up and +* one after a MII communications error. This means that we are +* doing more preambles than we need, but it is safer and will be +* much more robust. +*/ + + for (i = 0; i 32; i++) + mdiobb_send_bit(ctrl, 1); + + /* send the start bit (01) and the read opcode (10) or write (10) */ + mdiobb_send_bit(ctrl, 0); + mdiobb_send_bit(ctrl, 1); + mdiobb_send_bit(ctrl, read); +
Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
So maybe the following patch is necessary... I believe IPV6 DCCP are immune to this problem. Thanks again Denys for spotting this. Eric [PATCH] TCP : secure_tcp_sequence_number() should not use a too fast clock TCP V4 sequence numbers are 32bits, and RFC 793 assumed a 250 KHz clock. In order to follow network speed increase, we can use a faster clock, but we should limit this clock so that the delay between two rollovers is greater than MSL (TCP Maximum Segment Lifetime : 2 minutes) Choosing a 64 nsec clock should be OK, since the rollovers occur every 274 seconds. Problem spotted by Denys Fedoryshchenko Signed-off-by: Eric Dumazet [EMAIL PROTECTED] --- linux-2.6.22/drivers/char/random.c 2007-10-01 10:18:42.0 +0200 +++ linux-2.6.22-ed/drivers/char/random.c 2007-10-01 21:47:58.0 +0200 @@ -1550,11 +1550,13 @@ __u32 secure_tcp_sequence_number(__be32 * As close as possible to RFC 793, which * suggests using a 250 kHz clock. * Further reading shows this assumes 2 Mb/s networks. -* For 10 Gb/s Ethernet, a 1 GHz clock is appropriate. -* That's funny, Linux has one built in! Use it! -* (Networks are faster now - should this be increased?) +* For 10 Mb/s Ethernet, a 1 MHz clock is appropriate. +* For 10 Gb/s Ethernet, a 1 GHz clock should be ok, but +* we also need to limit the resolution so that the u32 seq +* overlaps less than one time per MSL (2 minutes). +* Choosing a clock of 64 ns period is OK. (period of 274 s) */ - seq += ktime_get_real().tv64; + seq += ktime_get_real().tv64 6; #if 0 printk(init_seq(%lx, %lx, %d, %d) = %d\n, saddr, daddr, sport, dport, seq);
sk98lin, jumbo frames, and memory fragmentation
Hi all, We're considering some hardware that uses the sk98lin network hardware, and we'll be using jumbo frames. Looking at the driver, when using a 9KB MTU it seems like it would end up trying to atomically allocate a 16KB buffer. Has anyone heard of this been a problem? It would seem like trying to atomically allocate four physically contiguous pages could become tricky after the system has been running for a while. The reason I ask is that we ran into this with the e1000. Before they added the new jumbo frame code it was trying to atomically allocate 32KB buffers and we would start getting allocation failures after a month or so of uptime. Any information anyone can provide would be appreciated. Thanks, Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin, jumbo frames, and memory fragmentation
Yes it has this problem. I've observed it in practice on a busy firewall. -John Chris Friesen wrote: Hi all, We're considering some hardware that uses the sk98lin network hardware, and we'll be using jumbo frames. Looking at the driver, when using a 9KB MTU it seems like it would end up trying to atomically allocate a 16KB buffer. Has anyone heard of this been a problem? It would seem like trying to atomically allocate four physically contiguous pages could become tricky after the system has been running for a while. The reason I ask is that we ran into this with the e1000. Before they added the new jumbo frame code it was trying to atomically allocate 32KB buffers and we would start getting allocation failures after a month or so of uptime. Any information anyone can provide would be appreciated. Thanks, Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net-2.6.24: old ax25 driver fix
From: Stephen Hemminger [EMAIL PROTECTED] Date: Mon, 1 Oct 2007 11:24:17 -0700 Recent change in hard header broke build of these old drivers. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] Applied, thanks Stephen. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/10] Preparatory refactoring part 1.
Patrick McHardy wrote: Corey Hickey wrote: Make a new function sfq_q_enqueue() that operates directly on the queue data. This will be useful for implementing sfq_change() in a later patch. A pleasant side-effect is reducing most of the duplicate code in sfq_enqueue() and sfq_requeue(). Similarly, make a new function sfq_q_dequeue(). Signed-off-by: Corey Hickey [EMAIL PROTECTED] --- net/sched/sch_sfq.c | 72 +++ 1 files changed, 38 insertions(+), 34 deletions(-) diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index 3a23e30..57485ef 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c The sfq_q_enqueue part looks fine. - sch-qstats.drops++; A line in the changelog explaining that this was increased twice would have been nice. Certainly; I think I didn't realize, when you originally pointed out the duplicate incrementing, that it was a bug in the original version and not in my patch. Otherwise, I would have sent it as a separate patch. If a note in this patch will suffice, though, I'll definitely do so. sfq_drop(sch); return NET_XMIT_CN; } - - - -static struct sk_buff * -sfq_dequeue(struct Qdisc* sch) +static struct +sk_buff *sfq_q_dequeue(struct sfq_sched_data *q) What is this function needed for? It gets used in sfq_change for moving packets from the old queue into the new one. In this case, we don't want to modify sch-q.qlen or sch-qstats.backlog, since those don't actually change. while ((skb = sfq_q_dequeue(q)) != NULL) sfq_q_enqueue(skb, tmp, SFQ_TAIL); I'll improve the description of this patch to make that more clear. -Corey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/10] Move two functions.
Patrick McHardy wrote: Corey Hickey wrote: Move sfq_q_destroy() to above sfq_q_init() so that it can be used by an error case in a later patch. Move sfq_destroy() as well, for clarity. This patch looks pointless, just put them where you need them in the patch introducing them. As you wish. I thought having a separate patch would ease reviewing. -Corey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/10] Add divisor.
Patrick McHardy wrote: Corey Hickey wrote: Make hash divisor user-configurable. @@ -120,7 +121,7 @@ static __inline__ unsigned sfq_fold_hash(struct sfq_sched_data *q, u32 h, u32 h1 /* Have we any rotation primitives? If not, WHY? */ h ^= (h1pert) ^ (h1(0x1F - pert)); h ^= h10; - return h 0x3FF; + return h (q-hash_divisor-1); This assumes that hash_divisor is a power of two, but this is not enforced anywhere. Ok. I'll move that part from userspace to the kernel. That should be better anyway. -Corey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/10] Make qdisc changeable.
Patrick McHardy wrote: Corey Hickey wrote: Re-implement sfq_change() and enable Qdisc_opts.change so tc qdisc change will work. +static int sfq_change(struct Qdisc *sch, struct rtattr *opt) +{ + ... + + /* finish up */ + if (q-perturb_period) { + q-perturb_timer.expires = jiffies + q-perturb_period; + add_timer(q-perturb_timer); + } else { + q-perturbation = 0; Seems counter-productive to explicitly set it to zero since it was still used during tranfering the packets with the old value. So I'd suggest to remove this or alternatively set it to the final value *before* transfering the packets. I suppose so; you're right. I'll adapt that part to fit before transferring the packets. -Corey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/10] Change perturb_period to unsigned.
Patrick McHardy wrote: Corey Hickey wrote: perturb_period is currently a signed integer, but I can't see any good reason why this is so--a negative perturbation period will add a timer that expires in the past, causing constant perturbation, which makes hashing useless. if (q-perturb_period) { q-perturb_timer.expires = jiffies + q-perturb_period; add_timer(q-perturb_timer); } Strictly speaking, this will break binary compatibility with older versions of tc, but that ought not to be a problem because (a) there's no valid use for a negative perturb_period, and (b) negative values will be seen as high values ( INT_MAX), which don't work anyway. If perturb_period is too large, (perturb_period * HZ) will overflow the size of an unsigned int and wrap around. So, check for thet and reject values that are too high. Sounds reasonable. --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -74,6 +74,9 @@ typedef unsigned int sfq_index; #define SFQ_MAX_DEPTH (UINT_MAX / 2 - 1) +/* We don't want perturb_period * HZ to overflow an unsigned int. */ +#define SFQ_MAX_PERTURB (UINT_MAX / HZ) jiffies are unsigned long. Hmm. You're right. It looks like my previous patch obviated the need for this part. I'll remove it. -Corey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/10] Use nested compat attributes to pass parameters.
Patrick McHardy wrote: Corey Hickey wrote: + +#define GET_PARAM(dst, nest, compat) do { \ + struct rtattr *rta = tb[(nest) - 1]; \ + if (rta) \ + (dst) = RTA_GET_U32(rta); \ + else if ((compat)) \ + (dst) = (compat); \ +} while (0) An inline function and a comment why this is done would increase readability. Well, I had a reason for making a macro, but it probably wasn't a good reason. Looking now, I don't see why not to make a function. I'll see what I can do. + nest = RTA_NEST_COMPAT(skb, TCA_OPTIONS, sizeof(opt), opt); + + RTA_PUT_U32(skb, TCA_SFQ_QUANTUM, q-quantum); + RTA_PUT_U32(skb, TCA_SFQ_PERTURB, q-perturb_period); + RTA_PUT_U32(skb, TCA_SFQ_LIMIT, q-limit); + RTA_PUT_U32(skb, TCA_SFQ_DIVISOR, q-hash_divisor); + RTA_PUT_U32(skb, TCA_SFQ_FLOWS, q-depth); RTA_PUT(skb, TCA_OPTIONS, sizeof(opt), opt); This is wrong, RTA_NEST_COMPAT already dumps the structure. You mean that last line (RTA_PUT) is superfluous, right? I can't see a reason for it to be there, so I must have just forgotten to delete it from the original code. If I'm wrong, I might need a little hand-holding here. My understanding of all the RTA stuff is a bit shaky. Much thanks for the review. I'll make a new set of patches soon. -Corey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin, jumbo frames, and memory fragmentation
Chris Friesen wrote: We're considering some hardware that uses the sk98lin network hardware, and we'll be using jumbo frames. Looking at the driver, when using a 9KB MTU it seems like it would end up trying to atomically allocate a 16KB buffer. The sk98lin driver is going away, please don't use it. It's unmaintained and full of known bugs. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/10] Preparatory refactoring part 2.
Patrick McHardy wrote: Corey Hickey wrote: The sfq_destroy() -- sfq_q_destroy() change looks pointless here, but it's cleaner to split now and add code to sfq_q_destroy() in a later patch. +static void sfq_destroy(struct Qdisc *sch) +{ + struct sfq_sched_data *q = qdisc_priv(sch); + sfq_q_destroy(q); +} It does look pointless, after applying all patches sfq_destroy still remains a simply wrapper around sfq_q_destroy. It does remain a wrapper, but both functions are used. It doesn't have to be this way, but I wanted to avoid duplicating code and I didn't see a better layout. sfq_q_destroy is used in sfq_q_init if a kcalloc fails. sfq_q_init knows nothing about struct Qdisc *sch, so it can't call sfq_destroy. sfq_destroy is still marked as the destroy function in sfq_qdisc_ops. -Corey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.21 - 2.6.22 2.6.23-rc8 performance regression
From: Eric Dumazet [EMAIL PROTECTED] Date: Mon, 01 Oct 2007 22:10:03 +0200 So maybe the following patch is necessary... I believe IPV6 DCCP are immune to this problem. Thanks again Denys for spotting this. Eric [PATCH] TCP : secure_tcp_sequence_number() should not use a too fast clock TCP V4 sequence numbers are 32bits, and RFC 793 assumed a 250 KHz clock. In order to follow network speed increase, we can use a faster clock, but we should limit this clock so that the delay between two rollovers is greater than MSL (TCP Maximum Segment Lifetime : 2 minutes) Choosing a 64 nsec clock should be OK, since the rollovers occur every 274 seconds. Problem spotted by Denys Fedoryshchenko Signed-off-by: Eric Dumazet [EMAIL PROTECTED] Thanks a lot Eric for bringing closure to this. I'll apply this and add a reference in the commit message to the changeset that introduced this problem, since it might help others who look at this. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin, jumbo frames, and memory fragmentation
On Mon, 01 Oct 2007 14:35:48 -0600 Chris Friesen [EMAIL PROTECTED] wrote: Hi all, We're considering some hardware that uses the sk98lin network hardware, and we'll be using jumbo frames. Looking at the driver, when using a 9KB MTU it seems like it would end up trying to atomically allocate a 16KB buffer. Has anyone heard of this been a problem? It would seem like trying to atomically allocate four physically contiguous pages could become tricky after the system has been running for a while. The reason I ask is that we ran into this with the e1000. Before they added the new jumbo frame code it was trying to atomically allocate 32KB buffers and we would start getting allocation failures after a month or so of uptime. Any information anyone can provide would be appreciated. Adding fragmentation support to skge driver is on my list of possible extensions. sky2 driver already supports it (yet one more feature that the vendor sk98lin driver doesn't do). -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin, jumbo frames, and memory fragmentation
Stephen Hemminger wrote: Adding fragmentation support to skge driver is on my list of possible extensions. sky2 driver already supports it (yet one more feature that the vendor sk98lin driver doesn't do). Thanks for speaking up. As I mentioned in my email to Jeff it looks like the sky2 driver is what I need (Marvel Yukon 88E8062). However, I'm on 2.6.14 and it doesn't exist there...do you anticipate any issues if I were to backport it? Thanks, Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin, jumbo frames, and memory fragmentation
Jeff Garzik wrote: The sk98lin driver is going away, please don't use it. It's unmaintained and full of known bugs. Okay...so it looks like the proper driver for the Marvell Yukon 88E8062 is the sky2 driver, and this one does avoid order0 allocations. Am I on track? Chris - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sk98lin, jumbo frames, and memory fragmentation
On Mon, 01 Oct 2007 15:15:59 -0600 Chris Friesen [EMAIL PROTECTED] wrote: Stephen Hemminger wrote: Adding fragmentation support to skge driver is on my list of possible extensions. sky2 driver already supports it (yet one more feature that the vendor sk98lin driver doesn't do). Thanks for speaking up. As I mentioned in my email to Jeff it looks like the sky2 driver is what I need (Marvel Yukon 88E8062). However, I'm on 2.6.14 and it doesn't exist there...do you anticipate any issues if I were to backport it? Nothing but usual annoying kernel API changes.. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.24 0/4]: TCP fixes
From: Ilpo_Järvinen [EMAIL PROTECTED] Date: Mon, 1 Oct 2007 15:29:40 +0300 This fixes the newreno fackets_out case, which turned out to be not related to the Cedric's case being under investigation. Two trivial comment patches, and frto with high-speed seqno wrap-around protection. Compile tested. Please apply to net-2.6.24. I've applied them all to net-2.6.24, thanks Ilpo! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] mv643xx_eth: Do not modify struct netdev tx_queue_len
From: Dale Farnsworth [EMAIL PROTECTED] This driver erroneously zeros dev-tx_queue_len, since mp-tx_ring_size has not yet been initialized. Actually, the driver shouldn't modify tx_queue_len at all and should leave the value set by alloc_etherdev(), currently 1000. Signed-off-by: Dale Farnsworth [EMAIL PROTECTED] --- Jeff, this bug was just reported today, or I would have batched it with the one I sent you last week. It's an obvious bugfix, so I'm not going to hold it in my queue. drivers/net/mv643xx_eth.c |1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c index 34288fe..3153356 100644 --- a/drivers/net/mv643xx_eth.c +++ b/drivers/net/mv643xx_eth.c @@ -1357,7 +1357,6 @@ static int mv643xx_eth_probe(struct platform_device *pdev) #endif dev-watchdog_timeo = 2 * HZ; - dev-tx_queue_len = mp-tx_ring_size; dev-base_addr = 0; dev-change_mtu = mv643xx_eth_change_mtu; dev-do_ioctl = mv643xx_eth_do_ioctl; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][TG3]Some cleanups
On Sun, 2007-09-30 at 14:11 -0400, jamal wrote: Here are some non-batching related changes that i have in my batching tree. Like the e1000e, they make the xmit code more readable. I wouldnt mind if you take them over. Jamal, in tg3_enqueue_buggy(), we may have to call tg3_tso_bug() which will recursively call tg3_start_xmit_dma_bug() after segmenting the TSO packet into normal packets. We need to restore the VLAN tag so that the GSO code will create the chain of segmented SKBs with the proper VLAN tag. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
How do queue-less virtual devices wake higher level senders?
Hello! I am having some trouble figuring out how virtual interfaces (such as mac-vlans) can wake up writers (such as udp sockets). For 'real' hardware, it seems that the netif_stop_queue and netif_wake_queue methods handle stopping and waking the higher level senders, but for virtual devices with no queues, how does this work? In my case, I'm using a virtual Station interface that sits on top of a wifi radio interface (hacked up madwifi). I notice that UDP connections set up for high speed, unidirectional sends are stalling after a few minutes. netstat -an shows a write-buffer that is quite full, but nothing is transmitted. If I ping or start any other type of traffic on these interfaces, the udp recovers. It seems like the udp send logic is just getting stuck and needs a kick. I do not see any problems with TCP connections, and if I keep a slow-speed tcp connection running, the UDP will not hang. It's likely the bug is in my driver and/or code, so this is not a bug report..just a question to hopefully help me debug it further :) Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do queue-less virtual devices wake higher level senders?
From: Ben Greear [EMAIL PROTECTED] Date: Mon, 01 Oct 2007 16:49:06 -0700 For 'real' hardware, it seems that the netif_stop_queue and netif_wake_queue methods handle stopping and waking the higher level senders, but for virtual devices with no queues, how does this work? They don't queue, there is nothing to stop or wakeup. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do queue-less virtual devices wake higher level senders?
David Miller wrote: From: Ben Greear [EMAIL PROTECTED] Date: Mon, 01 Oct 2007 16:49:06 -0700 For 'real' hardware, it seems that the netif_stop_queue and netif_wake_queue methods handle stopping and waking the higher level senders, but for virtual devices with no queues, how does this work? They don't queue, there is nothing to stop or wakeup. Ok, so if I have a UDP socket bound to an interface that has no queue, and yet I see the send portion of the queue being full in netstat, what does this mean? Maybe the device I think has no queue somehow does? I added some debugging to print out dev-state in sysfs, and the state of the virtual is always 0x6, which appears right to me. It's underlying device goes back and forth between 0x7 and 0x6, which also seems right to me. When the thing is in the hung state, phys and virtual interface have 0x6 state, and yet the udp tx queue remains full. The physical NIC also prints out some errors about being low on buffers right before the hang, but it seems to recover since just doing a ping or starting a second udp connection brings everything back to life. Other than IFF_UP and dev-state, are there other things that can make the tx logic stop sending to a device? Thanks, Ben - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tcp bw in 2.6
On Sat, Sep 29, 2007 at 11:02:32AM -0700, Linus Torvalds wrote: On Sat, 29 Sep 2007, Larry McVoy wrote: I haven't kept up on switch technology but in the past they were much better than you are thinking. The Kalpana switch that I had modified to support vlans (invented by yours truly), did not store and forward, it was cut through and could handle any load that was theoretically possible within about 1%. Hey, you may well be right. Maybe my assumptions about cutting corners are just cynical and pessimistic. So I got a netgear switch and it works fine. But my tests are busted. Catching netdev up, I'm trying to optimize traffic to a server that has a gbit interface; I moved to a 24 port netgear that is all 10/100/1000 and I have a pile of clients to act as load generators. I can do this on each of the clients dd if=/dev/zero bs=1024000 | rsh work dd of=/dev/null and that cranks up to about 47K packets/second which is about 70MB/sec. One of my clients also has gigabit so I played around with just that one and it (itanium running hpux w/ broadcom gigabit) can push the load as well. One weird thing is that it is dependent on the direction the data is flowing. If the hp is sending then I get 46MB/sec, if linux is sending then I get 18MB/sec. Weird. Linux is debian, running Linux work 2.6.18-5-k7 #1 SMP Thu Aug 30 02:52:31 UTC 2007 i686 and dual e1000 cards: e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection I wrote a tiny little program to try and emulate this and I can't get it to do as well. I've tracked it down, I think, to the read side. The server sources, the client sinks, the server looks like: 11689 accept(3, {sa_family=AF_INET, sin_port=htons(49376), sin_addr=inet_addr(10.3.1.38)}, [16]) = 4 11689 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [1048576], 4) = 0 11689 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0 11689 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7ddf708) = 11694 11689 close(4) = 0 11689 accept(3, unfinished ... 11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 1048576 11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 1048576 11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 1048576 11694 write(4, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 1048576 ... but the client looks like connect(3, {sa_family=AF_INET, sin_port=htons(31235), sin_addr=inet_addr(10.3.9.1)}, 16) = 0 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 1448 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 1448 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 which I suspect may be the problem. I played around with SO_RCVBUF/SO_SNDBUF and that didn't help. So any ideas why a simple dd piped through rsh is kicking my ass? It must be something simple but my test program is tiny and does nothing weird that I can see. -- --- Larry McVoylm at bitmover.com http://www.bitkeeper.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tcp bw in 2.6
On Mon, 1 Oct 2007, Larry McVoy wrote: but the client looks like connect(3, {sa_family=AF_INET, sin_port=htons(31235), sin_addr=inet_addr(10.3.9.1)}, 16) = 0 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 1448 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 .. This is exactly what I'd expect if the machine is *not* under excessive load. The system calls are fast enough that the latency for the TCP stack is roughly on the same scale as the time it takes to receive one new packet, so since a socket read will always return when it has any data (not until it has filled the whole buffer), you get exactly that one or two packets pattern. If you'd be really CPU-limited or under load from other programs, you'd have more packets come in while you're in the read path, and you'd get bigger reads. But do a tcpdump both ways, and see (for example) if the TCP window is much bigger going the other way. Linus - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tcp bw in 2.6
On Mon, Oct 01, 2007 at 07:14:37PM -0700, Linus Torvalds wrote: On Mon, 1 Oct 2007, Larry McVoy wrote: but the client looks like connect(3, {sa_family=AF_INET, sin_port=htons(31235), sin_addr=inet_addr(10.3.9.1)}, 16) = 0 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 1448 read(3, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..., 1048576) = 2896 .. This is exactly what I'd expect if the machine is *not* under excessive load. That's fine, but why is it that my trivial program can't do as well as dd | rsh dd? A short summary is can someone please post a test program that sources and sinks data at the wire speed? because apparently I'm too old and clueless to write such a thing. -- --- Larry McVoylm at bitmover.com http://www.bitkeeper.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/10] Preparatory refactoring part 1.
Corey Hickey wrote: Patrick McHardy wrote: -sch-qstats.drops++; A line in the changelog explaining that this was increased twice would have been nice. Certainly; I think I didn't realize, when you originally pointed out the duplicate incrementing, that it was a bug in the original version and not in my patch. Otherwise, I would have sent it as a separate patch. I didn't remember that :) If a note in this patch will suffice, though, I'll definitely do so. Sure, a note in the changelog will be fine. +static struct +sk_buff *sfq_q_dequeue(struct sfq_sched_data *q) What is this function needed for? It gets used in sfq_change for moving packets from the old queue into the new one. In this case, we don't want to modify sch-q.qlen or sch-qstats.backlog, since those don't actually change. while ((skb = sfq_q_dequeue(q)) != NULL) sfq_q_enqueue(skb, tmp, SFQ_TAIL); I missed that, thanks for the explanation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Removing DAD in IPv6
Hi, I just find out this IFA_F_NODAD was not in the kernel used in my test bed which is 2.6.17. So I tried to modify the code in ipv6/addrconf.c by myself to remove the DAD: if (!max_addresses || ipv6_count_addresses(in6_dev) max_addresses) ifp = ipv6_add_addr(in6_dev, addr, pinfo-prefix_len, addr_typeIPV6_ADDR_SCOPE_MASK, 0); if (!ifp || IS_ERR(ifp)) { in6_dev_put(in6_dev); return; } // New code if (!IS_ERR(ifp)) { spin_lock_bh(ifp-lock); ifp-flags = ~IFA_F_TENTATIVE; spin_unlock_bh(ifp-lock); addrconf_join_solict(ifp-idev-dev, ifp-addr); ipv6_ifa_notify(RTM_NEWADDR, ifp); //in6_ifa_put(ifp); printk(New address configured.\n); } // --end --- update_lft = create = 1; ifp-cstamp = jiffies; // addrconf_dad_start(ifp, RTF_ADDRCONF|RTF_PREFIX_RT); However, even the new address is generated and assigned to the interface, and I can read the address from the /proc interface, my first few packets are eaten by the kernel. Only until after about 1 second, then my packet can make its way out. Is kernel doing anything that blocks the sending and receiving of packets during the time of DAD? Thanks a lot! Best Regards, Xia Yang On Mon, 2007-10-01 at 20:44 +0900, YOSHIFUJI Hideaki / 吉藤英明 wrote: In article [EMAIL PROTECTED] (at Mon, 01 Oct 2007 11:53:27 +0800), Xia Yang [EMAIL PROTECTED] says: I would like to ask for help on how to remove or disable the DAD process properly, as long as the node can send, receive and forward packets immediately after a new IPv6 address is generated. Any pointer is appreciated. Thanks a lot in advance! IFA_F_NODAD address flag might help this. --yoshfuji - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html