Re: [PATCH] ipvs: Add sysctl documentation
On Mon, Jul 03, 2006 at 07:36:47PM -0700, David Miller wrote: From: Horms [EMAIL PROTECTED] Date: Mon, 3 Jul 2006 11:31:30 +0900 * Derived from http://www.linuxvirtualserver.org/docs/sysctl.html, v1.4 maintained by Wensong Zhang * Adjusted preample to match ip-sysctl.txt * Sorted options into alphabetical order * Added expire_quiescent_template * Removed timeout_* which are no longer present * Incoporated doc/debug-levels.txt from IPVS source tree into description of ipvs_debug * Minor spelling fixes * Further editing more than welcome Signed-Off-By: Horms [EMAIL PROTECTED] Applied, thanks Simon. * DaveM, do you need a 2.4 version of this document, it will likely be a slightly different list of options? I don't think it's really worthwhile, we should be touching 2.4.x as little as possible at this point. Changes we make in 2.4.x should be in the absolutely necssary category. Understood, I'm more than happy to let that sleeping dog lie. -- Horms H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
Christoph, On Mon, Jul 03, 2006 at 06:46:50PM +0100, Christoph Hellwig wrote: On Mon, Jul 03, 2006 at 12:18:51PM +0400, Andrey Savochkin wrote: Cleanup of net_device list use in net_dev core and IP. The cleanup consists of - converting the to list_head, to make the list double-linked (thus making remove operation O(1)), and list walks more readable; - introducing of for_each_netdev wrapper over list_for_each. When you change all this please make sure dev_base_head is never directly accessed anymore, not even through macros and dev_base_head is not exported anymore. That's the only way to keep drivers messing with it. Yes, it's a little more work as you need to audit all drivers to see what they are doing and find suitable abstractions but it's a must have that should have been done a lot earlier. Hiding dev_base_head can be done by converting first_netdev/next_netdev into functions and implementing for_each_netdev loop through them. Or are you talking about abstractions like functions for_each_netdev/find_netdev with callbacks? Do you think that hiding the list internals is worth the additional complexity and substantial increase of the patch size? Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
tiacx - don't use UTS_RELEASE
Hi, patch below removes the use of UTS_RELEASE from the tiacx driver; there is absolutely no reason for a driver to print the kernel version or use the UTS_RELEASE field; in addition this field changes all the time so this causes spurious rebuilds.. Signed-off-by: Arjan van de Ven [EMAIL PROTECTED] --- drivers/net/wireless/tiacx/pci.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6.17-mm4/drivers/net/wireless/tiacx/pci.c === --- linux-2.6.17-mm4.orig/drivers/net/wireless/tiacx/pci.c +++ linux-2.6.17-mm4/drivers/net/wireless/tiacx/pci.c @@ -1705,8 +1705,8 @@ acxpci_e_probe(struct pci_dev *pdev, con /* acx_sem_unlock(adev); */ printk(acx ACX_RELEASE: net device %s, driver compiled - against wireless extensions %d and Linux %s\n, - ndev-name, WIRELESS_EXT, UTS_RELEASE); + against wireless extensions %d\n, + ndev-name, WIRELESS_EXT); #if CMD_DISCOVERY great_inquisitor(adev); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
On Tue, Jul 04, 2006 at 11:24:05AM +0400, Andrey Savochkin wrote: Yes, it's a little more work as you need to audit all drivers to see what they are doing and find suitable abstractions but it's a must have that should have been done a lot earlier. Hiding dev_base_head can be done by converting first_netdev/next_netdev into functions and implementing for_each_netdev loop through them. Or are you talking about abstractions like functions for_each_netdev/find_netdev with callbacks? an for_each_netdev with a callback makes sense and gives a cleaner abstraction, yes. I don't think you should need a callback for the lookup structure. Do you think that hiding the list internals is worth the additional complexity and substantial increase of the patch size? Yes, absolutely. We've converted scsi hosts and devices from a model where drivers could directly access the list to strict iterators in the 2.5 series. It's quite a lot of work as you have to understand what the drivers actually do (and to at least 50% they were doing something really stupid) and convert them to the right abstractions. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tiacx - don't use UTS_RELEASE
On Tue, 2006-07-04 at 02:25 -0700, Andrew Morton wrote: On Tue, 04 Jul 2006 11:07:59 +0200 Arjan van de Ven [EMAIL PROTECTED] wrote: patch below removes the use of UTS_RELEASE from the tiacx driver; there is absolutely no reason for a driver to print the kernel version or use the UTS_RELEASE field; in addition this field changes all the time so this causes spurious rebuilds.. http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-04-usb/usb-storage-uname-in-pr-sc-unneeded-message.patch did it too. UTS_RELEASE doesn't change much. It's 2.6.17. no but the header that it's in changes all the time iirc, at least it used to (one of those kbuild regenerated files) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tiacx - don't use UTS_RELEASE
On Tue, 04 Jul 2006 11:07:59 +0200 Arjan van de Ven [EMAIL PROTECTED] wrote: patch below removes the use of UTS_RELEASE from the tiacx driver; there is absolutely no reason for a driver to print the kernel version or use the UTS_RELEASE field; in addition this field changes all the time so this causes spurious rebuilds.. http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-04-usb/usb-storage-uname-in-pr-sc-unneeded-message.patch did it too. UTS_RELEASE doesn't change much. It's 2.6.17. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tiacx - don't use UTS_RELEASE
On Tue, Jul 04, 2006 at 11:27:27AM +0200, Arjan van de Ven wrote: On Tue, 2006-07-04 at 02:25 -0700, Andrew Morton wrote: On Tue, 04 Jul 2006 11:07:59 +0200 Arjan van de Ven [EMAIL PROTECTED] wrote: patch below removes the use of UTS_RELEASE from the tiacx driver; there is absolutely no reason for a driver to print the kernel version or use the UTS_RELEASE field; in addition this field changes all the time so this causes spurious rebuilds.. http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-04-usb/usb-storage-uname-in-pr-sc-unneeded-message.patch did it too. UTS_RELEASE doesn't change much. It's 2.6.17. no but the header that it's in changes all the time iirc, at least it used to (one of those kbuild regenerated files) Yesterday I pushed a change that splitted include/linux/version.h in two parts. Now include/linux/version.h only contains: #define LINUX_VERSION_CODE 132625 #define KERNEL_VERSION(a,b,c) (((a) 16) + ((b) 8) + (c)) And the file wil only be regenerated when the file-content actually changes. And UTS_RELEASE has moved to include/linux/utsrelease.h which contains: #define UTS_RELEASE 2.6.17-g05668381-dirty This is the file that will change often - at least for git users. But with the patch only users of UTS_RELEASE will be rebuild which is far less than users of version.h. Sam - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tiacx - don't use UTS_RELEASE
On Tue, 2006-07-04 at 11:51 +0200, Sam Ravnborg wrote: On Tue, Jul 04, 2006 at 11:27:27AM +0200, Arjan van de Ven wrote: On Tue, 2006-07-04 at 02:25 -0700, Andrew Morton wrote: On Tue, 04 Jul 2006 11:07:59 +0200 Arjan van de Ven [EMAIL PROTECTED] wrote: patch below removes the use of UTS_RELEASE from the tiacx driver; there is absolutely no reason for a driver to print the kernel version or use the UTS_RELEASE field; in addition this field changes all the time so this causes spurious rebuilds.. http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-04-usb/usb-storage-uname-in-pr-sc-unneeded-message.patch did it too. UTS_RELEASE doesn't change much. It's 2.6.17. no but the header that it's in changes all the time iirc, at least it used to (one of those kbuild regenerated files) Yesterday I pushed a change that splitted include/linux/version.h in two parts. Now include/linux/version.h only contains: #define LINUX_VERSION_CODE 132625 #define KERNEL_VERSION(a,b,c) (((a) 16) + ((b) 8) + (c)) And the file wil only be regenerated when the file-content actually changes. And UTS_RELEASE has moved to include/linux/utsrelease.h which contains: #define UTS_RELEASE 2.6.17-g05668381-dirty This is the file that will change often - at least for git users. But with the patch only users of UTS_RELEASE will be rebuild which is far less than users of version.h. which is a good thing, and we should keep users of utsrelease.h to a minimum... hence my patch to eliminate a user ;) (which used it to do a printk.. but if you use a kernel the version is already in dmesg, no need to printk it again :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [VLAN]: translate IF_OPER_DORMANT to netif_dormant_on()
commit ddd7bf9fe4e59afc0a041378f82b6e1aa88f714b tree 98764adba1bae7d128d2e7db7d9fc1e2fe5826d8 parent b00055aacdb172c05067612278ba27265fcd05ce author Stefan Rompf [EMAIL PROTECTED] Tue, 21 Mar 2006 09:11:41 -0800 committer David S. Miller [EMAIL PROTECTED] Tue, 21 Mar 2006 09:11:41 -0800 [VLAN]: translate IF_OPER_DORMANT to netif_dormant_on() diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c index fa76220..3948949 100644 --- a/net/8021q/vlan.c +++ b/net/8021q/vlan.c @@ -69,7 +69,7 @@ static struct packet_type vlan_packet_ty /* Bits of netdev state that are propagated from real device to virtual */ #define VLAN_LINK_STATE_MASK \ - ((1__LINK_STATE_PRESENT)|(1__LINK_STATE_NOCARRIER)) + ((1__LINK_STATE_PRESENT)|(1__LINK_STATE_NOCARRIER)|(1__LINK_STATE_DORMANT)) /* End of global variables definitions. */ @@ -450,7 +470,7 @@ static struct net_device *register_vlan_ new_dev-flags = real_dev-flags; new_dev-flags = ~IFF_UP; - new_dev-state = real_dev-state VLAN_LINK_STATE_MASK; + new_dev-state = real_dev-state ~(1__LINK_STATE_START); /* need 4 bytes for extra VLAN header info, * hope the underlying device can handle it. This introduced a regression by propagating the __LINK_STATE_XOFF flag, when the queue of the underlying device is stopped it will be stopped for the VLAN device too and never be woken up. Since you changed VLAN_LINK_STATE_MASK, I assume the intention was to just add __LINK_STATE_DORMANT to the propagated flags and keep using it here? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
On Mon, 26 Jun 2006, Andi Kleen wrote: I encountered the same problem on a dual core opteron equipped with a broadcom NIC (tg3) under 2.4. It could receive 1 Mpps when using TSC as the clock source, but the time jumped back and forth, so I changed it to 'notsc', then the performance dropped dramatically to around the same value as above with one CPU saturated. I suspect that the clock precision is needed by the tg3 driver to correctly decide to switch to polling mode, but unfortunately, the performance drop rendered the solution so much unusable that I finally decided to use it only in uniprocessor with TSC enabled. 2.6 is more clever at this than 2.4. In particular it does the timestamp for each packet only when actually needed, which is relativelt rare. Old experiences do not always apply to new kernels. Note, that I experinced this problem on 2.6. Actually the change happens between kernel version 2.6.15 and 2.6.16. And is a result of Andi's changes to arch/x86_64/Kconfig and drivers/acpi/Kconfig, which allows/activates the use of the timer on x86_64. Cheers, Jesper Brouer -- --- MSc. Master of Computer Science Dept. of Computer Science, University of Copenhagen Author of http://www.adsl-optimizer.dk --- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
On Tuesday 04 July 2006 13:41, Jesper Dangaard Brouer wrote: On Mon, 26 Jun 2006, Andi Kleen wrote: I encountered the same problem on a dual core opteron equipped with a broadcom NIC (tg3) under 2.4. It could receive 1 Mpps when using TSC as the clock source, but the time jumped back and forth, so I changed it to 'notsc', then the performance dropped dramatically to around the same value as above with one CPU saturated. I suspect that the clock precision is needed by the tg3 driver to correctly decide to switch to polling mode, but unfortunately, the performance drop rendered the solution so much unusable that I finally decided to use it only in uniprocessor with TSC enabled. 2.6 is more clever at this than 2.4. In particular it does the timestamp for each packet only when actually needed, which is relativelt rare. Old experiences do not always apply to new kernels. Note, that I experinced this problem on 2.6. Actually the change happens between kernel version 2.6.15 and 2.6.16. The timestamp optimizations are older. Don't remember the exact release, but earlier 2.6. And is a result of Andi's changes to arch/x86_64/Kconfig and drivers/acpi/Kconfig, which allows/activates the use of the timer on x86_64. Not sure what you mean here? 2.6.18 will likely be more aggressive at using the TSC on i386 on Intel systems where possible, but x86-64 did this already for a long time. When x86-64 uses non TSC then it's because using the TSC is not safe. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: strict isolation of net interfaces
Andrey Savochkin wrote: I still can't completely understand your direction of thoughts. Could you elaborate on IP address assignment in your diagram, please? For example, guest0 wants 127.0.0.1 and 192.168.0.1 addresses on its lo interface, and 10.1.1.1 on its eth0 interface. Does this diagram assume any local IP addresses on v* interfaces in the host? And the second question. Are vlo0, veth0, etc. devices supposed to have hard_xmit routines? Andrey, some people are interested by a network full isolation/virtualization like you did with the layer 2 isolation and some other people are interested by a light network isolation done at the layer 3. This one is intended to implement application container aka lightweight container. In the case of a layer 3 isolation, the network interface is not totally isolated and the debate here is to find a way to have something intuitive to manage the network devices. IHMO, all the discussion we had convinced me of the needs to have the possibility to choose between a layer 2 or a layer 3 isolation. If it is ok for you, we can collaborate to merge the two solutions in one. I will focus on layer 3 isolation and you on the layer 2. Regards - Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote: On Mon, 03 Jul 2006 20:54:37 -0400 Shailabh Nagar [EMAIL PROTECTED] wrote: What happens when a listener exits without doing deregistration (or if the listener attempts to register another cpumask while a current registration is still active). ( Jamal, your thoughts on this problem would be appreciated) Problem is that we have a listener task which has registered with taskstats and caused its pid to be stored in various per-cpu lists of listeners. Later, when some other task exits on a given cpu, its exit data is sent using genlmsg_unicast on each pid present on that cpu's list. If the listener exits without doing a deregister, its pid continues to be kept around, obviously not a good thing. So we need some way of detecting the situation (task is no longer listening on these cpus events) that is efficient. Also need to address the case where the listener has closed off his file descriptor but continues to run. So hooking into listener's exit() isn't appropriate - the teardown is associated with the lifetime of the fd, not of the process. If we do that, exit() gets handled for free. If you are always going to send unicast messages, then -ECONNREFUSED will tell you the listener has closed their fd - this doesnt meant it has exited. Besides that one process could open several sockets. I know that would not be the app you would write - but it doesnt stop other people from doing it. I think i may not follow what you are doing - for some reason i thought you may have many listeners in user space and these messages get multicast to them? Does the user space program somehow communicate its pid to the kernel? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPROUTE]: Introduce tc monitor
On Mon, 2006-03-07 at 12:13 +0200, Patrick McHardy wrote: Speaking of actions, do you have any plans to add help-texts? Currently the output is very confusing, whenever I use them I need to google for examples. Thanks for reminding me. There are examples in the doc/ directory of iproute2, but they may be insufficient. In any case, I wont have time today or the rest of the week but will get some patch after that. [Actually, I have about half a day off but I want to spend time reviewing the qdisc_is_running thing in a test environment( It takes me at least 2 hours to steal hardware and set it up)]. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
Russell Stuart wrote: On 26/06/2006 9:10 PM, Patrick McHardy wrote: 5. We still did have to modify the kernel for ATM. That was because of its rather unusual characteristics. However, it you look at the size of modifications made to the kernel verses the size made to the user space tool, (37 lines versus 303 lines,) the bulk of the work was does in user space. I'm sorry, but arguing that a limited special case solution is better because it needs slightly less code is just not reasonable. Without seeing your actual proposal it is difficult to judge whether this is a reasonable trade-off or not. Hopefully we will see your code soon. Do you have any idea when? Unfortunately I still didn't got to cleaning them up, so I'm sending them in their preliminary state. Its not much that is missing, but the netem usage of skb-cb needs to be integrated better, I failed to move it to the qdisc_skb_cb so far because of circular includes. But nothing unfixable. I'm mostly interested if the current size-tables can express what you need for ATM, I wasn't able to understand the big comment in tc_core.c in your patch. [NET_SCHED]: Add accessor function for packet length for qdiscs Signed-off-by: Patrick McHardy [EMAIL PROTECTED] --- commit 2a6508576111d82246ee018edbcc4b0f0d18acad tree 8be27ab6040ea90ed11728763e5b8fcf9e221b67 parent 31304c909e6945b005af62cd55a582e9c010a0b4 author Patrick McHardy [EMAIL PROTECTED] Tue, 04 Jul 2006 15:03:01 +0200 committer Patrick McHardy [EMAIL PROTECTED] Tue, 04 Jul 2006 15:03:01 +0200 include/net/sch_generic.h |9 +++-- net/sched/sch_atm.c |4 ++-- net/sched/sch_cbq.c | 12 ++-- net/sched/sch_dsmark.c|2 +- net/sched/sch_fifo.c |2 +- net/sched/sch_gred.c | 12 ++-- net/sched/sch_hfsc.c |8 net/sched/sch_htb.c |8 net/sched/sch_netem.c |6 +++--- net/sched/sch_prio.c |2 +- net/sched/sch_red.c |2 +- net/sched/sch_sfq.c | 14 +++--- net/sched/sch_tbf.c |6 +++--- net/sched/sch_teql.c |4 ++-- 14 files changed, 48 insertions(+), 43 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index b0e9108..75d7a55 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -184,12 +184,17 @@ tcf_destroy(struct tcf_proto *tp) kfree(tp); } +static inline unsigned int qdisc_tx_len(struct sk_buff *skb) +{ + return skb-len; +} + static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff_head *list) { __skb_queue_tail(list, skb); - sch-qstats.backlog += skb-len; - sch-bstats.bytes += skb-len; + sch-qstats.backlog += qdisc_tx_len(skb); + sch-bstats.bytes += qdisc_tx_len(skb); sch-bstats.packets++; return NET_XMIT_SUCCESS; diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c index dbf44da..4df305e 100644 --- a/net/sched/sch_atm.c +++ b/net/sched/sch_atm.c @@ -453,9 +453,9 @@ #endif if (flow) flow-qstats.drops++; return ret; } - sch-bstats.bytes += skb-len; + sch-bstats.bytes += qdisc_tx_len(skb); sch-bstats.packets++; - flow-bstats.bytes += skb-len; + flow-bstats.bytes += qdisc_tx_len(skb); flow-bstats.packets++; /* * Okay, this may seem weird. We pretend we've dropped the packet if diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c index 80b7f6a..5d705e2 100644 --- a/net/sched/sch_cbq.c +++ b/net/sched/sch_cbq.c @@ -404,7 +404,7 @@ static int cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch) { struct cbq_sched_data *q = qdisc_priv(sch); - int len = skb-len; + int len = qdisc_tx_len(skb); int ret; struct cbq_class *cl = cbq_classify(skb, sch, ret); @@ -688,7 +688,7 @@ #ifdef CONFIG_NET_CLS_POLICE static int cbq_reshape_fail(struct sk_buff *skb, struct Qdisc *child) { - int len = skb-len; + int len = qdisc_tx_len(skb); struct Qdisc *sch = child-__parent; struct cbq_sched_data *q = qdisc_priv(sch); struct cbq_class *cl = q-rx_class; @@ -915,7 +915,7 @@ cbq_dequeue_prio(struct Qdisc *sch, int if (skb == NULL) goto skip_class; - cl-deficit -= skb-len; + cl-deficit -= qdisc_tx_len(skb); q-tx_class = cl; q-tx_borrowed = borrow; if (borrow != cl) { @@ -923,11 +923,11 @@ #ifndef CBQ_XSTATS_BORROWS_BYTES borrow-xstats.borrows++; cl-xstats.borrows++; #else - borrow-xstats.borrows += skb-len; - cl-xstats.borrows += skb-len; +
Re: strict isolation of net interfaces
Sam Vilain wrote: Daniel Lezcano wrote: If it is ok for you, we can collaborate to merge the two solutions in one. I will focus on layer 3 isolation and you on the layer 2. So, you're writing a LSM module or adapting the BSD Jail LSM, right? :) Sam. No. I am adapting a prototype of network application container we did. -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
On Tue, Jul 04, 2006 at 10:10:03AM +0100, Christoph Hellwig wrote: On Tue, Jul 04, 2006 at 11:24:05AM +0400, Andrey Savochkin wrote: Yes, it's a little more work as you need to audit all drivers to see what they are doing and find suitable abstractions but it's a must have that should have been done a lot earlier. Hiding dev_base_head can be done by converting first_netdev/next_netdev into functions and implementing for_each_netdev loop through them. Or are you talking about abstractions like functions for_each_netdev/find_netdev with callbacks? an for_each_netdev with a callback makes sense and gives a cleaner abstraction, yes. I don't think you should need a callback for the lookup structure. Different modules want different kinds of lookup. So, I'm thinking about something like ilookup5. Do you think that hiding the list internals is worth the additional complexity and substantial increase of the patch size? Yes, absolutely. We've converted scsi hosts and devices from a model where drivers could directly access the list to strict iterators in the 2.5 series. It's quite a lot of work as you have to understand what the drivers actually do (and to at least 50% they were doing something really stupid) and convert them to the right abstractions. The next question: would people agree to review a patch doing this for net_devices? :) Andrey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
jamal wrote: On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote: On Mon, 03 Jul 2006 20:54:37 -0400 Shailabh Nagar [EMAIL PROTECTED] wrote: What happens when a listener exits without doing deregistration (or if the listener attempts to register another cpumask while a current registration is still active). ( Jamal, your thoughts on this problem would be appreciated) Problem is that we have a listener task which has registered with taskstats and caused its pid to be stored in various per-cpu lists of listeners. Later, when some other task exits on a given cpu, its exit data is sent using genlmsg_unicast on each pid present on that cpu's list. If the listener exits without doing a deregister, its pid continues to be kept around, obviously not a good thing. So we need some way of detecting the situation (task is no longer listening on these cpus events) that is efficient. Also need to address the case where the listener has closed off his file descriptor but continues to run. So hooking into listener's exit() isn't appropriate - the teardown is associated with the lifetime of the fd, not of the process. If we do that, exit() gets handled for free. If you are always going to send unicast messages, then -ECONNREFUSED will tell you the listener has closed their fd - this doesnt meant it has exited. Thats good. So we have atleast one way of detecting the closed fd without deregistering within taskstats itself. Besides that one process could open several sockets. I know that would not be the app you would write - but it doesnt stop other people from doing it. As far as API is concerned, even a taskstats listener is not being prevented from opening multiple sockets. As Andrew also pointed out, everything needs to be done per-socket. I think i may not follow what you are doing - for some reason i thought you may have many listeners in user space and these messages get multicast to them? That was the design earlier. In the past week, the design has changed to one where there are still many listeners in user space but messages get unicast to each of them. Earlier listeners would get messages generated on task exit from every cpu, now they get it only from cpus for which they have explicitly registered interest (via a cpumask passed in through another genetlink command). Does the user space program somehow communicate its pid to the kernel? Yes. When the listener registers interest in a set of cpus, as described above, its (genl_info-pid) is being stored in the per-cpu list of listeners for those cpus. When a task exits on one of those cpus, the exit data is only sent via genetlink_unicast to those pids (really, nl_pids) who are on that cpu's listener list. Now that I think more about it, netlink is really maintaining a pidhash of nl_pids, not process pids, right ? So if one userapp were to open multiple sockets using NETLINK_GENERIC protocol (regardless of how many of those are for the taskstats), each of them would have to use a different nl_pid. Hence, it would be valid for the taskstats layer to use netlink_lookup() at any time to see if the corresponding socket were closed ? --Shailabh - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: possible recursive locking in ATM layer
From: Arjan van de Ven [EMAIL PROTECTED] Linux version 2.6.17-git22 ([EMAIL PROTECTED]) (gcc version 4.0.3 (Ubuntu 4.0.3-1ubuntu5)) #20 PREEMPT Tue Jul 4 10:35:04 CEST 2006 [ 2381.598609] = [ 2381.619314] [ INFO: possible recursive locking detected ] [ 2381.635497] - [ 2381.651706] atmarpd/2696 is trying to acquire lock: [ 2381.666354] (skb_queue_lock_key){-+..}, at: [c028c540] skb_migrate+0x24/0x6c [ 2381.688848] ok this is a real potential deadlock in a way, it takes two locks of 2 skbuffs without doing any kind of lock ordering; I think the following patch should fix it. Just sort the lock taking order by address of the skb.. it's not pretty but it's the best this can do in a minimally invasive way. I still agree with the comment that this code shouldn't live in the atm layer... Signed-off-by: Arjan van de Ven [EMAIL PROTECTED] --- net/atm/ipcommon.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) Index: linux-2.6.17-mm6/net/atm/ipcommon.c === --- linux-2.6.17-mm6.orig/net/atm/ipcommon.c +++ linux-2.6.17-mm6/net/atm/ipcommon.c @@ -25,8 +25,8 @@ /* * skb_migrate appends the list at from to to, emptying from in the * process. skb_migrate is atomic with respect to all other skb operations on - * from and to. Note that it locks both lists at the same time, so beware - * of potential deadlocks. + * from and to. Note that it locks both lists at the same time, so to deal + * with the lock ordering, the locks are taken in address order. * * This function should live in skbuff.c or skbuff.h. */ @@ -39,8 +39,13 @@ void skb_migrate(struct sk_buff_head *fr struct sk_buff *skb_to = (struct sk_buff *) to; struct sk_buff *prev; - spin_lock_irqsave(from-lock,flags); - spin_lock(to-lock); + if (fromto) { + spin_lock_irqsave(from-lock,flags); + spin_lock_nested(to-lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock_irqsave(to-lock, flags); + spin_lock_nested(from-lock, SINGLE_DEPTH_NESTING); + } prev = from-prev; from-next-prev = to-prev; prev-next = skb_to; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/7] net_device list cleanup: core
Hello! Different modules want different kinds of lookup. So, I'm thinking about something like ilookup5. The next question: would people agree to review a patch doing this for net_devices? :) One not original suggestion, which did not sound nevertheless: to implement netdev_iterate_list() or whatever, update only core and a few of devices and deprecate dev_base_head with __deprecated_for_modules adding it to Documentation/feature-removal-schedule.txt Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh Nagar wrote: jamal wrote: On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote: On Mon, 03 Jul 2006 20:54:37 -0400 Shailabh Nagar [EMAIL PROTECTED] wrote: What happens when a listener exits without doing deregistration (or if the listener attempts to register another cpumask while a current registration is still active). ( Jamal, your thoughts on this problem would be appreciated) Problem is that we have a listener task which has registered with taskstats and caused its pid to be stored in various per-cpu lists of listeners. Later, when some other task exits on a given cpu, its exit data is sent using genlmsg_unicast on each pid present on that cpu's list. If the listener exits without doing a deregister, its pid continues to be kept around, obviously not a good thing. So we need some way of detecting the situation (task is no longer listening on these cpus events) that is efficient. Also need to address the case where the listener has closed off his file descriptor but continues to run. So hooking into listener's exit() isn't appropriate - the teardown is associated with the lifetime of the fd, not of the process. If we do that, exit() gets handled for free. If you are always going to send unicast messages, then -ECONNREFUSED will tell you the listener has closed their fd - this doesnt meant it has exited. Thats good. So we have atleast one way of detecting the closed fd without deregistering within taskstats itself. Besides that one process could open several sockets. I know that would not be the app you would write - but it doesnt stop other people from doing it. As far as API is concerned, even a taskstats listener is not being prevented from opening multiple sockets. As Andrew also pointed out, everything needs to be done per-socket. I think i may not follow what you are doing - for some reason i thought you may have many listeners in user space and these messages get multicast to them? That was the design earlier. In the past week, the design has changed to one where there are still many listeners in user space but messages get unicast to each of them. Earlier listeners would get messages generated on task exit from every cpu, now they get it only from cpus for which they have explicitly registered interest (via a cpumask passed in through another genetlink command). Does the user space program somehow communicate its pid to the kernel? Yes. When the listener registers interest in a set of cpus, as described above, its (genl_info-pid) is being stored in the per-cpu list of listeners for those cpus. When a task exits on one of those cpus, the exit data is only sent via genetlink_unicast to those pids (really, nl_pids) who are on that cpu's listener list. Now that I think more about it, netlink is really maintaining a pidhash of nl_pids, not process pids, right ? So if one userapp were to open multiple sockets using NETLINK_GENERIC protocol (regardless of how many of those are for the taskstats), each of them would have to use a different nl_pid. Hence, it would be valid for the taskstats layer to use netlink_lookup() at any time to see if the corresponding socket were closed ? Here's a strawman for the problem we're trying to solve: get notification of the close of a NETLINK_GENERIC socket that had been used to register interest for some cpus within taskstats. From looking at the netlink code, the way to go seems to be - it maintains a pidhash of nl_pids that are currently registered to listen to atleast one cpu. It also stores the cpumask used. - taskstats registers a notifier block within netlink_chain and receives a callback on the NETLINK_URELEASE event, similar to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event() - the callback checks to see that the protocol is NETLINK_GENERIC and that the nl_pid for the socket is in taskstat's pidhash. If so, it does a cleanup using the stored cpumask and releases the nl_pid from the pidhash. We can even do away with the deregister command altogether and simply rely on this autocleanup. --Shailabh - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[e1000]: flow control on by default - good idea really?
CCing anybody who may have stakes on this. Ignore the email if this doesnt interest you. Ok, folks - i had deferred this discussion but it bit me in the ass. I just spend an hour debugging it (and in the process blew up a gbic i borrowed, so my day aint going well since i actually have to pay for this and cant really do the testing i was planning to;-). I have a device connected to a e1000 that was erroneously advertising both tx/rx flow control but wasnt properly reacting to it. The default setup on the e1000 has rx flow control turned on. I was sending at wire rate gige from the device - which is about 1.48Mpps. The e1000 was in turn sending me flow control packets as per default/expected behavior. Unfortunately, it was sending a very large amount of packets. At one point i was seeing upto 1Mpps and on average, the flow control packets were consuming 60-70% of the bandwidth. Even when i fixed this behavior to act properly, allowing flow control on consumed up to 15% of the bandwidth. Clearly, this is a bad thing. Yes, the device in the first instance was at fault. But i have argued in the past that NAPI does just fine without flow control being turned on, so even chewing 5% of bandwidth on flow control is a bad thing.. As a compromise, can we declare flow control as an advanced feature and turn it off by default? People who feel it is valuable and know what they are doing can turn it off. If you want more details just shoot. cheers, jamal PS:- BTW, even turning off flow control on e1000 didnt give as good performance as in the old days on this machine - but i dont want to go into that discussion. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] d80211: fix receiving through virtual interfaces
On Mon, 3 Jul 2006 19:24:08 +0200 (CEST), Jiri Benc wrote: - Packet type (PACKET_HOST and PACKET_OTHER_HOST) is set correctly now. Uhm, not really. @@ -3057,7 +3048,9 @@ ieee80211_rx_h_check(struct ieee80211_tx return TXRX_DROP; } - if (memcmp(rx-dev-dev_addr, hdr-addr1, ETH_ALEN) == 0) + if (rx-fc WLAN_FC_TODS) + rx-skb-pkt_type = PACKET_OTHERHOST; I'm not sure how something so obviously wrong slipped there. The corrected version of the patch follows. --- net/d80211/ieee80211.c | 171 +++ net/d80211/ieee80211_i.h |5 + net/d80211/wpa.c |4 + 3 files changed, 124 insertions(+), 56 deletions(-) --- dscape.orig/net/d80211/ieee80211.c +++ dscape/net/d80211/ieee80211.c @@ -2463,27 +2463,15 @@ ieee80211_rx_h_data(struct ieee80211_txr memcpy(ehdr-h_source, src, ETH_ALEN); ehdr-h_proto = len; } - -if (rx-sta !rx-sta-assoc_ap - !(rx-sta (rx-sta-flags WLAN_STA_WDS))) -skb-dev = rx-sta-dev; -else -skb-dev = dev; + skb-dev = dev; skb2 = NULL; -sdata = IEEE80211_DEV_TO_SUB_IF(dev); -/* - * don't count the master since the low level code - * counts it already for us. - */ -if (skb-dev != sdata-master) { - sdata-stats.rx_packets++; - sdata-stats.rx_bytes += skb-len; -} + sdata-stats.rx_packets++; + sdata-stats.rx_bytes += skb-len; if (local-bridge_packets (sdata-type == IEEE80211_IF_TYPE_AP - || sdata-type == IEEE80211_IF_TYPE_VLAN)) { + || sdata-type == IEEE80211_IF_TYPE_VLAN) rx-u.rx.ra_match) { if (is_multicast_ether_addr(skb-data)) { /* send multicast frames both to higher layers in * local net stack and back to the wireless media */ @@ -2760,13 +2748,14 @@ static int ap_sta_ps_end(struct net_devi static ieee80211_txrx_result -ieee80211_rx_h_ieee80211_rx_h_ps_poll(struct ieee80211_txrx_data *rx) +ieee80211_rx_h_ps_poll(struct ieee80211_txrx_data *rx) { struct sk_buff *skb; int no_pending_pkts; if (likely(!rx-sta || WLAN_FC_GET_TYPE(rx-fc) != WLAN_FC_TYPE_CTRL || - WLAN_FC_GET_STYPE(rx-fc) != WLAN_FC_STYPE_PSPOLL)) + WLAN_FC_GET_STYPE(rx-fc) != WLAN_FC_STYPE_PSPOLL || + !rx-u.rx.ra_match)) return TXRX_CONTINUE; skb = skb_dequeue(rx-sta-tx_filtered); @@ -3042,8 +3031,10 @@ ieee80211_rx_h_check(struct ieee80211_tx if (unlikely(rx-fc WLAN_FC_RETRY rx-sta-last_seq_ctrl[rx-u.rx.queue] == hdr-seq_ctrl)) { - rx-local-dot11FrameDuplicateCount++; - rx-sta-num_duplicates++; + if (rx-u.rx.ra_match) { + rx-local-dot11FrameDuplicateCount++; + rx-sta-num_duplicates++; + } return TXRX_DROP; } else rx-sta-last_seq_ctrl[rx-u.rx.queue] = hdr-seq_ctrl; @@ -3057,7 +3048,9 @@ ieee80211_rx_h_check(struct ieee80211_tx return TXRX_DROP; } - if (memcmp(rx-dev-dev_addr, hdr-addr1, ETH_ALEN) == 0) + if (!rx-u.rx.ra_match) + rx-skb-pkt_type = PACKET_OTHERHOST; + else if (memcmp(rx-dev-dev_addr, hdr-addr1, ETH_ALEN) == 0) rx-skb-pkt_type = PACKET_HOST; else if (is_multicast_ether_addr(hdr-addr1)) { if (is_broadcast_ether_addr(hdr-addr1)) @@ -3080,8 +3073,10 @@ ieee80211_rx_h_check(struct ieee80211_tx WLAN_FC_GET_STYPE(rx-fc) == WLAN_FC_STYPE_PSPOLL)) rx-sdata-type != IEEE80211_IF_TYPE_IBSS (!rx-sta || !(rx-sta-flags WLAN_STA_ASSOC { - if (!(rx-fc WLAN_FC_FROMDS) !(rx-fc WLAN_FC_TODS)) { - /* Drop IBSS frames silently. */ + if ((!(rx-fc WLAN_FC_FROMDS) !(rx-fc WLAN_FC_TODS)) || + !rx-u.rx.ra_match) { + /* Drop IBSS frames and frames for other hosts +* silently. */ return TXRX_DROP; } @@ -3113,6 +3108,8 @@ ieee80211_rx_h_check(struct ieee80211_tx rx-key = rx-sdata-keys[keyidx]; } if (!rx-key) { + if (!rx-u.rx.ra_match) + return TXRX_DROP; printk(KERN_DEBUG %s: RX WEP frame with unknown keyidx %d (A1= MACSTR A2= MACSTR A3= MACSTR )\n, @@ -3128,7
[2.6.17-git22] lock debugging output
Hoping gmail doesn't mess it too badly... eth0: tg3 (BCM5751 Gbit Ethernet) eth1: ipw2200 (Intel PRO/Wireless 2200BG) Sequence: 1. boot with eth0 disconnected (eth1 doesn't come up on boot) 2. ifup eth1, bring wpa-supplicant up 3. run 'dig' --- lock debug info gets printed on console Note that due to my very variable network setup, I had no /etc/resolv.conf in place at the moment I ran 'dig'. Second execution of 'dig' did not print any lock debug output but just (properly) stalled; then I realized I didn't put my home resolv.conf in place, did that and 'dig' just worked. System appears to work and I'm actually typing this report from the same kernel that reported the following upon invoking 'dig' : = [ INFO: inconsistent lock state ] - inconsistent {softirq-on-W} - {in-softirq-R} usage. dig/2373 [HC0[0]:SC1[2]:HE1:SE0] takes: (sk-sk_dst_lock){---?}, at: [c028cf72] sk_dst_check+0x1b/0xe6 {softirq-on-W} state was registered at: [c0127a6a] lock_acquire+0x60/0x80 [c02e151d] _write_lock+0x19/0x28 [c028c0af] sock_setsockopt+0x351/0x49c [c0289d0d] sys_setsockopt+0x5b/0x8d [c028ac22] sys_socketcall+0x148/0x186 [c0102699] sysenter_past_esp+0x56/0x8d irq event stamp: 1130 hardirqs last enabled at (1130): [c01161ed] local_bh_enable_ip+0xb2/0xbb hardirqs last disabled at (1129): [c011618e] local_bh_enable_ip+0x53/0xbb softirqs last enabled at (1120): [c029423c] dev_queue_xmit+0x205/0x211 softirqs last disabled at (1121): [c01040e6] do_softirq+0x4d/0xac other info that might help us debug this: 2 locks held by dig/2373: #0: (sk_lock-AF_INET6){--..}, at: [f8cf1168] udpv6_sendmsg+0x546/0x818 [ipv6] #1: (slock-AF_INET6){-...}, at: [f8cf3228] icmpv6_send+0x222/0x549 [ipv6] stack backtrace: [c0102e44] show_trace+0xd/0x10 [c010335e] dump_stack+0x19/0x1b [c01260e1] print_usage_bug+0x1cc/0x1d9 [c01265e2] mark_lock+0x193/0x360 [c01271ee] __lock_acquire+0x3b7/0x969 [c0127a6a] lock_acquire+0x60/0x80 [c02e15ff] _read_lock+0x19/0x28 [c028cf72] sk_dst_check+0x1b/0xe6 [f8ce1305] ip6_dst_lookup+0x31/0x16d [ipv6] [f8cf3338] icmpv6_send+0x332/0x549 [ipv6] [f8cf09a1] udpv6_rcv+0x4ab/0x4d6 [ipv6] [f8ce2900] ip6_input+0x19c/0x228 [ipv6] [f8ce2d61] ipv6_rcv+0x188/0x1b7 [ipv6] [c02925b7] netif_receive_skb+0x18d/0x1d8 [c0293d6a] process_backlog+0x80/0xf9 [c0293f43] net_rx_action+0x80/0x174 [c01162fd] __do_softirq+0x46/0x9c [c01040e6] do_softirq+0x4d/0xac === [c0116117] local_bh_enable+0xc8/0xec [c029423c] dev_queue_xmit+0x205/0x211 [c0298a8b] neigh_resolve_output+0x1db/0x207 [f8ce0bee] ip6_output2+0x1e4/0x202 [ipv6] [f8ce12aa] ip6_output+0x69e/0x6c8 [ipv6] [f8ce1706] ip6_push_pending_frames+0x2c5/0x377 [ipv6] [f8cefd8e] udp_v6_push_pending_frames+0x154/0x176 [ipv6] [f8cf122a] udpv6_sendmsg+0x608/0x818 [ipv6] [c02c6b1d] inet_sendmsg+0x3b/0x48 [c02894f9] sock_sendmsg+0xe8/0x103 [c0289b18] sys_sendmsg+0x14f/0x1aa [c028ac45] sys_socketcall+0x16b/0x186 [c0102699] sysenter_past_esp+0x56/0x8d Hope this may be useful to lock debug devs / netdev folks... Ciao, --alessandro I can't change what makes me high and I can't change what I believe in (Heather Nova, My Fidelity) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6.17-git22] lock debugging output
From: Arjan van de Ven [EMAIL PROTECTED] On Tue, 2006-07-04 at 20:13 +0200, Alessandro Suardi wrote: Hoping gmail doesn't mess it too badly... eth0: tg3 (BCM5751 Gbit Ethernet) eth1: ipw2200 (Intel PRO/Wireless 2200BG) Sequence: 1. boot with eth0 disconnected (eth1 doesn't come up on boot) 2. ifup eth1, bring wpa-supplicant up 3. run 'dig' --- lock debug info gets printed on console this appears to be a real deadlock: the SO_BINDTODEVICE ioctl calls sk_dst_reset(), which looks like this: static inline void sk_dst_reset(struct sock *sk) { write_lock(sk-sk_dst_lock); __sk_dst_reset(sk); write_unlock(sk-sk_dst_lock); } now... ipv6 does this in softirq context: [c028cf72] sk_dst_check+0x1b/0xe6 [f8ce1305] ip6_dst_lookup+0x31/0x16d [ipv6] [f8cf3338] icmpv6_send+0x332/0x549 [ipv6] [f8cf09a1] udpv6_rcv+0x4ab/0x4d6 [ipv6] [f8ce2900] ip6_input+0x19c/0x228 [ipv6] [f8ce2d61] ipv6_rcv+0x188/0x1b7 [ipv6] [c02925b7] netif_receive_skb+0x18d/0x1d8 [c0293d6a] process_backlog+0x80/0xf9 [c0293f43] net_rx_action+0x80/0x174 [c01162fd] __do_softirq+0x46/0x9c [c01040e6] do_softirq+0x4d/0xac where sk_dst_check() takes the same lock for read. that looks like a real deadlock to me... the most obvious low impact solution is to make sk_dst_reset use an irqsave variant; patch for that is attached below. I'll leave it to the networking people to say if that's the real right approach Signed-off-by: Arjan van de Ven [EMAIL PROTECTED] --- include/net/sock.h |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: linux-2.6.17-mm6/include/net/sock.h === --- linux-2.6.17-mm6.orig/include/net/sock.h +++ linux-2.6.17-mm6/include/net/sock.h @@ -1025,9 +1025,10 @@ __sk_dst_reset(struct sock *sk) static inline void sk_dst_reset(struct sock *sk) { - write_lock(sk-sk_dst_lock); + unsigned long flags; + write_lock_irqsave(sk-sk_dst_lock, flags); __sk_dst_reset(sk); - write_unlock(sk-sk_dst_lock); + write_unlock_irqrestore(sk-sk_dst_lock, flags); } extern struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote: On Saturday 01 July 2006 01:01, Tom Tucker wrote: On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote: The TOE folks have tried to submit their hooks and drivers on several occaisions, and we've rejected it every time. iWARP != TOE Perhaps a good start of that discussion David asked for would be if you could give us an overview of the differences and how you avoid the TOE problems. Interesting thread, I hope someone replies to Andi's request. I've actually no real idea what RDMA, IWARP TOE are, so I may be barking up completely the wrong tree here. If so, apologies. But it sounds like we're talking about technologies that offload some part of the network/transport layer processing to the interface device? And the primary objection to that is that it may bypass some of the cool features of the Linux stack? Stuff like iptables and ... what exactly? Presumably the reason why such offloading would be a Good Thing are to do with very high speed network processing, 10G ethernet and the like. Which sounds to me very like the way dedicated networking kit would do that. So if you have a device that needs to be a very high performance router, you dedicate it to that function and don't try to do clever per-packet or -flow processing at the same time. In the Cisco world, there's a network design approach where you consider your equipment in three 'layers', I think they call them the core, distribution and access layers, or something like that. The idea is that the core layer handles the real high speed stuff, you don't do anything much except routing/switching in there. The other layers aggregate flows for the core, if you need extra processing (firewalls etc) you do it somewhere in there. So, for example, the packet capture functions (sort of like tcpdump) don't work if fast switching is in use, which it would be in the core. Users accept these tradeoffs, because if you design it right you can do the extra processing on some other device in the outer layers. So perhaps there's a good argument to make that a Linux system with the right hardware could be considered a core device. Likely any place you have such a system it would be dedicated to just moving data as well as possible, and let other systems do the other stuff. You wouldn't want your server farm systems to also be your firewalls. Bottom line - these technologies seem to me to have a place in a well designed network. Just my 2c... - Andy -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [e1000]: flow control on by default - good idea really?
On Tue, 2006-04-07 at 13:11 -0400, jamal wrote: CCing anybody who may have stakes on this. Ignore the email if this doesnt interest you. Ok, folks - i had deferred this discussion but it bit me in the ass. I just spend an hour debugging it (and in the process blew up a gbic i borrowed, so my day aint going well since i actually have to pay for this and cant really do the testing i was planning to;-). I have a device connected to a e1000 that was erroneously advertising both tx/rx flow control but wasnt properly reacting to it. The default setup on the e1000 has rx flow control turned on. I was sending at wire rate gige from the device - which is about 1.48Mpps. The e1000 was in turn sending me flow control packets as per default/expected behavior. Unfortunately, it was sending a very large amount of packets. At one point i was seeing upto 1Mpps and on average, the flow control packets were consuming 60-70% of the bandwidth. Even when i fixed this behavior to act properly, allowing flow control on consumed up to 15% of the bandwidth. Clearly, this is a bad thing. Yes, the device in the first instance was at fault. But i have argued in the past that NAPI does just fine without flow control being turned on, so even chewing 5% of bandwidth on flow control is a bad thing.. As a compromise, can we declare flow control as an advanced feature and turn it off by default? People who feel it is valuable and know what they are doing can turn it off. I meant turn it on. BTW, As an addendum this default behavior changed around 2.6.16 it seems. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
On Tue, 2006-04-07 at 15:29 +0200, Patrick McHardy wrote: Russell Stuart wrote: [..] Without seeing your actual proposal it is difficult to judge whether this is a reasonable trade-off or not. Hopefully we will see your code soon. Do you have any idea when? Unfortunately I still didn't got to cleaning them up, so I'm sending them in their preliminary state. Its not much that is missing, but the netem usage of skb-cb needs to be integrated better, I failed to move it to the qdisc_skb_cb so far because of circular includes. But nothing unfixable. I'm mostly interested if the current size-tables can express what you need for ATM, I wasn't able to understand the big comment in tc_core.c in your patch. Looks good from within the range of change within reason of addressed problem. The cb on the qdisc seems only usable for netem, correct? Also while not unreasonable, i wasnt sure how qdisc_enqueue_root() fit in the grand scheme of things for this change (it seemed out of place). cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh, On Tue, 2006-04-07 at 12:37 -0400, Shailabh Nagar wrote: [..] Here's a strawman for the problem we're trying to solve: get notification of the close of a NETLINK_GENERIC socket that had been used to register interest for some cpus within taskstats. From looking at the netlink code, the way to go seems to be - it maintains a pidhash of nl_pids that are currently registered to listen to atleast one cpu. It also stores the cpumask used. - taskstats registers a notifier block within netlink_chain and receives a callback on the NETLINK_URELEASE event, similar to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event() - the callback checks to see that the protocol is NETLINK_GENERIC and that the nl_pid for the socket is in taskstat's pidhash. If so, it does a cleanup using the stored cpumask and releases the nl_pid from the pidhash. Sound quiet reasonable. I am beginning to wonder whether we should do do the NETLINK_URELEASE in general for NETLINK_GENERIC We can even do away with the deregister command altogether and simply rely on this autocleanup. I think if you may still need the register if you are going to allow multiple sockets per listener process, no? The other question is how do you correlate pid - fd? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Shailabh wrote: Perhaps I should use the the other ascii format for specifying cpumasks since its more amenable to specifying an upper bound for the length of the ascii string and is more compact ? Eh - basically - I don't have a strong opinion either way. I have a slight esthetic preference toward using list of ranges format from shell scripts and shell prompts, and using the 32-bit hex words from C code: 17-26,44-47 # shell - list of ranges f000,07fe # C - 32-bit hex words Since the primary interface you are working with is C code, that would mean I'd slightly prefer the 32-bit hex word variant. From what I've seen neither of the reasons you gave for preferring the 32-bit hex word format are persuasive (even though they both lead to the same conclusion as I preferred ;): Which is more compact depends on that particular bit pattern you need to represent. See for example the examples above. The lack of a perfect upper bound on the list of ranges format is a theoretical problem that I have never seen in practice. Only pathological constructs exceed six ascii characters per set bit. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.17-mm6
this is one for the networking people, and thus netdev On Tue, 2006-07-04 at 21:53 +0200, Rafael J. Wysocki wrote: On Monday 03 July 2006 12:03, Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17/2.6.17-mm6/ - A major update to the e1000 driver. - 1394 updates Just found this in dmesg: = [ INFO: inconsistent lock state ] - inconsistent {in-hardirq-W} - {hardirq-on-W} usage. nscd/4929 [HC0[0]:SC0[1]:HE1:SE0] takes: (skb_queue_lock_key){++..}, at: [8044fe40] udp_ioctl+0x50/0xa0 {in-hardirq-W} state was registered at: [8024b4fa] lock_acquire+0x8a/0xc0 [80476e3f] _spin_lock_irqsave+0x3f/0x60 [80408c25] skb_queue_tail+0x25/0x60 ok so skb_queue_lock is used in a hardirq context [881c9517] queue_packet_complete+0x27/0x40 [ieee1394] [881c9d6b] hpsb_packet_sent+0xab/0x100 [ieee1394] [8822a4b5] dma_trm_reset+0x115/0x140 [ohci1394] [8822c512] ohci_devctl+0x1c2/0x540 [ohci1394] [881c9673] hpsb_bus_reset+0x43/0xb0 [ieee1394] [8822d7f6] ohci_irq_handler+0x416/0x830 [ohci1394] [802631ab] handle_IRQ_event+0x2b/0x70 [80264dd4] handle_level_irq+0xc4/0x130 [8020c762] do_IRQ+0x112/0x130 [80209d90] common_interrupt+0x64/0x65 irq event stamp: 4280 hardirqs last enabled at (4279): [8047606a] trace_hardirqs_on_thunk+0x35/0x37 hardirqs last disabled at (4278): [804760a1] trace_hardirqs_off_thunk+0x35/0x67 softirqs last enabled at (4258): [804065b5] release_sock+0xd5/0xe0 softirqs last disabled at (4280): [804764d1] _spin_lock_bh+0x11/0x50 other info that might help us debug this: no locks held by nscd/4929. stack backtrace: Call Trace: [8020ab9f] show_trace+0x9f/0x240 [8020af75] dump_stack+0x15/0x20 [80249e52] print_usage_bug+0x272/0x290 [8024a0d7] mark_lock+0x267/0x5f0 [8024a9a6] __lock_acquire+0x546/0xd10 [8024b4fb] lock_acquire+0x8b/0xc0 [804764f4] _spin_lock_bh+0x34/0x50 [8044fe40] udp_ioctl+0x50/0xa0 yet udp_ioctl takes it only for _bh [80457359] inet_ioctl+0x69/0x70 [804033ac] sock_ioctl+0x22c/0x270 [802a32b1] do_ioctl+0x31/0xa0 [802a35db] vfs_ioctl+0x2bb/0x2e0 [802a366a] sys_ioctl+0x6a/0xa0 [8020985a] system_call+0x7e/0x83 [2b2d76ab98a9] is this a real scenario, or is this a case of firewire is special and needs it's own rules? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
Andrew wrote: OK, so we're passing in an ASCII string. Fair enough, I think. Paul would know better. Not sure if I know better - just got stronger opinions. I like the ASCII here - but this is one of those he who writes the code gets to -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats
pj wrote: writes the code gets to Never mind that last incomplete post - I hit Send when I meant to hit Cancel. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson [EMAIL PROTECTED] 1.925.600.0401 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
Andi Perhaps a good start of that discussion David asked for Andi would be if you could give us an overview of the differences Andi and how you avoid the TOE problems. Well, here's a quick overview, leaving out some of the details. The difference between TOE and iWARP/RDMA is really the interface that they present. A TOE (TCP Offload Engine) is a piece of hardware that offloads TCP processing from the main system to handle regular sockets. There is either some way to hand off a socket from the host stack to the TOE, or a socket is created on the TOE to start with, but in both cases, the TOE is handling processing for normal TCP sockets. This means that the TOE has some hardware and/or firmware to do stateful TCP processing. An iWARP device, or RNIC (RDMA NIC), also usually has hardware and/or firmware TCP processing, but this isn't exposed through the BSD socket interface. Instead, an RNIC presents an interface more like an InfiniBand HCA: work requests (sends, receives, RDMA operations) are passed to the RNIC via work queues, and completion notification is returned asynchronously via completion queues. An iWARP connection can handle both send/receive (two-sided) and get/put (RDMA or one-sided) operations. For full details of the protocol used for this, you can look at the drafs from the IETF rddp working group, but basically an RDMA protocol is layered on top of a connected stream protocol -- usually TCP, but SCTP could be used as well. A lot of the perfomance of iWARP comes from the RDMA/direct placement capabilities -- for example an NFS/RDMA server can process requests out of order and put data directly into the buffer that's waiting for it, without using any CPU on the destination -- but even send/receive operations can be useful. As a side note, an RNIC will also typically support the same sort of kernel bypass as an IB HCA, where work queues can be safely mapped into a userspace process's memory so that work requests can be posted without a system call. In fact, when people usually use RDMA as a shorthand for the combination of the three features I described: asynchronous work queues and completion queues, connections that support both send/receive and RDMA, and kernel bypass. In any case, RNIC support can be added to the existing IB stack with fairly minor modifications -- you can search the netdev archives for the patchsets posted by Steve Wise, but nearly all of the new code is in the low-level hardware driver for the specific iWARP devices. The real issues for netdev are things like Steve Wise's patch to add route change notifiers, which could be used to tell RNICs when to update the next hop for a connection they're handling. More generally, it would be interesting to see if it's possible to tie an RNIC into the kernel's packet filtering, so that disallowed connections don't get set up. This seems very similar in spirit to the problems around packet filtering that were raised for VJ netchannels. - Roland - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
Roland stated that it has never been the case that we have rejected adding support for a certain class of devices on the kinds of merits being discussed in this thread. And I'm saying that TOE is such a case where we have emphatically done so. Well, in the past it's seemed more like patches have been rejected not because of a blanket refusal to consider support for certain hardware, but rather because of issues with the patches themselves. eg last year when Chelsio submitted some TOE hooks, you wrote the following http://marc.theaimsgroup.com/?l=linux-netdevm=112382991506811w=2 There is no way you're going to be allowed to call such deep TCP internals from your driver. This would mean that every time we wish to change the data structures and interfaces for TCP socket lookup, your drivers would need to change. which looks like a very good reason to reject the changes. So I am not saying iWARP or RDMA is equal to TOE, and if you had actually read this thread you would have understood that. There's definitely been quite a bit of conflation between the two in this thread, even if you're not responsible... - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
So perhaps there's a good argument to make that a Linux system with the right hardware could be considered a core device. Likely any place you have such a system it would be dedicated to just moving data as well as possible, and let other systems do the other stuff. You wouldn't want your server farm systems to also be your firewalls. Why not? While Linux firewall performance is not flawless its problems (e.g. slow conntrack) seems to be mostly in an area where TOE cannot do much about. Bottom line - these technologies seem to me to have a place in a well designed network. I think there is a web page listing why it's bad, but here a quick summary: One worry is to debug it all together. Currently we have a single stack to debug, although it's already difficult to control the complexity as it grows more bells and whistles. Just take a look at Cisco IOS release notes to see how hard and difficult it is to get it all to work together. Another reason is that there are general doubts that TOE can keep up with the ever growing performance of CPUs. Even if Linux added it today it would be likely slower again a few months later. That is also a big difference to Cisco hardware. Linux usually runs on fast main CPUs (or if you run it on slow CPUs you normally don't expect the best network performance). And they get faster and faster constantly. Admittedly 10GB NICs are still a bit too fast for mainstream systems, but that seems to be mostly a problem outside the CPUs and it looks like the next generation of systems will catch up with enough bandwidth in this area. Also it tends to accelerate the wrong thing. On a lot of workloads the main problem is keeping a lot of different connections under control, and TOE tends to be slow at keeping connection information synchronized with the host. That is why the Linux strategy has been to ask for useful stateless offloads instead. Examples of this are checksum offload (long time classic), TSO (TCP segmentation offload), UFO (UDP segmentation offload), Intel iOAT (memcpy off load), RX hashing with MSI-X (not implemented yet, but basically it allows load balancing of incoming streams to CPU) Note that all these are more or less stateless offloads. iWARP is not clear yet what it is. From the meager bits of information about it that reached netdev so far it at least sounds it does RDMA and needs far more state than any of the other offloads we got so far and likely got the usual TOE scaling issues. It's also likely on the wrong side of Moore's law. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[mini-RFT] via-velocity cleanup
Against 2.6.17: http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.17/via-velocity/ The mii operations look now more familiar. There should be no functional change. The patches do not clash with Jeff's netdev-2.6#upstream. Please report if I have broken something. -- Ueimor - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Action API fixes
Dave, Fixes for some rather serious action API bugs. Please apply. net/sched/act_api.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] [PKT_SCHED]: Fix illegal memory dereferences when dumping actions
The TCA_ACT_KIND attribute is used without checking its availability when dumping actions therefore leading to a value of 0x4 being dereferenced. The use of strcmp() in tc_lookup_action_n() isn't safe when fed with string from an attribute without enforcing proper NUL termination. Both bugs can be triggered with malformed netlink message and don't require any privileges. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6.git/net/sched/act_api.c === --- net-2.6.git.orig/net/sched/act_api.c +++ net-2.6.git/net/sched/act_api.c @@ -776,7 +776,7 @@ replay: return ret; } -static char * +static struct rtattr * find_dump_kind(struct nlmsghdr *n) { struct rtattr *tb1, *tb2[TCA_ACT_MAX+1]; @@ -804,7 +804,7 @@ find_dump_kind(struct nlmsghdr *n) return NULL; kind = tb2[TCA_ACT_KIND-1]; - return (char *) RTA_DATA(kind); + return kind; } static int @@ -817,16 +817,15 @@ tc_dump_action(struct sk_buff *skb, stru struct tc_action a; int ret = 0; struct tcamsg *t = (struct tcamsg *) NLMSG_DATA(cb-nlh); - char *kind = find_dump_kind(cb-nlh); + struct rtattr *kind = find_dump_kind(cb-nlh); if (kind == NULL) { printk(tc_dump_action: action bad kind\n); return 0; } - a_o = tc_lookup_action_n(kind); + a_o = tc_lookup_action(kind); if (a_o == NULL) { - printk(failed to find %s\n, kind); return 0; } @@ -834,7 +833,7 @@ tc_dump_action(struct sk_buff *skb, stru a.ops = a_o; if (a_o-walk == NULL) { - printk(tc_dump_action: %s !capable of dumping table\n, kind); + printk(tc_dump_action: %s !capable of dumping table\n, a_o-kind); goto rtattr_failure; } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] [PKT_SCHED]: Return ENOENT if action module is unavailable
Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6.git/net/sched/act_api.c === --- net-2.6.git.orig/net/sched/act_api.c +++ net-2.6.git/net/sched/act_api.c @@ -305,6 +305,7 @@ struct tc_action *tcf_action_init_1(stru goto err_mod; } #endif + *err = -ENOENT; goto err_out; } -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
On Tue, 2006-07-04 at 22:47 +0200, Andi Kleen wrote: So perhaps there's a good argument to make that a Linux system with the right hardware could be considered a core device. Likely any place you have such a system it would be dedicated to just moving data as well as possible, and let other systems do the other stuff. You wouldn't want your server farm systems to also be your firewalls. Why not? While Linux firewall performance is not flawless its problems (e.g. slow conntrack) seems to be mostly in an area where TOE cannot do much about. No doubt you *can* do this, but would you want to? My point wasn't really about performance here, more that systems needing this level of performance (server farm is just an example) will probably be on an 'inside' network with firewalling being done elsewhere (at the access layer, to use the Cisco paradigm). It's just not good design to attach such systems directly to an untrusted network, IMHO. So these systems just don't need netfilter capabilities. Bottom line - these technologies seem to me to have a place in a well designed network. I think there is a web page listing why it's bad, but here a quick summary: One worry is to debug it all together. Currently we have a single stack to debug, although it's already difficult to control the complexity as it grows more bells and whistles. Just take a look at Cisco IOS release notes to see how hard and difficult it is to get it all to work together. No argument there! Another reason is that there are general doubts that TOE can keep up with the ever growing performance of CPUs. Even if Linux added it today it would be likely slower again a few months later. That is also a big difference to Cisco hardware. Linux usually runs on fast main CPUs (or if you run it on slow CPUs you normally don't expect the best network performance). And they get faster and faster constantly. Admittedly 10GB NICs are still a bit too fast for mainstream systems, but that seems to be mostly a problem outside the CPUs and it looks like the next generation of systems will catch up with enough bandwidth in this area. Also it tends to accelerate the wrong thing. On a lot of workloads the main problem is keeping a lot of different connections under control, and TOE tends to be slow at keeping connection information synchronized with the host. That is why the Linux strategy has been to ask for useful stateless offloads instead. Examples of this are checksum offload (long time classic), TSO (TCP segmentation offload), UFO (UDP segmentation offload), Intel iOAT (memcpy off load), RX hashing with MSI-X (not implemented yet, but basically it allows load balancing of incoming streams to CPU) Note that all these are more or less stateless offloads. iWARP is not clear yet what it is. From the meager bits of information about it that reached netdev so far it at least sounds it does RDMA and needs far more state than any of the other offloads we got so far and likely got the usual TOE scaling issues. It's also likely on the wrong side of Moore's law. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] change netdevice to use struct device instead of struct class_device
On Mon, Jul 03, 2006 at 06:57:47PM -0700, David Miller wrote: From: Greg KH [EMAIL PROTECTED] Date: Mon, 3 Jul 2006 16:16:10 -0700 No, not really. According to Documentation/ABI/testing/sysfs-class all code that uses /sys/class/foo/ needs to be able to handle the fact that those entries might be symlinks and not just directories. Everything that I know of already works properly because the input layer has had symlinks in /sys/class/input for quite some time now. Do you know of any tools that use /sys/class/net/ that can not handle symlinks there? I've been running this on my boxes for about a week now with no noticeable issues. Renaming interfaces works just fine too. I do not think this change will cause any problems. Great, thanks for looking. Do you mind if I keep this in my tree, due to the dependancies on the other driver core changes? thanks, greg k-h - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
subscribe linux-netdev --- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
My point wasn't really about performance here, more that systems needing this level of performance (server farm is just an example) will probably be on an 'inside' network with firewalling being done elsewhere (at the access layer, to use the Cisco paradigm). It's just not good design to attach such systems directly to an untrusted network, IMHO. So these systems just don't need netfilter capabilities. Don't think of the highend. It is exotic and rare. Think of the ordinary single linux box somewhere at a rackspace provider which represents the majority of Linux boxes around. With a not too skilled admin who mostly uses the default settings of his configuration. For that running firewalling on the same box makes a lot of sense. Normally it is not that loaded and it doesn't matter much how it performs, but it might be occasionally slashdotted and then it should still hold up. BTW basic firewalling is not really that bad as long as you don't have too many rules. Mostly conntrack is painful right now. I'm sure at some point it will be fixed too. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
On Wed, 2006-07-05 at 01:01 +0200, Andi Kleen wrote: My point wasn't really about performance here, more that systems needing this level of performance (server farm is just an example) will probably be on an 'inside' network with firewalling being done elsewhere (at the access layer, to use the Cisco paradigm). It's just not good design to attach such systems directly to an untrusted network, IMHO. So these systems just don't need netfilter capabilities. Don't think of the highend. It is exotic and rare. Sure. But isn't the high end exactly where these new technologies are intended to fit? Think of the ordinary single linux box somewhere at a rackspace provider which represents the majority of Linux boxes around. How many of those need 10G nics? With a not too skilled admin who mostly uses the default settings of his configuration. For that running firewalling on the same box makes a lot of sense. Yup. I run a few of those. And I run firewalls on them. But they're on 1.5M T1 pipes at best. I probably fit into your 'not too skilled' category, too :) Normally it is not that loaded and it doesn't matter much how it performs, but it might be occasionally slashdotted and then it should still hold up. BTW basic firewalling is not really that bad as long as you don't have too many rules. Mostly conntrack is painful right now. I'm sure at some point it will be fixed too. Actually, I wasn't aware of any pain with conntrack, it works great for me. But like I said, I don't run any real high speed connections. We're focusing on netfilter here. Is breaking netfilter really the only issue with this stuff? I know you mentioned some other concerns (about TOE specifically), they were really scalability things though weren't they - like you're not convinced this really solves any performance issues long term. I'm certainly not qualified to discuss that, hopefully some of the others will weigh in here. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL
jamal wrote: On Tue, 2006-04-07 at 15:29 +0200, Patrick McHardy wrote: Russell Stuart wrote: [..] Without seeing your actual proposal it is difficult to judge whether this is a reasonable trade-off or not. Hopefully we will see your code soon. Do you have any idea when? Unfortunately I still didn't got to cleaning them up, so I'm sending them in their preliminary state. Its not much that is missing, but the netem usage of skb-cb needs to be integrated better, I failed to move it to the qdisc_skb_cb so far because of circular includes. But nothing unfixable. I'm mostly interested if the current size-tables can express what you need for ATM, I wasn't able to understand the big comment in tc_core.c in your patch. Looks good from within the range of change within reason of addressed problem. The cb on the qdisc seems only usable for netem, correct? Yes, it has the same limitations as current netem cb usage. Really makeing it useable for all qdiscs would require reserving a few bytes for every level, so far that isn't necessary and I would prefer to just add a time_to_send field for netem. The problem with this is that it currently requires sch_generic.h and pkt_sched.h to include one another, so I did the qdisc_skb_cb() thing to at least get it to compile for now. Also while not unreasonable, i wasnt sure how qdisc_enqueue_root() fit in the grand scheme of things for this change (it seemed out of place). Its there as a spot to do the initial time calculations and store them in the cb. I didn't want to put this in net/core/dev.c. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA will be reverted
Think of the ordinary single linux box somewhere at a rackspace provider which represents the majority of Linux boxes around. How many of those need 10G nics? Most of them already have gigabit. At some point they will have 10G too. Admittedly the iThingy under discussion here seems to be Infiniband only which will probably not appear in such a use case. We're focusing on netfilter here. Is breaking netfilter really the only issue with this stuff? Another concern is that it will just not be able to keep up with a high rate of new connections or a high number of them (because the hardware has too limited state) And then there are the other issues I listed like subtle TCP bugs (TSO is already a nightmare in this area and it's still not quite right) etc. I know you mentioned some other concerns (about TOE specifically), they were really scalability things though weren't they There was more than just scalability. Reread it. Anyways the thread is already getting off topic - i'm not actually that much interested in a generic TOE discussion because the issue is pretty much settled already with broad consensus. You can refer to the netdev archives or the respective web pages if you want more details. It would need someone who can describe how this new RDMA device avoids all the problems, but so far its advocates don't seem to be interested in doing that and I cannot contribute more. -Andi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
possible dos / wsize affected frozen connection length (was: Re: 2.6.17.1: fails to fully get webpage)
On Fri, Jun 30, 2006 at 08:50:39AM +1000, CaT wrote: Another datapoint to this is that I've had this my netcat web test running since 8:42pm yesterday. It's 8:37am now. It hasn't progressed in any way. It hasn't quit. It hasn't timed out. It just sits there, hung. This leads me to consider the possibility of a DOS, either intentional or accidental (think about 2.6.17.x running on a mail server and someone mails/spams from a broken place). I'm just wondering if connections hanging around this long are normal. The above has now been running for 6 days. netstat is still reporting an established session. netcat has not timed out. It's all just sitting there doing nothing. -- To the extent that we overreact, we proffer the terrorists the greatest tribute. - High Court Judge Michael Kirby - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [PKT_SCHED]: Fix illegal memory dereferences when dumping actions
On Wed, 2006-05-07 at 00:00 +0200, Thomas Graf wrote: plain text document attachment (act_fix_dump_null_deref) The TCA_ACT_KIND attribute is used without checking its availability when dumping actions therefore leading to a value of 0x4 being dereferenced. The use of strcmp() in tc_lookup_action_n() isn't safe when fed with string from an attribute without enforcing proper NUL termination. Both bugs can be triggered with malformed netlink message and don't require any privileges. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Good catch. Acked-by: Jamal Hadi Salim [EMAIL PROTECTED] cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] [PKT_SCHED]: Return ENOENT if action module is unavailable
On Wed, 2006-05-07 at 00:00 +0200, Thomas Graf wrote: plain text document attachment (act_fix_init_ret_val) Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6.git/net/sched/act_api.c === --- net-2.6.git.orig/net/sched/act_api.c +++ net-2.6.git/net/sched/act_api.c @@ -305,6 +305,7 @@ struct tc_action *tcf_action_init_1(stru goto err_mod; } #endif + *err = -ENOENT; goto err_out; } Ok, this falls under the LinuxWay(tm). Quick inspection of the qdisc code reveals the same bug. The cls side seems fine - but i didnt spend more than 30 secs. So why dont you fix the qdisc one while you are at it? Acked-by: Jamal Hadi Salim [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] [PKT_SCHED]: Fix error handling while dumping actions
I need to stare at this one for longer than 1 minute and i dont have time right now; it does look strange (I am unsure what my thoughts were at that point with -err - or maybe that was a change made by someone else). I dont have time until tommorow - but i would think the better fix will be to change return -err to return -1? cheers, jamal On Wed, 2006-05-07 at 00:00 +0200, Thomas Graf wrote: plain text document attachment (act_fix_dump_err_handling) return -err and blindly inheriting the error code in the netlink failure exception handler causes errors codes to be returned as positive value therefore making them being ignored by the caller. May lead to sending out incomplete netlink messages. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6.git/net/sched/act_api.c === --- net-2.6.git.orig/net/sched/act_api.c +++ net-2.6.git/net/sched/act_api.c @@ -250,15 +250,17 @@ tcf_action_dump(struct sk_buff *skb, str RTA_PUT(skb, a-order, 0, NULL); err = tcf_action_dump_1(skb, a, bind, ref); if (err 0) - goto rtattr_failure; + goto errout; r-rta_len = skb-tail - (u8*)r; } return 0; rtattr_failure: + err = -EINVAL; +errout: skb_trim(skb, b - skb-data); - return -err; + return err; } struct tc_action *tcf_action_init_1(struct rtattr *rta, struct rtattr *est, -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] [PKT_SCHED]: Fix illegal memory dereferences when dumping actions
On Wed, 2006-05-07 at 01:42 +0200, Patrick McHardy wrote: Thomas Graf wrote: if (a_o-walk == NULL) { - printk(tc_dump_action: %s !capable of dumping table\n, kind); + printk(tc_dump_action: %s !capable of dumping table\n, a_o-kind); goto rtattr_failure; } Can't we just get rid of these printks? This seems like a good opportunity. perhaps convert to DPRINTKs instead cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Hottest Deal out there (a JOB)
Hola!. Este correo electrónico le muestra una oferta de trabajo, que podrÃa ser interesante a usted. Gerente financiero situado en su paÃs! Trabajo en Internet con buen sueldo! GoldLeader Inc. busca a personas enérgicas y responsables para completar el puesto de encargado de deudores de media jornada. Como encargado de deudores, usted será el responsable de procesar y facilitar las transferencias de fondos iniciadas por nuestros clientes bajo la supervisión del gerente regional. Ofrecemos: - Ventajas buenas (más de 1000 $ por semana); - Contrato legal; Se precisa puntualidad, capacidades directivas y responsabilidad. Usted también recibirá instrucciones detalladas para acciones subsecuentes de nuestro gerente, con información sobre como recibir/transferir el dinero. 1. Ser capaz de comprobar su correo electrónico varias veces por dÃa 2. Ser capaz de responder a correos electrónico inmediatamente 3. Ser capaz de trabajar horas extra si es necesario 4. Ser responsable y trabajador 5. Hablar ingles 6. Tener más de 21 años; 7. DeberÃa tener una cuenta bancaria personal Para información adicionales y preguntas sobre el puesto de trabajo, por favor envÃe sus datos de contacto a [EMAIL PROTECTED] NO SON VENTAS!!! NO SON LLAMADAS!!! USTED NO NECESITA DINERO PARA COMENZAR!!! Gracias por su atencion. Con respeto, Departamento de personal Goldleader Inc. http://www.goldleader.biz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html