date:20060704

Hi,

patch below removes the use of UTS_RELEASE from the tiacx driver; there
is absolutely no reason for a driver to print the kernel version or use
the UTS_RELEASE field; in addition this field changes all the time so
this causes spurious rebuilds..

Signed-off-by: Arjan van de Ven [EMAIL PROTECTED]

---
 drivers/net/wireless/tiacx/pci.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.17-mm4/drivers/net/wireless/tiacx/pci.c
===
--- linux-2.6.17-mm4.orig/drivers/net/wireless/tiacx/pci.c
+++ linux-2.6.17-mm4/drivers/net/wireless/tiacx/pci.c
@@ -1705,8 +1705,8 @@ acxpci_e_probe(struct pci_dev *pdev, con
/* acx_sem_unlock(adev); */
 
printk(acx ACX_RELEASE: net device %s, driver compiled 
-   against wireless extensions %d and Linux %s\n,
-   ndev-name, WIRELESS_EXT, UTS_RELEASE);
+   against wireless extensions %d\n,
+   ndev-name, WIRELESS_EXT);
 
 #if CMD_DISCOVERY
great_inquisitor(adev);


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/7] net_device list cleanup: core

2006-07-04 Thread Christoph Hellwig

On Tue, Jul 04, 2006 at 11:24:05AM +0400, Andrey Savochkin wrote:
  Yes, it's a little more work as you need to audit all drivers to see what
  they are doing and find suitable abstractions but it's a must have that
  should have been done a lot earlier.
 
 Hiding dev_base_head can be done by converting first_netdev/next_netdev into
 functions and implementing for_each_netdev loop through them.
 
 Or are you talking about abstractions like functions
 for_each_netdev/find_netdev with callbacks?

an for_each_netdev with a callback makes sense and gives a cleaner
abstraction, yes.  I don't think you should need a callback for the lookup
structure.

 Do you think that hiding the list internals is worth the additional
 complexity and substantial increase of the patch size?

Yes, absolutely.  We've converted scsi hosts and devices from a model
where drivers could directly access the list to strict iterators in the
2.5 series.  It's quite a lot of work as you have to understand what
the drivers actually do (and to at least 50% they were doing something
really stupid) and convert them to the right abstractions.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tiacx - don't use UTS_RELEASE

On Tue, 2006-07-04 at 02:25 -0700, Andrew Morton wrote:
 On Tue, 04 Jul 2006 11:07:59 +0200
 Arjan van de Ven [EMAIL PROTECTED] wrote:
 
  patch below removes the use of UTS_RELEASE from the tiacx driver; there
  is absolutely no reason for a driver to print the kernel version or use
  the UTS_RELEASE field; in addition this field changes all the time so
  this causes spurious rebuilds..
 
 http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-04-usb/usb-storage-uname-in-pr-sc-unneeded-message.patch
  did it too.
 
 UTS_RELEASE doesn't change much.  It's 2.6.17.

no but the header that it's in changes all the time iirc, at least it
used to (one of those kbuild regenerated files)


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tiacx - don't use UTS_RELEASE

2006-07-04 Thread Andrew Morton

On Tue, 04 Jul 2006 11:07:59 +0200
Arjan van de Ven [EMAIL PROTECTED] wrote:

 patch below removes the use of UTS_RELEASE from the tiacx driver; there
 is absolutely no reason for a driver to print the kernel version or use
 the UTS_RELEASE field; in addition this field changes all the time so
 this causes spurious rebuilds..

http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-04-usb/usb-storage-uname-in-pr-sc-unneeded-message.patch
 did it too.

UTS_RELEASE doesn't change much.  It's 2.6.17.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: tiacx - don't use UTS_RELEASE

2006-07-04 Thread Sam Ravnborg

On Tue, Jul 04, 2006 at 11:27:27AM +0200, Arjan van de Ven wrote:
On Tue, 2006-07-04 at 02:25 -0700, Andrew Morton wrote:
On Tue, 04 Jul 2006 11:07:59 +0200
Arjan van de Ven [EMAIL PROTECTED] wrote:

patch below removes the use of UTS_RELEASE from the tiacx driver; there
is absolutely no reason for a driver to print the kernel version or use
the UTS_RELEASE field; in addition this field changes all the time so
this causes spurious rebuilds..

http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-04-usb/usb-storage-uname-in-pr-sc-unneeded-message.patch
did it too.

UTS_RELEASE doesn't change much. It's 2.6.17.

no but the header that it's in changes all the time iirc, at least it
used to (one of those kbuild regenerated files)
Yesterday I pushed a change that splitted include/linux/version.h in two
parts.
Now include/linux/version.h only contains:
#define LINUX_VERSION_CODE 132625
#define KERNEL_VERSION(a,b,c) (((a) 16) + ((b) 8) + (c))

And the file wil only be regenerated when the file-content actually
changes.

And UTS_RELEASE has moved to include/linux/utsrelease.h which contains:
#define UTS_RELEASE 2.6.17-g05668381-dirty

This is the file that will change often - at least for git users.
But with the patch only users of UTS_RELEASE will be rebuild which is
far less than users of version.h.

Sam
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: tiacx - don't use UTS_RELEASE

On Tue, 2006-07-04 at 11:51 +0200, Sam Ravnborg wrote:
 On Tue, Jul 04, 2006 at 11:27:27AM +0200, Arjan van de Ven wrote:
  On Tue, 2006-07-04 at 02:25 -0700, Andrew Morton wrote:
   On Tue, 04 Jul 2006 11:07:59 +0200
   Arjan van de Ven [EMAIL PROTECTED] wrote:
   
patch below removes the use of UTS_RELEASE from the tiacx driver; there
is absolutely no reason for a driver to print the kernel version or use
the UTS_RELEASE field; in addition this field changes all the time so
this causes spurious rebuilds..
   
   http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-04-usb/usb-storage-uname-in-pr-sc-unneeded-message.patch
did it too.
   
   UTS_RELEASE doesn't change much.  It's 2.6.17.
  
  no but the header that it's in changes all the time iirc, at least it
  used to (one of those kbuild regenerated files)
 Yesterday I pushed a change that splitted include/linux/version.h in two
 parts.
 Now include/linux/version.h only contains:
 #define LINUX_VERSION_CODE 132625
 #define KERNEL_VERSION(a,b,c) (((a)  16) + ((b)  8) + (c))
 
 And the file wil only be regenerated when the file-content actually
 changes.
 
 And UTS_RELEASE has moved to include/linux/utsrelease.h which contains:
 #define UTS_RELEASE 2.6.17-g05668381-dirty
 
 This is the file that will change often - at least for git users.
 But with the patch only users of UTS_RELEASE will be rebuild which is
 far less than users of version.h.

which is a good thing, and we should keep users of utsrelease.h to a
minimum... hence my patch to eliminate a user ;) (which used it to do a
printk.. but if you use a kernel the version is already in dmesg, no
need to printk it again :)


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [VLAN]: translate IF_OPER_DORMANT to netif_dormant_on()

2006-07-04 Thread Patrick McHardy

 commit ddd7bf9fe4e59afc0a041378f82b6e1aa88f714b
 tree 98764adba1bae7d128d2e7db7d9fc1e2fe5826d8
 parent b00055aacdb172c05067612278ba27265fcd05ce
 author Stefan Rompf [EMAIL PROTECTED] Tue, 21 Mar 2006 09:11:41 -0800
 committer David S. Miller [EMAIL PROTECTED] Tue, 21 Mar 2006 09:11:41 -0800
 
 [VLAN]: translate IF_OPER_DORMANT to netif_dormant_on()

 diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
 index fa76220..3948949 100644
 --- a/net/8021q/vlan.c
 +++ b/net/8021q/vlan.c
 @@ -69,7 +69,7 @@ static struct packet_type vlan_packet_ty
  
  /* Bits of netdev state that are propagated from real device to virtual */
  #define VLAN_LINK_STATE_MASK \
 - ((1__LINK_STATE_PRESENT)|(1__LINK_STATE_NOCARRIER))
 + 
 ((1__LINK_STATE_PRESENT)|(1__LINK_STATE_NOCARRIER)|(1__LINK_STATE_DORMANT))
  
  /* End of global variables definitions. */
  
 @@ -450,7 +470,7 @@ static struct net_device *register_vlan_
   new_dev-flags = real_dev-flags;
   new_dev-flags = ~IFF_UP;
  
 - new_dev-state = real_dev-state  VLAN_LINK_STATE_MASK;
 + new_dev-state = real_dev-state  ~(1__LINK_STATE_START);
  
   /* need 4 bytes for extra VLAN header info,
* hope the underlying device can handle it.

This introduced a regression by propagating the __LINK_STATE_XOFF flag,
when the queue of the underlying device is stopped it will be stopped
for the VLAN device too and never be woken up. Since you changed
VLAN_LINK_STATE_MASK, I assume the intention was to just add
__LINK_STATE_DORMANT to the propagated flags and keep using it here?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-07-04 Thread Jesper Dangaard Brouer



On Mon, 26 Jun 2006, Andi Kleen wrote:


I encountered the same problem on a dual core opteron equipped with a
broadcom NIC (tg3) under 2.4. It could receive 1 Mpps when using TSC
as the clock source, but the time jumped back and forth, so I changed
it to 'notsc', then the performance dropped dramatically to around the
same value as above with one CPU saturated. I suspect that the clock
precision is needed by the tg3 driver to correctly decide to switch to
polling mode, but unfortunately, the performance drop rendered the
solution so much unusable that I finally decided to use it only in
uniprocessor with TSC enabled.


2.6 is more clever at this than 2.4. In particular it does the timestamp
for each packet only when actually needed, which is relativelt rare.

Old experiences do not always apply to new kernels.


Note, that I experinced this problem on 2.6.

Actually the change happens between kernel version 2.6.15 and 2.6.16. And 
is a result of Andi's changes to arch/x86_64/Kconfig and 
drivers/acpi/Kconfig, which allows/activates the use of the timer on 
x86_64.


Cheers,
  Jesper Brouer

--
---
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
---
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

On Tuesday 04 July 2006 13:41, Jesper Dangaard Brouer wrote:
 
 On Mon, 26 Jun 2006, Andi Kleen wrote:
 
  I encountered the same problem on a dual core opteron equipped with a
  broadcom NIC (tg3) under 2.4. It could receive 1 Mpps when using TSC
  as the clock source, but the time jumped back and forth, so I changed
  it to 'notsc', then the performance dropped dramatically to around the
  same value as above with one CPU saturated. I suspect that the clock
  precision is needed by the tg3 driver to correctly decide to switch to
  polling mode, but unfortunately, the performance drop rendered the
  solution so much unusable that I finally decided to use it only in
  uniprocessor with TSC enabled.
 
  2.6 is more clever at this than 2.4. In particular it does the timestamp
  for each packet only when actually needed, which is relativelt rare.
 
  Old experiences do not always apply to new kernels.
 
 Note, that I experinced this problem on 2.6.
 
 Actually the change happens between kernel version 2.6.15 and 2.6.16.

The timestamp optimizations are older. Don't remember the exact release,
but earlier 2.6.

 And  
 is a result of Andi's changes to arch/x86_64/Kconfig and 
 drivers/acpi/Kconfig, which allows/activates the use of the timer on 
 x86_64.

Not sure what you mean here?

2.6.18 will likely be more aggressive at using the TSC on i386 on
Intel systems where possible, but x86-64 did this already for a long time. 
When x86-64 uses non TSC then it's because using the TSC is not safe.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: strict isolation of net interfaces

2006-07-04 Thread Daniel Lezcano


Andrey Savochkin wrote:


I still can't completely understand your direction of thoughts.
Could you elaborate on IP address assignment in your diagram, please?  For
example, guest0 wants 127.0.0.1 and 192.168.0.1 addresses on its lo
interface, and 10.1.1.1 on its eth0 interface.
Does this diagram assume any local IP addresses on v* interfaces in the
host?

And the second question.
Are vlo0, veth0, etc. devices supposed to have hard_xmit routines?



Andrey,

some people are interested by a network full isolation/virtualization 
like you did with the layer 2 isolation and some other people are 
interested by a light network isolation done at the layer 3. This one is 
intended to implement application container aka lightweight container.


In the case of a layer 3 isolation, the network interface is not totally 
isolated and the debate here is to find a way to have something 
intuitive to manage the network devices.


IHMO, all the discussion we had convinced me of the needs to have the 
possibility to choose between a layer 2 or a layer 3 isolation.


If it is ok for you, we can collaborate to merge the two solutions in 
one. I will focus on layer 3 isolation and you on the layer 2.


Regards

  - Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote:
 On Mon, 03 Jul 2006 20:54:37 -0400
 Shailabh Nagar [EMAIL PROTECTED] wrote:
 
   What happens when a listener exits without doing deregistration
   (or if the listener attempts to register another cpumask while a current
   registration is still active).
  
  ( Jamal, your thoughts on this problem would be appreciated)
  
  Problem is that we have a listener task which has registered with 
  taskstats and caused
  its pid to be stored in various per-cpu lists of listeners. Later, when 
  some other task exits on a given cpu, its exit data is sent using 
  genlmsg_unicast on each pid present on that cpu's list.
  
  If the listener exits without doing a deregister, its pid continues to 
  be kept around, obviously not a good thing. So we need some way of 
  detecting the situation (task is no longer listening on
  these cpus events) that is efficient.
 
 Also need to address the case where the listener has closed off his file
 descriptor but continues to run.
 
 So hooking into listener's exit() isn't appropriate - the teardown is
 associated with the lifetime of the fd, not of the process.  If we do that,
 exit() gets handled for free.  

If you are always going to send unicast messages, then  -ECONNREFUSED
will tell you the listener has closed their fd - this doesnt meant it
has exited. Besides that one process could open several sockets. I know
that would not be the app you would write - but it doesnt stop other
people from doing it.
I think i may not follow what you are doing - for some reason i thought
you may have many listeners in user space and these messages get
multicast to them?
Does the user space program somehow communicate its pid to the kernel?

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [IPROUTE]: Introduce tc monitor

On Mon, 2006-03-07 at 12:13 +0200, Patrick McHardy wrote:
 Speaking of actions, do you have any plans to
 add help-texts? Currently the output is very confusing, whenever
 I use them I need to google for examples.
 

Thanks for reminding me. There are examples in the doc/ directory of
iproute2, but they may be insufficient.
In any case, I wont have time today or the rest of the week but will get
some patch after that. 

[Actually, I have about half a day off but I want to spend time
reviewing the qdisc_is_running thing in a test environment( It takes me
at least 2 hours to steal hardware and set it up)].

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

2006-07-04 Thread Patrick McHardy

Russell Stuart wrote:
 On 26/06/2006 9:10 PM, Patrick McHardy wrote:
 
 5.  We still did have to modify the kernel for ATM.  That was
because of its rather unusual characteristics.  However,
it you look at the size of modifications made to the kernel
verses the size made to the user space tool, (37 lines
versus 303 lines,) the bulk of the work was does in user
space.


 I'm sorry, but arguing that a limited special case solution is
 better because it needs slightly less code is just not reasonable.
 
 
 Without seeing your actual proposal it is difficult to
 judge whether this is a reasonable trade-off or not.
 Hopefully we will see your code soon.  Do you have any
 idea when?

Unfortunately I still didn't got to cleaning them up, so I'm sending
them in their preliminary state. Its not much that is missing, but
the netem usage of skb-cb needs to be integrated better, I failed
to move it to the qdisc_skb_cb so far because of circular includes.
But nothing unfixable. I'm mostly interested if the current size-tables
can express what you need for ATM, I wasn't able to understand the
big comment in tc_core.c in your patch.

[NET_SCHED]: Add accessor function for packet length for qdiscs

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 2a6508576111d82246ee018edbcc4b0f0d18acad
tree 8be27ab6040ea90ed11728763e5b8fcf9e221b67
parent 31304c909e6945b005af62cd55a582e9c010a0b4
author Patrick McHardy [EMAIL PROTECTED] Tue, 04 Jul 2006 15:03:01 +0200
committer Patrick McHardy [EMAIL PROTECTED] Tue, 04 Jul 2006 15:03:01 +0200

 include/net/sch_generic.h |9 +++--
 net/sched/sch_atm.c   |4 ++--
 net/sched/sch_cbq.c   |   12 ++--
 net/sched/sch_dsmark.c|2 +-
 net/sched/sch_fifo.c  |2 +-
 net/sched/sch_gred.c  |   12 ++--
 net/sched/sch_hfsc.c  |8 
 net/sched/sch_htb.c   |8 
 net/sched/sch_netem.c |6 +++---
 net/sched/sch_prio.c  |2 +-
 net/sched/sch_red.c   |2 +-
 net/sched/sch_sfq.c   |   14 +++---
 net/sched/sch_tbf.c   |6 +++---
 net/sched/sch_teql.c  |4 ++--
 14 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index b0e9108..75d7a55 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -184,12 +184,17 @@ tcf_destroy(struct tcf_proto *tp)
kfree(tp);
 }
 
+static inline unsigned int qdisc_tx_len(struct sk_buff *skb)
+{
+   return skb-len;
+}
+
 static inline int __qdisc_enqueue_tail(struct sk_buff *skb, struct Qdisc *sch,
   struct sk_buff_head *list)
 {
__skb_queue_tail(list, skb);
-   sch-qstats.backlog += skb-len;
-   sch-bstats.bytes += skb-len;
+   sch-qstats.backlog += qdisc_tx_len(skb);
+   sch-bstats.bytes += qdisc_tx_len(skb);
sch-bstats.packets++;
 
return NET_XMIT_SUCCESS;
diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c
index dbf44da..4df305e 100644
--- a/net/sched/sch_atm.c
+++ b/net/sched/sch_atm.c
@@ -453,9 +453,9 @@ #endif
if (flow) flow-qstats.drops++;
return ret;
}
-   sch-bstats.bytes += skb-len;
+   sch-bstats.bytes += qdisc_tx_len(skb);
sch-bstats.packets++;
-   flow-bstats.bytes += skb-len;
+   flow-bstats.bytes += qdisc_tx_len(skb);
flow-bstats.packets++;
/*
 * Okay, this may seem weird. We pretend we've dropped the packet if
diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 80b7f6a..5d705e2 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -404,7 +404,7 @@ static int
 cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
struct cbq_sched_data *q = qdisc_priv(sch);
-   int len = skb-len;
+   int len = qdisc_tx_len(skb);
int ret;
struct cbq_class *cl = cbq_classify(skb, sch, ret);
 
@@ -688,7 +688,7 @@ #ifdef CONFIG_NET_CLS_POLICE
 
 static int cbq_reshape_fail(struct sk_buff *skb, struct Qdisc *child)
 {
-   int len = skb-len;
+   int len = qdisc_tx_len(skb);
struct Qdisc *sch = child-__parent;
struct cbq_sched_data *q = qdisc_priv(sch);
struct cbq_class *cl = q-rx_class;
@@ -915,7 +915,7 @@ cbq_dequeue_prio(struct Qdisc *sch, int 
if (skb == NULL)
goto skip_class;
 
-   cl-deficit -= skb-len;
+   cl-deficit -= qdisc_tx_len(skb);
q-tx_class = cl;
q-tx_borrowed = borrow;
if (borrow != cl) {
@@ -923,11 +923,11 @@ #ifndef CBQ_XSTATS_BORROWS_BYTES
borrow-xstats.borrows++;
cl-xstats.borrows++;
 #else
-   borrow-xstats.borrows += skb-len;
-   cl-xstats.borrows += skb-len;
+

Re: strict isolation of net interfaces

2006-07-04 Thread Daniel Lezcano


Sam Vilain wrote:

Daniel Lezcano wrote:


If it is ok for you, we can collaborate to merge the two solutions in
one. I will focus on layer 3 isolation and you on the layer 2.



So, you're writing a LSM module or adapting the BSD Jail LSM, right? :)

Sam.


No. I am adapting a prototype of network application container we did.

  -- Daniel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/7] net_device list cleanup: core

2006-07-04 Thread Andrey Savochkin

On Tue, Jul 04, 2006 at 10:10:03AM +0100, Christoph Hellwig wrote:
 On Tue, Jul 04, 2006 at 11:24:05AM +0400, Andrey Savochkin wrote:
   Yes, it's a little more work as you need to audit all drivers to see what
   they are doing and find suitable abstractions but it's a must have that
   should have been done a lot earlier.
  
  Hiding dev_base_head can be done by converting first_netdev/next_netdev into
  functions and implementing for_each_netdev loop through them.
  
  Or are you talking about abstractions like functions
  for_each_netdev/find_netdev with callbacks?
 
 an for_each_netdev with a callback makes sense and gives a cleaner
 abstraction, yes.  I don't think you should need a callback for the lookup
 structure.

Different modules want different kinds of lookup.
So, I'm thinking about something like ilookup5.

 
  Do you think that hiding the list internals is worth the additional
  complexity and substantial increase of the patch size?
 
 Yes, absolutely.  We've converted scsi hosts and devices from a model
 where drivers could directly access the list to strict iterators in the
 2.5 series.  It's quite a lot of work as you have to understand what
 the drivers actually do (and to at least 50% they were doing something
 really stupid) and convert them to the right abstractions.

The next question: would people agree to review a patch doing this for
net_devices? :)

Andrey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-04 Thread Shailabh Nagar


jamal wrote:

On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote:


On Mon, 03 Jul 2006 20:54:37 -0400
Shailabh Nagar [EMAIL PROTECTED] wrote:



What happens when a listener exits without doing deregistration
(or if the listener attempts to register another cpumask while a current
registration is still active).



( Jamal, your thoughts on this problem would be appreciated)

Problem is that we have a listener task which has registered with 
taskstats and caused
its pid to be stored in various per-cpu lists of listeners. Later, when 
some other task exits on a given cpu, its exit data is sent using 
genlmsg_unicast on each pid present on that cpu's list.


If the listener exits without doing a deregister, its pid continues to 
be kept around, obviously not a good thing. So we need some way of 
detecting the situation (task is no longer listening on

these cpus events) that is efficient.


Also need to address the case where the listener has closed off his file
descriptor but continues to run.

So hooking into listener's exit() isn't appropriate - the teardown is
associated with the lifetime of the fd, not of the process.  If we do that,
exit() gets handled for free.  



If you are always going to send unicast messages, then  -ECONNREFUSED
will tell you the listener has closed their fd - this doesnt meant it
has exited. 


Thats good. So we have atleast one way of detecting the closed fd without
deregistering within taskstats itself.


Besides that one process could open several sockets. I know
that would not be the app you would write - but it doesnt stop other
people from doing it.


As far as API is concerned, even a taskstats listener is not being
prevented from opening multiple sockets. As Andrew also pointed out,
everything needs to be done per-socket.


I think i may not follow what you are doing - for some reason i thought
you may have many listeners in user space and these messages get
multicast to them?


That was the design earlier. In the past week, the design has changed to
one where there are still many listeners in user space but messages
get unicast to each of them. Earlier listeners would get messages generated
on task exit from every cpu, now they get it only from cpus for which
they have explicitly registered interest (via a cpumask passed in through
another genetlink command).


Does the user space program somehow communicate its pid to the kernel?


Yes. When the listener registers interest in a set of cpus, as described
above, its (genl_info-pid) is being stored in the per-cpu list of
listeners for those cpus. When a task exits on one of those cpus, the
exit data is only sent via genetlink_unicast to those pids
(really, nl_pids) who are on that cpu's listener list.


Now that I think more about it, netlink is really maintaining a pidhash
of nl_pids, not process pids, right ? So if one userapp were to open
multiple sockets using NETLINK_GENERIC protocol (regardless of how many
of those are for the taskstats), each of them would have to use a
different nl_pid. Hence, it would be valid for the taskstats layer to use 
netlink_lookup() at any time to see if the corresponding socket were

closed ?


--Shailabh




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: possible recursive locking in ATM layer


From: Arjan van de Ven [EMAIL PROTECTED]

 Linux version 2.6.17-git22 ([EMAIL PROTECTED]) (gcc version 4.0.3 (Ubuntu 
 4.0.3-1ubuntu5)) #20 PREEMPT Tue Jul 4 10:35:04 CEST 2006
 
 [ 2381.598609] =
 [ 2381.619314] [ INFO: possible recursive locking detected ]
 [ 2381.635497] -
 [ 2381.651706] atmarpd/2696 is trying to acquire lock:
 [ 2381.666354]  (skb_queue_lock_key){-+..}, at: [c028c540] 
 skb_migrate+0x24/0x6c
 [ 2381.688848]


ok this is a real potential deadlock in a way, it takes two locks of 2
skbuffs without doing any kind of lock ordering; I think the following
patch should fix it. Just sort the lock taking order by address of the
skb.. it's not pretty but it's the best this can do in a minimally
invasive way.

I still agree with the comment that this code shouldn't live in the atm
layer...

Signed-off-by: Arjan van de Ven [EMAIL PROTECTED]

---
 net/atm/ipcommon.c |   13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux-2.6.17-mm6/net/atm/ipcommon.c
===
--- linux-2.6.17-mm6.orig/net/atm/ipcommon.c
+++ linux-2.6.17-mm6/net/atm/ipcommon.c
@@ -25,8 +25,8 @@
 /*
  * skb_migrate appends the list at from to to, emptying from in the
  * process. skb_migrate is atomic with respect to all other skb operations on
- * from and to. Note that it locks both lists at the same time, so beware
- * of potential deadlocks.
+ * from and to. Note that it locks both lists at the same time, so to deal
+ * with the lock ordering, the locks are taken in address order.
  *
  * This function should live in skbuff.c or skbuff.h.
  */
@@ -39,8 +39,13 @@ void skb_migrate(struct sk_buff_head *fr
struct sk_buff *skb_to = (struct sk_buff *) to;
struct sk_buff *prev;
 
-   spin_lock_irqsave(from-lock,flags);
-   spin_lock(to-lock);
+   if (fromto) {
+   spin_lock_irqsave(from-lock,flags);
+   spin_lock_nested(to-lock, SINGLE_DEPTH_NESTING);
+   } else {
+   spin_lock_irqsave(to-lock, flags);
+   spin_lock_nested(from-lock, SINGLE_DEPTH_NESTING);
+   }
prev = from-prev;
from-next-prev = to-prev;
prev-next = skb_to;


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/7] net_device list cleanup: core

2006-07-04 Thread Alexey Kuznetsov

Hello!

 Different modules want different kinds of lookup.
 So, I'm thinking about something like ilookup5.

 The next question: would people agree to review a patch doing this for
 net_devices? :)

One not original suggestion, which did not sound nevertheless:
to implement netdev_iterate_list() or whatever, update only core
and a few of devices and deprecate dev_base_head
with __deprecated_for_modules adding it to
Documentation/feature-removal-schedule.txt

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-04 Thread Shailabh Nagar


Shailabh Nagar wrote:

jamal wrote:


On Mon, 2006-03-07 at 18:01 -0700, Andrew Morton wrote:


On Mon, 03 Jul 2006 20:54:37 -0400
Shailabh Nagar [EMAIL PROTECTED] wrote:



What happens when a listener exits without doing deregistration
(or if the listener attempts to register another cpumask while a 
current

registration is still active).



( Jamal, your thoughts on this problem would be appreciated)

Problem is that we have a listener task which has registered with 
taskstats and caused
its pid to be stored in various per-cpu lists of listeners. Later, 
when some other task exits on a given cpu, its exit data is sent 
using genlmsg_unicast on each pid present on that cpu's list.


If the listener exits without doing a deregister, its pid 
continues to be kept around, obviously not a good thing. So we need 
some way of detecting the situation (task is no longer listening on

these cpus events) that is efficient.



Also need to address the case where the listener has closed off his file
descriptor but continues to run.

So hooking into listener's exit() isn't appropriate - the teardown is
associated with the lifetime of the fd, not of the process.  If we do 
that,
exit() gets handled for free.  




If you are always going to send unicast messages, then  -ECONNREFUSED
will tell you the listener has closed their fd - this doesnt meant it
has exited. 



Thats good. So we have atleast one way of detecting the closed fd without
deregistering within taskstats itself.


Besides that one process could open several sockets. I know
that would not be the app you would write - but it doesnt stop other
people from doing it.



As far as API is concerned, even a taskstats listener is not being
prevented from opening multiple sockets. As Andrew also pointed out,
everything needs to be done per-socket.


I think i may not follow what you are doing - for some reason i thought
you may have many listeners in user space and these messages get
multicast to them?



That was the design earlier. In the past week, the design has changed to
one where there are still many listeners in user space but messages
get unicast to each of them. Earlier listeners would get messages generated
on task exit from every cpu, now they get it only from cpus for which
they have explicitly registered interest (via a cpumask passed in through
another genetlink command).


Does the user space program somehow communicate its pid to the kernel?



Yes. When the listener registers interest in a set of cpus, as described
above, its (genl_info-pid) is being stored in the per-cpu list of
listeners for those cpus. When a task exits on one of those cpus, the
exit data is only sent via genetlink_unicast to those pids
(really, nl_pids) who are on that cpu's listener list.


Now that I think more about it, netlink is really maintaining a pidhash
of nl_pids, not process pids, right ? So if one userapp were to open
multiple sockets using NETLINK_GENERIC protocol (regardless of how many
of those are for the taskstats), each of them would have to use a
different nl_pid. Hence, it would be valid for the taskstats layer to 
use netlink_lookup() at any time to see if the corresponding socket were

closed ?



Here's a strawman for the problem we're trying to solve: get
notification of the close of a NETLINK_GENERIC socket that had
been used to register interest for some cpus within taskstats.

From looking at the netlink code, the way to go seems to be

- it maintains a pidhash of nl_pids that are currently
registered to listen to atleast one cpu. It also stores the
cpumask used.
- taskstats registers a notifier block within netlink_chain
and receives a callback on the NETLINK_URELEASE event, similar
to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event()

- the callback checks to see that the protocol is NETLINK_GENERIC
and that the nl_pid for the socket is in taskstat's pidhash. If so, it
does a cleanup using the stored cpumask and releases the nl_pid
from the pidhash.

We can even do away with the deregister command altogether and
simply rely on this autocleanup.

--Shailabh
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[e1000]: flow control on by default - good idea really?


CCing anybody who may have stakes on this. Ignore the email if this
doesnt interest you.
Ok, folks - i had deferred this discussion but it bit me in the ass. 
I just spend an hour debugging it (and in the process blew up a gbic i
borrowed, so my day aint going well since i actually have to pay for
this and cant really do the testing i was planning to;-).
 
I have a device connected to a e1000 that was erroneously advertising
both tx/rx flow control but wasnt properly reacting to it. 
The default setup on the e1000 has rx flow control turned on.
I was sending at wire rate gige from the device - which is about
1.48Mpps. The e1000 was in turn sending me flow control packets
as per default/expected behavior. Unfortunately, it was sending
a very large amount of packets. At one point i was seeing upto
1Mpps and on average, the flow control packets were consuming
60-70% of the bandwidth. Even when i fixed this behavior to act
properly, allowing flow control on consumed up to 15% of the bandwidth. 
Clearly, this is a bad thing. Yes, the device in the first instance was
at fault. But i have argued in the past that NAPI does just fine without
flow control being turned on, so even chewing 5% of bandwidth on flow
control is a bad thing..

As a compromise, can we declare flow control as an advanced feature
and turn it off by default? People who feel it is valuable and know
what they are doing can turn it off.

If you want more details just shoot.

cheers,
jamal 

PS:- BTW, even turning off flow control on e1000 didnt give as good
performance as in the old days on this machine - but i dont want to go
into that discussion.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/4] d80211: fix receiving through virtual interfaces

2006-07-04 Thread Jiri Benc

On Mon,  3 Jul 2006 19:24:08 +0200 (CEST), Jiri Benc wrote:
 - Packet type (PACKET_HOST and PACKET_OTHER_HOST) is set correctly now.

Uhm, not really.

 @@ -3057,7 +3048,9 @@ ieee80211_rx_h_check(struct ieee80211_tx
   return TXRX_DROP;
   }
  
 - if (memcmp(rx-dev-dev_addr, hdr-addr1, ETH_ALEN) == 0)
 + if (rx-fc  WLAN_FC_TODS)
 + rx-skb-pkt_type = PACKET_OTHERHOST;

I'm not sure how something so obviously wrong slipped there.

The corrected version of the patch follows.

---
 net/d80211/ieee80211.c   |  171 +++
 net/d80211/ieee80211_i.h |5 +
 net/d80211/wpa.c |4 +
 3 files changed, 124 insertions(+), 56 deletions(-)

--- dscape.orig/net/d80211/ieee80211.c
+++ dscape/net/d80211/ieee80211.c
@@ -2463,27 +2463,15 @@ ieee80211_rx_h_data(struct ieee80211_txr
memcpy(ehdr-h_source, src, ETH_ALEN);
 ehdr-h_proto = len;
}
-
-if (rx-sta  !rx-sta-assoc_ap 
-   !(rx-sta  (rx-sta-flags  WLAN_STA_WDS)))
-skb-dev = rx-sta-dev;
-else
-skb-dev = dev;
+   skb-dev = dev;
 
 skb2 = NULL;
-sdata = IEEE80211_DEV_TO_SUB_IF(dev);
 
-/*
- * don't count the master since the low level code
- * counts it already for us.
- */
-if (skb-dev != sdata-master) {
-   sdata-stats.rx_packets++;
-   sdata-stats.rx_bytes += skb-len;
-}
+   sdata-stats.rx_packets++;
+   sdata-stats.rx_bytes += skb-len;
 
if (local-bridge_packets  (sdata-type == IEEE80211_IF_TYPE_AP
-   || sdata-type == IEEE80211_IF_TYPE_VLAN)) {
+   || sdata-type == IEEE80211_IF_TYPE_VLAN)  rx-u.rx.ra_match) {
if (is_multicast_ether_addr(skb-data)) {
/* send multicast frames both to higher layers in
 * local net stack and back to the wireless media */
@@ -2760,13 +2748,14 @@ static int ap_sta_ps_end(struct net_devi
 
 
 static ieee80211_txrx_result
-ieee80211_rx_h_ieee80211_rx_h_ps_poll(struct ieee80211_txrx_data *rx)
+ieee80211_rx_h_ps_poll(struct ieee80211_txrx_data *rx)
 {
struct sk_buff *skb;
int no_pending_pkts;
 
if (likely(!rx-sta || WLAN_FC_GET_TYPE(rx-fc) != WLAN_FC_TYPE_CTRL ||
-  WLAN_FC_GET_STYPE(rx-fc) != WLAN_FC_STYPE_PSPOLL))
+  WLAN_FC_GET_STYPE(rx-fc) != WLAN_FC_STYPE_PSPOLL ||
+  !rx-u.rx.ra_match))
return TXRX_CONTINUE;
 
skb = skb_dequeue(rx-sta-tx_filtered);
@@ -3042,8 +3031,10 @@ ieee80211_rx_h_check(struct ieee80211_tx
if (unlikely(rx-fc  WLAN_FC_RETRY 
 rx-sta-last_seq_ctrl[rx-u.rx.queue] ==
 hdr-seq_ctrl)) {
-   rx-local-dot11FrameDuplicateCount++;
-   rx-sta-num_duplicates++;
+   if (rx-u.rx.ra_match) {
+   rx-local-dot11FrameDuplicateCount++;
+   rx-sta-num_duplicates++;
+   }
return TXRX_DROP;
} else
rx-sta-last_seq_ctrl[rx-u.rx.queue] = hdr-seq_ctrl;
@@ -3057,7 +3048,9 @@ ieee80211_rx_h_check(struct ieee80211_tx
return TXRX_DROP;
}
 
-   if (memcmp(rx-dev-dev_addr, hdr-addr1, ETH_ALEN) == 0)
+   if (!rx-u.rx.ra_match)
+   rx-skb-pkt_type = PACKET_OTHERHOST;
+   else if (memcmp(rx-dev-dev_addr, hdr-addr1, ETH_ALEN) == 0)
rx-skb-pkt_type = PACKET_HOST;
else if (is_multicast_ether_addr(hdr-addr1)) {
if (is_broadcast_ether_addr(hdr-addr1))
@@ -3080,8 +3073,10 @@ ieee80211_rx_h_check(struct ieee80211_tx
   WLAN_FC_GET_STYPE(rx-fc) == WLAN_FC_STYPE_PSPOLL)) 
 rx-sdata-type != IEEE80211_IF_TYPE_IBSS 
 (!rx-sta || !(rx-sta-flags  WLAN_STA_ASSOC {
-   if (!(rx-fc  WLAN_FC_FROMDS)  !(rx-fc  WLAN_FC_TODS)) {
-   /* Drop IBSS frames silently. */
+   if ((!(rx-fc  WLAN_FC_FROMDS)  !(rx-fc  WLAN_FC_TODS)) ||
+   !rx-u.rx.ra_match) {
+   /* Drop IBSS frames and frames for other hosts
+* silently. */
return TXRX_DROP;
}
 
@@ -3113,6 +3108,8 @@ ieee80211_rx_h_check(struct ieee80211_tx
rx-key = rx-sdata-keys[keyidx];
}
if (!rx-key) {
+   if (!rx-u.rx.ra_match)
+   return TXRX_DROP;
printk(KERN_DEBUG %s: RX WEP frame with 
   unknown keyidx %d (A1= MACSTR  A2=
   MACSTR  A3= MACSTR )\n,
@@ -3128,7

[2.6.17-git22] lock debugging output

2006-07-04 Thread Alessandro Suardi


Hoping gmail doesn't mess it too badly...

eth0: tg3 (BCM5751 Gbit Ethernet)
eth1: ipw2200 (Intel PRO/Wireless 2200BG)

Sequence:
1. boot with eth0 disconnected (eth1 doesn't come up on boot)
2. ifup eth1, bring wpa-supplicant up
3. run 'dig' --- lock debug info gets printed on console

Note that due to my very variable network setup, I had no /etc/resolv.conf
in place at the moment I ran 'dig'. Second execution of 'dig' did not print
any lock debug output but just (properly) stalled; then I realized I didn't
put my home resolv.conf in place, did that and 'dig' just worked.

System appears to work and I'm actually typing this report from the
same kernel that reported the following upon invoking 'dig' :

=
[ INFO: inconsistent lock state ]
-
inconsistent {softirq-on-W} - {in-softirq-R} usage.
dig/2373 [HC0[0]:SC1[2]:HE1:SE0] takes:
 (sk-sk_dst_lock){---?}, at: [c028cf72] sk_dst_check+0x1b/0xe6
{softirq-on-W} state was registered at:
  [c0127a6a] lock_acquire+0x60/0x80
  [c02e151d] _write_lock+0x19/0x28
  [c028c0af] sock_setsockopt+0x351/0x49c
  [c0289d0d] sys_setsockopt+0x5b/0x8d
  [c028ac22] sys_socketcall+0x148/0x186
  [c0102699] sysenter_past_esp+0x56/0x8d
irq event stamp: 1130
hardirqs last  enabled at (1130): [c01161ed] local_bh_enable_ip+0xb2/0xbb
hardirqs last disabled at (1129): [c011618e] local_bh_enable_ip+0x53/0xbb
softirqs last  enabled at (1120): [c029423c] dev_queue_xmit+0x205/0x211
softirqs last disabled at (1121): [c01040e6] do_softirq+0x4d/0xac

other info that might help us debug this:
2 locks held by dig/2373:
 #0:  (sk_lock-AF_INET6){--..}, at: [f8cf1168]
udpv6_sendmsg+0x546/0x818 [ipv6]
 #1:  (slock-AF_INET6){-...}, at: [f8cf3228] icmpv6_send+0x222/0x549 [ipv6]

stack backtrace:
 [c0102e44] show_trace+0xd/0x10
 [c010335e] dump_stack+0x19/0x1b
 [c01260e1] print_usage_bug+0x1cc/0x1d9
 [c01265e2] mark_lock+0x193/0x360
 [c01271ee] __lock_acquire+0x3b7/0x969
 [c0127a6a] lock_acquire+0x60/0x80
 [c02e15ff] _read_lock+0x19/0x28
 [c028cf72] sk_dst_check+0x1b/0xe6
 [f8ce1305] ip6_dst_lookup+0x31/0x16d [ipv6]
 [f8cf3338] icmpv6_send+0x332/0x549 [ipv6]
 [f8cf09a1] udpv6_rcv+0x4ab/0x4d6 [ipv6]
 [f8ce2900] ip6_input+0x19c/0x228 [ipv6]
 [f8ce2d61] ipv6_rcv+0x188/0x1b7 [ipv6]
 [c02925b7] netif_receive_skb+0x18d/0x1d8
 [c0293d6a] process_backlog+0x80/0xf9
 [c0293f43] net_rx_action+0x80/0x174
 [c01162fd] __do_softirq+0x46/0x9c
 [c01040e6] do_softirq+0x4d/0xac
 ===
 [c0116117] local_bh_enable+0xc8/0xec
 [c029423c] dev_queue_xmit+0x205/0x211
 [c0298a8b] neigh_resolve_output+0x1db/0x207
 [f8ce0bee] ip6_output2+0x1e4/0x202 [ipv6]
 [f8ce12aa] ip6_output+0x69e/0x6c8 [ipv6]
 [f8ce1706] ip6_push_pending_frames+0x2c5/0x377 [ipv6]
 [f8cefd8e] udp_v6_push_pending_frames+0x154/0x176 [ipv6]
 [f8cf122a] udpv6_sendmsg+0x608/0x818 [ipv6]
 [c02c6b1d] inet_sendmsg+0x3b/0x48
 [c02894f9] sock_sendmsg+0xe8/0x103
 [c0289b18] sys_sendmsg+0x14f/0x1aa
 [c028ac45] sys_socketcall+0x16b/0x186
 [c0102699] sysenter_past_esp+0x56/0x8d


Hope this may be useful to lock debug devs / netdev folks...


Ciao,

--alessandro

I can't change what makes me high and I can't change what I believe in
(Heather Nova, My Fidelity)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.17-git22] lock debugging output

From: Arjan van de Ven [EMAIL PROTECTED]

On Tue, 2006-07-04 at 20:13 +0200, Alessandro Suardi wrote:
 Hoping gmail doesn't mess it too badly...
 
 eth0: tg3 (BCM5751 Gbit Ethernet)
 eth1: ipw2200 (Intel PRO/Wireless 2200BG)
 
 Sequence:
  1. boot with eth0 disconnected (eth1 doesn't come up on boot)
  2. ifup eth1, bring wpa-supplicant up
  3. run 'dig' --- lock debug info gets printed on console


this appears to be a real deadlock:

the SO_BINDTODEVICE ioctl calls sk_dst_reset(), which looks like this:
static inline void
sk_dst_reset(struct sock *sk)
{
write_lock(sk-sk_dst_lock);
__sk_dst_reset(sk);
write_unlock(sk-sk_dst_lock);
}

now... ipv6 does this in softirq context:
  [c028cf72] sk_dst_check+0x1b/0xe6
  [f8ce1305] ip6_dst_lookup+0x31/0x16d [ipv6]
  [f8cf3338] icmpv6_send+0x332/0x549 [ipv6]
  [f8cf09a1] udpv6_rcv+0x4ab/0x4d6 [ipv6]
  [f8ce2900] ip6_input+0x19c/0x228 [ipv6]
  [f8ce2d61] ipv6_rcv+0x188/0x1b7 [ipv6]
  [c02925b7] netif_receive_skb+0x18d/0x1d8
  [c0293d6a] process_backlog+0x80/0xf9
  [c0293f43] net_rx_action+0x80/0x174
  [c01162fd] __do_softirq+0x46/0x9c
  [c01040e6] do_softirq+0x4d/0xac

where sk_dst_check() takes the same lock for read.

that looks like a real deadlock to me... 
the most obvious low impact solution is to make sk_dst_reset use an
irqsave variant; patch for that is attached below. I'll leave it to the
networking people to say if that's the real right approach

Signed-off-by: Arjan van de Ven [EMAIL PROTECTED]

---
 include/net/sock.h |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux-2.6.17-mm6/include/net/sock.h
===
--- linux-2.6.17-mm6.orig/include/net/sock.h
+++ linux-2.6.17-mm6/include/net/sock.h
@@ -1025,9 +1025,10 @@ __sk_dst_reset(struct sock *sk)
 static inline void
 sk_dst_reset(struct sock *sk)
 {
-   write_lock(sk-sk_dst_lock);
+   unsigned long flags;
+   write_lock_irqsave(sk-sk_dst_lock, flags);
__sk_dst_reset(sk);
-   write_unlock(sk-sk_dst_lock);
+   write_unlock_irqrestore(sk-sk_dst_lock, flags);
 }
 
 extern struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie);


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-04 Thread Andy Gay

On Sat, 2006-07-01 at 16:26 +0200, Andi Kleen wrote:
 On Saturday 01 July 2006 01:01, Tom Tucker wrote:
  On Fri, 2006-06-30 at 14:16 -0700, David Miller wrote:
  
   The TOE folks have tried to submit their hooks and drivers
   on several occaisions, and we've rejected it every time.
  
  iWARP != TOE
 
 Perhaps a good start of that discussion David asked for would 
 be if you could give us an overview of the differences
 and how you avoid the TOE problems.

Interesting thread, I hope someone replies to Andi's request.
I've actually no real idea what RDMA, IWARP  TOE are, so I may be
barking up completely the wrong tree here. If so, apologies.

But it sounds like we're talking about technologies that offload some
part of the network/transport layer processing to the interface device?

And the primary objection to that is that it may bypass some of the cool
features of the Linux stack? Stuff like iptables and ... what exactly?

Presumably the reason why such offloading would be a Good Thing are to
do with very high speed network processing, 10G ethernet and the like.
Which sounds to me very like the way dedicated networking kit would do
that. So if you have a device that needs to be a very high performance
router, you dedicate it to that function and don't try to do clever
per-packet or -flow processing at the same time.

In the Cisco world, there's a network design approach where you consider
your equipment in three 'layers', I think they call them the core,
distribution and access layers, or something like that. The idea is that
the core layer handles the real high speed stuff, you don't do anything
much except routing/switching in there. The other layers aggregate flows
for the core, if you need extra processing (firewalls etc) you do it
somewhere in there. So, for example, the packet capture functions (sort
of like tcpdump) don't work if fast switching is in use, which it would
be in the core. Users accept these tradeoffs, because if you design it
right you can do the extra processing on some other device in the outer
layers.

So perhaps there's a good argument to make that a Linux system with the
right hardware could be considered a core device. Likely any place you
have such a system it would be dedicated to just moving data as well as
possible, and let other systems do the other stuff. You wouldn't want
your server farm systems to also be your firewalls.

Bottom line - these technologies seem to me to have a place in a well
designed network.

Just my 2c...

- Andy

 
 -Andi
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [e1000]: flow control on by default - good idea really?

On Tue, 2006-04-07 at 13:11 -0400, jamal wrote:
 CCing anybody who may have stakes on this. Ignore the email if this
 doesnt interest you.
 Ok, folks - i had deferred this discussion but it bit me in the ass. 
 I just spend an hour debugging it (and in the process blew up a gbic i
 borrowed, so my day aint going well since i actually have to pay for
 this and cant really do the testing i was planning to;-).
  
 I have a device connected to a e1000 that was erroneously advertising
 both tx/rx flow control but wasnt properly reacting to it. 
 The default setup on the e1000 has rx flow control turned on.
 I was sending at wire rate gige from the device - which is about
 1.48Mpps. The e1000 was in turn sending me flow control packets
 as per default/expected behavior. Unfortunately, it was sending
 a very large amount of packets. At one point i was seeing upto
 1Mpps and on average, the flow control packets were consuming
 60-70% of the bandwidth. Even when i fixed this behavior to act
 properly, allowing flow control on consumed up to 15% of the bandwidth. 
 Clearly, this is a bad thing. Yes, the device in the first instance was
 at fault. But i have argued in the past that NAPI does just fine without
 flow control being turned on, so even chewing 5% of bandwidth on flow
 control is a bad thing..
 
 As a compromise, can we declare flow control as an advanced feature
 and turn it off by default? People who feel it is valuable and know
 what they are doing can turn it off.

I meant turn it on.

BTW, As an addendum this default behavior changed around 2.6.16 it
seems.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

On Tue, 2006-04-07 at 15:29 +0200, Patrick McHardy wrote:
 Russell Stuart wrote:
[..]
  Without seeing your actual proposal it is difficult to
  judge whether this is a reasonable trade-off or not.
  Hopefully we will see your code soon.  Do you have any
  idea when?
 
 Unfortunately I still didn't got to cleaning them up, so I'm sending
 them in their preliminary state. Its not much that is missing, but
 the netem usage of skb-cb needs to be integrated better, I failed
 to move it to the qdisc_skb_cb so far because of circular includes.
 But nothing unfixable. I'm mostly interested if the current size-tables
 can express what you need for ATM, I wasn't able to understand the
 big comment in tc_core.c in your patch.
 

Looks good from within the range of change within reason of addressed
problem. The cb on the qdisc seems only usable for netem, correct?
Also while not unreasonable, i wasnt sure how qdisc_enqueue_root()
fit in the grand scheme of things for this change (it seemed out of
place).

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

Shailabh,

On Tue, 2006-04-07 at 12:37 -0400, Shailabh Nagar wrote:
[..]
 Here's a strawman for the problem we're trying to solve: get
 notification of the close of a NETLINK_GENERIC socket that had
 been used to register interest for some cpus within taskstats.
 
  From looking at the netlink code, the way to go seems to be
 
 - it maintains a pidhash of nl_pids that are currently
 registered to listen to atleast one cpu. It also stores the
 cpumask used.
 - taskstats registers a notifier block within netlink_chain
 and receives a callback on the NETLINK_URELEASE event, similar
 to drivers/scsci/scsi_transport_iscsi.c: iscsi_rcv_nl_event()
 
 - the callback checks to see that the protocol is NETLINK_GENERIC
 and that the nl_pid for the socket is in taskstat's pidhash. If so, it
 does a cleanup using the stored cpumask and releases the nl_pid
 from the pidhash.
 

Sound quiet reasonable.  I am beginning to wonder whether we should do 
do the NETLINK_URELEASE in general for NETLINK_GENERIC

 We can even do away with the deregister command altogether and
 simply rely on this autocleanup.

I think if you may still need the register if you are going to allow
multiple sockets per listener process, no?
The other question is how do you correlate pid - fd?

cheers,
jamal



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-04 Thread Paul Jackson

Shailabh wrote:
 Perhaps I should use the the other ascii format for specifying cpumasks 
 since its more amenable
 to specifying an upper bound for the length of the ascii string and is 
 more compact ?

Eh - basically - I don't have a strong opinion either way.

I have a slight esthetic preference toward using list of ranges format
from shell scripts and shell prompts, and using the 32-bit hex words
from C code:

17-26,44-47 # shell - list of ranges
f000,07fe   # C - 32-bit hex words

Since the primary interface you are working with is C code, that would
mean I'd slightly prefer the 32-bit hex word variant.

From what I've seen neither of the reasons you gave for preferring
the 32-bit hex word format are persuasive (even though they both
lead to the same conclusion as I preferred ;):

Which is more compact depends on that particular bit pattern
you need to represent.  See for example the examples above.

The lack of a perfect upper bound on the list of ranges format
is a theoretical problem that I have never seen in practice.
Only pathological constructs exceed six ascii characters per
set bit.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.17-mm6

this is one for the networking people, and thus netdev


On Tue, 2006-07-04 at 21:53 +0200, Rafael J. Wysocki wrote:
 On Monday 03 July 2006 12:03, Andrew Morton wrote:
  
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.17/2.6.17-mm6/
  
  
  - A major update to the e1000 driver.
  
  - 1394 updates
 
 Just found this in dmesg:
 
 =
 [ INFO: inconsistent lock state ]
 -
 inconsistent {in-hardirq-W} - {hardirq-on-W} usage.
 nscd/4929 [HC0[0]:SC0[1]:HE1:SE0] takes:
  (skb_queue_lock_key){++..}, at: [8044fe40] udp_ioctl+0x50/0xa0
 {in-hardirq-W} state was registered at:
   [8024b4fa] lock_acquire+0x8a/0xc0
   [80476e3f] _spin_lock_irqsave+0x3f/0x60
   [80408c25] skb_queue_tail+0x25/0x60

ok so skb_queue_lock is used in a hardirq context

   [881c9517] queue_packet_complete+0x27/0x40 [ieee1394]
   [881c9d6b] hpsb_packet_sent+0xab/0x100 [ieee1394]
   [8822a4b5] dma_trm_reset+0x115/0x140 [ohci1394]
   [8822c512] ohci_devctl+0x1c2/0x540 [ohci1394]
   [881c9673] hpsb_bus_reset+0x43/0xb0 [ieee1394]
   [8822d7f6] ohci_irq_handler+0x416/0x830 [ohci1394]
   [802631ab] handle_IRQ_event+0x2b/0x70
   [80264dd4] handle_level_irq+0xc4/0x130
   [8020c762] do_IRQ+0x112/0x130
   [80209d90] common_interrupt+0x64/0x65
 irq event stamp: 4280
 hardirqs last  enabled at (4279): [8047606a] 
 trace_hardirqs_on_thunk+0x35/0x37
 hardirqs last disabled at (4278): [804760a1] 
 trace_hardirqs_off_thunk+0x35/0x67
 softirqs last  enabled at (4258): [804065b5] release_sock+0xd5/0xe0
 softirqs last disabled at (4280): [804764d1] _spin_lock_bh+0x11/0x50
 
 other info that might help us debug this:
 no locks held by nscd/4929.
 
 stack backtrace:
 
 Call Trace:
  [8020ab9f] show_trace+0x9f/0x240
  [8020af75] dump_stack+0x15/0x20
  [80249e52] print_usage_bug+0x272/0x290
  [8024a0d7] mark_lock+0x267/0x5f0
  [8024a9a6] __lock_acquire+0x546/0xd10
  [8024b4fb] lock_acquire+0x8b/0xc0
  [804764f4] _spin_lock_bh+0x34/0x50
  [8044fe40] udp_ioctl+0x50/0xa0

yet udp_ioctl takes it only for _bh

  [80457359] inet_ioctl+0x69/0x70
  [804033ac] sock_ioctl+0x22c/0x270
  [802a32b1] do_ioctl+0x31/0xa0
  [802a35db] vfs_ioctl+0x2bb/0x2e0
  [802a366a] sys_ioctl+0x6a/0xa0
  [8020985a] system_call+0x7e/0x83
  [2b2d76ab98a9]


is this a real scenario, or is this a case of firewire is special and
needs it's own rules?


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-04 Thread Paul Jackson

Andrew wrote:
 OK, so we're passing in an ASCII string.  Fair enough, I think.  Paul would
 know better.

Not sure if I know better - just got stronger opinions.

I like the ASCII here - but this is one of those he who
writes the code gets to 

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

2006-07-04 Thread Paul Jackson

pj wrote:
 writes the code gets to 

Never mind that last incomplete post - I hit Send
when I meant to hit Cancel.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-04 Thread Roland Dreier

Andi Perhaps a good start of that discussion David asked for
Andi would be if you could give us an overview of the differences
Andi and how you avoid the TOE problems.

Well, here's a quick overview, leaving out some of the details.  The
difference between TOE and iWARP/RDMA is really the interface that
they present.

A TOE (TCP Offload Engine) is a piece of hardware that offloads TCP
processing from the main system to handle regular sockets.  There is
either some way to hand off a socket from the host stack to the TOE,
or a socket is created on the TOE to start with, but in both cases,
the TOE is handling processing for normal TCP sockets.  This means
that the TOE has some hardware and/or firmware to do stateful TCP
processing.

An iWARP device, or RNIC (RDMA NIC), also usually has hardware and/or
firmware TCP processing, but this isn't exposed through the BSD socket
interface.  Instead, an RNIC presents an interface more like an
InfiniBand HCA: work requests (sends, receives, RDMA operations) are
passed to the RNIC via work queues, and completion notification is
returned asynchronously via completion queues.  An iWARP connection
can handle both send/receive (two-sided) and get/put (RDMA or
one-sided) operations.

For full details of the protocol used for this, you can look at the
drafs from the IETF rddp working group, but basically an RDMA protocol
is layered on top of a connected stream protocol -- usually TCP, but
SCTP could be used as well.

A lot of the perfomance of iWARP comes from the RDMA/direct placement
capabilities -- for example an NFS/RDMA server can process requests
out of order and put data directly into the buffer that's waiting for
it, without using any CPU on the destination -- but even send/receive
operations can be useful.

As a side note, an RNIC will also typically support the same sort of
kernel bypass as an IB HCA, where work queues can be safely mapped
into a userspace process's memory so that work requests can be posted
without a system call.  In fact, when people usually use RDMA as a
shorthand for the combination of the three features I described:
asynchronous work queues and completion queues, connections that
support both send/receive and RDMA, and kernel bypass.

In any case, RNIC support can be added to the existing IB stack with
fairly minor modifications -- you can search the netdev archives for
the patchsets posted by Steve Wise, but nearly all of the new code is
in the low-level hardware driver for the specific iWARP devices.

The real issues for netdev are things like Steve Wise's patch to add
route change notifiers, which could be used to tell RNICs when to
update the next hop for a connection they're handling.  More
generally, it would be interesting to see if it's possible to tie an
RNIC into the kernel's packet filtering, so that disallowed
connections don't get set up.  This seems very similar in spirit to
the problems around packet filtering that were raised for VJ netchannels.

 - Roland
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-04 Thread Roland Dreier

  Roland stated that it has never been the case that we have
  rejected adding support for a certain class of devices on the
  kinds of merits being discussed in this thread.  And I'm saying
  that TOE is such a case where we have emphatically done so.

Well, in the past it's seemed more like patches have been rejected not
because of a blanket refusal to consider support for certain hardware,
but rather because of issues with the patches themselves.  eg last
year when Chelsio submitted some TOE hooks, you wrote the following
http://marc.theaimsgroup.com/?l=linux-netdevm=112382991506811w=2

   There is no way you're going to be allowed to call such deep TCP
   internals from your driver.

   This would mean that every time we wish to change the data structures
   and interfaces for TCP socket lookup, your drivers would need to
   change.

which looks like a very good reason to reject the changes.

  So I am not saying iWARP or RDMA is equal to TOE, and if you had
  actually read this thread you would have understood that.

There's definitely been quite a bit of conflation between the two in
this thread, even if you're not responsible...

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted


 So perhaps there's a good argument to make that a Linux system with the
 right hardware could be considered a core device. Likely any place you
 have such a system it would be dedicated to just moving data as well as
 possible, and let other systems do the other stuff. You wouldn't want
 your server farm systems to also be your firewalls.

Why not? While Linux firewall performance is not flawless its problems
(e.g. slow conntrack) seems to be mostly in an area where TOE cannot
do much about.

 Bottom line - these technologies seem to me to have a place in a well
 designed network.

I think there is a web page listing why it's bad, but here 
a quick summary:

One worry is to debug it all together. Currently we have a single stack
to debug, although it's already difficult to control the complexity as it 
grows more bells and whistles.

Just take a look at Cisco IOS release notes to see how hard
and difficult it is to get it all to work together.

Another reason is that there are general doubts that TOE can
keep up with the ever growing performance of CPUs. Even if Linux
added it today it would be likely slower again a few months later.
That is also a big difference to Cisco hardware. Linux usually
runs on fast main CPUs (or if you run it on slow CPUs you normally
don't expect the best network performance). And they get faster
and faster constantly.

Admittedly 10GB NICs are still a bit too fast for
mainstream systems, but that seems to be mostly a problem
outside the CPUs and it looks like the next generation
of systems will catch up with enough bandwidth in this area.

Also it tends to accelerate the wrong thing. On a lot of workloads
the main problem is keeping a lot of different connections under 
control, and TOE tends to be slow at keeping connection
information synchronized with the host.

That is why the Linux strategy has been to ask for useful stateless offloads
instead. Examples of this are checksum offload (long time classic), TSO (TCP 
segmentation offload), UFO (UDP segmentation offload), Intel iOAT (memcpy off 
load), RX hashing with MSI-X (not implemented yet, but basically
it allows load balancing of incoming streams to CPU) 

Note that all these are more or less stateless offloads.

iWARP is not clear yet what it is. From the meager bits of information
about it that reached netdev so far it at least sounds it does RDMA and needs 
far more state than any of the other offloads we got so far and likely
got the usual TOE scaling issues. It's also likely on the wrong side 
of Moore's law.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[mini-RFT] via-velocity cleanup

2006-07-04 Thread Francois Romieu

Against 2.6.17:

http://www.fr.zoreil.com/linux/kernel/2.6.x/2.6.17/via-velocity/

The mii operations look now more familiar. There should be no functional
change. The patches do not clash with Jeff's netdev-2.6#upstream.

Please report if I have broken something.

-- 
Ueimor
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/3] Action API fixes

2006-07-04 Thread Thomas Graf

Dave,

Fixes for some rather serious action API bugs. Please apply.

 net/sched/act_api.c |   18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] [PKT_SCHED]: Fix illegal memory dereferences when dumping actions

2006-07-04 Thread Thomas Graf

The TCA_ACT_KIND attribute is used without checking its
availability when dumping actions therefore leading to a
value of 0x4 being dereferenced.

The use of strcmp() in tc_lookup_action_n() isn't safe
when fed with string from an attribute without enforcing
proper NUL termination.

Both bugs can be triggered with malformed netlink message
and don't require any privileges.

Signed-off-by: Thomas Graf [EMAIL PROTECTED]

Index: net-2.6.git/net/sched/act_api.c
===
--- net-2.6.git.orig/net/sched/act_api.c
+++ net-2.6.git/net/sched/act_api.c
@@ -776,7 +776,7 @@ replay:
return ret;
 }
 
-static char *
+static struct rtattr *
 find_dump_kind(struct nlmsghdr *n)
 {
struct rtattr *tb1, *tb2[TCA_ACT_MAX+1];
@@ -804,7 +804,7 @@ find_dump_kind(struct nlmsghdr *n)
return NULL;
kind = tb2[TCA_ACT_KIND-1];
 
-   return (char *) RTA_DATA(kind);
+   return kind;
 }
 
 static int
@@ -817,16 +817,15 @@ tc_dump_action(struct sk_buff *skb, stru
struct tc_action a;
int ret = 0;
struct tcamsg *t = (struct tcamsg *) NLMSG_DATA(cb-nlh);
-   char *kind = find_dump_kind(cb-nlh);
+   struct rtattr *kind = find_dump_kind(cb-nlh);
 
if (kind == NULL) {
printk(tc_dump_action: action bad kind\n);
return 0;
}
 
-   a_o = tc_lookup_action_n(kind);
+   a_o = tc_lookup_action(kind);
if (a_o == NULL) {
-   printk(failed to find %s\n, kind);
return 0;
}
 
@@ -834,7 +833,7 @@ tc_dump_action(struct sk_buff *skb, stru
a.ops = a_o;
 
if (a_o-walk == NULL) {
-   printk(tc_dump_action: %s !capable of dumping table\n, kind);
+   printk(tc_dump_action: %s !capable of dumping table\n, 
a_o-kind);
goto rtattr_failure;
}
 

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] [PKT_SCHED]: Return ENOENT if action module is unavailable

2006-07-04 Thread Thomas Graf

Signed-off-by: Thomas Graf [EMAIL PROTECTED]

Index: net-2.6.git/net/sched/act_api.c
===
--- net-2.6.git.orig/net/sched/act_api.c
+++ net-2.6.git/net/sched/act_api.c
@@ -305,6 +305,7 @@ struct tc_action *tcf_action_init_1(stru
goto err_mod;
}
 #endif
+   *err = -ENOENT;
goto err_out;
}
 

--

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-04 Thread Andy Gay

On Tue, 2006-07-04 at 22:47 +0200, Andi Kleen wrote:
  So perhaps there's a good argument to make that a Linux system with the
  right hardware could be considered a core device. Likely any place you
  have such a system it would be dedicated to just moving data as well as
  possible, and let other systems do the other stuff. You wouldn't want
  your server farm systems to also be your firewalls.
 
 Why not? While Linux firewall performance is not flawless its problems
 (e.g. slow conntrack) seems to be mostly in an area where TOE cannot
 do much about.
No doubt you *can* do this, but would you want to?
My point wasn't really about performance here, more that systems needing
this level of performance (server farm is just an example) will probably
be on an 'inside' network with firewalling being done elsewhere (at the
access layer, to use the Cisco paradigm). It's just not good design to
attach such systems directly to an untrusted network, IMHO. So these
systems just don't need netfilter capabilities.

 
  Bottom line - these technologies seem to me to have a place in a well
  designed network.
 
 I think there is a web page listing why it's bad, but here 
 a quick summary:
 
 One worry is to debug it all together. Currently we have a single stack
 to debug, although it's already difficult to control the complexity as it 
 grows more bells and whistles.
 
 Just take a look at Cisco IOS release notes to see how hard
 and difficult it is to get it all to work together.
No argument there!

 
 Another reason is that there are general doubts that TOE can
 keep up with the ever growing performance of CPUs. Even if Linux
 added it today it would be likely slower again a few months later.
 That is also a big difference to Cisco hardware. Linux usually
 runs on fast main CPUs (or if you run it on slow CPUs you normally
 don't expect the best network performance). And they get faster
 and faster constantly.
 
 Admittedly 10GB NICs are still a bit too fast for
 mainstream systems, but that seems to be mostly a problem
 outside the CPUs and it looks like the next generation
 of systems will catch up with enough bandwidth in this area.
 
 Also it tends to accelerate the wrong thing. On a lot of workloads
 the main problem is keeping a lot of different connections under 
 control, and TOE tends to be slow at keeping connection
 information synchronized with the host.
 
 That is why the Linux strategy has been to ask for useful stateless offloads
 instead. Examples of this are checksum offload (long time classic), TSO (TCP 
 segmentation offload), UFO (UDP segmentation offload), Intel iOAT (memcpy off 
 load), RX hashing with MSI-X (not implemented yet, but basically
 it allows load balancing of incoming streams to CPU) 
 
 Note that all these are more or less stateless offloads.
 
 iWARP is not clear yet what it is. From the meager bits of information
 about it that reached netdev so far it at least sounds it does RDMA and needs 
 far more state than any of the other offloads we got so far and likely
 got the usual TOE scaling issues. It's also likely on the wrong side 
 of Moore's law.
 
 -Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] change netdevice to use struct device instead of struct class_device

2006-07-04 Thread Greg KH

On Mon, Jul 03, 2006 at 06:57:47PM -0700, David Miller wrote:
 From: Greg KH [EMAIL PROTECTED]
 Date: Mon, 3 Jul 2006 16:16:10 -0700

  No, not really.  According to Documentation/ABI/testing/sysfs-class all
  code that uses /sys/class/foo/ needs to be able to handle the fact that
  those entries might be symlinks and not just directories.  Everything
  that I know of already works properly because the input layer has had
  symlinks in /sys/class/input for quite some time now.

  Do you know of any tools that use /sys/class/net/ that can not handle
  symlinks there?  I've been running this on my boxes for about a week now
  with no noticeable issues.  Renaming interfaces works just fine too.

 I do not think this change will cause any problems.

Great, thanks for looking.

Do you mind if I keep this in my tree, due to the dependancies on the
other driver core changes?

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[no subject]

2006-07-04 Thread Neal Sidhwaney


subscribe linux-netdev
---
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted


 My point wasn't really about performance here, more that systems needing
 this level of performance (server farm is just an example) will probably
 be on an 'inside' network with firewalling being done elsewhere (at the
 access layer, to use the Cisco paradigm). It's just not good design to
 attach such systems directly to an untrusted network, IMHO. So these
 systems just don't need netfilter capabilities.

Don't think of the highend. It is exotic and rare.

Think of the ordinary single linux box somewhere at a rackspace provider which 
represents the majority of Linux boxes around. 

With a not too skilled admin who mostly uses the default settings of his 
configuration.
For that running firewalling on the same box makes a lot of sense.

Normally it is not that loaded and it doesn't matter much how it performs,
but it might be occasionally slashdotted and then it should still hold up.

BTW basic firewalling is not really that bad as long as you don't have too many
rules. Mostly conntrack is painful right now. I'm sure at some point it will
be fixed too.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-04 Thread Andy Gay

On Wed, 2006-07-05 at 01:01 +0200, Andi Kleen wrote:
  My point wasn't really about performance here, more that systems needing
  this level of performance (server farm is just an example) will probably
  be on an 'inside' network with firewalling being done elsewhere (at the
  access layer, to use the Cisco paradigm). It's just not good design to
  attach such systems directly to an untrusted network, IMHO. So these
  systems just don't need netfilter capabilities.
 
 Don't think of the highend. It is exotic and rare.
Sure. But isn't the high end exactly where these new technologies are
intended to fit?

 
 Think of the ordinary single linux box somewhere at a rackspace provider 
 which 
 represents the majority of Linux boxes around. 
How many of those need 10G nics?

 
 With a not too skilled admin who mostly uses the default settings of his 
 configuration.
 For that running firewalling on the same box makes a lot of sense.
Yup. I run a few of those. And I run firewalls on them. But they're on
1.5M T1 pipes at best.
I probably fit into your 'not too skilled' category, too :) 

 
 Normally it is not that loaded and it doesn't matter much how it performs,
 but it might be occasionally slashdotted and then it should still hold up.
 
 BTW basic firewalling is not really that bad as long as you don't have too 
 many
 rules. Mostly conntrack is painful right now. I'm sure at some point it will
 be fixed too.
Actually, I wasn't aware of any pain with conntrack, it works great for
me. But like I said, I don't run any real high speed connections.

We're focusing on netfilter here. Is breaking netfilter really the only
issue with this stuff? I know you mentioned some other concerns (about
TOE specifically), they were really scalability things though weren't
they - like you're not convinced this really solves any performance
issues long term. I'm certainly not qualified to discuss that, hopefully
some of the others will weigh in here.

 
 -Andi
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] NET: Accurate packet scheduling for ATM/ADSL

2006-07-04 Thread Patrick McHardy

jamal wrote:
 On Tue, 2006-04-07 at 15:29 +0200, Patrick McHardy wrote:
 
Russell Stuart wrote:
 
 [..]
 
Without seeing your actual proposal it is difficult to
judge whether this is a reasonable trade-off or not.
Hopefully we will see your code soon.  Do you have any
idea when?

Unfortunately I still didn't got to cleaning them up, so I'm sending
them in their preliminary state. Its not much that is missing, but
the netem usage of skb-cb needs to be integrated better, I failed
to move it to the qdisc_skb_cb so far because of circular includes.
But nothing unfixable. I'm mostly interested if the current size-tables
can express what you need for ATM, I wasn't able to understand the
big comment in tc_core.c in your patch.

 
 
 Looks good from within the range of change within reason of addressed
 problem. The cb on the qdisc seems only usable for netem, correct?

Yes, it has the same limitations as current netem cb usage. Really
makeing it useable for all qdiscs would require reserving a few bytes
for every level, so far that isn't necessary and I would prefer to
just add a time_to_send field for netem. The problem with this is
that it currently requires sch_generic.h and pkt_sched.h to include
one another, so I did the qdisc_skb_cb() thing to at least get it to
compile for now.

 Also while not unreasonable, i wasnt sure how qdisc_enqueue_root()
 fit in the grand scheme of things for this change (it seemed out of
 place).

Its there as a spot to do the initial time calculations and store them
in the cb. I didn't want to put this in net/core/dev.c.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted


  Think of the ordinary single linux box somewhere at a rackspace provider 
  which 
  represents the majority of Linux boxes around. 
 How many of those need 10G nics?

Most of them already have gigabit. At some point they will have 10G too.

Admittedly the iThingy under discussion here seems to be Infiniband only which
will probably not appear in such a use case.

 We're focusing on netfilter here. Is breaking netfilter really the only
 issue with this stuff?

Another concern is that it will just not be able to keep 
up with a high rate of new connections or a high number of them
(because the hardware has too limited state)

And then there are the other issues I listed like subtle TCP bugs
(TSO is already a nightmare in this area and it's still not quite
right) etc. 

 I know you mentioned some other concerns (about 
 TOE specifically), they were really scalability things though weren't
 they 

There was more than just scalability. Reread it.

Anyways the thread is already getting off topic - i'm not actually
that much interested in a generic TOE discussion because the issue
is pretty much settled already with broad consensus. You can refer
to the netdev archives or the respective web pages if you want more
details.

It would need someone who can describe how this new RDMA device avoids
all the problems, but so far its advocates don't seem to be interested
in doing that and I cannot contribute more.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

possible dos / wsize affected frozen connection length (was: Re: 2.6.17.1: fails to fully get webpage)

2006-07-04 Thread CaT

On Fri, Jun 30, 2006 at 08:50:39AM +1000, CaT wrote:
 Another datapoint to this is that I've had this my netcat web test
 running since 8:42pm yesterday. It's 8:37am now. It hasn't progressed
 in any way. It hasn't quit. It hasn't timed out. It just sits there,
 hung. This leads me to consider the possibility of a DOS, either
 intentional or accidental (think about 2.6.17.x running on a mail server
 and someone mails/spams from a broken place).

I'm just wondering if connections hanging around this long are normal.
The above has now been running for 6 days. netstat is still reporting an
established session. netcat has not timed out. It's all just sitting
there doing nothing.

-- 
To the extent that we overreact, we proffer the terrorists the
greatest tribute.
- High Court Judge Michael Kirby
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] [PKT_SCHED]: Fix illegal memory dereferences when dumping actions

On Wed, 2006-05-07 at 00:00 +0200, Thomas Graf wrote:
 plain text document attachment (act_fix_dump_null_deref)
 The TCA_ACT_KIND attribute is used without checking its
 availability when dumping actions therefore leading to a
 value of 0x4 being dereferenced.
 
 The use of strcmp() in tc_lookup_action_n() isn't safe
 when fed with string from an attribute without enforcing
 proper NUL termination.
 
 Both bugs can be triggered with malformed netlink message
 and don't require any privileges.
 
 Signed-off-by: Thomas Graf [EMAIL PROTECTED]
 

Good catch.

Acked-by: Jamal Hadi Salim [EMAIL PROTECTED]


cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] [PKT_SCHED]: Return ENOENT if action module is unavailable

On Wed, 2006-05-07 at 00:00 +0200, Thomas Graf wrote:
 plain text document attachment (act_fix_init_ret_val)
 Signed-off-by: Thomas Graf [EMAIL PROTECTED]
 
 Index: net-2.6.git/net/sched/act_api.c
 ===
 --- net-2.6.git.orig/net/sched/act_api.c
 +++ net-2.6.git/net/sched/act_api.c
 @@ -305,6 +305,7 @@ struct tc_action *tcf_action_init_1(stru
   goto err_mod;
   }
  #endif
 + *err = -ENOENT;
   goto err_out;
   }
  

Ok, this falls under the LinuxWay(tm). Quick inspection of the qdisc
code reveals the same bug. The cls side seems fine - but i didnt spend
more than 30 secs. So why dont you fix the qdisc one while you are at
it?

Acked-by: Jamal Hadi Salim [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] [PKT_SCHED]: Fix error handling while dumping actions


I need to stare at this one for longer than 1 minute and i dont have
time right now; it does look strange (I am unsure what my thoughts were
at that point with -err - or maybe that was a change made by someone
else). 
I dont have time until tommorow - but i would  think the better fix will
be to change return -err to  return -1?

cheers,
jamal 

On Wed, 2006-05-07 at 00:00 +0200, Thomas Graf wrote:
 plain text document attachment (act_fix_dump_err_handling)
 return -err and blindly inheriting the error code in the netlink
 failure exception handler causes errors codes to be returned as
 positive value therefore making them being ignored by the caller.
 
 May lead to sending out incomplete netlink messages.
 
 Signed-off-by: Thomas Graf [EMAIL PROTECTED]
 
 
 Index: net-2.6.git/net/sched/act_api.c
 ===
 --- net-2.6.git.orig/net/sched/act_api.c
 +++ net-2.6.git/net/sched/act_api.c
 @@ -250,15 +250,17 @@ tcf_action_dump(struct sk_buff *skb, str
   RTA_PUT(skb, a-order, 0, NULL);
   err = tcf_action_dump_1(skb, a, bind, ref);
   if (err  0)
 - goto rtattr_failure;
 + goto errout;
   r-rta_len = skb-tail - (u8*)r;
   }
  
   return 0;
  
  rtattr_failure:
 + err = -EINVAL;
 +errout:
   skb_trim(skb, b - skb-data);
 - return -err;
 + return err;
  }
  
  struct tc_action *tcf_action_init_1(struct rtattr *rta, struct rtattr *est,
 
 --
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] [PKT_SCHED]: Fix illegal memory dereferences when dumping actions