Re: new NAPI interface broken

2007-10-17 Thread Christoph Raisch

Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote on 16.10.2007
11:01:49:

>
>
> Christoph, have any of you tried it on powerpc ?

No we didn't try this (yet).
This approach makes a lot of sense.

Why is this not installed by both large distros on PPC by default?
how mature is this for larger SMPs on PPC?

>
> Cheers,
> Ben.
>
>
Gruss / Regards
Christoph R.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-10-16 Thread Anton Blanchard
 
Hi,

> As far as I know, the x86 in-kernel thingy doesn't but yeah, the
> userland one seems much more evolved and does things based on the
> "class" of the device.
> 
> Christoph, have any of you tried it on powerpc ?

FYI It works fine on PowerPC and its installed by default on some
distros (eg RHEL5).

Anton
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-10-16 Thread Arjan van de Ven

David Miller wrote:

From: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
Date: Tue, 16 Oct 2007 18:28:56 +1000


Allright, so that's an out of tree userland thingy... (which may well
work on ppc too I suppose). Definitely not installed by default by my
distro so IRQs from the network cards on all x86's using ubuntu gutsy at
least are spread to all CPUs :-)


But the thing does treat network interfaces differently from
other devices.


and it works on various architectures
The in-kernel x86 thing is going away (it's actually highly 
suboptimal)


Yes it's done in userland, this is the right place so far out of 
tree... well... there's no good place for such userspace tools in the 
kernel tree currently otherwise I'd love to have it there.
If your distro doesn't install this by default, please file a bug 
against the distro; I know we (Intel) worked with Fedora, RHEL, SuSE 
and Ubuntu to get it included maybe others don't?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-10-16 Thread Benjamin Herrenschmidt

On Tue, 2007-10-16 at 01:31 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
> Date: Tue, 16 Oct 2007 18:28:56 +1000
> 
> > Allright, so that's an out of tree userland thingy... (which may well
> > work on ppc too I suppose). Definitely not installed by default by my
> > distro so IRQs from the network cards on all x86's using ubuntu gutsy at
> > least are spread to all CPUs :-)
> 
> But the thing does treat network interfaces differently from
> other devices.
> 
> Arjan ran over to my table at kernel summit when I presented this
> topic to the audience, to remind me of all of this and he should be
> included in on any discussions about this topic. :-)
> 
> I've CC:'d him.

As far as I know, the x86 in-kernel thingy doesn't but yeah, the
userland one seems much more evolved and does things based on the
"class" of the device.

Christoph, have any of you tried it on powerpc ?

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-10-16 Thread David Miller
From: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
Date: Tue, 16 Oct 2007 18:28:56 +1000

> Allright, so that's an out of tree userland thingy... (which may well
> work on ppc too I suppose). Definitely not installed by default by my
> distro so IRQs from the network cards on all x86's using ubuntu gutsy at
> least are spread to all CPUs :-)

But the thing does treat network interfaces differently from
other devices.

Arjan ran over to my table at kernel summit when I presented this
topic to the audience, to remind me of all of this and he should be
included in on any discussions about this topic. :-)

I've CC:'d him.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-10-16 Thread Benjamin Herrenschmidt

On Tue, 2007-10-16 at 00:44 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
> Date: Tue, 16 Oct 2007 17:29:47 +1000
> 
> > Do you have any pointer to how that is done on x86 or sparc64 ?
> 
> Sparc64 does it statically in the kernel.
> 
> For x86, see http://irqbalance.org/

Allright, so that's an out of tree userland thingy... (which may well
work on ppc too I suppose). Definitely not installed by default by my
distro so IRQs from the network cards on all x86's using ubuntu gutsy at
least are spread to all CPUs :-)

There's also a balance kernel thread in x86 with has a notion of irq
that isn't moveable but that flag isn't set by any driver.

Cheers,
Ben.



-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-10-16 Thread David Miller
From: Benjamin Herrenschmidt <[EMAIL PROTECTED]>
Date: Tue, 16 Oct 2007 17:29:47 +1000

> Do you have any pointer to how that is done on x86 or sparc64 ?

Sparc64 does it statically in the kernel.

For x86, see http://irqbalance.org/
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-10-16 Thread Benjamin Herrenschmidt
> So the powerpc platform just honors the affinity mask, and depending on
> the PIC does things that range from nothing to spreading interrupts to
> CPUs in the affinity mask.
> 
> All interrupts by defaults are spread to all CPUs (full balancing).
> 
> At this stage, it's afaik userland business to enforce different
> policies by changing the affinities via /proc/irq/*.
> 
> Do you have any pointer to how that is done on x86 or sparc64 ? On my
> x86 laptop using ubuntu gutsy, I definitely see the IRQ on which the
> network card is connected (e1000) happily spread between the 2 cores
> just like powerpc would do.

More specifically, IRQF_NOBALANCING doesn't seem to be set anywhere
except a few arch specific timer interrupts etc... nowhere I can see in
network drivers or the network stack (the stack wouldn't know what IRQ
anyway since not all drivers set netdev->irq).

We currently don't have a balance kthread like x86 has, though I wonder
if we should move this one out of x86 and make it generic (hell, it's
even hidden in the IO_APIC code :-) But at this stage, HW balancing by
the PIC is the norm and seems to be happening.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-10-16 Thread Benjamin Herrenschmidt
Jumping on an old train ...

On Wed, 2007-09-12 at 05:50 -0700, David Miller wrote:
> 
> > This would mean we have a problem on all SMP machines right now.
> 
> This is not a correct statement.
> 
> Only on your platform do network device interrupts get moved
> around, no other platform does this.
> 
> Sparc64 doesn't, all interrupts stay in one location after
> the cpu is initially choosen.
> 
> x86 and x86_64 specifically do not move around network
> device interrupts, even though other device types do
> get dynamic IRQ cpu distribution.
> 
> That's why you are the only person seeing this problem.
> 
> I agree that it should be fixed, but we should also fix the IRQ
> distribution scheme used on powerpc platforms which is totally
> broken in these cases.

So the powerpc platform just honors the affinity mask, and depending on
the PIC does things that range from nothing to spreading interrupts to
CPUs in the affinity mask.

All interrupts by defaults are spread to all CPUs (full balancing).

At this stage, it's afaik userland business to enforce different
policies by changing the affinities via /proc/irq/*.

Do you have any pointer to how that is done on x86 or sparc64 ? On my
x86 laptop using ubuntu gutsy, I definitely see the IRQ on which the
network card is connected (e1000) happily spread between the 2 cores
just like powerpc would do.

Cheers,
Ben.


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-09-19 Thread Jan-Bernd Themann
On Wednesday 19 September 2007 17:33, Roland Dreier wrote:
>  > One other thing I observed is that I can not unload the module as the
>  > ref counter of the eth device is too low. I haven't tracked down the 
>  > source of this problem yet.
> 
> I suspect that this is because netif_rx_reschedule() was missing a
> dev_hold() to match the dev_put() in netif_rx_complete().  Dave merged
> a fix for that problem yesterday, so current net-2.6.24 should be OK.
> Let us know if it's not OK, because then there's another bug...
> 
>  - R.
> 

When I applied this netif_rx_reschedule fix this problem did not occur
anymore (module could be unloaded). So I guess I hit the problem you
described.

Thanks,
Jan-Bernd
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-09-19 Thread Roland Dreier
 > One other thing I observed is that I can not unload the module as the
 > ref counter of the eth device is too low. I haven't tracked down the 
 > source of this problem yet.

I suspect that this is because netif_rx_reschedule() was missing a
dev_hold() to match the dev_put() in netif_rx_complete().  Dave merged
a fix for that problem yesterday, so current net-2.6.24 should be OK.
Let us know if it's not OK, because then there's another bug...

 - R.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-09-18 Thread David Miller
From: Jan-Bernd Themann <[EMAIL PROTECTED]>
Date: Tue, 18 Sep 2007 18:15:45 +0200

> One other thing I observed is that I can not unload the module as the
> ref counter of the eth device is too low. I haven't tracked down the 
> source of this problem yet.

This is probably because of the resched device refcounting
bug that Roland Dreier just posted a fix for.

Look for subject "Fix refcounting problem with netif_rx_reschedule()"

I merge that in shortly to net-2.6.24 as well.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-09-18 Thread Jan-Bernd Themann
Hi,

On Saturday 15 September 2007 00:12, David Miller wrote:
> Ok, for now I'm going to try and deal with this by reverting
> the list handling to something approximating the old NAPI
> code, as per the patch below.
> 
> I've only quickly test booted into this kernel on my workstation.
> So take care when trying it out.
> 
> I took the liberty to add some assertions and comments all over
> wrt. list and quota handling.  One thing to note that (and this
> was true previously too) that if ->poll() uses the whole quota
> it _MUST_ _NOT_ modify the NAPI state by doing a complete or
> reschedule.  The net_rx_action code still owns the NAPI state in
> that case, and therefore is allowed to make modifications to the
> ->poll_list.

I'm currently updating the eHEA polling function to work with this
scheme. Up to now we had sort of a quota for or TX refill part as well,
which was also done in the poll function. This was possible as the device
could be scheduled again even if the quota has not been used
completely. If I got it right this is not longer possible. 
The idea was to benefit from the same "fairness" scheme of NAPI as it is
done for the RX side.

One other thing I observed is that I can not unload the module as the
ref counter of the eth device is too low. I haven't tracked down the 
source of this problem yet.

Regards,
Jan-Bernd
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-09-14 Thread David Miller
From: Jan-Bernd Themann <[EMAIL PROTECTED]>
Date: Fri, 7 Sep 2007 11:37:02 +0200

> Its about the question who inserts and removes devices from the poll list.
> 
> netif_rx_schedule: sets NAPI_STATE_SCHED flag, insert device in poll list.
> netif_rx_complete: clears NAPI_STATE_SCHED
> netif_rx_reschedule: sets NAPI_STATE_SCHED, insert device in poll list.
> net_rx_action: 
>  -removes dev from poll list
>  -calls poll function
>  -adds dev to poll list if NAPI_STATE_SCHED still set

Ok, for now I'm going to try and deal with this by reverting
the list handling to something approximating the old NAPI
code, as per the patch below.

I've only quickly test booted into this kernel on my workstation.
So take care when trying it out.

I took the liberty to add some assertions and comments all over
wrt. list and quota handling.  One thing to note that (and this
was true previously too) that if ->poll() uses the whole quota
it _MUST_ _NOT_ modify the NAPI state by doing a complete or
reschedule.  The net_rx_action code still owns the NAPI state in
that case, and therefore is allowed to make modifications to the
->poll_list.

I realize this adds a lot of IRQ enable/disable overhead back
into the code, but we can't get rid of that cleanly without
solving this list ownership and modification issue properly.

Signed-off-by: David S. Miller <[EMAIL PROTECTED]>

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0106fa6..9837ed3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -284,7 +284,14 @@ extern int __init netdev_boot_setup(char *str);
  * Structure for NAPI scheduling similar to tasklet but with weighting
  */
 struct napi_struct {
+   /* The poll_list must only be managed by the entity which
+* changes the state of the NAPI_STATE_SCHED bit.  This means
+* whoever atomically sets that bit can add this napi_struct
+* to the per-cpu poll_list, and whoever clears that bit
+* can remove from the list right before clearing the bit.
+*/
struct list_headpoll_list;
+
unsigned long   state;
int weight;
int quota;
@@ -336,13 +343,21 @@ static inline void napi_schedule(struct napi_struct *n)
  *
  * Mark NAPI processing as complete.
  */
-static inline void napi_complete(struct napi_struct *n)
+static inline void __napi_complete(struct napi_struct *n)
 {
BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
+   list_del(&n->poll_list);
smp_mb__before_clear_bit();
clear_bit(NAPI_STATE_SCHED, &n->state);
 }
 
+static inline void napi_complete(struct napi_struct *n)
+{
+   local_irq_disable();
+   __napi_complete(n);
+   local_irq_enable();
+}
+
 /**
  * napi_disable - prevent NAPI from scheduling
  * @n: napi context
@@ -1184,19 +1199,11 @@ static inline u32 netif_msg_init(int debug_value, int 
default_msg_enable_bits)
return (1 << debug_value) - 1;
 }
 
-/* Try to reschedule poll. Called by dev->poll() after netif_rx_complete().
- * Do not inline this?
- */
+/* Try to reschedule poll. Called by dev->poll() after netif_rx_complete().  */
 static inline int netif_rx_reschedule(struct napi_struct *n)
 {
if (napi_schedule_prep(n)) {
-   unsigned long flags;
-
-   local_irq_save(flags);
-   list_add_tail(&n->poll_list,
- &__get_cpu_var(softnet_data).poll_list);
-   __raise_softirq_irqoff(NET_RX_SOFTIRQ);
-   local_irq_restore(flags);
+   __napi_schedule(n);
return 1;
}
return 0;
@@ -1234,7 +1241,7 @@ static inline void netif_rx_schedule(struct net_device 
*dev,
 static inline void __netif_rx_complete(struct net_device *dev,
   struct napi_struct *napi)
 {
-   napi_complete(napi);
+   __napi_complete(napi);
dev_put(dev);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f119dc0..2897352 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2120,6 +2120,13 @@ static int process_backlog(struct napi_struct *napi, int 
quota)
return work;
 }
 
+static void napi_check_quota_bug(struct napi_struct *n)
+{
+   /* It is illegal to consume more than the given quota.  */
+   if (WARN_ON_ONCE(n->quota < 0))
+   n->quota = 0;
+}
+
 /**
  * __napi_schedule - schedule for receive
  * @napi: entry to schedule
@@ -2130,10 +2137,8 @@ void fastcall __napi_schedule(struct napi_struct *n)
 {
unsigned long flags;
 
-   if (n->quota < 0)
-   n->quota += n->weight;
-   else
-   n->quota = n->weight;
+   napi_check_quota_bug(n);
+   n->quota = n->weight;
 
local_irq_save(flags);
list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list);
@@ -2145,32 +2150,36 @@ EXPORT_SYMBOL(__napi_schedule);
 
 static void net_rx_action(struct softirq_ac

Re: new NAPI interface broken for POWER architecture?

2007-09-12 Thread Arnd Bergmann
On Wednesday 12 September 2007, Christoph Raisch wrote:
> David Miller <[EMAIL PROTECTED]> wrote on 12.09.2007 14:50:04:
>
> > I agree that it should be fixed, but we should also fix the IRQ
> > distribution scheme used on powerpc platforms which is totally
> > broken in these cases.
> 
> This is definitely not something we can change in the HEA device driver
> alone.
> It could also affect any other networking cards on POWER (e1000,s2io...).
> 
> Paul, Michael, Arndt, what is your opinion here?

The situation on Cell with the existing south bridge chips is that
interrupts _never_ get moved around, but are routed to specific
SMT threads by the firmware, while Linux does not interfere with
this.

We have been thinking about changing this so we can distribute
interrupts over all SMT threads in a given NUMA node, or even
over all logical CPUs in the system by reprogramming the
interrupt controller after each interrupt, but the current Axon bridge
chip will always have all devices routed to the same target CPU,
so it's unclear whether that is even an advantage.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken for POWER architecture?

2007-09-12 Thread David Miller
From: Christoph Raisch <[EMAIL PROTECTED]>
Date: Wed, 12 Sep 2007 15:10:08 +0200

> This is definitely not something we can change in the HEA device driver
> alone.

And it shouldn't be, x86 implements the policy in irq balance
daemon, powerpc should do it wherever it would be appropriate
there.

> Paul, Michael, Arndt, what is your opinion here?

I'm all ears too :)
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken for POWER architecture?

2007-09-12 Thread Christoph Raisch


David Miller <[EMAIL PROTECTED]> wrote on 12.09.2007 14:50:04:

> From: Jan-Bernd Themann <[EMAIL PROTECTED]>
> Date: Fri, 7 Sep 2007 11:37:02 +0200
>
> > 2) On SMP systems: after netif_rx_complete has been called on CPU1
> >(+interruts enabled), netif_rx_schedule could be called on CPU2
> >(irq handler) before net_rx_action on CPU1 has checked
NAPI_STATE_SCHED.
> >In that case the device would be added to poll lists of CPU1 and
CPU2
> >as net_rx_action would see NAPI_STATE_SCHED set.
> >This must not happen. It will be caught when netif_rx_complete is
> >called the second time (BUG() called)
> >
> > This would mean we have a problem on all SMP machines right now.
>
> This is not a correct statement.
>
> Only on your platform do network device interrupts get moved
> around, no other platform does this.
>
> Sparc64 doesn't, all interrupts stay in one location after
> the cpu is initially choosen.
>
> x86 and x86_64 specifically do not move around network
> device interrupts, even though other device types do
> get dynamic IRQ cpu distribution.
>
> That's why you are the only person seeing this problem.
>
> I agree that it should be fixed, but we should also fix the IRQ
> distribution scheme used on powerpc platforms which is totally
> broken in these cases.

This is definitely not something we can change in the HEA device driver
alone.
It could also affect any other networking cards on POWER (e1000,s2io...).

Paul, Michael, Arndt, what is your opinion here?

Gruss / Regards
Christoph Raisch

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new NAPI interface broken

2007-09-12 Thread David Miller
From: Jan-Bernd Themann <[EMAIL PROTECTED]>
Date: Fri, 7 Sep 2007 11:37:02 +0200

> 2) On SMP systems: after netif_rx_complete has been called on CPU1
>(+interruts enabled), netif_rx_schedule could be called on CPU2 
>(irq handler) before net_rx_action on CPU1 has checked NAPI_STATE_SCHED. 
>In that case the device would be added to poll lists of CPU1 and CPU2
>as net_rx_action would see NAPI_STATE_SCHED set.
>This must not happen. It will be caught when netif_rx_complete is
>called the second time (BUG() called)
> 
> This would mean we have a problem on all SMP machines right now.

This is not a correct statement.

Only on your platform do network device interrupts get moved
around, no other platform does this.

Sparc64 doesn't, all interrupts stay in one location after
the cpu is initially choosen.

x86 and x86_64 specifically do not move around network
device interrupts, even though other device types do
get dynamic IRQ cpu distribution.

That's why you are the only person seeing this problem.

I agree that it should be fixed, but we should also fix the IRQ
distribution scheme used on powerpc platforms which is totally
broken in these cases.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


new NAPI interface broken

2007-09-07 Thread Jan-Bernd Themann
Hi Stephen,

I saw that you developed most of the new NAPI interface.
I already addressed this issue a while ago. Please correct me if I got
it wrong. I think there is still a serious problem with the NAPI
changes to make NAPI polling independent of struct net_device objects.
Its about the question who inserts and removes devices from the poll list.

netif_rx_schedule: sets NAPI_STATE_SCHED flag, insert device in poll list.
netif_rx_complete: clears NAPI_STATE_SCHED
netif_rx_reschedule: sets NAPI_STATE_SCHED, insert device in poll list.
net_rx_action: 
 -removes dev from poll list
 -calls poll function
 -adds dev to poll list if NAPI_STATE_SCHED still set

1) netif_rx_complete and netif_rx_reschedule don't work together
2) On SMP systems: after netif_rx_complete has been called on CPU1
   (+interruts enabled), netif_rx_schedule could be called on CPU2 
   (irq handler) before net_rx_action on CPU1 has checked NAPI_STATE_SCHED. 
   In that case the device would be added to poll lists of CPU1 and CPU2
   as net_rx_action would see NAPI_STATE_SCHED set.
   This must not happen. It will be caught when netif_rx_complete is
   called the second time (BUG() called)

This would mean we have a problem on all SMP machines right now.

If I got all this right then we probably need a further flag to tell
net_rx_action whether to poll again or to stop (with the possibility
that the device has been scheduled on a different CPU in between).
The "old" NAPI interface uses the return value of poll to determine
if the device has to be polled again or not. 
We can either switch back or in case we want to stick to
the new return value, we might have to add something similar to 
the NAPI_STATE_SCHED flag or a new parameter...

Regards,
Jan-Bernd
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html