date:20060724

Re: Patch [PKT_SCHED]: PSCHED_TADD() and PSCHED_TADD2() can result,tv_usec = 1000000 seems wrong

2006-07-24 Thread David Miller

From: Guillaume Chazarain [EMAIL PROTECTED]
Date: Wed, 19 Jul 2006 14:47:34 +0200

 Shuya MAEDA wrote :
  while (__delta  USEC_PER_SEC){ ... }, but I think it should be
  while (__delta = USEC_PER_SEC){ ... }. Is it right?

 I agree, good catch :-)

Applied, thanks a lot.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH dscape] d80211: Switch to IEEE80211_ style naming in d80211.h

2006-07-24 Thread Michael Wu

Christoph Hellwig made a comment about how the names in ieee80211.h make more 
sense, and I also agree. This should also make patches for migrating fullmac 
drivers to d80211 smaller.

Hopefully I didn't miss anything in the transition. Patch is bzip2ed since a 
49kb patch seems a little big for inline.

Thanks,
-MIchael Wu


switch-to-new-names.patch.bz2
Description: BZip2 compressed data


pgpAnnmOSkqNA.pgp
Description: PGP signature

[NET] initialisation cleanup for ULI526x-net-driver

2006-07-24 Thread Henne

From: Henrik Kretzschmar [EMAIL PROTECTED]

[NET] initialisation cleanup for ULI526x-net-driver

removes the unneeded local variable rc
replace pci_module_init() with pci_register_driver()
two coding style issues on switch

Signed-off-by: Henrik Kretzschmar [EMAIL PROTECTED]

---

diff -ruN linux-2.6.18-rc2-git2/drivers/net/tulip/uli526x.c 
linux/drivers/net/tulip/uli526x.c
--- linux-2.6.18-rc2-git2/drivers/net/tulip/uli526x.c   2006-07-24 
13:58:05.0 +0200
+++ linux/drivers/net/tulip/uli526x.c   2006-07-24 14:21:43.0 +0200
@@ -1702,7 +1702,6 @@
 
 static int __init uli526x_init_module(void)
 {
-   int rc;
 
printk(version);
printed_version = 1;
@@ -1714,22 +1713,19 @@
if (cr6set)
uli526x_cr6_user_set = cr6set;
 
-   switch(mode) {
+   switch (mode) {
case ULI526X_10MHF:
case ULI526X_100MHF:
case ULI526X_10MFD:
case ULI526X_100MFD:
uli526x_media_mode = mode;
break;
-   default:uli526x_media_mode = ULI526X_AUTO;
+   default:
+   uli526x_media_mode = ULI526X_AUTO;
break;
}
 
-   rc = pci_module_init(uli526x_driver);
-   if (rc  0)
-   return rc;
-
-   return 0;
+   return pci_register_driver(uli526x_driver);
 }
 
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[NET] initialisation cleanup for ULI526x-net-driver 2nd(mailer issue)

2006-07-24 Thread Henne

From: Henrik Kretzschmar [EMAIL PROTECTED]

[NET] initialisation cleanup for ULI526x-net-driver

removes the unneeded local variable rc
replace pci_module_init() with pci_register_driver()
two coding style issues on switch

Signed-off-by: Henrik Kretzschmar [EMAIL PROTECTED]

---

diff -ruN linux-2.6.18-rc2-git2/drivers/net/tulip/uli526x.c 
linux/drivers/net/tulip/uli526x.c
--- linux-2.6.18-rc2-git2/drivers/net/tulip/uli526x.c   2006-07-24 
13:58:05.0 +0200
+++ linux/drivers/net/tulip/uli526x.c   2006-07-24 14:21:43.0 +0200
@@ -1702,7 +1702,6 @@
 
 static int __init uli526x_init_module(void)
 {
-   int rc;
 
printk(version);
printed_version = 1;
@@ -1714,22 +1713,19 @@
if (cr6set)
uli526x_cr6_user_set = cr6set;
 
-   switch(mode) {
+   switch (mode) {
case ULI526X_10MHF:
case ULI526X_100MHF:
case ULI526X_10MFD:
case ULI526X_100MFD:
uli526x_media_mode = mode;
break;
-   default:uli526x_media_mode = ULI526X_AUTO;
+   default:
+   uli526x_media_mode = ULI526X_AUTO;
break;
}
 
-   rc = pci_module_init(uli526x_driver);
-   if (rc  0)
-   return rc;
-
-   return 0;
+   return pci_register_driver(uli526x_driver);
 }
 
 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH dscape] d80211: Switch to IEEE80211_ style naming in d80211.h

2006-07-24 Thread Jiri Benc

On Mon, 24 Jul 2006 01:39:30 -0700, Michael Wu wrote:
 Christoph Hellwig made a comment about how the names in ieee80211.h make more 
 sense, and I also agree. This should also make patches for migrating fullmac 
 drivers to d80211 smaller.

Could you split the patch to two patches (one for the d80211 stack and
one for drivers) and add some description to both of them? I will take
care of pushing both patches to John then.

Thanks,

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH wireless-dev] d80211: Make MACSTR/MAC2STR macro available to drivers

2006-07-24 Thread Jiri Benc

On Sun, 23 Jul 2006 01:43:25 -0700, Michael Wu wrote:
 This patch moves the MACSTR/MAC2STR macros to d80211.h
 so that they are available to drivers. It also converts the adm8211
 and bcm43xx drivers to use this macro.

I really dislike those MACSTR/MAC2STR names. I always fail to remember
which one is which. What about renaming them when we are touching them?

And why not to use MAC_FMT/MAC_ARG names as used in net/ieee80211.h? ;-)

Thanks,

 Jiri

-- 
Jiri Benc
SUSE Labs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: wireless-2.6 git repos broken

2006-07-24 Thread John W. Linville

On Sun, Jul 23, 2006 at 11:13:52AM -0500, Larry Finger wrote:
 Michael Buesch wrote:
 On Sunday 23 July 2006 17:59, Larry Finger wrote:
 Do you have the same problem on other git trees?
 I saw some people running into this error when cloning Linus' 
 linux-2.6.git. I couldn't reproduce it, using exactly the same git 
 version.
 I had the same error when pulling from Linus's tree. It was fixed with a 
 'git fsck-objects --full' command.
 
 Uhm, that tells me the git tree on kernel.org is broken
 and John has to run this command, right?
 
 I think so.

Hmmm...well, I'll look into it.  FWIW, I cloned my git tree from
kernel.org onto my laptop while I was at OLS w/o any problems.

John
-- 
John W. Linville
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] PHYLIB: Fix forcing mode reduction

2006-07-24 Thread Kumar Gala


Jeff,

Any status on accepting this patch, I've got some additional fixes  
that are based on having access to genphy_update_link()


- kumar

On Jun 5, 2006, at 6:45 PM, Nathaniel Case wrote:


On Mon, 2006-06-05 at 17:08 -0500, Andy Fleming wrote:

Looks good.  Feel free to send these patches to
netdev@vger.kernel.org (you may need to subscribe), and copy Jeff
Garzik [EMAIL PROTECTED].


This fixes a problem seen when a port without a cable connected would
repeatedly print out Trying 1000/HALF.  While in the PHY_FORCING
state, the call to phy_read_status() was resetting the value of
phydev-speed and phydev-duplex, preventing it from incrementally
trying the speed/duplex variations.

Since we really just want the link status updated for the PHY_FORCING
state, calling genphy_update_link() instead of phy_read_status() fixes
this issue.

Patch tested on a MPC8540 platform with a BCM5421 PHY.

Signed-off-by: Nate Case [EMAIL PROTECTED]
Signed-off-by: Andy Fleming [EMAIL PROTECTED]

---

--- a/drivers/net/phy/phy.c 2006-06-04 16:01:59.0 -0500
+++ b/drivers/net/phy/phy.c 2006-06-05 10:55:31.0 -0500
@@ -767,7 +783,7 @@
}
break;
case PHY_FORCING:
-   err = phy_read_status(phydev);
+   err = genphy_update_link(phydev);

if (err)
break;
--- a/drivers/net/phy/phy_device.c  2006-06-04 16:02:08.0 -0500
+++ b/drivers/net/phy/phy_device.c  2006-06-04 19:12:26.0 -0500
@@ -417,6 +417,7 @@

return 0;
 }
+EXPORT_SYMBOL(genphy_update_link);

 /* genphy_read_status
  *


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH wireless-dev] d80211: Make MACSTR/MAC2STR macro available to drivers

2006-07-24 Thread Michael Wu

On Monday 24 July 2006 06:54, Jiri Benc wrote:
 I really dislike those MACSTR/MAC2STR names. I always fail to remember
 which one is which. What about renaming them when we are touching them?

 And why not to use MAC_FMT/MAC_ARG names as used in net/ieee80211.h? ;-)

I dislike MACSTR/MAC2STR too, but I was trying to minimize changes to the 
d80211 code. I'll switch to MAC_FMT/MAC_ARG.

-Michael Wu


pgpzNB1IryxvS.pgp
Description: PGP signature

Re: [PATCH wireless-dev 0/5] Switch drivers to d80211

2006-07-24 Thread Luis R. Rodriguez


On 7/23/06, Michael Wu [EMAIL PROTECTED] wrote:

Hi,
This patch series converts a number of fullmac wireless drivers to use
d80211.h instead of ieee80211.h.


Nice work


The remaining drivers are hostap, atmel, zd1211rw and ipw*.


I've been working on zd1211rw, give me a week. Anyone started ipw yet?

 Luis
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help with bugfix for bond active-backup mode + vlans

2006-07-24 Thread Jay Vosburgh

Christophe Devriese [EMAIL PROTECTED] wrote:

[...]
Would it be acceptable to have an interface flag IFF_SILENT that can be set on 
an interface, which prevents it from receiving packets in both forwarding 
paths ?

Starting with kernel version 2.6.17, there is code in skb_bond()
to suppress traffic on inactive slaves, but it looks like that will
bypassed by hardware accelerated VLAN packets (which, if I'm reading the
code correctly, have their skb-dev directly assigned to the VLAN
interface, so they go into netif_receive_skb with skb-dev not set to
the slave device, which will bypass the stuff in skb_bond).

An IFF_SILENT type of flag would work fine (if checked in both
input paths) for the active-backup mode, but the 802.3ad and balance-alb
modes need differing types of traffic dropping, e.g., the balance-alb
mode just needs to suppress broadcast and multicasts.

One possible solution for this would be to have bonding remove
the vlan registration for inactive slaves, which would cause the errant
packets to pass through skb_bond() normally and presumably be dropped.
That would work for the active-backup case, but would cause balance-alb
mode to lose VLAN acceleration on all interfaces except for one.

Another possibility would be to have __vlan_hwaccel_rx check the
VLAN_DEV_INFO(skb-dev)-real_dev, and if that's a bonding device, apply
the same logic found in skb_bond().  Or, if there's some way to ask the
question is dev a VLAN device?, then that same test could be put into
skb_bond() and all of the packet suppression fru fru could stay there.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problems with sky2 driver.

2006-07-24 Thread Daniel Drake


Todd Showalter wrote:

I've been having trouble with the sky2 driver.  It appears to work
most of the time, but it will quite often wedge during transfers.  The
2.6.17.* kernels actually seem worse than 2.6.16.19, but none of them
work perfectly.

What typically happens is that after working perfectly for a while,
existing net connections hang, and subsequent net connections don't
seem to start at all.  firefox gets stuck with a bunch of half-loaded
pages, for instance, and I've watched an scp of a large file to a
colleague's machine stall and remain stalled.


Please test with the very latest git snapshot. A critical fix was 
applied after 2.6.18-rc2 was released.



Once the machine is behaving this way, a reboot is the only way I
have found of recovering it.

We have two identical machines here that are both behaving this
way, so I'm assuming it's not a hardware problem per se.  The machines
are Intel Pentium D 940 (3GHz) processors.  They have ASUS P5LD2
motherboards, with builtin Marvell PCIe 88E8053 gigabit ethernet
controllers.

I'm not running any binary modules; it's an untainted kernel.  I'm
running a Gentoo system, but I'm using the vanilla-sources kernel (ie:
a pure kernel.org release, not the Gentoo-specific patched version).

What can I do to help solve this?


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problems with sky2 driver.

2006-07-24 Thread Stephen Hemminger

On Mon, 24 Jul 2006 14:38:39 -0400
Todd Showalter [EMAIL PROTECTED] wrote:

 On Mon, 24 Jul 2006 10:53:03 -0700, Stephen Hemminger
 [EMAIL PROTECTED] wrote:
 
  There is a receive problem that shows up under load, that is fixed
  in the latest version (2.6.18 git), the patch is queued for the stable
  tree as well.
 
 I have hand-patched this in my 2.6.17.6 kernel.  It seems better
 (no hard wedge yet), but there are still definitely problems.
 
 The most obvious place is in firefox; for example, the front page
 of slashdot half-renders (all the borders, no stories) and then sits
 loading for eternity.  Ditto the online package database at
 gentoo.org.  I'm seeing similar behavior with other websites as well.
 It's consistant, too; I haven't been able to view the slashdot front
 page since booting a 2.6.17 kernel.


I suspect that probably isn't a sky2 driver problem.
Does it go away if you turn of TCP window scaling:
sysctl -w net.ipv4.tcp_window_scaling=0

If so, you probably have a middlebox in your path that is not correctly
handling TCP window scaling. OpenBSD seems to be particularly bad.

 
 If I boot with the 2.6.16.9 kernel, I don't seem to get that
 problem until the network actually hangs.
 
Todd.
 
 --
   Todd Showalter
   Silverbirch Studios


-- 
Stephen Hemminger [EMAIL PROTECTED]
And in the Packet there writ down that doome
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-24 Thread Stephen Hemminger

On Wed, 19 Jul 2006 13:01:50 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Stephen Hemminger [EMAIL PROTECTED]
 Date: Wed, 19 Jul 2006 15:52:04 -0400

  As a related note, I am looking into fixing inet hash tables to use RCU.

 IBM had posted a patch a long time ago, which would be not
 so hard to munge into the current tree.  See if you can
 spot it in the archives :)

Srivatsa Vaddagiri from IBM did  patch: http://lkml.org/lkml/2004/8/31/129

And Ben had a patch: http://lwn.net/Articles/174596/

Srivata's was more complete but pre-dates Acme's rearrangement.
Also, there is some code for refcnt's in it that looks wrong.
Or at minimum is masking underlying design flaws.

/* Ungrab socket and destroy it, if it was the last reference. */
 static inline void sock_put(struct sock *sk)
 {
-   if (atomic_dec_and_test(sk-sk_refcnt))
-   sk_free(sk);
+sp_loop:
+   if (atomic_dec_and_test(sk-sk_refcnt)) {
+   /* Restore ref count and schedule callback.
+* If we don't restore ref count, then the callback can be
+* scheduled by more than one CPU.
+*/
+   atomic_inc(sk-sk_refcnt);
+
+   if (atomic_read(sk-sk_refcnt) == 1)
+   call_rcu(sk-sk_rcu, sk_free_rcu);
+   else
+   goto sp_loop;
+   }
 }

Ben's still left reader writer locks, and needed IPV6 work. He said he
plans to get back to it.

-- 
Stephen Hemminger [EMAIL PROTECTED]
And in the Packet there writ down that doome
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problems with sky2 driver.

2006-07-24 Thread Todd Showalter

On Mon, 24 Jul 2006 11:45:33 -0700, Stephen Hemminger
[EMAIL PROTECTED] wrote:

 I suspect that probably isn't a sky2 driver problem.
 Does it go away if you turn of TCP window scaling:
   sysctl -w net.ipv4.tcp_window_scaling=0
 
 If so, you probably have a middlebox in your path that is not
 correctly handling TCP window scaling. OpenBSD seems to be
 particularly bad.

This seems to be the case.  The combination of the patch and
shutting off tcp window scaling seems to have fixed the box.  Thanks!

I'll ask around locally about network structure.

   Todd.

--
  Todd Showalter
  Silverbirch Studios
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kthread: airo.c

2006-07-24 Thread Sukadev Bhattiprolu

Sukadev Bhattiprolu [EMAIL PROTECTED] wrote:

| Andrew,
| 
| Javier Achirica, one of the major contributors to drivers/net/wireless/airo.c
| took a look at this patch, and doesn't have any problems with it. It doesn't
| fix any bugs and is just a cleanup, so it certainly isn't a candidate
| for this mainline cycle

Here is the same patch, merged up to 2.6.18-rc2. Christoph's patch (see
http://lkml.org/lkml/2006/7/13/332) still applies cleanly on top of this.

-
The airo driver is currently caching a pid for later use, but with the
implementation of containers, pids themselves do not uniquely identify
a task. The driver is also using kernel_thread() which is deprecated in
drivers.

This patch essentially replaces the kernel_thread() with kthread_create().
It also stores the task_struct of the airo_thread rather than its pid.
Since this introduces a second task_struct in struct airo_info, the patch
renames airo_info.task to airo_info.list_bss_task.

As an extension of these changes, the patch further:

 - replaces kill_proc() with kthread_stop()
 - replaces signal_pending() with kthread_should_stop()
 - removes thread completion synchronisation which is handled by
   kthread_stop().

Signed-off-by: Sukadev Bhattiprolu [EMAIL PROTECTED]
Cc: Javier Achirica [EMAIL PROTECTED]
Cc: Christoph Hellwig [EMAIL PROTECTED]
Cc: John Linville [EMAIL PROTECTED]

 drivers/net/wireless/airo.c |   38 +++---
 1 files changed, 15 insertions(+), 23 deletions(-)

Index: linux-2.6.18-rc1-mm2/drivers/net/wireless/airo.c
===
--- linux-2.6.18-rc1-mm2.orig/drivers/net/wireless/airo.c   2006-07-14 
14:04:01.0 -0700
+++ linux-2.6.18-rc1-mm2/drivers/net/wireless/airo.c2006-07-20 
19:44:50.0 -0700
@@ -47,6 +47,7 @@
 #include linux/pci.h
 #include asm/uaccess.h
 #include net/ieee80211.h
+#include linux/kthread.h
 
 #include airo.h
 
@@ -1187,11 +1188,10 @@ struct airo_info {
int whichbap);
unsigned short *flash;
tdsRssiEntry *rssi;
-   struct task_struct *task;
+   struct task_struct *list_bss_task;
+   struct task_struct *airo_thread_task;
struct semaphore sem;
-   pid_t thr_pid;
wait_queue_head_t thr_wait;
-   struct completion thr_exited;
unsigned long expires;
struct {
struct sk_buff *skb;
@@ -1736,9 +1736,9 @@ static int readBSSListRid(struct airo_in
issuecommand(ai, cmd, rsp);
up(ai-sem);
/* Let the command take effect */
-   ai-task = current;
+   ai-list_bss_task = current;
ssleep(3);
-   ai-task = NULL;
+   ai-list_bss_task = NULL;
}
rc = PC4500_readrid(ai, first ? ai-bssListFirst : ai-bssListNext,
list, ai-bssListRidLen, 1);
@@ -2400,8 +2400,7 @@ void stop_airo_card( struct net_device *
clear_bit(FLAG_REGISTERED, ai-flags);
}
set_bit(JOB_DIE, ai-jobs);
-   kill_proc(ai-thr_pid, SIGTERM, 1);
-   wait_for_completion(ai-thr_exited);
+   kthread_stop(ai-airo_thread_task);
 
/*
 * Clean out tx queue
@@ -2811,9 +2810,8 @@ static struct net_device *_init_airo_car
ai-config.len = 0;
ai-pci = pci;
init_waitqueue_head (ai-thr_wait);
-   init_completion (ai-thr_exited);
-   ai-thr_pid = kernel_thread(airo_thread, dev, CLONE_FS | CLONE_FILES);
-   if (ai-thr_pid  0)
+   ai-airo_thread_task = kthread_run(airo_thread, dev, dev-name);
+   if (IS_ERR(ai-airo_thread_task))
goto err_out_free;
ai-tfm = NULL;
rc = add_airo_dev( dev );
@@ -2930,8 +2928,7 @@ err_out_unlink:
del_airo_dev(dev);
 err_out_thr:
set_bit(JOB_DIE, ai-jobs);
-   kill_proc(ai-thr_pid, SIGTERM, 1);
-   wait_for_completion(ai-thr_exited);
+   kthread_stop(ai-airo_thread_task);
 err_out_free:
free_netdev(dev);
return NULL;
@@ -3063,13 +3060,7 @@ static int airo_thread(void *data) {
struct airo_info *ai = dev-priv;
int locked;

-   daemonize(%s, dev-name);
-   allow_signal(SIGTERM);
-
while(1) {
-   if (signal_pending(current))
-   flush_signals(current);
-
/* make swsusp happy with our thread */
try_to_freeze();
 
@@ -3097,7 +3088,7 @@ static int airo_thread(void *data) {
set_bit(JOB_AUTOWEP, ai-jobs);
break;
}
-   if (!signal_pending(current)) {
+   if (!kthread_should_stop()) {
unsigned long wake_at;

Re: linux-2.6.17(.6): bnx2.c:(.text+0xd741e): undefined reference to `crc32_le'

2006-07-24 Thread Michael Chan

On Fri, 2006-07-21 at 05:06 -0700, Toralf Förster wrote:
 Compiling  (an exotic ?) config I got:
 
 ...
  CC  init/version.o
   LD  init/built-in.o
   LD  .tmp_vmlinux1
 drivers/built-in.o: In function `bnx2_set_rx_mode':
 bnx2.c:(.text+0xd741e): undefined reference to `crc32_le'
 drivers/built-in.o: In function `bnx2_test_nvram':
 bnx2.c:(.text+0xd9a5f): undefined reference to `crc32_le'
 bnx2.c:(.text+0xd9a83): undefined reference to `crc32_le'
 make: *** [.tmp_vmlinux1] Error 1
 
BNX2 requires the CRC32 library and the current kernels do not have that
dependency in the Kconfig.  This has been fixed and will be in 2.6.18.

For now, you can turn on CONFIG_CRC32 (to y) and that should fix the
problem.

Thanks.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Repost: Re: [VLAN]: translate IF_OPER_DORMANT to netif_dormant_on()

2006-07-24 Thread David Miller

From: Patrick McHardy [EMAIL PROTECTED]
Date: Wed, 19 Jul 2006 14:42:35 +0200

 Stefan Rompf wrote:
  [VLAN]: Fix link state propagation

  When the queue of the underlying device is stopped at initialization time
  or the device is marked not present, the state will be propagated to the
  vlan device and never change. Based on an analysis by Patrick McHardy.

 ACKed-by: Patrick McHardy [EMAIL PROTECTED]

Applied, and I will queue this up for -stable.
Thanks everyone.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-24 Thread Alexey Kuznetsov

Hello!

 Also, there is some code for refcnt's in it that looks wrong.

Yes, it is disgusting. rcu does not allow to increase socket refcnt
in lookup routine.

Ben's version looks cleaner here, it does not touch refcnt
in rcu lookups. But it is dubious too:

 do_time_wait:
+   sock_hold(sk);

is obviously in violation of the rule. Probably, rcu lookup should do something
like:

if (!atomic_inc_not_zero(sk-sk_refcnt))
pretend_it_is_not_found; 

It is clear Ben did not look into IBM patch, because one known place
of trouble is missed: when socket moves from established to timewait,
timewait bucket must be inserted before established socket is removed.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SNMPv2 tcpAttemptFails counter error

2006-07-24 Thread David Miller

From: Wei Yongjun [EMAIL PROTECTED]
Date: Wed, 05 Jul 2006 05:19:54 -0400

 In my test, those direct state transition can not be counted to
 tcpAttemptFails. Following is my patch:

 Signed-off-by: Wei Yongjun [EMAIL PROTECTED]

This change can be implemented more simply, I believe.

Except for the tcp_minisocks.c change, all the paths
changed go to tcp_done() which is what actually transfers
the state to TCP_CLOSE.  Therefore, tcp_done() can
simply be modified to check if the current state is
TCP_SYN_RECV, and is so bump the counter.

Once you implement it this way, please audit all call paths
to make sure we don't now bump this counter twice.

Thank you.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help with bugfix for bond active-backup mode + vlans

2006-07-24 Thread Jay Vosburgh

Ben Greear [EMAIL PROTECTED] wrote:

Jay Vosburgh wrote:

  Another possibility would be to have __vlan_hwaccel_rx check the
 VLAN_DEV_INFO(skb-dev)-real_dev, and if that's a bonding device, apply
 the same logic found in skb_bond().  Or, if there's some way to ask the
 question is dev a VLAN device?, then that same test could be put into
 skb_bond() and all of the packet suppression fru fru could stay there.

There is a flag in if.h to denote VLAN devices:

Thanks, I missed that.

Sadly, elegance remains elusive, since the by the time skb_bond
is called, the slave device the packet arrived on isn't available
(vlan-real_dev points to 'bond0' by this point), and that information
is needed to decide whether to drop the packet or not.

The least grotty solution that comes to mind is to have
__vlan_hwaccel_rx call some skb_bond_suppress_dups() function directly,
and change skb_bond() to also call that function.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help with bugfix for bond active-backup mode + vlans

2006-07-24 Thread Ben Greear


Jay Vosburgh wrote:

Ben Greear [EMAIL PROTECTED] wrote:



Jay Vosburgh wrote:



Another possibility would be to have __vlan_hwaccel_rx check the
VLAN_DEV_INFO(skb-dev)-real_dev, and if that's a bonding device, apply
the same logic found in skb_bond().  Or, if there's some way to ask the
question is dev a VLAN device?, then that same test could be put into
skb_bond() and all of the packet suppression fru fru could stay there.


There is a flag in if.h to denote VLAN devices:



Thanks, I missed that.

Sadly, elegance remains elusive, since the by the time skb_bond
is called, the slave device the packet arrived on isn't available
(vlan-real_dev points to 'bond0' by this point), and that information
is needed to decide whether to drop the packet or not.

The least grotty solution that comes to mind is to have
__vlan_hwaccel_rx call some skb_bond_suppress_dups() function directly,
and change skb_bond() to also call that function.


Can you use skb-input_dev?

Ben

--
Ben Greear [EMAIL PROTECTED]
Candela Technologies Inc  http://www.candelatech.com

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] SNMPv2 tcpOutSegs counter error

2006-07-24 Thread David Miller

From: Wei Yongjun [EMAIL PROTECTED]
Date: Thu, 06 Jul 2006 04:01:18 -0400

 - TCP_INC_STATS(TCP_MIB_OUTSEGS);
 + if (!(tcb-sacked  TCPCB_LOST))
 + TCP_INC_STATS(TCP_MIB_OUTSEGS);

This test is not accurate enough.  For example, timer based
retransmits will not set the TCPCB_LOST bit.

I'm tempted to say to pass a flag to tcp_transmit_skb()
which says whether it is a retransmit or not, but that
function already takes way too many arguments.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] skge: chip clock rate typo

2006-07-24 Thread Stephen Hemminger

On Mon, 24 Jul 2006 16:34:46 -0500
Larry Finger [EMAIL PROTECTED] wrote:

 Stephen Hemminger wrote:
  Michael Buesch spotted this typo.  The impact is that the incorrect value
  was being computed for blinking LED and interrupt moderation values.
  
  Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
  
  --- skge-2.6.orig/drivers/net/skge.c
  +++ skge-2.6/drivers/net/skge.c
  @@ -519,7 +519,7 @@ static inline u32 hwkhz(const struct skg
  if (hw-chip_id == CHIP_ID_GENESIS)
  return 53215; /* or:  53.125 MHz */
 -^   Should this be 53125?
 
 Larry
 
yup
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] skge: chip clock rate typo

2006-07-24 Thread Stephen Hemminger

Okay, Fix both typo's in one patch .The impact is that the incorrect value
was being computed for blinking LED and interrupt moderation values.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

--- skge-2.6.orig/drivers/net/skge.c
+++ skge-2.6/drivers/net/skge.c
@@ -516,10 +516,7 @@ static int skge_set_pauseparam(struct ne
 /* Chip internal frequency for clock calculations */
 static inline u32 hwkhz(const struct skge_hw *hw)
 {
-   if (hw-chip_id == CHIP_ID_GENESIS)
-   return 53215; /* or:  53.125 MHz */
-   else
-   return 78215; /* or:  78.125 MHz */
+   return (hw-chip_id == CHIP_ID_GENESIS) ? 53125 : 78125;
 }
 
 /* Chip HZ to microseconds */



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread David Miller

From: Roland Dreier [EMAIL PROTECTED]
Date: Tue, 04 Jul 2006 13:34:27 -0700

 Well, here's a quick overview, leaving out some of the details.  The
 difference between TOE and iWARP/RDMA is really the interface that
 they present.

Thanks for the description Roland.  It helps me understand the
situation better.

 The real issues for netdev are things like Steve Wise's patch to add
 route change notifiers, which could be used to tell RNICs when to
 update the next hop for a connection they're handling.

I'll probably put Steve's patches in soon.

 More generally, it would be interesting to see if it's possible to
 tie an RNIC into the kernel's packet filtering, so that disallowed
 connections don't get set up.  This seems very similar in spirit to
 the problems around packet filtering that were raised for VJ
 netchannels.

Don't get too excited about VJ netchannels, more and more roadblocks
to their practicality are being found every day.

For example, my idea to allow ESTABLISHED TCP socket demux to be done
before netfilter is flawed.  Connection tracking and NAT can change
the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP
socket, therefore we must always hit netfilter first.

All the original costs of route, netfilter, TCP socket lookup all
reappear as we make VJ netchannels fit all the rules of real practical
systems, eliminating their gains entirely.  I will also note in
passing that papers on related ideas, such as the Exokernel stuff, are
very careful to not address the issue of how practical 1) their demux
engine is and 2) the negative side effects of userspace TCP
implementations.  For an example of the latter, if you have some 1GB
JAVA process you do not want to wake that monster up just to do some
ACK processing or TCP window updates, yet if you don't you violate
TCP's rules and risk spurious unnecessary retransmits.

Furthermore, the VJ netchannel gains can be partially obtained from
generic stateless facilities that we are going to get anyways.
Networking chips supporting multiple MSI-X vectors, choosen by hashing
the flow ID, can move TCP processing to end nodes which are cpu
threads in this case, by having each such MSI-X vector target a
different cpu thread.

The good news is that we've survived a long time without revolutions
like VJ net channels, and the existing TCP stack can be improved
dramatically and in ways that people will see benefits from in a
shorter amount of time.  For example, Alexey Kuznetsov and I have some
ideas on how to make the most expensive TCP function for a sender,
tcp_ack(), more efficient by using different data structures for the
retransmit queue and the loss/recovery packet SACK state.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread David Miller

From: Tom Tucker [EMAIL PROTECTED]
Date: Wed, 05 Jul 2006 12:09:42 -0500

 A TOE net stack is closed source firmware. Linux engineers have no way
 to fix security issues that arise. As a result, only non-TOE users will
 receive security updates, leaving random windows of vulnerability for
 each TOE NIC's users.

 - A Linux security update may or may not be relevant to a vendors
 implementation. 

 - If a vendor's implementation has a security issue then the customer
 must rely on the vendor to fix it. This is no less true for iWARP than
 for any adapter.

This isn't how things actually work.

Users have a computer, and they can rightly expect the community
to help them solve problems that occur in the upstream kernel.

When a bug is found and the person is using NIC X, we don't
necessarily forward the bug report to the vendor of NIC X.
Instead we try to fix the bug.  Many chip drivers are maintained
by people who do not work for the company that makes the chip,
and this works just fine.

If only the chip vendor can fix a security problem, this makes Linux
less agile to fix.  Even aspect of a problem on a Linux system that
cannot be fixed entirely by the community is a net negative for Linux.

 - iWARP needs to do protocol processing in order to validate and
 evaluate TCP payload in advance of direct data placement. This
 requirement is independent of CPU speed. 

Yet, RDMA itself is just an optimization meant to deal with
limitations of cpu and memory speed.  You can rephrase the
situation in whatever way suits your argument, but it does not
make the core issue go away :)

 - I suspect that connection rates for RDMA adapters fall well-below the
 rates attainable with a dumb device. That said, all of the RDMA
 applications that I know of are not connection intensive. Even for TOE,
 the later HTTP versions makes connection rates less of an issue.

This is a very naive evaluation of the situation.  Yes, newer
versions of protocols such as HTTP make the per-client connection
burdon lower, but the number of clients will increase in time to
more than makeup for whatever gains are seen due to this.

And then you have protocols which by design are connection heavy,
and rightly so, such as bittorrent.

Being able to handle enormous numbers of connections, with extreme
scalability and low latency, is an absolute requirement of any modern
day serious TCP stack.  And this requirement is not going away.
Wishing this requirement away due to HTTP persistent connections is
very unrealistic, at best.

 - This is the problem we're trying to solve...incrementally and
 responsibly.

You can't.  See my email to Roland about why even VJ net channels
are found to be impractical.  To support netfilter properly, you
must traverse the whole netfilter stack, because NAT can rewrite
packets, yet still make them destined for the local system, and
thus they will have a different identity for connection demux
by the time the TCP stack sees the packet.

All of these tranformations occur between normal packet receive
and the TCP stack.  You would therefore need to put your card
between netfilter and TCP in the packet input path, and at that
point why bother with the stateful card at all?

The fact is that stateless approaches will always be better than
stateful things because you cannot replicate the functionality we
have in the Linux stack without replicating 10 years of work into
your chip's firmware.  At that point you should just run Linux
on your NIC since that is what you are effectively doing :)

In conversations such as these, it helps us a lot if you can be frank
and honest about the true absolute limitations of your technology.  I
can see that your viewpoint is tainted when I hear things such as HTTP
persistent connections being used as a reason why high TCP connection
rates won't matter in the future.  Such assertions are understood to
be patently false by anyone who understands TCP and how it is used in
the real world.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread David Miller

From: Steve Wise [EMAIL PROTECTED]
Date: Wed, 05 Jul 2006 12:50:34 -0500

 However, iWARP devices _could_ integrate with netfilter.  For most
 devices the connection request event (SYN) gets passed up to the host
 driver.  So the driver can enforce filter rules then.

This doesn't work.  In order to handle things like NAT and connection
tracking properly you must even allow ESTABLISHED state packets to
pass through netfilter.

Netfilter can have rules such as NAT port 200 to 300, leave the other
fields alone and your suggested scheme cannot handle this.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove CONFIG_HAVE_ARCH_DEV_ALLOC_SKB

2006-07-24 Thread David Miller

From: Christoph Hellwig [EMAIL PROTECTED]
Date: Fri, 7 Jul 2006 11:10:08 +0200

 skbuff.h has an #ifndef CONFIG_HAVE_ARCH_DEV_ALLOC_SKB to allow
 architectures to reimplement __dev_alloc_skb.  It's not set on any
 architecture and now that we have an architecture-overrideable
 NET_SKB_PAD there is not point at all to have one either.

 Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Applied, thanks Christoph.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] correct dev_alloc_skb kerneldoc

2006-07-24 Thread David Miller

From: Christoph Hellwig [EMAIL PROTECTED]
Date: Fri, 7 Jul 2006 11:09:57 +0200

 dev_alloc_skb is designated for RX descriptors, not TX.  (Some drivers
 use it for the latter anyway, but that's a different story)

 Signed-off-by: Christoph Hellwig [EMAIL PROTECTED]

Also applied, thanks a lot.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is RDMA

2006-07-24 Thread Rick Jones

That TOE/iWARP could end-up being precluded by NAT seems so ironic from a POE2E 
standpoint.


rick jones

Purity Of End To END
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ip multicast route bug fix

2006-07-24 Thread Stephen Hemminger

This should fix the problem reported in 
http://bugzilla.kernel.org/show_bug.cgi?id=6186
where the skb is used after freed. The code in IP multicast route.

Code was reusing an skb which could lead to use after free or double free.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
---
 net/ipv4/ipmr.c |   20 ++--
 1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index ba33f86..d336104 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1580,6 +1580,7 @@ int ipmr_get_route(struct sk_buff *skb, 
cache = ipmr_cache_find(rt-rt_src, rt-rt_dst);
 
if (cache==NULL) {
+   struct sk_buff *iskb;
struct net_device *dev;
int vif;
 
@@ -1593,12 +1594,19 @@ int ipmr_get_route(struct sk_buff *skb, 
read_unlock(mrt_lock);
return -ENODEV;
}
-   skb-nh.raw = skb_push(skb, sizeof(struct iphdr));
-   skb-nh.iph-ihl = sizeof(struct iphdr)2;
-   skb-nh.iph-saddr = rt-rt_src;
-   skb-nh.iph-daddr = rt-rt_dst;
-   skb-nh.iph-version = 0;
-   err = ipmr_cache_unresolved(vif, skb);
+   
+   iskb = alloc_skb(sizeof(struct iphdr), GFP_KERNEL);
+   if (!iskb) {
+   read_unlock(mrt_lock);
+   return -ENOMEM;
+   }
+   memset(iskb-data, 0, sizeof(struct iphdr));
+   iskb-nh.raw = iskb-data;
+   iskb-nh.iph-ihl = sizeof(struct iphdr)2;
+   iskb-nh.iph-saddr = rt-rt_src;
+   iskb-nh.iph-daddr = rt-rt_dst;
+
+   err = ipmr_cache_unresolved(vif, iskb);
read_unlock(mrt_lock);
return err;
}
-- 
1.4.0

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is RDMA

2006-07-24 Thread David Miller

From: Rick Jones [EMAIL PROTECTED]
Date: Mon, 24 Jul 2006 15:34:30 -0700

 That TOE/iWARP could end-up being precluded by NAT seems so ironic
 from a POE2E standpoint.

To be honest we do not have a pure end to end internet, and some of
our failed experiments in the past prove this :-)

For example, we have an optimization that allows much earlier
termination of TIME_WAIT connections.  It relies upon TCP timestamps
and attributes we can determine about end hosts using that information
(it is yet another Van Jacobson idea btw).  But NAT means that IP+Port
does not necessarily equate to the same host over time, not even over
short periods of time.  A NAT box could be using Port X for host A and
then host B some short time later.

Therefore we had to disable the early timewait recycling trick by
default.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] remove CONFIG_HAVE_ARCH_DEV_ALLOC_SKB

2006-07-24 Thread Roland Dreier

 skbuff.h has an #ifndef CONFIG_HAVE_ARCH_DEV_ALLOC_SKB to allow
 architectures to reimplement __dev_alloc_skb.  It's not set on any
 architecture and now that we have an architecture-overrideable
 NET_SKB_PAD there is not point at all to have one either.

I missed this when hch first posted it, sorry.

But my impression was that the intent of the config option was to let
Xen hook __dev_alloc_skb() to allocate special receive skbs to handle
their page-flipping virtual network device.  Which goes beyond
NET_SKB_PAD.

So the real question is about Xen hooks I guess -- and given where the
rest of Xen is, it probably does make sense to go ahead and strip this
out.

 - R.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RDMA will be reverted

2006-07-24 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: Steve Wise [EMAIL PROTECTED]
 Date: Wed, 05 Jul 2006 12:50:34 -0500

 However, iWARP devices _could_ integrate with netfilter.  For most
 devices the connection request event (SYN) gets passed up to the host
 driver.  So the driver can enforce filter rules then.

 This doesn't work.  In order to handle things like NAT and
 connection tracking properly you must even allow ESTABLISHED
 state packets to pass through netfilter.

 Netfilter can have rules such as NAT port 200 to 300, leave
 the other fields alone and your suggested scheme cannot handle this.

This is totally irrelevant. But it does work.

First, an RDMA connection once established associates a
TCP connection *as identified external to the box* with
an RDMA endpoint (conventionally a QP).

Performing a NAT translation on a TCP packet would certainly
be within the capabilities of an RNIC, but it would be pointless.
The relabeled TCP segment would be associated with the same QP.

Once an RDMA connection is established, the individual TCP segments
are only of interest to the RDMA endpoint. Payload is delivered
through the RDMA interface (the same one already used for
InfiniBand). The purpose of integration with netfilter would
be to ensure that no RDMA Connection could exist, or persist,
if netfilter would not allow the TCP connection to be created.

That is not a matter of packet filtering, it is matter of
administrative consistency. If someone uses netfilter to block
connections from a given IP netmask then they reasonably expect
that there will be no connections with any host within that
IP netmask. They do not expect exceptions for RDMA, iSCSI
or InfiniBand.

The existing connection management interfaces in openfabrics,
designed to support both InfiniBand and iWARP, could naturally
be extended to validate all RDMA connections using an IP address
with netfilter. This would be of real value.

The only real value of a rule such as NAT port 200 to 300 is
to allow a remote peer to establish a connection to port 200
with a local listener using port 300. That *can* be supported
without actually manipulating the header in each TCP packet.

It is also possible to discuss other netfilter functionality
that serves a valid end-user purpose, such as counting packets.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is RDMA

2006-07-24 Thread Andi Kleen

On Tuesday 25 July 2006 00:34, Rick Jones wrote:
 That TOE/iWARP could end-up being precluded by NAT seems so ironic from a 
 POE2E 
 standpoint.

Yes, it's sad, but reality unfortunately. 

There is even precedent: the VJ stateless TW recycling scheme also
turned out to not work because of NAT considerations.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RDMA will be reverted

2006-07-24 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: Tom Tucker [EMAIL PROTECTED]
 Date: Wed, 05 Jul 2006 12:09:42 -0500

 A TOE net stack is closed source firmware. Linux engineers have no
 way to fix security issues that arise. As a result, only non-TOE
 users will receive security updates, leaving random windows of
 vulnerability for each TOE NIC's users. 

 - A Linux security update may or may not be relevant to a vendors
 implementation. 

 - If a vendor's implementation has a security issue then the customer
 must rely on the vendor to fix it. This is no less true for iWARP
 than for any adapter.

 This isn't how things actually work.

 Users have a computer, and they can rightly expect the
 community to help them solve problems that occur in the
 upstream kernel.

 When a bug is found and the person is using NIC X, we don't
 necessarily forward the bug report to the vendor of NIC X.
 Instead we try to fix the bug.  Many chip drivers are
 maintained by people who do not work for the company that
 makes the chip, and this works just fine.

 If only the chip vendor can fix a security problem, this
 makes Linux less agile to fix.  Even aspect of a problem on a
 Linux system that cannot be fixed entirely by the community
 is a net negative for Linux.

 - iWARP needs to do protocol processing in order to validate and
 evaluate TCP payload in advance of direct data placement. This
 requirement is independent of CPU speed.

 Yet, RDMA itself is just an optimization meant to deal with
 limitations of cpu and memory speed.  You can rephrase the
 situation in whatever way suits your argument, but it does
 not make the core issue go away :)

RDMA is a protocol that allows the application to more
precisely state the actual ordering requirements. It
improves the end-to-end interactions and has value
over a protocol with only byte or message stream
semantics regardless of local interface efficiencies.
See http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt

In any event, isn't the value of an RDMA interface to applications
already settled? The question is how best to deal integrate the
usage of IP addresses with the kernel. The inability to validate
the low-level packet validation in open source code is a limitation
of *all* RDMA solutions, the transport layer of InfiniBand is just
as offloaded as it is for iWARP.

The patches proposed are intended to support integrated connection
management for RDMA connections using IP addresses, no matter what
the underlying transport is. The only difference is that *all* iWARP
connections use IP addresses.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread Andi Kleen


 For example, my idea to allow ESTABLISHED TCP socket demux to be done
 before netfilter is flawed.  Connection tracking and NAT can change
 the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP
 socket, therefore we must always hit netfilter first.

Hmm, how does this happen?

I guess either when a connection is masqueraded and an application did a bind()
on a local port that is used by the masquerading engine.  That could be handled
by just disallowing it.

Or when you have a transparent proxy setup with the proxy on the local host.
Perhaps in that case netfilter could be taught to reinject packets
in a way that they hit another ESTABLISHED lookup.

Did I miss a case?

 All the original costs of route, netfilter, TCP socket lookup all
 reappear as we make VJ netchannels fit all the rules of real practical
 systems, eliminating their gains entirely.

At least most of the optimizations from the early demux scheme could
be probably gotten simpler by adding a fast path to iptables/conntrack/etc. 
that checks if all rules only check SYN etc. packets and doesn't walk the
full rules then (or more generalized a fast TCP flag mask check similar 
to what TCP does). With that ESTABLISHED would hit TCP only with relatively
small overhead.

 I will also note in 
 passing that papers on related ideas, such as the Exokernel stuff, are
 very careful to not address the issue of how practical 1) their demux
 engine is and 2) the negative side effects of userspace TCP
 implementations.  For an example of the latter, if you have some 1GB
 JAVA process you do not want to wake that monster up just to do some
 ACK processing or TCP window updates, yet if you don't you violate
 TCP's rules and risk spurious unnecessary retransmits.

I don't quite get why the size of the process matters here - if only
some user space TCP library is called directly then it shouldn't
really matter how big or small the rest of the process is.

Or did you mean migration costs as described below?

But on the other hand full user space TCP seems to me of little gain
compared to a process context implementation.

I somehow like it better to hide these implementation details in 
the kernel.
 
 Furthermore, the VJ netchannel gains can be partially obtained from
 generic stateless facilities that we are going to get anyways.
 Networking chips supporting multiple MSI-X vectors, choosen by hashing
 the flow ID, can move TCP processing to end nodes which are cpu
 threads in this case, by having each such MSI-X vector target a
 different cpu thread.

The problem with the scheme is that to do process context processing
effectively you would need to teach the scheduler to aggressively
migrate on wake up (so that the process ends up on the CPU that 
was selected by the hash function in the NIC).

But what do you do when you have lots of different connections
with different target CPU hash values or when this would require
you to move multiple compute intensive processes or a single core?

Without user context TCP, but using softirqs instead, it looks a bit better 
because you can at least use different CPUs to do the ACK processing etc.
and the hash function spreading out connections over your CPUs doesn't harm.

But you still have relatively high cache line transfer costs in handing
over these packet from the softirq CPUs to the final process consumer. I liked
VJ's idea of using arrays-of-something instead of lists for that to avoid
some cache line transfers.  Ok at least it sounds nice in theory - haven't seen 
any 
hard numbers on this scheme compared to a traditional double linked list.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[patch 1/2] s2io driver bug fixes

2006-07-24 Thread Ananda Raju

Hi, 
This patch contains some of the bug fixes and enhancements done to 
s2io driver. Following are the brief description of changes

1. Introduced macro S2IO_PARM_INT for declaring integer load parameter
2. UDP_RR test failure, memset txdl after Tx completion 
3. PXE boot may leave adapter in unknown state so do reset in probe.
4. Add Tx completion code in netpoll
5. In s2io_vpd_read() move array vpd_data[] to pointer, saves stack 
memory 
6. Fix bug in ethtool online test

Signed-off-by: Ananda Raju [EMAIL PROTECTED]
---
diff -upNr netdev/drivers/net/s2io.c bug_fixes_1/drivers/net/s2io.c
--- netdev/drivers/net/s2io.c   2006-07-14 07:58:06.0 -0700
+++ bug_fixes_1/drivers/net/s2io.c  2006-07-14 09:26:09.0 -0700
@@ -370,38 +370,50 @@ static const u64 fix_mac[] = {
END_SIGN
 };
 
+MODULE_AUTHOR(Raghavendra Koushik [EMAIL PROTECTED]);
+MODULE_LICENSE(GPL);
+MODULE_VERSION(DRV_VERSION);
+
+
 /* Module Loadable parameters. */
-static unsigned int tx_fifo_num = 1;
-static unsigned int tx_fifo_len[MAX_TX_FIFOS] =
-{DEFAULT_FIFO_0_LEN, [1 ...(MAX_TX_FIFOS - 1)] = DEFAULT_FIFO_1_7_LEN};
-static unsigned int rx_ring_num = 1;
-static unsigned int rx_ring_sz[MAX_RX_RINGS] =
-{[0 ...(MAX_RX_RINGS - 1)] = SMALL_BLK_CNT};
-static unsigned int rts_frm_len[MAX_RX_RINGS] =
-{[0 ...(MAX_RX_RINGS - 1)] = 0 };
-static unsigned int rx_ring_mode = 1;
-static unsigned int use_continuous_tx_intrs = 1;
-static unsigned int rmac_pause_time = 0x100;
-static unsigned int mc_pause_threshold_q0q3 = 187;
-static unsigned int mc_pause_threshold_q4q7 = 187;
-static unsigned int shared_splits;
-static unsigned int tmac_util_period = 5;
-static unsigned int rmac_util_period = 5;
-static unsigned int bimodal = 0;
-static unsigned int l3l4hdr_size = 128;
-#ifndef CONFIG_S2IO_NAPI
-static unsigned int indicate_max_pkts;
-#endif
+S2IO_PARM_INT(tx_fifo_num, 1);
+S2IO_PARM_INT(rx_ring_num, 1);
+
+
+S2IO_PARM_INT(rx_ring_mode, 1);
+S2IO_PARM_INT(use_continuous_tx_intrs, 1);
+S2IO_PARM_INT(rmac_pause_time, 0x100);
+S2IO_PARM_INT(mc_pause_threshold_q0q3, 187);
+S2IO_PARM_INT(mc_pause_threshold_q4q7, 187);
+S2IO_PARM_INT(shared_splits, 0);
+S2IO_PARM_INT(tmac_util_period, 5);
+S2IO_PARM_INT(rmac_util_period, 5);
+S2IO_PARM_INT(bimodal, 0);
+S2IO_PARM_INT(l3l4hdr_size, 128);
 /* Frequency of Rx desc syncs expressed as power of 2 */
-static unsigned int rxsync_frequency = 3;
+S2IO_PARM_INT(rxsync_frequency, 3);
 /* Interrupt type. Values can be 0(INTA), 1(MSI), 2(MSI_X) */
-static unsigned int intr_type = 0;
+S2IO_PARM_INT(intr_type, 0);
 /* Large receive offload feature */
-static unsigned int lro = 0;
+S2IO_PARM_INT(lro, 0);
 /* Max pkts to be aggregated by LRO at one time. If not specified,
  * aggregation happens until we hit max IP pkt size(64K)
  */
-static unsigned int lro_max_pkts = 0x;
+S2IO_PARM_INT(lro_max_pkts, 0x);
+#ifndef CONFIG_S2IO_NAPI
+S2IO_PARM_INT(indicate_max_pkts, 0);
+#endif
+
+static unsigned int tx_fifo_len[MAX_TX_FIFOS] =
+{DEFAULT_FIFO_0_LEN, [1 ...(MAX_TX_FIFOS - 1)] = DEFAULT_FIFO_1_7_LEN};
+static unsigned int rx_ring_sz[MAX_RX_RINGS] =
+{[0 ...(MAX_RX_RINGS - 1)] = SMALL_BLK_CNT};
+static unsigned int rts_frm_len[MAX_RX_RINGS] =
+{[0 ...(MAX_RX_RINGS - 1)] = 0 };
+
+module_param_array(tx_fifo_len, uint, NULL, 0);
+module_param_array(rx_ring_sz, uint, NULL, 0);
+module_param_array(rts_frm_len, uint, NULL, 0);
 
 /*
  * S2IO device table.
@@ -464,10 +476,9 @@ static int init_shared_mem(struct s2io_n
size += config-tx_cfg[i].fifo_len;
}
if (size  MAX_AVAILABLE_TXDS) {
-   DBG_PRINT(ERR_DBG, %s: Requested TxDs too high, ,
- __FUNCTION__);
+   DBG_PRINT(ERR_DBG, s2io: Requested TxDs too high, );
DBG_PRINT(ERR_DBG, Requested: %d, max supported: 8192\n, 
size);
-   return FAILURE;
+   return -EINVAL;
}
 
lst_size = (sizeof(TxD_t) * config-max_txds);
@@ -547,6 +558,7 @@ static int init_shared_mem(struct s2io_n
nic-ufo_in_band_v = kmalloc((sizeof(u64) * size), GFP_KERNEL);
if (!nic-ufo_in_band_v)
return -ENOMEM;
+   memset(nic-ufo_in_band_v, 0, size);
 
/* Allocation and initialization of RXDs in Rings */
size = 0;
@@ -1213,7 +1225,7 @@ static int init_nic(struct s2io_nic *nic
break;
}
 
-   /* Enable Tx FIFO partition 0. */
+   /* Enable all configured Tx FIFO partitions */
val64 = readq(bar0-tx_fifo_partition_0);
val64 |= (TX_FIFO_PARTITION_EN);
writeq(val64, bar0-tx_fifo_partition_0);
@@ -1650,7 +1662,7 @@ static void en_dis_able_nic_intrs(struct
writeq(temp64, bar0-general_int_mask);
/*
 * If Hercules adapter enable GPIO otherwise
-* disabled all PCIX, Flash, MDIO, IIC

[patch 2/2] s2io driver bug fixes

2006-07-24 Thread Ananda Raju

Hi,
This patch contains some of the bug fixes and enhancements done to
s2io driver. Following are the brief description of changes

1. code cleanup to handle gso modification better
2. Move repeated code in rx path, to a common function 
   s2io_chk_rx_buffers()
3. Bug fix in MSI interrupt 
4. clear statistics when card is down
5. Avoid linked list traversing in lro aggregation.
6. Use pci_dma_sync_single_for_cpu for buffer0 in case of 2/3
   buffer mode.
7. ethtool tso get/set functions to set clear NETIF_F_TSO6
8. Stop LRO aggregation when we receive ECN notification

Signed-off-by: Ananda Raju [EMAIL PROTECTED]
---
diff -upNr bug_fixes_1/drivers/net/s2io.c bug_fixes_2/drivers/net/s2io.c
--- bug_fixes_1/drivers/net/s2io.c  2006-07-14 09:26:09.0 -0700
+++ bug_fixes_2/drivers/net/s2io.c  2006-07-21 05:22:19.0 -0700
@@ -76,7 +76,7 @@
 #include s2io.h
 #include s2io-regs.h
 
-#define DRV_VERSION 2.0.14.2
+#define DRV_VERSION 2.0.15.2
 
 /* S2io Driver name  version. */
 static char s2io_driver_name[] = Neterion;
@@ -2383,9 +2383,14 @@ static int fill_rx_buffers(struct s2io_n
skb-data = (void *) (unsigned long)tmp;
skb-tail = (void *) (unsigned long)tmp;
 
-   ((RxD3_t*)rxdp)-Buffer0_ptr =
-   pci_map_single(nic-pdev, ba-ba_0, BUF0_LEN,
+   if (!(((RxD3_t*)rxdp)-Buffer0_ptr))
+   ((RxD3_t*)rxdp)-Buffer0_ptr =
+  pci_map_single(nic-pdev, ba-ba_0, BUF0_LEN,
   PCI_DMA_FROMDEVICE);
+   else
+   pci_dma_sync_single_for_device(nic-pdev,
+   (dma_addr_t) ((RxD3_t*)rxdp)-Buffer0_ptr,
+   BUF0_LEN, PCI_DMA_FROMDEVICE);
rxdp-Control_2 = SET_BUFFER0_SIZE_3(BUF0_LEN);
if (nic-rxd_mode == RXD_MODE_3B) {
/* Two buffer mode */
@@ -2398,10 +2403,13 @@ static int fill_rx_buffers(struct s2io_n
(nic-pdev, skb-data, dev-mtu + 4,
PCI_DMA_FROMDEVICE);
 
-   /* Buffer-1 will be dummy buffer not used */
-   ((RxD3_t*)rxdp)-Buffer1_ptr =
-   pci_map_single(nic-pdev, ba-ba_1, BUF1_LEN,
-   PCI_DMA_FROMDEVICE);
+   /* Buffer-1 will be dummy buffer. Not used */
+   if (!(((RxD3_t*)rxdp)-Buffer1_ptr)) {
+   ((RxD3_t*)rxdp)-Buffer1_ptr =
+   pci_map_single(nic-pdev, 
+   ba-ba_1, BUF1_LEN,
+   PCI_DMA_FROMDEVICE);
+   }
rxdp-Control_2 |= SET_BUFFER1_SIZE_3(1);
rxdp-Control_2 |= SET_BUFFER2_SIZE_3
(dev-mtu + 4);
@@ -2728,7 +2736,7 @@ static void rx_intr_handler(ring_info_t 
/* If your are next to put index then it's FIFO full condition 
*/
if ((get_block == put_block) 
(get_info.offset + 1) == put_info.offset) {
-   DBG_PRINT(ERR_DBG, %s: Ring Full\n,dev-name);
+   DBG_PRINT(INTR_DBG, %s: Ring Full\n,dev-name);
break;
}
skb = (struct sk_buff *) ((unsigned long)rxdp-Host_Control);
@@ -2748,18 +2756,15 @@ static void rx_intr_handler(ring_info_t 
 HEADER_SNAP_SIZE,
 PCI_DMA_FROMDEVICE);
} else if (nic-rxd_mode == RXD_MODE_3B) {
-   pci_unmap_single(nic-pdev, (dma_addr_t)
+   pci_dma_sync_single_for_cpu(nic-pdev, (dma_addr_t)
 ((RxD3_t*)rxdp)-Buffer0_ptr,
 BUF0_LEN, PCI_DMA_FROMDEVICE);
pci_unmap_single(nic-pdev, (dma_addr_t)
-((RxD3_t*)rxdp)-Buffer1_ptr,
-BUF1_LEN, PCI_DMA_FROMDEVICE);
-   pci_unmap_single(nic-pdev, (dma_addr_t)
 ((RxD3_t*)rxdp)-Buffer2_ptr,
 dev-mtu + 4,
 PCI_DMA_FROMDEVICE);
} else {
-   pci_unmap_single(nic-pdev, (dma_addr_t)
+   pci_dma_sync_single_for_cpu(nic-pdev, (dma_addr_t)
 ((RxD3_t*)rxdp)-Buffer0_ptr, BUF0_LEN,

Re: RDMA will be reverted

2006-07-24 Thread Andi Kleen

On Tuesday 25 July 2006 01:22, David Miller wrote:
 From: Andi Kleen [EMAIL PROTECTED]
 Date: Tue, 25 Jul 2006 01:10:25 +0200

   All the original costs of route, netfilter, TCP socket lookup all
   reappear as we make VJ netchannels fit all the rules of real practical
   systems, eliminating their gains entirely.

  At least most of the optimizations from the early demux scheme could
  be probably gotten simpler by adding a fast path to iptables/conntrack/etc. 
  that checks if all rules only check SYN etc. packets and doesn't walk the
  full rules then (or more generalized a fast TCP flag mask check similar 
  to what TCP does). With that ESTABLISHED would hit TCP only with relatively
  small overhead.

 Actually, all is not lost.  Alexey has a more clever idea which
 is basically to run the netfilter hooks in the socket receive
 path.

The gain being that the target CPU does the work instead of 
the softirq one?

Some combined lookup and better handler of ESTABLISHED still
seems like a good idea.

One idea I had at some point was to separate conntrack for local
connection vs routed connections and attach the local conntrack
to the socket (and use its lookup tables). Then at least for
local connections conntrack should be nearly free.

It should also solve the issue we currently have that enabled 
conntrack makes local network performance significantly worse.

 Where does state live in such a huge process?  Usually, it is
 scattered all over it's address space.  Let us say that java
 application just did a lot of churning on it's own data
 structure, swapping out TCP library state objects, we will
 prematurely swap that stuff back in just to spit out an ACK
 or similar.

TCP state is usually multiple cache lines, so you would have
cache misses anyways. Do you worry about the TLBs? 

  But what do you do when you have lots of different connections
  with different target CPU hash values or when this would require
  you to move multiple compute intensive processes or a single core?

 That is why we have scheduler :)

It can't do well if it gets conflicting input.

 Even in a best effort scenerio, things 
 will be generally better than they are currently, plus there is nothing
 precluding the flow demux MSI-X selection from getting more intelligent.

Intelligent = statefull in this case.

AFAIK the only way to do it stateless is hashes and the output
of hashes tends to be unpredictible by definition.

 For example, the demuxer could notice that TCPdata transmits for
 flow X tend to happen on cpu X, and update a flow table to record that
 fact.  It could use the same flow table as the one used for LRO.

Hmm, i somewhat doubt that lower end NICs will ever have such flow tables.
Also the flow tables could always thrash (because the on NIC RAM is necessarily
limited) or they or require the NIC to look up state in memory which is 
probably not much faster than the CPUs doing it.

Using hash functions in the hardware to select the MSI-X seems 
more elegant, cheaper and much more scalable to me.

The drawback of hashes is that for processes with multiple
connections you have to move some work back into the softirqs
that run on the MSI-X target CPUs.

So basically doing process context TCP fully will require
much more complex and statefull hardware. 

Or you can keep it only as a fast path for specific situations
(single busy connection per thread) and stay with mostly-softirq
processing for the many connection cases.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[XFRM]: Fix protocol field value for outgoing IPv6 GSO packets

2006-07-24 Thread Patrick McHardy

This appears to be a mistake, but I didn't follow the GSO stuff
very closely, so there could be some non-obvious reason.


[XFRM]: Fix protocol field value for outgoing IPv6 GSO packets

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit 8035f60a607630459e4440dbbc5a20f3cfbf97ac
tree f1a7061cfd1f923b3991ee8f449cffce86870a3e
parent 440848a8e33fc1927bab45bd73f6c8e042ea7abd
author Patrick McHardy [EMAIL PROTECTED] Tue, 25 Jul 2006 02:02:00 +0200
committer Patrick McHardy [EMAIL PROTECTED] Tue, 25 Jul 2006 02:02:00 +0200

 net/ipv6/xfrm6_output.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
index 0eea60e..c8c8b44 100644
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -125,7 +125,7 @@ static int xfrm6_output_finish(struct sk
if (!skb_is_gso(skb))
return xfrm6_output_finish2(skb);
 
-   skb-protocol = htons(ETH_P_IP);
+   skb-protocol = htons(ETH_P_IPV6);
segs = skb_gso_segment(skb, 0);
kfree_skb(skb);
if (unlikely(IS_ERR(segs)))

Re: [XFRM]: Fix protocol field value for outgoing IPv6 GSO packets

2006-07-24 Thread Herbert Xu

On Tue, Jul 25, 2006 at 02:09:26AM +0200, Patrick McHardy wrote:
 This appears to be a mistake, but I didn't follow the GSO stuff
 very closely, so there could be some non-obvious reason.

Yes it definitely was a mistake! Thanks for picking this up Patrick.

 [XFRM]: Fix protocol field value for outgoing IPv6 GSO packets
 
 Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

Acked-by: Herbert Xu [EMAIL PROTECTED]

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread Rick Jones

This all sounds like the discussions we had within HP-UX between 10.20 and 11.0 
concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling.  IPS 
was done by the 10.20 stack at the handoff between the driver and netisr.  If 
the packet was not an IP datagram fragment, parts of the transport and IP 
headers would be hashed, and the result would be the netisr queue to which the 
packet would be queued for further processing.


It worked fine and dandy for stuff like aggregate netperf TCP_RR tests because 
there was a 1-1 correspondence between a connection and a process/thread.  It 
was OK for the networking to dictate where the process should run.  That feels 
rather like a NIC that would hash packets and pick the MSI-X based on that.


However, as Andi discusses, when there is a process/thread doing more than one 
connection, picking a CPU based on addressing hashing will be like TweedleDee 
and TweedleDum telling Alice to go in opposite directions.  Hence TOPS in 11.X. 
 This time, when there is a normal lookup location in the path, where the 
application last accessed the socket is determined, and things shift-over to 
that CPU.  This then is the process (well actually the scheduler) telling 
networking where it should do its work.


That addresses the multiple connections per thread/process and still works just 
as well for 1-1.  There are still issues if you have mutiple threads/processes 
concurrently accessing the same socket/connection, but that one is much more rare.


Nirvana I suppose would be the addition of a field in the header which could be 
used for the determination of where to process. A Transport Protocol option I 
suppose, maybe the IPv6 flow id, but knuth only knows if anyone would go for 
something along those lines.  It does though mean that the state is per-packet 
without it having to be based on addressing information.  Almost like RDMA 
arriving saying where the data goes, but this thing says where the processing 
should happen :)


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread David Miller

From: Rick Jones [EMAIL PROTECTED]
Date: Mon, 24 Jul 2006 17:29:05 -0700

 Nirvana I suppose would be the addition of a field in the header
 which could be used for the determination of where to process. A
 Transport Protocol option I suppose, maybe the IPv6 flow id, but
 knuth only knows if anyone would go for something along those lines.
 It does though mean that the state is per-packet without it having
 to be based on addressing information.  Almost like RDMA arriving
 saying where the data goes, but this thing says where the processing
 should happen :)

Since the full interpretation of the TCP timestamp option field value
is largely local to the peer setting it, there is nothing wrong with
stealing a few bits for destination cpu information.

It would have to be done in such a way as to not make the PAWS
tests fail by accident.  But I think it's doable.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread Rick Jones

David Miller wrote:

From: Rick Jones [EMAIL PROTECTED]
Date: Mon, 24 Jul 2006 17:29:05 -0700

Nirvana I suppose would be the addition of a field in the header
which could be used for the determination of where to process. A
Transport Protocol option I suppose, maybe the IPv6 flow id, but
knuth only knows if anyone would go for something along those lines.
It does though mean that the state is per-packet without it having
to be based on addressing information.  Almost like RDMA arriving
saying where the data goes, but this thing says where the processing
should happen :)

Since the full interpretation of the TCP timestamp option field value
is largely local to the peer setting it, there is nothing wrong with
stealing a few bits for destination cpu information.

Even enough bits for 1024 or 2048 CPUs in the single system image?  I have seen 
1024 touted by SGI, and with things going so multi-core, perhaps 16384 while 
sounding initially bizzare would be in the realm of theoretically possible 
before to long.

It would have to be done in such a way as to not make the PAWS
tests fail by accident.  But I think it's doable.

That would cover TCP, are there similarly fungible fields in SCTP or other ULPs?

And if we were to want to get HW support for the thing, getting it adopted in a 
de jure standards body would probably be in order :)

rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread Rick Jones


It would have to be done in such a way as to not make the PAWS
tests fail by accident.  But I think it's doable.


CPU ID and higher-order generation number such that whenever the process 
migrates to a lower-numbered CPU, the generation number is bumped to make the 
timestamp larger than before?


rick jones
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

softmac possible null deref [was: Complete report of Null dereference errors in kernel 2.6.17.1]

2006-07-24 Thread Daniel Drake


Tom Walter Dillig wrote:

[109]
452 net/ieee80211/softmac/ieee80211softmac_io.c
Possible null dereference of variable *pkt in function call
(include/asm/string.h:__constant_c_and_count_memset) checked at
(453:net/ieee80211/softmac/ieee80211softmac_io.c)


Either I'm misunderstanding, or this is bogus.

when *pkt is allocated by the various child functions (e.g. 
ieee80211softmac_disassoc_deauth), it is always checked for NULL.


Finally, line 453 does another NULL check.

What is the report trying to say?

Daniel

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

softmac possible null deref [was: Complete report of Null dereference errors in kernel 2.6.17.1]

2006-07-24 Thread Daniel Drake


Tom Walter Dillig wrote:

[109]
452 net/ieee80211/softmac/ieee80211softmac_io.c
Possible null dereference of variable *pkt in function call
(include/asm/string.h:__constant_c_and_count_memset) checked at
(453:net/ieee80211/softmac/ieee80211softmac_io.c)


Either I'm misunderstanding, or this is bogus.

when *pkt is allocated by the various child functions (e.g. 
ieee80211softmac_disassoc_deauth), it is always checked for NULL before 
being used.


Finally, line 453 does another NULL check, so that any failures 
generated above are handled appropriately.


What is the report trying to say?

Daniel

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: softmac possible null deref [was: Complete report of Null dereference errors in kernel 2.6.17.1]

2006-07-24 Thread Stephen Hemminger

On Tue, 25 Jul 2006 01:00:54 +0100
Daniel Drake [EMAIL PROTECTED] wrote:

 Tom Walter Dillig wrote:
  [109]
  452 net/ieee80211/softmac/ieee80211softmac_io.c
  Possible null dereference of variable *pkt in function call
  (include/asm/string.h:__constant_c_and_count_memset) checked at
  (453:net/ieee80211/softmac/ieee80211softmac_io.c)
 
 Either I'm misunderstanding, or this is bogus.
 
 when *pkt is allocated by the various child functions (e.g. 
 ieee80211softmac_disassoc_deauth), it is always checked for NULL.
 
 Finally, line 453 does another NULL check.

 
 What is the report trying to say?

That the check in 453 should be removed because is unneeded.

People who obsess about code coverage care that there are unneded
checks. I don't think it matters.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread Andi Kleen


 Even enough bits for 1024 or 2048 CPUs in the single system image? 

MSI-X supports 255 sub interrupts max, most hardware probably much less
(e.g. 8 seems to be a popular number). 

It can be always hashed to the existing CPUs.

It's a nice idea but I think standard hashing + processing in softirq 
would be worth a try first at least.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] via-rhine: NAPI support

2006-07-24 Thread Stephen Hemminger

Add NAPI support to the via-rhine driver so that it can handle higher speeds
and doesn't get overloaded by interrupts as easily.

Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

---

 drivers/net/via-rhine.c |   75 +++
 1 files changed, 63 insertions(+), 12 deletions(-)

90258ab7e4c90d183cfa32156cd2ff48aca03974
--- skge.orig/drivers/net/via-rhine.c
+++ skge/drivers/net/via-rhine.c
@@ -30,8 +30,8 @@
 */
 
 #define DRV_NAME   via-rhine
-#define DRV_VERSION1.4.0
-#define DRV_RELDATEJune-27-2006
+#define DRV_VERSION1.4.1
+#define DRV_RELDATEJuly-24-2006
 
 
 /* A few user-configurable values.
@@ -63,7 +63,11 @@ static const int multicast_filter_limit 
There are no ill effects from too-large receive rings. */
 #define TX_RING_SIZE   16
 #define TX_QUEUE_LEN   10  /* Limit ring entries actually used. */
+#ifdef CONFIG_VIA_RHINE_NAPI
+#define RX_RING_SIZE   64
+#else
 #define RX_RING_SIZE   16
+#endif
 
 
 /* Operational parameters that usually are not changed. */
@@ -396,7 +400,7 @@ static void rhine_tx_timeout(struct net_
 static int  rhine_start_tx(struct sk_buff *skb, struct net_device *dev);
 static irqreturn_t rhine_interrupt(int irq, void *dev_instance, struct pt_regs 
*regs);
 static void rhine_tx(struct net_device *dev);
-static void rhine_rx(struct net_device *dev);
+static int rhine_rx(struct net_device *dev, int limit);
 static void rhine_error(struct net_device *dev, int intr_status);
 static void rhine_set_rx_mode(struct net_device *dev);
 static struct net_device_stats *rhine_get_stats(struct net_device *dev);
@@ -564,6 +568,32 @@ static void rhine_poll(struct net_device
 }
 #endif
 
+#ifdef CONFIG_VIA_RHINE_NAPI
+static int rhine_napipoll(struct net_device *dev, int *budget)
+{
+   struct rhine_private *rp = netdev_priv(dev);
+   void __iomem *ioaddr = rp-base;
+   int done, limit = min(dev-quota, *budget);
+
+   done = rhine_rx(dev, limit);
+   *budget -= done;
+   dev-quota -= done;
+
+   if (done  limit) {
+   netif_rx_complete(dev);
+
+   iowrite16(IntrRxDone | IntrRxErr | IntrRxEmpty| IntrRxOverflow |
+ IntrRxDropped | IntrRxNoBuf | IntrTxAborted |
+ IntrTxDone | IntrTxError | IntrTxUnderrun |
+ IntrPCIErr | IntrStatsMax | IntrLinkChange,
+ ioaddr + IntrEnable);
+   return 0;
+   }
+   else
+   return 1;
+}
+#endif
+
 static void rhine_hw_init(struct net_device *dev, long pioaddr)
 {
struct rhine_private *rp = netdev_priv(dev);
@@ -744,6 +774,10 @@ static int __devinit rhine_init_one(stru
 #ifdef CONFIG_NET_POLL_CONTROLLER
dev-poll_controller = rhine_poll;
 #endif
+#ifdef CONFIG_VIA_RHINE_NAPI
+   dev-poll = rhine_napipoll;
+   dev-weight = 64;
+#endif
if (rp-quirks  rqRhineI)
dev-features |= NETIF_F_SG|NETIF_F_HW_CSUM;
 
@@ -1165,6 +1199,7 @@ static void rhine_tx_timeout(struct net_
dev-trans_start = jiffies;
rp-stats.tx_errors++;
netif_wake_queue(dev);
+   netif_poll_enable(dev);
 }
 
 static int rhine_start_tx(struct sk_buff *skb, struct net_device *dev)
@@ -1268,8 +1303,18 @@ static irqreturn_t rhine_interrupt(int i
   dev-name, intr_status);
 
if (intr_status  (IntrRxDone | IntrRxErr | IntrRxDropped |
-   IntrRxWakeUp | IntrRxEmpty | IntrRxNoBuf))
-   rhine_rx(dev);
+  IntrRxWakeUp | IntrRxEmpty | IntrRxNoBuf)) {
+#ifdef CONFIG_VIA_RHINE_NAPI
+   iowrite16(IntrTxAborted |
+ IntrTxDone | IntrTxError | IntrTxUnderrun |
+ IntrPCIErr | IntrStatsMax | IntrLinkChange,
+ ioaddr + IntrEnable);
+
+   netif_rx_schedule(dev);
+#else
+   rhine_rx(dev, RX_RING_SIZE);
+#endif
+   }
 
if (intr_status  (IntrTxErrSummary | IntrTxDone)) {
if (intr_status  IntrTxErrSummary) {
@@ -1367,13 +1412,12 @@ static void rhine_tx(struct net_device *
spin_unlock(rp-lock);
 }
 
-/* This routine is logically part of the interrupt handler, but isolated
-   for clarity and better register allocation. */
-static void rhine_rx(struct net_device *dev)
+/* Process up to limit frames from receive ring */
+static int rhine_rx(struct net_device *dev, int limit)
 {
struct rhine_private *rp = netdev_priv(dev);
+   int count;
int entry = rp-cur_rx % RX_RING_SIZE;
-   int boguscnt = rp-dirty_rx + RX_RING_SIZE - rp-cur_rx;
 
if (debug  4) {
printk(KERN_DEBUG %s: rhine_rx(), entry %d status %8.8x.\n,
@@ -1382,16 +1426,18 @@ static void rhine_rx(struct net_device *
}
 
/* If EOP is set on the next entry, it's a new packet.

Re: RDMA will be reverted

2006-07-24 Thread David Miller

From: Rick Jones [EMAIL PROTECTED]
Date: Mon, 24 Jul 2006 17:55:24 -0700

 Even enough bits for 1024 or 2048 CPUs in the single system image?  I have 
 seen 
 1024 touted by SGI, and with things going so multi-core, perhaps 16384 while 
 sounding initially bizzare would be in the realm of theoretically possible 
 before to long.

Read the RSS NDIS documents from Microsoft.  You aren't going to want
to demux to more than, say, 256 cpus for single network adapter even
on the largest machines.

Therefore a simple translation table and/or base cpu number is
sufficient to only need 8 bits of cpu identification.

You will be limited by the number of MSI-X vectors also,
for implementations demuxing directly to cpus using MSI-X
selection.

 That would cover TCP, are there similarly fungible fields in SCTP or
 other ULPs?  And if we were to want to get HW support for the thing,
 getting it adopted in a de jure standards body would probably be in
 order :)

Microsoft never does this, neither do we.  LRO came out of our own
design, the network folks found it reasonable and thus they have
started to implement it.  The same is true for Microsofts RSS stuff.

It's a hardware interpretation, therefore it belongs in a driver API
specification, nowhere else.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help with bugfix for bond active-backup mode + vlans

2006-07-24 Thread Jay Vosburgh

Ben Greear [EMAIL PROTECTED] wrote:

Jay Vosburgh wrote:
  Sadly, elegance remains elusive, since the by the time skb_bond
 is called, the slave device the packet arrived on isn't available
 (vlan-real_dev points to 'bond0' by this point), and that information
 is needed to decide whether to drop the packet or not.
  The least grotty solution that comes to mind is to have
 __vlan_hwaccel_rx call some skb_bond_suppress_dups() function directly,
 and change skb_bond() to also call that function.

Can you use skb-input_dev?

Not as it is currently implemented.  It is set by
netif_receive_skb, not by the vlan accelerator, so input_dev ends up
being the vlan device, not the underlying actual ethernet device.  It
looks like input_dev will be inconsistently assigned with vlans over
bonding: if the slave device is vlan accelerated, input_dev will be the
vlan device; if the slave isn't accelerated, input_dev will be the
slave.

As far as I can tell, the input_dev is only used by the
NET_CLS_IND (input device classification) stuff, which has warnings
saying it might be going away.  I'm not seeing anything else right
offhand that uses it.

Anyway, the skb_bond logic really needs the enslaved interface,
which isn't necessarily the input_dev (even if input_dev was always the
device that actually had the wire plugged into it).  If the slave is
itself some kind of virtual device (a vlan, for example), then input_dev
wouldn't be the right thing.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread Andi Kleen

On Tuesday 25 July 2006 02:29, Rick Jones wrote:
 This all sounds like the discussions we had within HP-UX between 10.20 and 
 11.0 
 concerning Inbound Packet Scheduling vs Thread Optimized Packet Scheduling.  

We've also talking about this for many years, just no code so far.
Or rather Linux so far left the job to manual tuning.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: skge error; hangs w/ hardware memory hole

2006-07-24 Thread Andi Kleen

On Sunday 23 July 2006 08:32, Anthony DeRobertis wrote:
 Andreas Kleen wrote:
 
  
  You need to use iommu=soft swiotlb=force
  
  The standard IOMMU is also broken on VIA, but forced swiotlb should
  work.
 
 Didn't work :-(

swiotlb=force is unfortunately broken right now. 

But which this patch it should work. Does it?

-Andi

Test patch only: disable DMA over 4GB

Index: linux-2.6.17-work/arch/x86_64/kernel/pci-dma.c
===
--- linux-2.6.17-work.orig/arch/x86_64/kernel/pci-dma.c
+++ linux-2.6.17-work/arch/x86_64/kernel/pci-dma.c
@@ -202,7 +202,7 @@ int dma_set_mask(struct device *dev, u64
 {
if (!dev-dma_mask || !dma_supported(dev, mask))
return -EIO;
-   *dev-dma_mask = mask;
+   *dev-dma_mask = mask  0x;
return 0;
 }
 EXPORT_SYMBOL(dma_set_mask);

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[IPROUTE]: Add support for multipath route realms

2006-07-24 Thread Patrick McHardy

[IPROUTE]: Add support for multipath route realms

Routing realms exist per nexthop, but iproute currently only allows to send
a single route realm, which is refused by the kernel for multipath routes.
Add support for specifying per nexthop realms. Old kernels only return the
first realm back to userspace when dumping, so the others can't be displayed,
besides that it will also behave correctly on old kernels.

old kernel:

1.2.3.4 realm 1
nexthop dev dummy0 weight 1
nexthop dev dummy1 weight 1
nexthop dev dummy2 weight 1
nexthop dev dummy3 weight 1

new kernel:

1.2.3.4
nexthop realm 1 dev dummy0 weight 1
nexthop realm 2 dev dummy1 weight 1
nexthop realm 3 dev dummy2 weight 1
nexthop realm 4 dev dummy3 weight 1

route queries on both old and new kernel:

1.2.3.4 dev dummy0  src 10.0.0.1 realm 1
cache  mtu 1500 advmss 1460 metric 10 64
1.2.3.4 dev dummy1  src 10.0.0.1 realm 2
cache  mtu 1500 advmss 1460 metric 10 64
1.2.3.4 dev dummy2  src 10.0.0.1 realm 3
cache  mtu 1500 advmss 1460 metric 10 64
1.2.3.4 dev dummy3  src 10.0.0.1 realm 4
cache  mtu 1500 advmss 1460 metric 10 64

Signed-off-by: Patrick McHardy [EMAIL PROTECTED]

---
commit dbc39a8d37d658776a8959d2393b1047ea124436
tree be59669a06709aaa3b194f050529fe3986928dc8
parent 8f8a36487119a3cd1afe86a9649704aca088567b
author Patrick McHardy [EMAIL PROTECTED] Tue, 25 Jul 2006 05:55:36 +0200
committer Patrick McHardy [EMAIL PROTECTED] Tue, 25 Jul 2006 05:55:36 +0200

 ip/iproute.c |   19 +++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index a43c09e..3544f02 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -557,6 +557,18 @@ #endif

RTA_DATA(tb[RTA_GATEWAY]),
abuf, 
sizeof(abuf)));
}
+   if (tb[RTA_FLOW]) {
+   __u32 to = 
*(__u32*)RTA_DATA(tb[RTA_FLOW]);
+   __u32 from = to16;
+   to = 0x;
+   fprintf(fp,  realm%s , from ? s : 
);
+   if (from) {
+   fprintf(fp, %s/,
+   rtnl_rtrealm_n2a(from, 
b1, sizeof(b1)));
+   }
+   fprintf(fp, %s,
+   rtnl_rtrealm_n2a(to, b1, 
sizeof(b1)));
+   }
}
if (r-rtm_flagsRTM_F_CLONED  r-rtm_type == 
RTN_MULTICAST) {
fprintf(fp,  %s, 
ll_index_to_name(nh-rtnh_ifindex));
@@ -606,6 +618,13 @@ int parse_one_nh(struct rtattr *rta, str
rtnh-rtnh_hops = w - 1;
} else if (strcmp(*argv, onlink) == 0) {
rtnh-rtnh_flags |= RTNH_F_ONLINK;
+   } else if (matches(*argv, realms) == 0) {
+   __u32 realm;
+   NEXT_ARG();
+   if (get_rt_realms(realm, *argv))
+   invarg(\realm\ value is invalid\n, *argv);
+   rta_addattr32(rta, 4096, RTA_FLOW, realm);
+   rtnh-rtnh_len += sizeof(struct rtattr) + 4;
} else
break;
}

Can we have GET_NETDEV_DEV?

2006-07-24 Thread Pavel Roskin

Hello!

gregkh-driver-network-class_device-to-device.patch, which briefly
appeared in Linux 2.6.18-rc1-mm1 broke MadWifi, which is copying the
physical device information from the master network device to the
virtual network devices:

SET_NETDEV_DEV(dev, mdev-class_dev.dev);

The same code exists in hostap.  The patch is gone from 2.6.18-rc1-mm2,
but I'd like to be prepared if it reappears.

An easy solution would be to have GET_NETDEV_DEV macro.  Then the
drivers could do this:

SET_NETDEV_DEV(dev, GET_NETDEV_DEV(mdev));

without having to worry about the internals of struct net_device.  It
should be done before class_dev is removed or in the same time.

Should I send a patch?

-- 
Regards,
Pavel Roskin


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ip multicast route bug fix

2006-07-24 Thread James Morris

On Wed, 19 Jul 2006, Stephen Hemminger wrote:

 This should fix the problem reported in 
 http://bugzilla.kernel.org/show_bug.cgi?id=6186
 where the skb is used after freed. The code in IP multicast route.
 
 Code was reusing an skb which could lead to use after free or double free.
 
 Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]

Acked-by: James Morris [EMAIL PROTECTED]


-- 
James Morris
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RDMA will be reverted

2006-07-24 Thread Evgeniy Polyakov

On Mon, Jul 24, 2006 at 03:06:13PM -0700, David Miller ([EMAIL PROTECTED]) 
wrote:
 Don't get too excited about VJ netchannels, more and more roadblocks
 to their practicality are being found every day.
 
 For example, my idea to allow ESTABLISHED TCP socket demux to be done
 before netfilter is flawed.  Connection tracking and NAT can change
 the packet ID and loop it back to us to hit exactly an ESTABLISHED TCP
 socket, therefore we must always hit netfilter first.

There is no problem with netfilter and process context processing - when
skb is removed from hardware list/array and is being processed by
netfilter in netchannel (or in process context in general), 
there is no problems if changed skb will be rerouted into different 
queue and state.

 All the original costs of route, netfilter, TCP socket lookup all
 reappear as we make VJ netchannels fit all the rules of real practical
 systems, eliminating their gains entirely.  I will also note in
 passing that papers on related ideas, such as the Exokernel stuff, are
 very careful to not address the issue of how practical 1) their demux
 engine is and 2) the negative side effects of userspace TCP
 implementations.  For an example of the latter, if you have some 1GB
 JAVA process you do not want to wake that monster up just to do some
 ACK processing or TCP window updates, yet if you don't you violate
 TCP's rules and risk spurious unnecessary retransmits.

I still plan to continue userspace implementation.

If gigantic-java-monster (tm) is going to read some data - it has been
awakened already, thus it is in the memeory (with linked tcp lib), so
there is zero overhead.

 Furthermore, the VJ netchannel gains can be partially obtained from
 generic stateless facilities that we are going to get anyways.
 Networking chips supporting multiple MSI-X vectors, choosen by hashing
 the flow ID, can move TCP processing to end nodes which are cpu
 threads in this case, by having each such MSI-X vector target a
 different cpu thread.

And if that CPU is very busy?
Linux should somehow tell NIC that some CPUs are valid and some are not
right now, but not in a second, so scheduler must be tightly bound with
network internals.

Just my 2 coins.

-- 
Evgeniy Polyakov
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

60 matches

Mail list logo