Re: 2.6.23-rc8 network problem. Mem leak? ip1000a?

2007-09-30 Thread linux
 ntpd.  Sounds like pps leaking to me.

That's what I'd think, except that pps does no allocation in the normal
running state, so there's nothing to leak.  The interrupt path just
records the time in some preallocated, static buffers and wakes up
blocked readers.  The read path copies the latest data out of those
static buffers.  There's allocation when the PPS device is created,
and more when it's opened.

 Can anyone offer some diagnosis advice?

 CONFIG_DEBUG_SLAB_LEAK?

Ah, thanks you; I've been using SLUB which doesn't support this option.
Here's what I've extracted.  I've only presented the top few
slab_allocators and a small subset of the oom-killer messages, but I
have full copies if desired.  Unfortunately, I've discovered that the
machine doesn't live in this unhappy state forever.  Indeed, I'm not
sure if killing ntpd fixes anything; my previous observations
may have been optimistic ignorance.

(For my own personal reference looking for more oom-kill, I nuked ntpd
at 06:46:56.  And the oom-kills are continuing, with the latest at
07:43:52.)

Anyway, I have a bunch of information from the slab_allocators file, but
I'm not quire sure how to make sense of it.


With a machine in the unhappy state and firing the OOM killer, the top
20 slab_allocators are:
$ sort -rnk2 /proc/slab_allocators | head -20
skbuff_head_cache: 1712746 __alloc_skb+0x31/0x121
size-512: 1706572 tcp_send_ack+0x23/0x102
skbuff_fclone_cache: 149113 __alloc_skb+0x31/0x121
size-2048: 148500 tcp_sendmsg+0x1b5/0xae1
sysfs_dir_cache: 5289 sysfs_new_dirent+0x4b/0xec
size-512: 2613 sock_alloc_send_skb+0x93/0x1dd
Acpi-Operand: 2014 acpi_ut_allocate_object_desc_dbg+0x34/0x6e
size-32: 1995 sysfs_new_dirent+0x29/0xec
vm_area_struct: 1679 mmap_region+0x18f/0x421
size-512: 1618 tcp_xmit_probe_skb+0x1f/0xcd
size-512: 1571 arp_create+0x4e/0x1cd
vm_area_struct: 1544 copy_process+0x9f1/0x1108
anon_vma: 1448 anon_vma_prepare+0x29/0x74
filp: 1201 get_empty_filp+0x44/0xcd
UDP: 1173 sk_alloc+0x25/0xaf
size-128: 1048 r1bio_pool_alloc+0x23/0x3b
size-128: 1024 nfsd_cache_init+0x2d/0xcf
Acpi-Namespace: 973 acpi_ns_create_node+0x2c/0x45
vm_area_struct: 717 split_vma+0x33/0xe5
dentry: 594 d_alloc+0x24/0x177

I'm not sure quite what normal numbers are, but I do wonder why there
are 1.7 million TCP acks buffered in the system.  Shouldn't they be
transmitted and deallocated pretty quickly?

This machine receives more data than it sends, so I'd expect acks to
outnumber real packets.  Could the ip1000a driver's transmit path be
leaking skbs somehow?  that would also explain the flailing of the
oom-killer; it can't associate the allocations with a process.

Here's /proc/meminfo:
MemTotal:  1035756 kB
MemFree: 43508 kB
Buffers: 72920 kB
Cached: 224056 kB
SwapCached: 344916 kB
Active: 664976 kB
Inactive:   267656 kB
SwapTotal: 4950368 kB
SwapFree:  3729384 kB
Dirty:6460 kB
Writeback:   0 kB
AnonPages:  491708 kB
Mapped:  79232 kB
Slab:41324 kB
SReclaimable:25008 kB
SUnreclaim:  16316 kB
PageTables:   8132 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:   5468244 kB
Committed_AS:  1946008 kB
VmallocTotal:   253900 kB
VmallocUsed:  2672 kB
VmallocChunk:   251228 kB

I have a lot of oom-killer messages, that I have saved but am not
posting for size reasons, but here are some example backtraces.
They're not very helpful to me; do they enlighten anyone else?

02:50:20: apcupsd invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0
02:50:22: 
02:50:22: Call Trace:
02:50:22:  [80246053] out_of_memory+0x71/0x1ba
02:50:22:  [8024755d] __alloc_pages+0x255/0x2d7
02:50:22:  [8025cbd6] cache_alloc_refill+0x2f4/0x60a
02:50:22:  [8040602c] hiddev_ioctl+0x579/0x919
02:50:22:  [8025d0fc] kmem_cache_alloc+0x57/0x95
02:50:22:  [8040602c] hiddev_ioctl+0x579/0x919
02:50:22:  [80262511] cp_new_stat+0xe5/0xfd
02:50:22:  [804058ff] hiddev_read+0x199/0x1f6
02:50:22:  [80222fa0] default_wake_function+0x0/0xe
02:50:22:  [80269bb5] do_ioctl+0x45/0x50
02:50:22:  [80269db9] vfs_ioctl+0x1f9/0x20b
02:50:22:  [80269e07] sys_ioctl+0x3c/0x5d
02:50:22:  [8020b43e] system_call+0x7e/0x83

02:52:18: postgres invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0
02:52:18: 
02:52:18: Call Trace:
02:52:18:  [80246053] out_of_memory+0x71/0x1ba
02:52:18:  [8024755d] __alloc_pages+0x255/0x2d7
02:52:18:  [8025be8a] poison_obj+0x26/0x2f
02:52:18:  [8024761f] __get_free_pages+0x40/0x79
02:52:18:  [80224d66] copy_process+0xb0/0x1108
02:52:18:  [80233388] alloc_pid+0x1f/0x27d
02:52:18:  [80225ed6] do_fork+0xb1/0x1a7
02:52:18:  [802f0627] copy_user_generic_string+0x17/0x40
02:52:18:  [8020b43e] system_call+0x7e/0x83
02:52:18:  [8020b757] ptregscall_common+0x67/0xb0

02:52:18: kthreadd invoked 

Re: 2.6.23-rc8 network problem. Mem leak? ip1000a?

2007-09-30 Thread Andrew Morton
On 30 Sep 2007 03:59:56 -0400 [EMAIL PROTECTED] wrote:

  ntpd.  Sounds like pps leaking to me.
 
 That's what I'd think, except that pps does no allocation in the normal
 running state, so there's nothing to leak.  The interrupt path just
 records the time in some preallocated, static buffers and wakes up
 blocked readers.  The read path copies the latest data out of those
 static buffers.  There's allocation when the PPS device is created,
 and more when it's opened.

OK.  Did you try to reproduce it without the pps patch applied?

  Can anyone offer some diagnosis advice?
 
  CONFIG_DEBUG_SLAB_LEAK?
 
 Ah, thanks you; I've been using SLUB which doesn't support this option.
 Here's what I've extracted.  I've only presented the top few
 slab_allocators and a small subset of the oom-killer messages, but I
 have full copies if desired.  Unfortunately, I've discovered that the
 machine doesn't live in this unhappy state forever.  Indeed, I'm not
 sure if killing ntpd fixes anything; my previous observations
 may have been optimistic ignorance.
 
 (For my own personal reference looking for more oom-kill, I nuked ntpd
 at 06:46:56.  And the oom-kills are continuing, with the latest at
 07:43:52.)
 
 Anyway, I have a bunch of information from the slab_allocators file, but
 I'm not quire sure how to make sense of it.
 
 
 With a machine in the unhappy state and firing the OOM killer, the top
 20 slab_allocators are:
 $ sort -rnk2 /proc/slab_allocators | head -20
 skbuff_head_cache: 1712746 __alloc_skb+0x31/0x121
 size-512: 1706572 tcp_send_ack+0x23/0x102
 skbuff_fclone_cache: 149113 __alloc_skb+0x31/0x121
 size-2048: 148500 tcp_sendmsg+0x1b5/0xae1
 sysfs_dir_cache: 5289 sysfs_new_dirent+0x4b/0xec
 size-512: 2613 sock_alloc_send_skb+0x93/0x1dd
 Acpi-Operand: 2014 acpi_ut_allocate_object_desc_dbg+0x34/0x6e
 size-32: 1995 sysfs_new_dirent+0x29/0xec
 vm_area_struct: 1679 mmap_region+0x18f/0x421
 size-512: 1618 tcp_xmit_probe_skb+0x1f/0xcd
 size-512: 1571 arp_create+0x4e/0x1cd
 vm_area_struct: 1544 copy_process+0x9f1/0x1108
 anon_vma: 1448 anon_vma_prepare+0x29/0x74
 filp: 1201 get_empty_filp+0x44/0xcd
 UDP: 1173 sk_alloc+0x25/0xaf
 size-128: 1048 r1bio_pool_alloc+0x23/0x3b
 size-128: 1024 nfsd_cache_init+0x2d/0xcf
 Acpi-Namespace: 973 acpi_ns_create_node+0x2c/0x45
 vm_area_struct: 717 split_vma+0x33/0xe5
 dentry: 594 d_alloc+0x24/0x177
 
 I'm not sure quite what normal numbers are, but I do wonder why there
 are 1.7 million TCP acks buffered in the system.  Shouldn't they be
 transmitted and deallocated pretty quickly?

Yeah, that's an skbuff leak.

 This machine receives more data than it sends, so I'd expect acks to
 outnumber real packets.  Could the ip1000a driver's transmit path be
 leaking skbs somehow?

Absolutely.  Normally a driver's transmit completion interrupt handler will
run dev_kfree_skb_irq() against the skbs which have been fully sent.

However it'd be darned odd if the driver was leaking only tcp acks.

I can find no occurrence of dev_kfree_skb in drivers/net/ipg.c, which is
suspicious.

Where did you get your ipg.c from, btw?  davem's tree?  rc8-mm1? rc8-mm2??

  that would also explain the flailing of the
 oom-killer; it can't associate the allocations with a process.
 
 Here's /proc/meminfo:
 MemTotal:  1035756 kB
 MemFree: 43508 kB
 Buffers: 72920 kB
 Cached: 224056 kB
 SwapCached: 344916 kB
 Active: 664976 kB
 Inactive:   267656 kB
 SwapTotal: 4950368 kB
 SwapFree:  3729384 kB
 Dirty:6460 kB
 Writeback:   0 kB
 AnonPages:  491708 kB
 Mapped:  79232 kB
 Slab:41324 kB
 SReclaimable:25008 kB
 SUnreclaim:  16316 kB
 PageTables:   8132 kB
 NFS_Unstable:0 kB
 Bounce:  0 kB
 CommitLimit:   5468244 kB
 Committed_AS:  1946008 kB
 VmallocTotal:   253900 kB
 VmallocUsed:  2672 kB
 VmallocChunk:   251228 kB

I assume that meminfo was not captured when the system was ooming?  There
isn't much slab there.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.23-rc8 network problem. Mem leak? ip1000a?

2007-09-30 Thread linux
 OK.  Did you try to reproduce it without the pps patch applied?

No.  But I've yanked the ip1000a driver (using old crufy vendor-supplied
out-of-kernel module) and the problems are GONE.

 This machine receives more data than it sends, so I'd expect acks to
 outnumber real packets.  Could the ip1000a driver's transmit path be
 leaking skbs somehow?

 Absolutely.  Normally a driver's transmit completion interrupt handler will
 run dev_kfree_skb_irq() against the skbs which have been fully sent.

 However it'd be darned odd if the driver was leaking only tcp acks.

It's leaking lots of things... you can see ARP packets in there and
all sorts of stuff.  But the big traffic hog is BackupPC doing inbound
rsyncs all night long, which generates a lot of acks.  Those are the
packets it sends, so those are the packets that get leaked.

 I can find no occurrence of dev_kfree_skb in drivers/net/ipg.c, which is
 suspicious.

Look for IPG_DEV_KFREE_SKB, which is a wrapper macro.  (Or just add
-i to your grep.)  It should probably be deleted (it just expands to
dev_kfree_skb), but was presumably useful to someone during development.

 Where did you get your ipg.c from, btw?  davem's tree?  rc8-mm1? rc8-mm2??

As I wrote originally, I got it from
http://marc.info/?l=linux-netdevm=118980588419882
which was a reuqest for mainline submission.

If there are other patches floating around, I'm happy to try them.
Now that I know what to look for, it's easy to spot the leak before OOM.

 I assume that meminfo was not captured when the system was ooming?  There
 isn't much slab there.

Oops, sorry.  I captured slabinfo but not meminfo.


Thank you very much!  Sorry to jump the gun and post a lot before I had
all the data, but if it WAS a problem in -rc8, I wanted to mention it
before -final.

Now, the rush is to get the ip1000a driver fixed before the merge
window opens.  I've added all the ip1000a developers to the Cc: list in
an attempt to speed that up.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-30 Thread Denis V. Lunev
Hmm, so it looks like we do not need this queue processing at all...

Regards,
Den

Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:
 
 Maybe I can save you some time: we used to do down_trylock()
 for the rtnl mutex, so senders would simply return if someone
 else was already processing the queue *or* the rtnl was locked
 for some other reason. In the first case the process already
 processing the queue would also process the new messages, but
 if it the rtnl was locked for some other reason (for example
 during module registration) the message would sit in the
 queue until the next rtnetlink sendmsg call, which is why
 rtnl_unlock does queue processing. Commit 6756ae4b changed
 the down_trylock to mutex_lock, so senders will now simply wait
 until the mutex is released and then call netlink_run_queue
 themselves. This means its not needed anymore.
 
 Sounds reasonable.
 
 I started looking through the code paths and I currently cannot
 see anything that would leave a message on a kernel rtnl socket.
 
 However I did a quick test adding a WARN_ON if there were any messages
 found in the queue during rtnl_unlock and I found this code path
 getting invoked from linkwatch_event.  So there is clearly something I
 don't understand, and it sounds at odds just a bit from your
 description.
 
 If we can remove the extra queue processing that would be great,
 as it looks like a nice way to simplify the locking and the special
 cases in the code.
 
 Eric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PKT_SCHED]: Add stateless NAT

2007-09-30 Thread Patrick McHardy
Herbert Xu wrote:
 On Sun, Sep 30, 2007 at 08:26:01AM +0800, Herbert Xu wrote:
 
Indeed.  The only other case I can think of is defragmentation.
But even there we should be able to squeeze it into the original
skb :)
 
 
 OK it won't be pretty but it's definitely doable.  We simply
 swap the contents of that skb with the head skb that would've
 otherwise been returned.
 
 If this is the only place in netfilter where we need to modify
 *pskb then I think it's worthwhile.
 
 So the question is are there any other places that we haven't
 covered which also needs to change *pskb?


From a quick look it seems at least IPv4 and IPv6 don't need to
change the skb anywhere else. It seems to be necessary for
bridging though to unshare the skb. OTOH br_netfilter.c already
unshares the skb for IPv4 and IPv6, so we could simply do this
unconditionally before calling the hooks.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware

2007-09-30 Thread Patrick McHardy
Eric W. Biederman wrote:
 Patrick McHardy [EMAIL PROTECTED] writes:
 
 
Maybe I can save you some time: we used to do down_trylock()
for the rtnl mutex, so senders would simply return if someone
else was already processing the queue *or* the rtnl was locked
for some other reason. In the first case the process already
processing the queue would also process the new messages, but
if it the rtnl was locked for some other reason (for example
during module registration) the message would sit in the
queue until the next rtnetlink sendmsg call, which is why
rtnl_unlock does queue processing. Commit 6756ae4b changed
the down_trylock to mutex_lock, so senders will now simply wait
until the mutex is released and then call netlink_run_queue
themselves. This means its not needed anymore.
 
 
 Sounds reasonable.
 
 I started looking through the code paths and I currently cannot
 see anything that would leave a message on a kernel rtnl socket.
 
 However I did a quick test adding a WARN_ON if there were any messages
 found in the queue during rtnl_unlock and I found this code path
 getting invoked from linkwatch_event.  So there is clearly something I
 don't understand, and it sounds at odds just a bit from your
 description.


That sounds like a bug. Did you place the WARN_ON before or after
the mutex_unlock()?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rtnl: Simplify ASSERT_RTNL

2007-09-30 Thread Patrick McHardy
Herbert Xu wrote:
 On Sat, Sep 29, 2007 at 05:32:41PM +0200, Patrick McHardy wrote:
 
For unicast addresses its not strictly necessary since they may
only be changed under the RTNL anyway. The reason why it takes
the tx_lock is for consistency with multicast address handling,
which can't rely on the RTNL since IPv6 changes them from
BH context. The idea was that the -set_rx_mode function should
handle both secondary unicast and multicast addresses for
simplicity.
 
 
 In any case, coming back to the original question, the RTNL
 assertion is simply wrong in this case because if we're being
 called from IPv6 then the RTNL won't even be held.
 
 So I think we need to
 
 1) Move the assert into dev_set_promiscuity.
 2) Take the TX lock in dev_set_promiscuity.


In the IPv6 case we're only changing the multicast list,
so we're never calling into __dev_set_promiscuity.

I actually even added a comment about this :)

/* Unicast addresses changes may only happen under the rtnl,
 * therefore calling __dev_set_promiscuity here is safe.
 */

I would prefer to keep the ASSERT_RTNL in __dev_set_promiscuity
since it also covers the __dev_set_rx_mode path. How about
adding an ASSERT_RTNL_ATOMIC without the might_sleep or simply
open coding it?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ehea work queues

2007-09-30 Thread Anton Blanchard

Hi,

I booted 2.6.23-rc8 and noticed that ehea loves its workqueues:

# ps aux|grep ehea
root  3266  0.0  0.000 ?S   11:02   0:00 [ehea_driver_wq/]
root  3268  0.0  0.000 ?S   11:02   0:00 [ehea_driver_wq/]
root  3269  0.0  0.000 ?S   11:02   0:00 [ehea_driver_wq/]
root  3270  0.0  0.000 ?S   11:02   0:00 [ehea_driver_wq/]
root  3271  0.0  0.000 ?S   11:02   0:00 [ehea_driver_wq/]
root  3272  0.0  0.000 ?S   11:02   0:00 [ehea_driver_wq/]
root  3273  0.0  0.000 ?S   11:02   0:00 [ehea_driver_wq/]
root  3274  0.0  0.000 ?S   11:02   0:00 [ehea_driver_wq/]
root  3275  0.0  0.000 ?S   11:02   0:00 [ehea_wq/0]
root  3276  0.0  0.000 ?S   11:02   0:00 [ehea_wq/1]
root  3278  0.0  0.000 ?S   11:02   0:00 [ehea_wq/2]
root  3279  0.0  0.000 ?S   11:02   0:00 [ehea_wq/3]
root  3280  0.0  0.000 ?S   11:02   0:00 [ehea_wq/4]
root  3281  0.0  0.000 ?S   11:02   0:00 [ehea_wq/5]
root  3282  0.0  0.000 ?S   11:02   0:00 [ehea_wq/6]
root  3283  0.0  0.000 ?S   11:02   0:00 [ehea_wq/7]

(notice also that the ehea_driver_wq/XXX exceeds TASK_COMM_LEN). 

Since they are both infrequent events and not performance critical
(memory hotplug and driver reset), can we just use schedule_work?

Anton
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][E1000E] some cleanups

2007-09-30 Thread jamal
Auke,

heres part of something i promised. 
I couldnt do any packet testing on because 82571EB is disabled in the
driver. I uncommented the code out in the table, but the best i could
get was the module loading, some probing and some sysfs renaming
failures (probably a debianism); the machine access is intermittent, so
thats as far as i could go. In any case, you probably have a good reason
for disabling that chip. So, heres the patch, the burden of testing now
falls on you ;-
Once you have 82571EB on and kicking, my next steps are to kill LLTX
then add batching on top.
BTW, since this driver is just for PCIE, would you take a similar patch
for non-PCIE e1000?

comment:
There used to be an mmiowb() call right after the dma wake which is
gone now; is this unneeded with pcie? I have restored it, look for the
XXX.

cheers,
jamal
[E1000E] some cleanups
This patch makes the xmit path code a lot more readable and reusable.
preps for removing LLTX.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit ad63c288ce980907f68d94d5faac08625c0b1782
tree 296a6da371b98c6488c544a8910941ea6d8c18a8
parent 7f5d0afdff875b2c4957031f8934741aefe257cc
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 13:26:00 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 13:26:00 -0400

 drivers/net/e1000e/netdev.c |  187 +++
 1 files changed, 119 insertions(+), 68 deletions(-)

diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4a21d7d..5043504 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -3076,6 +3076,18 @@ link_up:
 #define E1000_TX_FLAGS_VLAN_MASK   0x
 #define E1000_TX_FLAGS_VLAN_SHIFT  16
 
+struct e1000_tx_cbdata {
+   int count;
+   unsigned int max_per_txd;
+   unsigned int nr_frags;
+   unsigned int mss;
+   unsigned int tx_flags;
+};
+
+#define E1000_SKB_CB(__skb) ((struct e1000_tx_cbdata *)((__skb)-cb[8]))
+#define NETDEV_TX_DROPPED -5
+#define TXD_USE_COUNT(S, X) (((S)  (X)) + 1)
+
 static int e1000_tso(struct e1000_adapter *adapter,
 struct sk_buff *skb)
 {
@@ -3194,8 +3206,7 @@ static bool e1000_tx_csum(struct e1000_adapter *adapter, 
struct sk_buff *skb)
 
 static int e1000_tx_map(struct e1000_adapter *adapter,
struct sk_buff *skb, unsigned int first,
-   unsigned int max_per_txd, unsigned int nr_frags,
-   unsigned int mss)
+   struct e1000_tx_cbdata *cb)
 {
struct e1000_ring *tx_ring = adapter-tx_ring;
struct e1000_buffer *buffer_info;
@@ -3207,11 +3218,11 @@ static int e1000_tx_map(struct e1000_adapter *adapter,
 
while (len) {
buffer_info = tx_ring-buffer_info[i];
-   size = min(len, max_per_txd);
+   size = min(len, cb-max_per_txd);
 
/* Workaround for premature desc write-backs
 * in TSO mode.  Append 4-byte sentinel desc */
-   if (mss  !nr_frags  size == len  size  8)
+   if (cb-mss  !cb-nr_frags  size == len  size  8)
size -= 4;
 
buffer_info-length = size;
@@ -3237,7 +3248,7 @@ static int e1000_tx_map(struct e1000_adapter *adapter,
i = 0;
}
 
-   for (f = 0; f  nr_frags; f++) {
+   for (f = 0; f  cb-nr_frags; f++) {
struct skb_frag_struct *frag;
 
frag = skb_shinfo(skb)-frags[f];
@@ -3246,10 +3257,11 @@ static int e1000_tx_map(struct e1000_adapter *adapter,
 
while (len) {
buffer_info = tx_ring-buffer_info[i];
-   size = min(len, max_per_txd);
+   size = min(len, cb-max_per_txd);
/* Workaround for premature desc write-backs
 * in TSO mode.  Append 4-byte sentinel desc */
-   if (mss  f == (nr_frags-1)  size == len  size  8)
+   if (cb-mss  f == (cb-nr_frags-1) 
+   size == len  size  8)
size -= 4;
 
buffer_info-length = size;
@@ -3334,18 +3346,7 @@ static void e1000_tx_queue(struct e1000_adapter *adapter,
}
 
tx_desc-lower.data |= cpu_to_le32(adapter-txd_cmd);
-
-   /* Force memory writes to complete before letting h/w
-* know there are new descriptors to fetch.  (Only
-* applicable for weak-ordered memory model archs,
-* such as IA-64). */
-   wmb();
-
tx_ring-next_to_use = i;
-   writel(i, adapter-hw.hw_addr + tx_ring-tail);
-   /* we need this if more than one processor can write to our tail
-* at a time, it synchronizes IO on IA64/Altix systems */
-   mmiowb();
 }
 
 #define MINIMUM_DHCP_PACKET_SIZE 282
@@ -3417,45 +3418,54 @@ static int e1000_maybe_stop_tx(struct net_device 
*netdev, int size)

[PATCH][TG3]Some cleanups

2007-09-30 Thread jamal

Here are some non-batching related changes that i have in my batching
tree. Like the e1000e, they make the xmit code more readable.
I wouldnt mind if you take them over.

cheers,
jamal

[TG3] Some cleanups
These cleanups make the xmit path code better functionally organized.
Matt Carlson contributed the moving of the VLAN formatting into
_prep_frame() portion.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 260dbcc4b0195897c539c5ff79d95afdddeb3378
tree b2047b0e474abb9f05dd40c22af7f0a86369957d
parent ad63c288ce980907f68d94d5faac08625c0b1782
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:01:46 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:01:46 -0400

 drivers/net/tg3.c |  278 -
 1 files changed, 169 insertions(+), 109 deletions(-)

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index d4ac6e9..5a864bd 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3910,47 +3910,69 @@ static void tg3_set_txd(struct tg3 *tp, int entry,
txd-vlan_tag = vlan_tag  TXD_VLAN_TAG_SHIFT;
 }
 
-/* hard_start_xmit for devices that don't have any bugs and
- * support TG3_FLG2_HW_TSO_2 only.
- */
-static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+struct tg3_tx_cbdata {
+   u32 base_flags;
+   unsigned int mss;
+};
+#define TG3_SKB_CB(__skb)   ((struct tg3_tx_cbdata *)((__skb)-cb[0]))
+#define NETDEV_TX_DROPPED   -5
+
+static int tg3_prep_bug_frame(struct sk_buff *skb, struct net_device *dev)
 {
+   struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
+#if TG3_VLAN_TAG_USED
struct tg3 *tp = netdev_priv(dev);
-   dma_addr_t mapping;
-   u32 len, entry, base_flags, mss;
+   u32 vlantag = 0;
 
-   len = skb_headlen(skb);
+   if (tp-vlgrp != NULL  vlan_tx_tag_present(skb))
+   vlantag = (TXD_FLAG_VLAN | (vlan_tx_tag_get(skb)  16));
 
-   /* We are running in BH disabled context with netif_tx_lock
-* and TX reclaim runs via tp-napi.poll inside of a software
-* interrupt.  Furthermore, IRQ processing runs lockless so we have
-* no IRQ context deadlocks to worry about either.  Rejoice!
-*/
-   if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
-   if (!netif_queue_stopped(dev)) {
-   netif_stop_queue(dev);
+   cb-base_flags = vlantag;
+#endif
 
-   /* This is a hard error, log it. */
-   printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
-  queue awake!\n, dev-name);
+   cb-mss = skb_shinfo(skb)-gso_size;
+   if (cb-mss != 0) {
+   if (skb_header_cloned(skb) 
+   pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
+   dev_kfree_skb(skb);
+   return NETDEV_TX_DROPPED;
}
-   return NETDEV_TX_BUSY;
+
+   cb-base_flags |= (TXD_FLAG_CPU_PRE_DMA |
+  TXD_FLAG_CPU_POST_DMA);
}
 
-   entry = tp-tx_prod;
-   base_flags = 0;
-   mss = 0;
-   if ((mss = skb_shinfo(skb)-gso_size) != 0) {
+   if (skb-ip_summed == CHECKSUM_PARTIAL)
+   cb-base_flags |= TXD_FLAG_TCPUDP_CSUM;
+
+   return NETDEV_TX_OK;
+}
+
+static int tg3_prep_frame(struct sk_buff *skb, struct net_device *dev)
+{
+   struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
+#if TG3_VLAN_TAG_USED
+   struct tg3 *tp = netdev_priv(dev);
+   u32 vlantag = 0;
+
+   if (tp-vlgrp != NULL  vlan_tx_tag_present(skb))
+   vlantag = (TXD_FLAG_VLAN | (vlan_tx_tag_get(skb)  16));
+
+   cb-base_flags = vlantag;
+#endif
+
+   cb-mss = skb_shinfo(skb)-gso_size;
+   if (cb-mss != 0) {
int tcp_opt_len, ip_tcp_len;
 
if (skb_header_cloned(skb) 
pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
dev_kfree_skb(skb);
-   goto out_unlock;
+   return NETDEV_TX_DROPPED;
}
 
if (skb_shinfo(skb)-gso_type  SKB_GSO_TCPV6)
-   mss |= (skb_headlen(skb) - ETH_HLEN)  9;
+   cb-mss |= (skb_headlen(skb) - ETH_HLEN)  9;
else {
struct iphdr *iph = ip_hdr(skb);
 
@@ -3958,32 +3980,58 @@ static int tg3_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
 
iph-check = 0;
-   iph-tot_len = htons(mss + ip_tcp_len + tcp_opt_len);
-   mss |= (ip_tcp_len + tcp_opt_len)  9;
+   iph-tot_len = htons(cb-mss + ip_tcp_len
++ tcp_opt_len);
+   cb-mss |= (ip_tcp_len + tcp_opt_len)  9;
}
 
-   base_flags |= 

Re: [PATCH][TG3]Some cleanups

2007-09-30 Thread jamal
On Sun, 2007-30-09 at 14:11 -0400, jamal wrote:
 Here are some non-batching related changes that i have in my batching
 tree. Like the e1000e, they make the xmit code more readable.
 I wouldnt mind if you take them over.

Should have mentioned: Against Dave's tree from a few hours back.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][E1000E] some cleanups

2007-09-30 Thread Kok, Auke
jamal wrote:
 Auke,
 
 heres part of something i promised. 
 I couldnt do any packet testing on because 82571EB is disabled in the
 driver. I uncommented the code out in the table, but the best i could
 get was the module loading, some probing and some sysfs renaming
 failures (probably a debianism); the machine access is intermittent, so
 thats as far as i could go. In any case, you probably have a good reason
 for disabling that chip. So, heres the patch, the burden of testing now
 falls on you ;-

no, all the hardware that is commented should work just fine. I tested this 
driver
on 82571, 82573 and ich8/ich9 - extensively.

the reason that we disable them is that we're going to migrate devices over in
batches. At introduction we'll support ich9, afterwards we'll drop in the IDs of
the other groups of silicon.

 Once you have 82571EB on and kicking, my next steps are to kill LLTX
 then add batching on top.
 BTW, since this driver is just for PCIE, would you take a similar patch
 for non-PCIE e1000?

if it's a fix, yes.

 comment:
 There used to be an mmiowb() call right after the dma wake which is
 gone now; is this unneeded with pcie? I have restored it, look for the
 XXX.


thanks, I'll go and look at this in depth in the coming weeks.

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHES] TX batching

2007-09-30 Thread jamal

Latest net-2.6.24 breaks the patches i posted last week; so this is an
update to resolve that. If you are receiving these emails and are
finding them overloading, please give me a shout and i will remove your
name.

Please provide feedback on the code and/or architecture.
Last time i posted them i received none. They are now updated to 
work with the latest net-2.6.24 from a few hours ago.

Patch 1: Introduces batching interface
Patch 2: Core uses batching interface
Patch 3: get rid of dev-gso_skb

I have decided i will kill -hard_batch_xmit() and not support any
more LLTX drivers. This is the last of patches that will have
-hard_batch_xmit() as i am supporting an e1000 that is LLTX.

Dave please let me know if this meets your desires to allow devices
which are SG and able to compute CSUM benefit just in case i
misunderstood. 
Herbert, if you can look at at least patch 3 i will appreaciate it
(since it kills dev-gso_skb that you introduced).

More patches to follow later if i get some feedback - i didnt want to 
overload people by dumping too many patches. Most of these patches 
mentioned below are ready to go; some need some re-testing and others 
need a little porting from an earlier kernel: 
- tg3 driver (tested and works well, but dont want to send 
- tun driver
- pktgen
- netiron driver
- e1000 driver (LLTX)
- e1000e driver (non-LLTX)
- ethtool interface
- There is at least one other driver promised to me

Theres also a driver-howto i wrote that was posted on netdev last week
as well as one that describes the architectural decisions made.

Each of these patches has been performance tested (last with DaveM's
tree from last weekend) and the results are in the logs on a per-patch 
basis.  My system under test hardware is a 2xdual core opteron with a 
couple of tg3s. I have not re-run the tests with this morning's tree
but i suspect not much difference.
My test tool generates udp traffic of different sizes for upto 60 
seconds per run or a total of 30M packets. I have 4 threads each 
running on a specific CPU which keep all the CPUs as busy as they can 
sending packets targetted at a directly connected box's udp discard
port.
All 4 CPUs target a single tg3 to send. The receiving box has a tc rule 
which counts and drops all incoming udp packets to discard port - this
allows me to make sure that the receiver is not the bottleneck in the
testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each
packet size run is repeated 10 times to ensure that there are no
transients. The average of all 10 runs is then computed and collected.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] [NET_BATCH] Introduce batching interface

2007-09-30 Thread jamal

This patch introduces the netdevice interface for batching.

cheers,
jamal


[NET_BATCH] Introduce batching interface

This patch introduces the netdevice interface for batching.

A typical driver dev-hard_start_xmit() has 4 parts:
a) packet formating (example vlan, mss, descriptor counting etc)
b) chip specific formatting
c) enqueueing the packet on a DMA ring
d) IO operations to complete packet transmit, tell DMA engine to chew on,
tx completion interupts etc

[For code cleanliness/readability sake, regardless of this work,
one should break the dev-hard_start_xmit() into those 4 functions
anyways].

With the api introduced in this patch, a driver which has all
4 parts and needing to support batching is advised to split its
dev-hard_start_xmit() in the following manner:
1)use its dev-hard_prep_xmit() method to achieve #a
2)use its dev-hard_end_xmit() method to achieve #d
3)#b and #c can stay in -hard_start_xmit() (or whichever way you want
to do this)
Note: There are drivers which may need not support any of the two
methods (example the tun driver i patched) so the two methods are
optional.

The core will first do the packet formatting by invoking your supplied
dev-hard_prep_xmit() method. It will then pass you the packet via
your dev-hard_start_xmit() method and lastly will invoke your
dev-hard_end_xmit() when it completes passing you all the packets
queued for you. dev-hard_prep_xmit() is invoked without holding any
tx lock but the rest are under TX_LOCK().

LLTX present a challenge in that we have to introduce a deviation
from the norm and introduce the -hard_batch_xmit() method. An LLTX
driver presents us with -hard_batch_xmit() to which we pass it a list
of packets in a dev-blist skb queue. It is then the responsibility
of the -hard_batch_xmit() to exercise steps #b and #c for all packets
and #d when the batching is complete. Step #a is already done for you
by the time you get the packets in dev-blist.
And last xmit_win variable is introduced to ensure that when we pass
the driver a list of packets it will swallow all of them - which is
useful because we dont requeue to the qdisc (and avoids burning
unnecessary cpu cycles or introducing any strange re-ordering). The driver
tells us when it invokes netif_wake_queue how much space it has for
descriptors by setting this variable.

Some decisions i had to make:
- every driver will have a xmit_win variable and the core will set it
to 1 which means the behavior of non-batching drivers stays the same.
- the batch list, blist, is no longer a pointer; wastes a little extra
memmory i plan to recoup by killing gso_skb in later patches.

Theres a lot of history and reasoning of why batching in a document
i am writting which i may submit as a patch.
Thomas Graf (who doesnt know this probably) gave me the impetus to
start looking at this back in 2004 when he invited me to the linux
conference he was organizing. Parts of what i presented in SUCON in
2004 talk about batching. Herbert Xu forced me to take a second look around
2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided
me with more motivation in May 2007 when he posted on netdev and engaged
me.
Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan,
Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and
David Miller, have contributed in one or more of {bug fixes, enhancements,
testing, lively discussion}. The Broadcom and netiron folks have been
outstanding in their help.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 624a0bfeb971c9aa58496c7372df01f0ed750def
tree c1c0ee53453392866a5241631a7502ce6569b2cc
parent 260dbcc4b0195897c539c5ff79d95afdddeb3378
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:23:31 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:23:31 -0400

 include/linux/netdevice.h |   17 +++
 net/core/dev.c|  106 +
 2 files changed, 123 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91cd3f3..df1fb61 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -467,6 +467,7 @@ struct net_device
 #define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */
 #define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 #define NETIF_F_LRO32768   /* large receive offload */
+#define NETIF_F_BTX65536   /* Capable of batch tx */
 
/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT  16
@@ -595,6 +596,15 @@ struct net_device
void*priv;  /* pointer to private data  */
int (*hard_start_xmit) (struct sk_buff *skb,
struct net_device *dev);
+   /* hard_batch_xmit is needed for LLTX, kill it when those
+* disappear or better kill it now and dont support LLTX
+   */
+   int 

[PATCH 2/3][NET_BATCH] net core use batching

2007-09-30 Thread jamal

This patch adds the usage of batching within the core.

cheers,
jamal



[NET_BATCH] net core use batching

This patch adds the usage of batching within the core.
The same test methodology used in introducing txlock is used, with
the following results on different kernels:

++--+-+++
|   64B  |  128B| 256B| 512B   |1024B   |
++--+-+++
Original| 467482 | 463061   | 388267  | 216308 | 114704 |
||  | |||
txlock  | 468922 | 464060   | 388298  | 216316 | 114709 |
||  | |||
tg3nobtx| 468012 | 464079   | 388293  | 216314 | 114704 |
||  | |||
tg3btxdr| 480794 | 475102   | 388298  | 216316 | 114705 |
||  | |||
tg3btxco| 481059 | 475423   | 388285  | 216308 | 114706 |
++--+-+++

The first two colums Original and txlock were introduced in an earlier
patch and demonstrate a slight increase in performance with txlock.
tg3nobtx shows the tg3 driver with no changes to support batching.
The purpose of this test is to demonstrate the effect of introducing
the core changes to a driver that doesnt support them.
Although this patch brings down perfomance slightly compared to txlock
for such netdevices, it is still better compared to just the original
kernel.
tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3
driver. tg3btxco demonstrates the effect of letting the core do all the
work. As can be seen the last two are not very different in performance.
The difference is -hard_batch_xmit() introduces a new method which
is intrusive.

I have #if-0ed some of the old functions so the patch is more readable.

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit 9b4a8fb190278d388c0a622fb5529d184ac8c7dc
tree 053e8dda02b5d26fe7cc778823306a8a526df513
parent 624a0bfeb971c9aa58496c7372df01f0ed750def
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:11 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:11 -0400

 net/sched/sch_generic.c |  127 +++
 1 files changed, 115 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 95ae119..86a3f9d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q)
return q-q.qlen;
 }
 
+#if 0
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
  struct Qdisc *q)
 {
@@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff 
*skb,
 
return ret;
 }
+#endif
+
+static inline int handle_dev_cpu_collision(struct net_device *dev)
+{
+   if (unlikely(dev-xmit_lock_owner == smp_processor_id())) {
+   if (net_ratelimit())
+   printk(KERN_WARNING
+   Dead loop on netdevice %s, fix it urgently!\n,
+   dev-name);
+   return 1;
+   }
+   __get_cpu_var(netdev_rx_stat).cpu_collision++;
+   return 0;
+}
+
+static inline int
+dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev,
+  struct Qdisc *q)
+{
+
+   struct sk_buff *skb;
+
+   while ((skb = __skb_dequeue(skbs)) != NULL)
+   q-ops-requeue(skb, q);
+
+   netif_schedule(dev);
+   return 0;
+}
+
+static inline int
+xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev,
+   struct Qdisc *q)
+{
+   int ret = handle_dev_cpu_collision(dev);
+
+   if (ret) {
+   if (!skb_queue_empty(skbs))
+   skb_queue_purge(skbs);
+   return qdisc_qlen(q);
+   }
+
+   return dev_requeue_skbs(skbs, dev, q);
+}
+
+static int xmit_count_skbs(struct sk_buff *skb)
+{
+   int count = 0;
+   for (; skb; skb = skb-next) {
+   count += skb_shinfo(skb)-nr_frags;
+   count += 1;
+   }
+   return count;
+}
+
+static int xmit_get_pkts(struct net_device *dev,
+  struct Qdisc *q,
+  struct sk_buff_head *pktlist)
+{
+   struct sk_buff *skb;
+   int count = dev-xmit_win;
+
+   if (count   dev-gso_skb) {
+   skb = dev-gso_skb;
+   dev-gso_skb = NULL;
+   count -= xmit_count_skbs(skb);
+   __skb_queue_tail(pktlist, skb);
+   }
+
+   while (count  0) {
+   skb = q-dequeue(q);
+   if (!skb)
+   break;
+

[PATCH 3/3][NET_SCHED] kill dev-gso_skb

2007-09-30 Thread jamal

This patch removes dev-gso_skb as it is no longer necessary with
batching code.

cheers,
jamal


[NET_SCHED] kill dev-gso_skb
The batching code does what gso used to batch at the drivers.
There is no more need for gso_skb. If for whatever reason the
requeueing is a bad idea we are going to leave packets in dev-blist
(and still not need dev-gso_skb)

Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED]

---
commit c2916c550d228472ddcdd676c2689fa6c8ecfcc0
tree 5beaf197fd08a038d83501f405017f48712d0318
parent 9b4a8fb190278d388c0a622fb5529d184ac8c7dc
author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:58 -0400
committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:58 -0400

 include/linux/netdevice.h |3 ---
 net/sched/sch_generic.c   |   12 
 2 files changed, 0 insertions(+), 15 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index df1fb61..cea400a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -577,9 +577,6 @@ struct net_device
struct list_headqdisc_list;
unsigned long   tx_queue_len;   /* Max frames per queue allowed 
*/
 
-   /* Partially transmitted GSO packet. */
-   struct sk_buff  *gso_skb;
-
/* ingress path synchronizer */
spinlock_t  ingress_lock;
struct Qdisc*qdisc_ingress;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 86a3f9d..b4e1607 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev,
struct sk_buff *skb;
int count = dev-xmit_win;
 
-   if (count   dev-gso_skb) {
-   skb = dev-gso_skb;
-   dev-gso_skb = NULL;
-   count -= xmit_count_skbs(skb);
-   __skb_queue_tail(pktlist, skb);
-   }
-
while (count  0) {
skb = q-dequeue(q);
if (!skb)
@@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev)
 void dev_deactivate(struct net_device *dev)
 {
struct Qdisc *qdisc;
-   struct sk_buff *skb;
 
spin_lock_bh(dev-queue_lock);
qdisc = dev-qdisc;
@@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev)
 
qdisc_reset(qdisc);
 
-   skb = dev-gso_skb;
-   dev-gso_skb = NULL;
if (!skb_queue_empty(dev-blist))
skb_queue_purge(dev-blist);
dev-xmit_win = 1;
spin_unlock_bh(dev-queue_lock);
 
-   kfree_skb(skb);
-
dev_watchdog_down(dev);
 
/* Wait for outstanding dev_queue_xmit calls. */


Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface

2007-09-30 Thread jamal
Fixed subject - should be 1/3 not 1/4

On Sun, 2007-30-09 at 14:51 -0400, jamal wrote:
 This patch introduces the netdevice interface for batching.
 
 cheers,
 jamal
 
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][E1000E] some cleanups

2007-09-30 Thread jamal
On Sun, 2007-30-09 at 11:16 -0700, Kok, Auke wrote:

 no, all the hardware that is commented should work just fine. I tested this 
 driver
 on 82571, 82573 and ich8/ich9 - extensively.

Something else is wrong then. Can you just uncomment the 82571EB bits in
Dave's net-2.6.24 and just send a ping? If it works, let me know what
you did so i can test next time i get a chance.

 the reason that we disable them is that we're going to migrate devices over in
 batches. At introduction we'll support ich9, afterwards we'll drop in the IDs 
 of
 the other groups of silicon.

Turn them on if you want people to start using that driver.

  Once you have 82571EB on and kicking, my next steps are to kill LLTX
  then add batching on top.
  BTW, since this driver is just for PCIE, would you take a similar patch
  for non-PCIE e1000?
 
 if it's a fix, yes.

It just makes it easier to kill LLTX. If you consider killing LLTX
risky, then i will focus on e1000e.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHES] TX batching

2007-09-30 Thread jamal
And heres a patch that provides a sample of the usage for batching with
tg3. 
Requires patch [TG3]Some cleanups i posted earlier. 

cheers,
jamal

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 5a864bd..9aafb78 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -3103,6 +3103,13 @@ static inline u32 tg3_tx_avail(struct tg3 *tp)
 		((tp-tx_prod - tp-tx_cons)  (TG3_TX_RING_SIZE - 1)));
 }
 
+static inline void tg3_set_win(struct tg3 *tp)
+{
+	tp-dev-xmit_win = tg3_tx_avail(tp) - (MAX_SKB_FRAGS + 1);
+	if (tp-dev-xmit_win  1)
+		tp-dev-xmit_win = 1;
+}
+
 /* Tigon3 never reports partial packet sends.  So we do not
  * need special logic to handle SKBs that have not had all
  * of their frags sent yet, like SunGEM does.
@@ -3165,8 +3172,10 @@ static void tg3_tx(struct tg3 *tp)
 		 (tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH(tp {
 		netif_tx_lock(tp-dev);
 		if (netif_queue_stopped(tp-dev) 
-		(tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH(tp)))
+		(tg3_tx_avail(tp)  TG3_TX_WAKEUP_THRESH(tp))) {
+			tg3_set_win(tp);
 			netif_wake_queue(tp-dev);
+		}
 		netif_tx_unlock(tp-dev);
 	}
 }
@@ -4007,8 +4016,13 @@ void tg3_kick_DMA(struct net_device *dev)
 
 	if (unlikely(tg3_tx_avail(tp) = (MAX_SKB_FRAGS + 1))) {
 		netif_stop_queue(dev);
-		if (tg3_tx_avail(tp) = TG3_TX_WAKEUP_THRESH(tp))
+		dev-xmit_win = 1;
+		if (tg3_tx_avail(tp) = TG3_TX_WAKEUP_THRESH(tp)) {
+			tg3_set_win(tp);
 			netif_wake_queue(dev);
+		}
+	} else {
+		tg3_set_win(tp);
 	}
 
 	mmiowb();
@@ -4085,6 +4099,7 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
+			tp-dev-xmit_win = 1;
 
 			/* This is a hard error, log it. */
 			printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
@@ -4100,6 +4115,25 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	return ret;
 }
 
+static int tg3_start_bxmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+
+	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev-xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
+			   queue awake!\n, dev-name);
+		}
+		return NETDEV_TX_BUSY;
+	}
+
+	return tg3_enqueue(skb, dev);
+}
+
 static int tg3_start_xmit_dma_bug(struct sk_buff *, struct net_device *);
 
 /* Use GSO to workaround a rare TSO bug that may be triggered when the
@@ -4112,9 +4146,11 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
 	/* Estimate the number of fragments in the worst case */
 	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-gso_segs * 3))) {
 		netif_stop_queue(tp-dev);
+		tp-dev-xmit_win = 1;
 		if (tg3_tx_avail(tp) = (skb_shinfo(skb)-gso_segs * 3))
 			return NETDEV_TX_BUSY;
 
+		tg3_set_win(tp);
 		netif_wake_queue(tp-dev);
 	}
 
@@ -4267,6 +4303,25 @@ static int tg3_enqueue_buggy(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static int tg3_start_bxmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+
+	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+			dev-xmit_win = 1;
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
+			   queue awake!\n, dev-name);
+		}
+		return NETDEV_TX_BUSY;
+	}
+
+	return  tg3_enqueue_buggy(skb, dev);
+}
+
 static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 {
 	struct tg3 *tp = netdev_priv(dev);
@@ -4283,6 +4338,7 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 	if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) {
 		if (!netif_queue_stopped(dev)) {
 			netif_stop_queue(dev);
+			dev-xmit_win = 1;
 
 			/* This is a hard error, log it. */
 			printk(KERN_ERR PFX %s: BUG! Tx Ring full when 
@@ -11099,15 +11155,19 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
 	else
 		tp-tg3_flags = ~TG3_FLAG_POLL_SERDES;
 
+	tp-dev-hard_end_xmit = tg3_kick_DMA;
 	/* All chips before 5787 can get confused if TX buffers
 	 * straddle the 4GB address boundary in some cases.
 	 */
 	if (GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5755 ||
 	GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5787 ||
-	GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5906)
-		tp-dev-hard_start_xmit = tg3_start_xmit;
-	else
-		tp-dev-hard_start_xmit = tg3_start_xmit_dma_bug;
+	GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5906) {
+		tp-dev-hard_start_xmit = tg3_start_bxmit;
+		tp-dev-hard_prep_xmit = tg3_prep_frame;
+	} else {
+		tp-dev-hard_start_xmit = tg3_start_bxmit_dma_bug;
+		tp-dev-hard_prep_xmit = tg3_prep_bug_frame;
+	}
 
 	tp-rx_offset = 2;
 	if (GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5701 
@@ -11955,6 +12015,8 

Re: [PATCH][E1000E] some cleanups

2007-09-30 Thread Jeff Garzik

jamal wrote:

On Sun, 2007-30-09 at 11:16 -0700, Kok, Auke wrote:


no, all the hardware that is commented should work just fine. I tested this 
driver
on 82571, 82573 and ich8/ich9 - extensively.


Something else is wrong then. Can you just uncomment the 82571EB bits in
Dave's net-2.6.24 and just send a ping? If it works, let me know what
you did so i can test next time i get a chance.


the reason that we disable them is that we're going to migrate devices over in
batches. At introduction we'll support ich9, afterwards we'll drop in the IDs of
the other groups of silicon.


Turn them on if you want people to start using that driver.


Gotta wait a bit, otherwise we have confusion and a bit of breakage from 
two drivers with the same PCI IDs.


Jeff


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][E1000E] some cleanups

2007-09-30 Thread jamal
On Sun, 2007-30-09 at 15:23 -0400, Jeff Garzik wrote:

 Gotta wait a bit, otherwise we have confusion and a bit of breakage from 
 two drivers with the same PCI IDs.

ah, ok ;- 
When i was testing i compiled out e1000. I am willing to totaly migrate
to e1000e, since major machine i have access to has PCIE. Auke, just let
me know what you need to do other than uncommenting the table entries
and i will leave you alone ;-

cheers,
jamal


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IrDA: Oops fix for ksdazzle

2007-09-30 Thread Samuel Ortiz
Hi Dave,

This is the last remaining patch for IrDA, against net-2.6.24.

It fixes a kernel oops triggered by the ksdazzle SIR driver.
We need more space for input frames, and 2048 should be plenty of it.

Signed-off-by: Alex Villacís Lasso [EMAIL PROTECTED]
Signed-off-by: Samuel Ortiz [EMAIL PROTECTED]

---
 drivers/net/irda/ksdazzle-sir.c |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

Index: net-2.6.24-quilt/drivers/net/irda/ksdazzle-sir.c
===
--- net-2.6.24-quilt.orig/drivers/net/irda/ksdazzle-sir.c   2007-10-01 
01:53:56.0 +0300
+++ net-2.6.24-quilt/drivers/net/irda/ksdazzle-sir.c2007-10-01 
01:53:58.0 +0300
@@ -1,7 +1,7 @@
 /*
 *
 * Filename:  ksdazzle.c
-* Version:   0.1.1
+* Version:   0.1.2
 * Description:   Irda KingSun Dazzle USB Dongle
 * Status:Experimental
 * Author:Alex Villacís Lasso [EMAIL PROTECTED]
@@ -113,6 +113,7 @@
 #define KINGSUN_REQ_SEND 0x09
 
 #define KINGSUN_SND_FIFO_SIZE2048  /* Max packet we can send */
+#define KINGSUN_RCV_MAX 2048   /* Max transfer we can receive */
 
 struct ksdazzle_speedparams {
__le32 baudrate;/* baud rate, little endian */
@@ -150,7 +151,7 @@
__u8 tx_payload[8];
 
struct urb *rx_urb;
-   __u8 rx_payload[8];
+   __u8 *rx_buf;
iobuff_t rx_unwrap_buff;
 
struct usb_ctrlrequest *speed_setuprequest;
@@ -440,7 +441,8 @@
/* Start reception. */
usb_fill_int_urb(kingsun-rx_urb, kingsun-usbdev,
 usb_rcvintpipe(kingsun-usbdev, kingsun-ep_in),
-kingsun-rx_payload, 8, ksdazzle_rcv_irq, kingsun, 1);
+kingsun-rx_buf, KINGSUN_RCV_MAX, ksdazzle_rcv_irq,
+kingsun, 1);
kingsun-rx_urb-status = 0;
err = usb_submit_urb(kingsun-rx_urb, GFP_KERNEL);
if (err) {
@@ -641,6 +643,7 @@
kingsun-tx_buf_clear_sent = 0;
 
kingsun-rx_urb = NULL;
+   kingsun-rx_buf = NULL;
kingsun-rx_unwrap_buff.in_frame = FALSE;
kingsun-rx_unwrap_buff.state = OUTSIDE_FRAME;
kingsun-rx_unwrap_buff.skb = NULL;
@@ -651,6 +654,11 @@
kingsun-speed_urb = NULL;
kingsun-speedparams.baudrate = 0;
 
+   /* Allocate input buffer */
+   kingsun-rx_buf = kmalloc(KINGSUN_RCV_MAX, GFP_KERNEL);
+   if (!kingsun-rx_buf)
+   goto free_mem;
+
/* Allocate output buffer */
kingsun-tx_buf_clear = kmalloc(KINGSUN_SND_FIFO_SIZE, GFP_KERNEL);
if (!kingsun-tx_buf_clear)
@@ -714,6 +722,7 @@
   free_mem:
kfree(kingsun-speed_setuprequest);
kfree(kingsun-tx_buf_clear);
+   kfree(kingsun-rx_buf);
free_netdev(net);
   err_out1:
return ret;
@@ -746,6 +755,7 @@
 
kfree(kingsun-speed_setuprequest);
kfree(kingsun-tx_buf_clear);
+   kfree(kingsun-rx_buf);
free_netdev(kingsun-netdev);
 
usb_set_intfdata(intf, NULL);

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3 Rev4] Initilize and populate age field

2007-09-30 Thread David Miller
From: Varun Chandramohan [EMAIL PROTECTED]
Date: Thu, 20 Sep 2007 20:57:51 +0530

 @@ -420,6 +421,7 @@ static int fn_hash_insert(struct fib_tab
   else
   fa = fib_find_alias(f-fn_alias, tos, fi-fib_priority);
  
 + do_gettimeofday(tv);
   /* Now fa, if non-NULL, points to the first fib alias
* with the same keys [prefix,tos,priority], if such key already
* exists or to the node before which we will insert new one.

gettimeofday() is expensive, we don't even use it to timestamp every
incoming packet and we therefore should not do it every route cache
entry we create in order to handle DoS situations efficiently

I honestly don't like these patches.  I literally cringe every time
you post a new revision.  It's either going to add new costs to route
management or be so inaccurate as to be useless.

I question it's usefulness even if implemented efficiently and
accurately.  I really don't see people lining up asking for a route
aging metric.

We could report zero and be compliant with the RFC, we don't age our
route entries like the model of the SNMP vars seems to suggest, so
it's honestly accurate.  And this would mean no kernel changes, as the
userland program could just report zero in the absense of a kernel
provided value.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SFQ qdisc crashes with limit of 2 packets

2007-09-30 Thread David Miller
From: Alexey Kuznetsov [EMAIL PROTECTED]
Date: Fri, 21 Sep 2007 19:55:22 +0400

 Hello!
 
 Remove artificial limitation for sfq queue limit.
 
 This is followup to Patrick's patch. A little optimization to enqueue
 routine allows to remove artificial limitation on queue length.
 
 Plus, testing showed that hash function used by SFQ is too bad or even worse.
 It does not even sweep the whole range of hash values.
 Switched to Jenkins' hash.
 
 
 Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED]

Applied, thanks Alexey.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] bnx2: factor out gzip unpacker

2007-09-30 Thread David Miller
From: Michael Chan [EMAIL PROTECTED]
Date: Fri, 21 Sep 2007 19:47:17 -0700

 On Fri, 2007-09-21 at 10:49 -0700, David Miller wrote:
  From: Denys Vlasenko [EMAIL PROTECTED]
  Date: Fri, 21 Sep 2007 18:03:55 +0100
  
   Do patches look ok to you?
  
  I'm travelling so I haven't looked closely yet :-)
  
  Michael can take a look and I'll try to do so as well
  tonight.
  
 
 I've already reviewed the earlier versions of the patch and have made
 some suggestions.  This latest one looks ok to me and tested ok.
 
 I'll follow up later with another patch to remove all the zeros in other
 firmware sections, and to remove the gzip headers completely.
 
 Acked-by: Michael Chan [EMAIL PROTECTED]

I've added these patches to net-2.6.24, thanks.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][E1000E] some cleanups

2007-09-30 Thread Kok, Auke
jamal wrote:
 On Sun, 2007-30-09 at 15:23 -0400, Jeff Garzik wrote:
 
 Gotta wait a bit, otherwise we have confusion and a bit of breakage from 
 two drivers with the same PCI IDs.
 
 ah, ok ;- 
 When i was testing i compiled out e1000. I am willing to totaly migrate
 to e1000e, since major machine i have access to has PCIE. Auke, just let
 me know what you need to do other than uncommenting the table entries
 and i will leave you alone ;-

the IDs are the only thing needed to enable all pci-e e1000 hardware.

by all means we need to have guys like you and Jeff test the commented IDs! I've
been doing this myself and the e1000e driver goes to our labs for a period of
testing from next week. Unfortunately they don't know how to break it that good 
as
some of you guys ;)

I'll personally try to get an 82571EB tested on monday.

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Removing DAD in IPv6

2007-09-30 Thread Xia Yang
Hi all,

I am new to this list, so please forgive me if I say anything
inappropriate.

I am working on a mobility testbed based on IPv6. In order to improve
the performance, I would like to remove the DAD process during the
auto-configuration process of an IPv6 address for my testbed. 

During my experiment, I observed that when an address is configured
based on a router advertisement, the outgoing packets can be sent out
immediately. But the packets destined to local node can only be received
after some time (about 1 sec). I found the lost packets are discarded in
the kernel and never make it to the transport layer. I suspect it is
caused by the DAD process.

I would like to ask for help on how to remove or disable the DAD process
properly, as long as the node can send, receive and forward packets
immediately after a new IPv6 address is generated. Any pointer is
appreciated. Thanks a lot in advance!

Best Regards,

Xia Yang

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-09-30 Thread Bill Fink
On Sun, 30 Sep 2007, jamal wrote:

 This patch adds the usage of batching within the core.
 
 cheers,
 jamal



 [sep30-p2of3  text/plain (6.8KB)]
 [NET_BATCH] net core use batching
 
 This patch adds the usage of batching within the core.
 The same test methodology used in introducing txlock is used, with
 the following results on different kernels:
 
 ++--+-+++
 |   64B  |  128B| 256B| 512B   |1024B   |
 ++--+-+++
 Original| 467482 | 463061   | 388267  | 216308 | 114704 |
 ||  | |||
 txlock  | 468922 | 464060   | 388298  | 216316 | 114709 |
 ||  | |||
 tg3nobtx| 468012 | 464079   | 388293  | 216314 | 114704 |
 ||  | |||
 tg3btxdr| 480794 | 475102   | 388298  | 216316 | 114705 |
 ||  | |||
 tg3btxco| 481059 | 475423   | 388285  | 216308 | 114706 |
 ++--+-+++
 
 The first two colums Original and txlock were introduced in an earlier
 patch and demonstrate a slight increase in performance with txlock.
 tg3nobtx shows the tg3 driver with no changes to support batching.
 The purpose of this test is to demonstrate the effect of introducing
 the core changes to a driver that doesnt support them.
 Although this patch brings down perfomance slightly compared to txlock
 for such netdevices, it is still better compared to just the original
 kernel.
 tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3
 driver. tg3btxco demonstrates the effect of letting the core do all the
 work. As can be seen the last two are not very different in performance.
 The difference is -hard_batch_xmit() introduces a new method which
 is intrusive.

Have you done performance comparisons for the case of using 9000-byte
jumbo frames?

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html