Re: 2.6.23-rc8 network problem. Mem leak? ip1000a?
ntpd. Sounds like pps leaking to me. That's what I'd think, except that pps does no allocation in the normal running state, so there's nothing to leak. The interrupt path just records the time in some preallocated, static buffers and wakes up blocked readers. The read path copies the latest data out of those static buffers. There's allocation when the PPS device is created, and more when it's opened. Can anyone offer some diagnosis advice? CONFIG_DEBUG_SLAB_LEAK? Ah, thanks you; I've been using SLUB which doesn't support this option. Here's what I've extracted. I've only presented the top few slab_allocators and a small subset of the oom-killer messages, but I have full copies if desired. Unfortunately, I've discovered that the machine doesn't live in this unhappy state forever. Indeed, I'm not sure if killing ntpd fixes anything; my previous observations may have been optimistic ignorance. (For my own personal reference looking for more oom-kill, I nuked ntpd at 06:46:56. And the oom-kills are continuing, with the latest at 07:43:52.) Anyway, I have a bunch of information from the slab_allocators file, but I'm not quire sure how to make sense of it. With a machine in the unhappy state and firing the OOM killer, the top 20 slab_allocators are: $ sort -rnk2 /proc/slab_allocators | head -20 skbuff_head_cache: 1712746 __alloc_skb+0x31/0x121 size-512: 1706572 tcp_send_ack+0x23/0x102 skbuff_fclone_cache: 149113 __alloc_skb+0x31/0x121 size-2048: 148500 tcp_sendmsg+0x1b5/0xae1 sysfs_dir_cache: 5289 sysfs_new_dirent+0x4b/0xec size-512: 2613 sock_alloc_send_skb+0x93/0x1dd Acpi-Operand: 2014 acpi_ut_allocate_object_desc_dbg+0x34/0x6e size-32: 1995 sysfs_new_dirent+0x29/0xec vm_area_struct: 1679 mmap_region+0x18f/0x421 size-512: 1618 tcp_xmit_probe_skb+0x1f/0xcd size-512: 1571 arp_create+0x4e/0x1cd vm_area_struct: 1544 copy_process+0x9f1/0x1108 anon_vma: 1448 anon_vma_prepare+0x29/0x74 filp: 1201 get_empty_filp+0x44/0xcd UDP: 1173 sk_alloc+0x25/0xaf size-128: 1048 r1bio_pool_alloc+0x23/0x3b size-128: 1024 nfsd_cache_init+0x2d/0xcf Acpi-Namespace: 973 acpi_ns_create_node+0x2c/0x45 vm_area_struct: 717 split_vma+0x33/0xe5 dentry: 594 d_alloc+0x24/0x177 I'm not sure quite what normal numbers are, but I do wonder why there are 1.7 million TCP acks buffered in the system. Shouldn't they be transmitted and deallocated pretty quickly? This machine receives more data than it sends, so I'd expect acks to outnumber real packets. Could the ip1000a driver's transmit path be leaking skbs somehow? that would also explain the flailing of the oom-killer; it can't associate the allocations with a process. Here's /proc/meminfo: MemTotal: 1035756 kB MemFree: 43508 kB Buffers: 72920 kB Cached: 224056 kB SwapCached: 344916 kB Active: 664976 kB Inactive: 267656 kB SwapTotal: 4950368 kB SwapFree: 3729384 kB Dirty:6460 kB Writeback: 0 kB AnonPages: 491708 kB Mapped: 79232 kB Slab:41324 kB SReclaimable:25008 kB SUnreclaim: 16316 kB PageTables: 8132 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 5468244 kB Committed_AS: 1946008 kB VmallocTotal: 253900 kB VmallocUsed: 2672 kB VmallocChunk: 251228 kB I have a lot of oom-killer messages, that I have saved but am not posting for size reasons, but here are some example backtraces. They're not very helpful to me; do they enlighten anyone else? 02:50:20: apcupsd invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0 02:50:22: 02:50:22: Call Trace: 02:50:22: [80246053] out_of_memory+0x71/0x1ba 02:50:22: [8024755d] __alloc_pages+0x255/0x2d7 02:50:22: [8025cbd6] cache_alloc_refill+0x2f4/0x60a 02:50:22: [8040602c] hiddev_ioctl+0x579/0x919 02:50:22: [8025d0fc] kmem_cache_alloc+0x57/0x95 02:50:22: [8040602c] hiddev_ioctl+0x579/0x919 02:50:22: [80262511] cp_new_stat+0xe5/0xfd 02:50:22: [804058ff] hiddev_read+0x199/0x1f6 02:50:22: [80222fa0] default_wake_function+0x0/0xe 02:50:22: [80269bb5] do_ioctl+0x45/0x50 02:50:22: [80269db9] vfs_ioctl+0x1f9/0x20b 02:50:22: [80269e07] sys_ioctl+0x3c/0x5d 02:50:22: [8020b43e] system_call+0x7e/0x83 02:52:18: postgres invoked oom-killer: gfp_mask=0xd0, order=1, oomkilladj=0 02:52:18: 02:52:18: Call Trace: 02:52:18: [80246053] out_of_memory+0x71/0x1ba 02:52:18: [8024755d] __alloc_pages+0x255/0x2d7 02:52:18: [8025be8a] poison_obj+0x26/0x2f 02:52:18: [8024761f] __get_free_pages+0x40/0x79 02:52:18: [80224d66] copy_process+0xb0/0x1108 02:52:18: [80233388] alloc_pid+0x1f/0x27d 02:52:18: [80225ed6] do_fork+0xb1/0x1a7 02:52:18: [802f0627] copy_user_generic_string+0x17/0x40 02:52:18: [8020b43e] system_call+0x7e/0x83 02:52:18: [8020b757] ptregscall_common+0x67/0xb0 02:52:18: kthreadd invoked
Re: 2.6.23-rc8 network problem. Mem leak? ip1000a?
On 30 Sep 2007 03:59:56 -0400 [EMAIL PROTECTED] wrote: ntpd. Sounds like pps leaking to me. That's what I'd think, except that pps does no allocation in the normal running state, so there's nothing to leak. The interrupt path just records the time in some preallocated, static buffers and wakes up blocked readers. The read path copies the latest data out of those static buffers. There's allocation when the PPS device is created, and more when it's opened. OK. Did you try to reproduce it without the pps patch applied? Can anyone offer some diagnosis advice? CONFIG_DEBUG_SLAB_LEAK? Ah, thanks you; I've been using SLUB which doesn't support this option. Here's what I've extracted. I've only presented the top few slab_allocators and a small subset of the oom-killer messages, but I have full copies if desired. Unfortunately, I've discovered that the machine doesn't live in this unhappy state forever. Indeed, I'm not sure if killing ntpd fixes anything; my previous observations may have been optimistic ignorance. (For my own personal reference looking for more oom-kill, I nuked ntpd at 06:46:56. And the oom-kills are continuing, with the latest at 07:43:52.) Anyway, I have a bunch of information from the slab_allocators file, but I'm not quire sure how to make sense of it. With a machine in the unhappy state and firing the OOM killer, the top 20 slab_allocators are: $ sort -rnk2 /proc/slab_allocators | head -20 skbuff_head_cache: 1712746 __alloc_skb+0x31/0x121 size-512: 1706572 tcp_send_ack+0x23/0x102 skbuff_fclone_cache: 149113 __alloc_skb+0x31/0x121 size-2048: 148500 tcp_sendmsg+0x1b5/0xae1 sysfs_dir_cache: 5289 sysfs_new_dirent+0x4b/0xec size-512: 2613 sock_alloc_send_skb+0x93/0x1dd Acpi-Operand: 2014 acpi_ut_allocate_object_desc_dbg+0x34/0x6e size-32: 1995 sysfs_new_dirent+0x29/0xec vm_area_struct: 1679 mmap_region+0x18f/0x421 size-512: 1618 tcp_xmit_probe_skb+0x1f/0xcd size-512: 1571 arp_create+0x4e/0x1cd vm_area_struct: 1544 copy_process+0x9f1/0x1108 anon_vma: 1448 anon_vma_prepare+0x29/0x74 filp: 1201 get_empty_filp+0x44/0xcd UDP: 1173 sk_alloc+0x25/0xaf size-128: 1048 r1bio_pool_alloc+0x23/0x3b size-128: 1024 nfsd_cache_init+0x2d/0xcf Acpi-Namespace: 973 acpi_ns_create_node+0x2c/0x45 vm_area_struct: 717 split_vma+0x33/0xe5 dentry: 594 d_alloc+0x24/0x177 I'm not sure quite what normal numbers are, but I do wonder why there are 1.7 million TCP acks buffered in the system. Shouldn't they be transmitted and deallocated pretty quickly? Yeah, that's an skbuff leak. This machine receives more data than it sends, so I'd expect acks to outnumber real packets. Could the ip1000a driver's transmit path be leaking skbs somehow? Absolutely. Normally a driver's transmit completion interrupt handler will run dev_kfree_skb_irq() against the skbs which have been fully sent. However it'd be darned odd if the driver was leaking only tcp acks. I can find no occurrence of dev_kfree_skb in drivers/net/ipg.c, which is suspicious. Where did you get your ipg.c from, btw? davem's tree? rc8-mm1? rc8-mm2?? that would also explain the flailing of the oom-killer; it can't associate the allocations with a process. Here's /proc/meminfo: MemTotal: 1035756 kB MemFree: 43508 kB Buffers: 72920 kB Cached: 224056 kB SwapCached: 344916 kB Active: 664976 kB Inactive: 267656 kB SwapTotal: 4950368 kB SwapFree: 3729384 kB Dirty:6460 kB Writeback: 0 kB AnonPages: 491708 kB Mapped: 79232 kB Slab:41324 kB SReclaimable:25008 kB SUnreclaim: 16316 kB PageTables: 8132 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 5468244 kB Committed_AS: 1946008 kB VmallocTotal: 253900 kB VmallocUsed: 2672 kB VmallocChunk: 251228 kB I assume that meminfo was not captured when the system was ooming? There isn't much slab there. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc8 network problem. Mem leak? ip1000a?
OK. Did you try to reproduce it without the pps patch applied? No. But I've yanked the ip1000a driver (using old crufy vendor-supplied out-of-kernel module) and the problems are GONE. This machine receives more data than it sends, so I'd expect acks to outnumber real packets. Could the ip1000a driver's transmit path be leaking skbs somehow? Absolutely. Normally a driver's transmit completion interrupt handler will run dev_kfree_skb_irq() against the skbs which have been fully sent. However it'd be darned odd if the driver was leaking only tcp acks. It's leaking lots of things... you can see ARP packets in there and all sorts of stuff. But the big traffic hog is BackupPC doing inbound rsyncs all night long, which generates a lot of acks. Those are the packets it sends, so those are the packets that get leaked. I can find no occurrence of dev_kfree_skb in drivers/net/ipg.c, which is suspicious. Look for IPG_DEV_KFREE_SKB, which is a wrapper macro. (Or just add -i to your grep.) It should probably be deleted (it just expands to dev_kfree_skb), but was presumably useful to someone during development. Where did you get your ipg.c from, btw? davem's tree? rc8-mm1? rc8-mm2?? As I wrote originally, I got it from http://marc.info/?l=linux-netdevm=118980588419882 which was a reuqest for mainline submission. If there are other patches floating around, I'm happy to try them. Now that I know what to look for, it's easy to spot the leak before OOM. I assume that meminfo was not captured when the system was ooming? There isn't much slab there. Oops, sorry. I captured slabinfo but not meminfo. Thank you very much! Sorry to jump the gun and post a lot before I had all the data, but if it WAS a problem in -rc8, I wanted to mention it before -final. Now, the rush is to get the ip1000a driver fixed before the merge window opens. I've added all the ip1000a developers to the Cc: list in an attempt to speed that up. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Hmm, so it looks like we do not need this queue processing at all... Regards, Den Eric W. Biederman wrote: Patrick McHardy [EMAIL PROTECTED] writes: Maybe I can save you some time: we used to do down_trylock() for the rtnl mutex, so senders would simply return if someone else was already processing the queue *or* the rtnl was locked for some other reason. In the first case the process already processing the queue would also process the new messages, but if it the rtnl was locked for some other reason (for example during module registration) the message would sit in the queue until the next rtnetlink sendmsg call, which is why rtnl_unlock does queue processing. Commit 6756ae4b changed the down_trylock to mutex_lock, so senders will now simply wait until the mutex is released and then call netlink_run_queue themselves. This means its not needed anymore. Sounds reasonable. I started looking through the code paths and I currently cannot see anything that would leave a message on a kernel rtnl socket. However I did a quick test adding a WARN_ON if there were any messages found in the queue during rtnl_unlock and I found this code path getting invoked from linkwatch_event. So there is clearly something I don't understand, and it sounds at odds just a bit from your description. If we can remove the extra queue processing that would be great, as it looks like a nice way to simplify the locking and the special cases in the code. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
Herbert Xu wrote: On Sun, Sep 30, 2007 at 08:26:01AM +0800, Herbert Xu wrote: Indeed. The only other case I can think of is defragmentation. But even there we should be able to squeeze it into the original skb :) OK it won't be pretty but it's definitely doable. We simply swap the contents of that skb with the head skb that would've otherwise been returned. If this is the only place in netfilter where we need to modify *pskb then I think it's worthwhile. So the question is are there any other places that we haven't covered which also needs to change *pskb? From a quick look it seems at least IPv4 and IPv6 don't need to change the skb anywhere else. It seems to be necessary for bridging though to unshare the skb. OTOH br_netfilter.c already unshares the skb for IPv4 and IPv6, so we could simply do this unconditionally before calling the hooks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] net: Make rtnetlink infrastructure network namespace aware
Eric W. Biederman wrote: Patrick McHardy [EMAIL PROTECTED] writes: Maybe I can save you some time: we used to do down_trylock() for the rtnl mutex, so senders would simply return if someone else was already processing the queue *or* the rtnl was locked for some other reason. In the first case the process already processing the queue would also process the new messages, but if it the rtnl was locked for some other reason (for example during module registration) the message would sit in the queue until the next rtnetlink sendmsg call, which is why rtnl_unlock does queue processing. Commit 6756ae4b changed the down_trylock to mutex_lock, so senders will now simply wait until the mutex is released and then call netlink_run_queue themselves. This means its not needed anymore. Sounds reasonable. I started looking through the code paths and I currently cannot see anything that would leave a message on a kernel rtnl socket. However I did a quick test adding a WARN_ON if there were any messages found in the queue during rtnl_unlock and I found this code path getting invoked from linkwatch_event. So there is clearly something I don't understand, and it sounds at odds just a bit from your description. That sounds like a bug. Did you place the WARN_ON before or after the mutex_unlock()? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rtnl: Simplify ASSERT_RTNL
Herbert Xu wrote: On Sat, Sep 29, 2007 at 05:32:41PM +0200, Patrick McHardy wrote: For unicast addresses its not strictly necessary since they may only be changed under the RTNL anyway. The reason why it takes the tx_lock is for consistency with multicast address handling, which can't rely on the RTNL since IPv6 changes them from BH context. The idea was that the -set_rx_mode function should handle both secondary unicast and multicast addresses for simplicity. In any case, coming back to the original question, the RTNL assertion is simply wrong in this case because if we're being called from IPv6 then the RTNL won't even be held. So I think we need to 1) Move the assert into dev_set_promiscuity. 2) Take the TX lock in dev_set_promiscuity. In the IPv6 case we're only changing the multicast list, so we're never calling into __dev_set_promiscuity. I actually even added a comment about this :) /* Unicast addresses changes may only happen under the rtnl, * therefore calling __dev_set_promiscuity here is safe. */ I would prefer to keep the ASSERT_RTNL in __dev_set_promiscuity since it also covers the __dev_set_rx_mode path. How about adding an ASSERT_RTNL_ATOMIC without the might_sleep or simply open coding it? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ehea work queues
Hi, I booted 2.6.23-rc8 and noticed that ehea loves its workqueues: # ps aux|grep ehea root 3266 0.0 0.000 ?S 11:02 0:00 [ehea_driver_wq/] root 3268 0.0 0.000 ?S 11:02 0:00 [ehea_driver_wq/] root 3269 0.0 0.000 ?S 11:02 0:00 [ehea_driver_wq/] root 3270 0.0 0.000 ?S 11:02 0:00 [ehea_driver_wq/] root 3271 0.0 0.000 ?S 11:02 0:00 [ehea_driver_wq/] root 3272 0.0 0.000 ?S 11:02 0:00 [ehea_driver_wq/] root 3273 0.0 0.000 ?S 11:02 0:00 [ehea_driver_wq/] root 3274 0.0 0.000 ?S 11:02 0:00 [ehea_driver_wq/] root 3275 0.0 0.000 ?S 11:02 0:00 [ehea_wq/0] root 3276 0.0 0.000 ?S 11:02 0:00 [ehea_wq/1] root 3278 0.0 0.000 ?S 11:02 0:00 [ehea_wq/2] root 3279 0.0 0.000 ?S 11:02 0:00 [ehea_wq/3] root 3280 0.0 0.000 ?S 11:02 0:00 [ehea_wq/4] root 3281 0.0 0.000 ?S 11:02 0:00 [ehea_wq/5] root 3282 0.0 0.000 ?S 11:02 0:00 [ehea_wq/6] root 3283 0.0 0.000 ?S 11:02 0:00 [ehea_wq/7] (notice also that the ehea_driver_wq/XXX exceeds TASK_COMM_LEN). Since they are both infrequent events and not performance critical (memory hotplug and driver reset), can we just use schedule_work? Anton - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][E1000E] some cleanups
Auke, heres part of something i promised. I couldnt do any packet testing on because 82571EB is disabled in the driver. I uncommented the code out in the table, but the best i could get was the module loading, some probing and some sysfs renaming failures (probably a debianism); the machine access is intermittent, so thats as far as i could go. In any case, you probably have a good reason for disabling that chip. So, heres the patch, the burden of testing now falls on you ;- Once you have 82571EB on and kicking, my next steps are to kill LLTX then add batching on top. BTW, since this driver is just for PCIE, would you take a similar patch for non-PCIE e1000? comment: There used to be an mmiowb() call right after the dma wake which is gone now; is this unneeded with pcie? I have restored it, look for the XXX. cheers, jamal [E1000E] some cleanups This patch makes the xmit path code a lot more readable and reusable. preps for removing LLTX. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit ad63c288ce980907f68d94d5faac08625c0b1782 tree 296a6da371b98c6488c544a8910941ea6d8c18a8 parent 7f5d0afdff875b2c4957031f8934741aefe257cc author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 13:26:00 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 13:26:00 -0400 drivers/net/e1000e/netdev.c | 187 +++ 1 files changed, 119 insertions(+), 68 deletions(-) diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c index 4a21d7d..5043504 100644 --- a/drivers/net/e1000e/netdev.c +++ b/drivers/net/e1000e/netdev.c @@ -3076,6 +3076,18 @@ link_up: #define E1000_TX_FLAGS_VLAN_MASK 0x #define E1000_TX_FLAGS_VLAN_SHIFT 16 +struct e1000_tx_cbdata { + int count; + unsigned int max_per_txd; + unsigned int nr_frags; + unsigned int mss; + unsigned int tx_flags; +}; + +#define E1000_SKB_CB(__skb) ((struct e1000_tx_cbdata *)((__skb)-cb[8])) +#define NETDEV_TX_DROPPED -5 +#define TXD_USE_COUNT(S, X) (((S) (X)) + 1) + static int e1000_tso(struct e1000_adapter *adapter, struct sk_buff *skb) { @@ -3194,8 +3206,7 @@ static bool e1000_tx_csum(struct e1000_adapter *adapter, struct sk_buff *skb) static int e1000_tx_map(struct e1000_adapter *adapter, struct sk_buff *skb, unsigned int first, - unsigned int max_per_txd, unsigned int nr_frags, - unsigned int mss) + struct e1000_tx_cbdata *cb) { struct e1000_ring *tx_ring = adapter-tx_ring; struct e1000_buffer *buffer_info; @@ -3207,11 +3218,11 @@ static int e1000_tx_map(struct e1000_adapter *adapter, while (len) { buffer_info = tx_ring-buffer_info[i]; - size = min(len, max_per_txd); + size = min(len, cb-max_per_txd); /* Workaround for premature desc write-backs * in TSO mode. Append 4-byte sentinel desc */ - if (mss !nr_frags size == len size 8) + if (cb-mss !cb-nr_frags size == len size 8) size -= 4; buffer_info-length = size; @@ -3237,7 +3248,7 @@ static int e1000_tx_map(struct e1000_adapter *adapter, i = 0; } - for (f = 0; f nr_frags; f++) { + for (f = 0; f cb-nr_frags; f++) { struct skb_frag_struct *frag; frag = skb_shinfo(skb)-frags[f]; @@ -3246,10 +3257,11 @@ static int e1000_tx_map(struct e1000_adapter *adapter, while (len) { buffer_info = tx_ring-buffer_info[i]; - size = min(len, max_per_txd); + size = min(len, cb-max_per_txd); /* Workaround for premature desc write-backs * in TSO mode. Append 4-byte sentinel desc */ - if (mss f == (nr_frags-1) size == len size 8) + if (cb-mss f == (cb-nr_frags-1) + size == len size 8) size -= 4; buffer_info-length = size; @@ -3334,18 +3346,7 @@ static void e1000_tx_queue(struct e1000_adapter *adapter, } tx_desc-lower.data |= cpu_to_le32(adapter-txd_cmd); - - /* Force memory writes to complete before letting h/w -* know there are new descriptors to fetch. (Only -* applicable for weak-ordered memory model archs, -* such as IA-64). */ - wmb(); - tx_ring-next_to_use = i; - writel(i, adapter-hw.hw_addr + tx_ring-tail); - /* we need this if more than one processor can write to our tail -* at a time, it synchronizes IO on IA64/Altix systems */ - mmiowb(); } #define MINIMUM_DHCP_PACKET_SIZE 282 @@ -3417,45 +3418,54 @@ static int e1000_maybe_stop_tx(struct net_device *netdev, int size)
[PATCH][TG3]Some cleanups
Here are some non-batching related changes that i have in my batching tree. Like the e1000e, they make the xmit code more readable. I wouldnt mind if you take them over. cheers, jamal [TG3] Some cleanups These cleanups make the xmit path code better functionally organized. Matt Carlson contributed the moving of the VLAN formatting into _prep_frame() portion. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 260dbcc4b0195897c539c5ff79d95afdddeb3378 tree b2047b0e474abb9f05dd40c22af7f0a86369957d parent ad63c288ce980907f68d94d5faac08625c0b1782 author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:01:46 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:01:46 -0400 drivers/net/tg3.c | 278 - 1 files changed, 169 insertions(+), 109 deletions(-) diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index d4ac6e9..5a864bd 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -3910,47 +3910,69 @@ static void tg3_set_txd(struct tg3 *tp, int entry, txd-vlan_tag = vlan_tag TXD_VLAN_TAG_SHIFT; } -/* hard_start_xmit for devices that don't have any bugs and - * support TG3_FLG2_HW_TSO_2 only. - */ -static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev) +struct tg3_tx_cbdata { + u32 base_flags; + unsigned int mss; +}; +#define TG3_SKB_CB(__skb) ((struct tg3_tx_cbdata *)((__skb)-cb[0])) +#define NETDEV_TX_DROPPED -5 + +static int tg3_prep_bug_frame(struct sk_buff *skb, struct net_device *dev) { + struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb); +#if TG3_VLAN_TAG_USED struct tg3 *tp = netdev_priv(dev); - dma_addr_t mapping; - u32 len, entry, base_flags, mss; + u32 vlantag = 0; - len = skb_headlen(skb); + if (tp-vlgrp != NULL vlan_tx_tag_present(skb)) + vlantag = (TXD_FLAG_VLAN | (vlan_tx_tag_get(skb) 16)); - /* We are running in BH disabled context with netif_tx_lock -* and TX reclaim runs via tp-napi.poll inside of a software -* interrupt. Furthermore, IRQ processing runs lockless so we have -* no IRQ context deadlocks to worry about either. Rejoice! -*/ - if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { - if (!netif_queue_stopped(dev)) { - netif_stop_queue(dev); + cb-base_flags = vlantag; +#endif - /* This is a hard error, log it. */ - printk(KERN_ERR PFX %s: BUG! Tx Ring full when - queue awake!\n, dev-name); + cb-mss = skb_shinfo(skb)-gso_size; + if (cb-mss != 0) { + if (skb_header_cloned(skb) + pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) { + dev_kfree_skb(skb); + return NETDEV_TX_DROPPED; } - return NETDEV_TX_BUSY; + + cb-base_flags |= (TXD_FLAG_CPU_PRE_DMA | + TXD_FLAG_CPU_POST_DMA); } - entry = tp-tx_prod; - base_flags = 0; - mss = 0; - if ((mss = skb_shinfo(skb)-gso_size) != 0) { + if (skb-ip_summed == CHECKSUM_PARTIAL) + cb-base_flags |= TXD_FLAG_TCPUDP_CSUM; + + return NETDEV_TX_OK; +} + +static int tg3_prep_frame(struct sk_buff *skb, struct net_device *dev) +{ + struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb); +#if TG3_VLAN_TAG_USED + struct tg3 *tp = netdev_priv(dev); + u32 vlantag = 0; + + if (tp-vlgrp != NULL vlan_tx_tag_present(skb)) + vlantag = (TXD_FLAG_VLAN | (vlan_tx_tag_get(skb) 16)); + + cb-base_flags = vlantag; +#endif + + cb-mss = skb_shinfo(skb)-gso_size; + if (cb-mss != 0) { int tcp_opt_len, ip_tcp_len; if (skb_header_cloned(skb) pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) { dev_kfree_skb(skb); - goto out_unlock; + return NETDEV_TX_DROPPED; } if (skb_shinfo(skb)-gso_type SKB_GSO_TCPV6) - mss |= (skb_headlen(skb) - ETH_HLEN) 9; + cb-mss |= (skb_headlen(skb) - ETH_HLEN) 9; else { struct iphdr *iph = ip_hdr(skb); @@ -3958,32 +3980,58 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev) ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr); iph-check = 0; - iph-tot_len = htons(mss + ip_tcp_len + tcp_opt_len); - mss |= (ip_tcp_len + tcp_opt_len) 9; + iph-tot_len = htons(cb-mss + ip_tcp_len ++ tcp_opt_len); + cb-mss |= (ip_tcp_len + tcp_opt_len) 9; } - base_flags |=
Re: [PATCH][TG3]Some cleanups
On Sun, 2007-30-09 at 14:11 -0400, jamal wrote: Here are some non-batching related changes that i have in my batching tree. Like the e1000e, they make the xmit code more readable. I wouldnt mind if you take them over. Should have mentioned: Against Dave's tree from a few hours back. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][E1000E] some cleanups
jamal wrote: Auke, heres part of something i promised. I couldnt do any packet testing on because 82571EB is disabled in the driver. I uncommented the code out in the table, but the best i could get was the module loading, some probing and some sysfs renaming failures (probably a debianism); the machine access is intermittent, so thats as far as i could go. In any case, you probably have a good reason for disabling that chip. So, heres the patch, the burden of testing now falls on you ;- no, all the hardware that is commented should work just fine. I tested this driver on 82571, 82573 and ich8/ich9 - extensively. the reason that we disable them is that we're going to migrate devices over in batches. At introduction we'll support ich9, afterwards we'll drop in the IDs of the other groups of silicon. Once you have 82571EB on and kicking, my next steps are to kill LLTX then add batching on top. BTW, since this driver is just for PCIE, would you take a similar patch for non-PCIE e1000? if it's a fix, yes. comment: There used to be an mmiowb() call right after the dma wake which is gone now; is this unneeded with pcie? I have restored it, look for the XXX. thanks, I'll go and look at this in depth in the coming weeks. Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHES] TX batching
Latest net-2.6.24 breaks the patches i posted last week; so this is an update to resolve that. If you are receiving these emails and are finding them overloading, please give me a shout and i will remove your name. Please provide feedback on the code and/or architecture. Last time i posted them i received none. They are now updated to work with the latest net-2.6.24 from a few hours ago. Patch 1: Introduces batching interface Patch 2: Core uses batching interface Patch 3: get rid of dev-gso_skb I have decided i will kill -hard_batch_xmit() and not support any more LLTX drivers. This is the last of patches that will have -hard_batch_xmit() as i am supporting an e1000 that is LLTX. Dave please let me know if this meets your desires to allow devices which are SG and able to compute CSUM benefit just in case i misunderstood. Herbert, if you can look at at least patch 3 i will appreaciate it (since it kills dev-gso_skb that you introduced). More patches to follow later if i get some feedback - i didnt want to overload people by dumping too many patches. Most of these patches mentioned below are ready to go; some need some re-testing and others need a little porting from an earlier kernel: - tg3 driver (tested and works well, but dont want to send - tun driver - pktgen - netiron driver - e1000 driver (LLTX) - e1000e driver (non-LLTX) - ethtool interface - There is at least one other driver promised to me Theres also a driver-howto i wrote that was posted on netdev last week as well as one that describes the architectural decisions made. Each of these patches has been performance tested (last with DaveM's tree from last weekend) and the results are in the logs on a per-patch basis. My system under test hardware is a 2xdual core opteron with a couple of tg3s. I have not re-run the tests with this morning's tree but i suspect not much difference. My test tool generates udp traffic of different sizes for upto 60 seconds per run or a total of 30M packets. I have 4 threads each running on a specific CPU which keep all the CPUs as busy as they can sending packets targetted at a directly connected box's udp discard port. All 4 CPUs target a single tg3 to send. The receiving box has a tc rule which counts and drops all incoming udp packets to discard port - this allows me to make sure that the receiver is not the bottleneck in the testing. Packet sizes sent are {64B, 128B, 256B, 512B, 1024B}. Each packet size run is repeated 10 times to ensure that there are no transients. The average of all 10 runs is then computed and collected. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] [NET_BATCH] Introduce batching interface
This patch introduces the netdevice interface for batching. cheers, jamal [NET_BATCH] Introduce batching interface This patch introduces the netdevice interface for batching. A typical driver dev-hard_start_xmit() has 4 parts: a) packet formating (example vlan, mss, descriptor counting etc) b) chip specific formatting c) enqueueing the packet on a DMA ring d) IO operations to complete packet transmit, tell DMA engine to chew on, tx completion interupts etc [For code cleanliness/readability sake, regardless of this work, one should break the dev-hard_start_xmit() into those 4 functions anyways]. With the api introduced in this patch, a driver which has all 4 parts and needing to support batching is advised to split its dev-hard_start_xmit() in the following manner: 1)use its dev-hard_prep_xmit() method to achieve #a 2)use its dev-hard_end_xmit() method to achieve #d 3)#b and #c can stay in -hard_start_xmit() (or whichever way you want to do this) Note: There are drivers which may need not support any of the two methods (example the tun driver i patched) so the two methods are optional. The core will first do the packet formatting by invoking your supplied dev-hard_prep_xmit() method. It will then pass you the packet via your dev-hard_start_xmit() method and lastly will invoke your dev-hard_end_xmit() when it completes passing you all the packets queued for you. dev-hard_prep_xmit() is invoked without holding any tx lock but the rest are under TX_LOCK(). LLTX present a challenge in that we have to introduce a deviation from the norm and introduce the -hard_batch_xmit() method. An LLTX driver presents us with -hard_batch_xmit() to which we pass it a list of packets in a dev-blist skb queue. It is then the responsibility of the -hard_batch_xmit() to exercise steps #b and #c for all packets and #d when the batching is complete. Step #a is already done for you by the time you get the packets in dev-blist. And last xmit_win variable is introduced to ensure that when we pass the driver a list of packets it will swallow all of them - which is useful because we dont requeue to the qdisc (and avoids burning unnecessary cpu cycles or introducing any strange re-ordering). The driver tells us when it invokes netif_wake_queue how much space it has for descriptors by setting this variable. Some decisions i had to make: - every driver will have a xmit_win variable and the core will set it to 1 which means the behavior of non-batching drivers stays the same. - the batch list, blist, is no longer a pointer; wastes a little extra memmory i plan to recoup by killing gso_skb in later patches. Theres a lot of history and reasoning of why batching in a document i am writting which i may submit as a patch. Thomas Graf (who doesnt know this probably) gave me the impetus to start looking at this back in 2004 when he invited me to the linux conference he was organizing. Parts of what i presented in SUCON in 2004 talk about batching. Herbert Xu forced me to take a second look around 2.6.18 - refer to my netconf 2006 presentation. Krishna Kumar provided me with more motivation in May 2007 when he posted on netdev and engaged me. Sridhar Samudrala, Krishna Kumar, Matt Carlson, Michael Chan, Jeremy Ethridge, Evgeniy Polyakov, Sivakumar Subramani, and David Miller, have contributed in one or more of {bug fixes, enhancements, testing, lively discussion}. The Broadcom and netiron folks have been outstanding in their help. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 624a0bfeb971c9aa58496c7372df01f0ed750def tree c1c0ee53453392866a5241631a7502ce6569b2cc parent 260dbcc4b0195897c539c5ff79d95afdddeb3378 author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:23:31 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:23:31 -0400 include/linux/netdevice.h | 17 +++ net/core/dev.c| 106 + 2 files changed, 123 insertions(+), 0 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 91cd3f3..df1fb61 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -467,6 +467,7 @@ struct net_device #define NETIF_F_NETNS_LOCAL8192/* Does not change network namespaces */ #define NETIF_F_MULTI_QUEUE16384 /* Has multiple TX/RX queues */ #define NETIF_F_LRO32768 /* large receive offload */ +#define NETIF_F_BTX65536 /* Capable of batch tx */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -595,6 +596,15 @@ struct net_device void*priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + /* hard_batch_xmit is needed for LLTX, kill it when those +* disappear or better kill it now and dont support LLTX + */ + int
[PATCH 2/3][NET_BATCH] net core use batching
This patch adds the usage of batching within the core. cheers, jamal [NET_BATCH] net core use batching This patch adds the usage of batching within the core. The same test methodology used in introducing txlock is used, with the following results on different kernels: ++--+-+++ | 64B | 128B| 256B| 512B |1024B | ++--+-+++ Original| 467482 | 463061 | 388267 | 216308 | 114704 | || | ||| txlock | 468922 | 464060 | 388298 | 216316 | 114709 | || | ||| tg3nobtx| 468012 | 464079 | 388293 | 216314 | 114704 | || | ||| tg3btxdr| 480794 | 475102 | 388298 | 216316 | 114705 | || | ||| tg3btxco| 481059 | 475423 | 388285 | 216308 | 114706 | ++--+-+++ The first two colums Original and txlock were introduced in an earlier patch and demonstrate a slight increase in performance with txlock. tg3nobtx shows the tg3 driver with no changes to support batching. The purpose of this test is to demonstrate the effect of introducing the core changes to a driver that doesnt support them. Although this patch brings down perfomance slightly compared to txlock for such netdevices, it is still better compared to just the original kernel. tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3 driver. tg3btxco demonstrates the effect of letting the core do all the work. As can be seen the last two are not very different in performance. The difference is -hard_batch_xmit() introduces a new method which is intrusive. I have #if-0ed some of the old functions so the patch is more readable. Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit 9b4a8fb190278d388c0a622fb5529d184ac8c7dc tree 053e8dda02b5d26fe7cc778823306a8a526df513 parent 624a0bfeb971c9aa58496c7372df01f0ed750def author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:11 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:11 -0400 net/sched/sch_generic.c | 127 +++ 1 files changed, 115 insertions(+), 12 deletions(-) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 95ae119..86a3f9d 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -56,6 +56,7 @@ static inline int qdisc_qlen(struct Qdisc *q) return q-q.qlen; } +#if 0 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { @@ -110,6 +111,97 @@ static inline int handle_dev_cpu_collision(struct sk_buff *skb, return ret; } +#endif + +static inline int handle_dev_cpu_collision(struct net_device *dev) +{ + if (unlikely(dev-xmit_lock_owner == smp_processor_id())) { + if (net_ratelimit()) + printk(KERN_WARNING + Dead loop on netdevice %s, fix it urgently!\n, + dev-name); + return 1; + } + __get_cpu_var(netdev_rx_stat).cpu_collision++; + return 0; +} + +static inline int +dev_requeue_skbs(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + + struct sk_buff *skb; + + while ((skb = __skb_dequeue(skbs)) != NULL) + q-ops-requeue(skb, q); + + netif_schedule(dev); + return 0; +} + +static inline int +xmit_islocked(struct sk_buff_head *skbs, struct net_device *dev, + struct Qdisc *q) +{ + int ret = handle_dev_cpu_collision(dev); + + if (ret) { + if (!skb_queue_empty(skbs)) + skb_queue_purge(skbs); + return qdisc_qlen(q); + } + + return dev_requeue_skbs(skbs, dev, q); +} + +static int xmit_count_skbs(struct sk_buff *skb) +{ + int count = 0; + for (; skb; skb = skb-next) { + count += skb_shinfo(skb)-nr_frags; + count += 1; + } + return count; +} + +static int xmit_get_pkts(struct net_device *dev, + struct Qdisc *q, + struct sk_buff_head *pktlist) +{ + struct sk_buff *skb; + int count = dev-xmit_win; + + if (count dev-gso_skb) { + skb = dev-gso_skb; + dev-gso_skb = NULL; + count -= xmit_count_skbs(skb); + __skb_queue_tail(pktlist, skb); + } + + while (count 0) { + skb = q-dequeue(q); + if (!skb) + break; +
[PATCH 3/3][NET_SCHED] kill dev-gso_skb
This patch removes dev-gso_skb as it is no longer necessary with batching code. cheers, jamal [NET_SCHED] kill dev-gso_skb The batching code does what gso used to batch at the drivers. There is no more need for gso_skb. If for whatever reason the requeueing is a bad idea we are going to leave packets in dev-blist (and still not need dev-gso_skb) Signed-off-by: Jamal Hadi Salim [EMAIL PROTECTED] --- commit c2916c550d228472ddcdd676c2689fa6c8ecfcc0 tree 5beaf197fd08a038d83501f405017f48712d0318 parent 9b4a8fb190278d388c0a622fb5529d184ac8c7dc author Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:58 -0400 committer Jamal Hadi Salim [EMAIL PROTECTED] Sun, 30 Sep 2007 14:38:58 -0400 include/linux/netdevice.h |3 --- net/sched/sch_generic.c | 12 2 files changed, 0 insertions(+), 15 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index df1fb61..cea400a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -577,9 +577,6 @@ struct net_device struct list_headqdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ - /* Partially transmitted GSO packet. */ - struct sk_buff *gso_skb; - /* ingress path synchronizer */ spinlock_t ingress_lock; struct Qdisc*qdisc_ingress; diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 86a3f9d..b4e1607 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -172,13 +172,6 @@ static int xmit_get_pkts(struct net_device *dev, struct sk_buff *skb; int count = dev-xmit_win; - if (count dev-gso_skb) { - skb = dev-gso_skb; - dev-gso_skb = NULL; - count -= xmit_count_skbs(skb); - __skb_queue_tail(pktlist, skb); - } - while (count 0) { skb = q-dequeue(q); if (!skb) @@ -654,7 +647,6 @@ void dev_activate(struct net_device *dev) void dev_deactivate(struct net_device *dev) { struct Qdisc *qdisc; - struct sk_buff *skb; spin_lock_bh(dev-queue_lock); qdisc = dev-qdisc; @@ -662,15 +654,11 @@ void dev_deactivate(struct net_device *dev) qdisc_reset(qdisc); - skb = dev-gso_skb; - dev-gso_skb = NULL; if (!skb_queue_empty(dev-blist)) skb_queue_purge(dev-blist); dev-xmit_win = 1; spin_unlock_bh(dev-queue_lock); - kfree_skb(skb); - dev_watchdog_down(dev); /* Wait for outstanding dev_queue_xmit calls. */
Re: [PATCH 1/3] [NET_BATCH] Introduce batching interface
Fixed subject - should be 1/3 not 1/4 On Sun, 2007-30-09 at 14:51 -0400, jamal wrote: This patch introduces the netdevice interface for batching. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][E1000E] some cleanups
On Sun, 2007-30-09 at 11:16 -0700, Kok, Auke wrote: no, all the hardware that is commented should work just fine. I tested this driver on 82571, 82573 and ich8/ich9 - extensively. Something else is wrong then. Can you just uncomment the 82571EB bits in Dave's net-2.6.24 and just send a ping? If it works, let me know what you did so i can test next time i get a chance. the reason that we disable them is that we're going to migrate devices over in batches. At introduction we'll support ich9, afterwards we'll drop in the IDs of the other groups of silicon. Turn them on if you want people to start using that driver. Once you have 82571EB on and kicking, my next steps are to kill LLTX then add batching on top. BTW, since this driver is just for PCIE, would you take a similar patch for non-PCIE e1000? if it's a fix, yes. It just makes it easier to kill LLTX. If you consider killing LLTX risky, then i will focus on e1000e. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHES] TX batching
And heres a patch that provides a sample of the usage for batching with tg3. Requires patch [TG3]Some cleanups i posted earlier. cheers, jamal diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index 5a864bd..9aafb78 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -3103,6 +3103,13 @@ static inline u32 tg3_tx_avail(struct tg3 *tp) ((tp-tx_prod - tp-tx_cons) (TG3_TX_RING_SIZE - 1))); } +static inline void tg3_set_win(struct tg3 *tp) +{ + tp-dev-xmit_win = tg3_tx_avail(tp) - (MAX_SKB_FRAGS + 1); + if (tp-dev-xmit_win 1) + tp-dev-xmit_win = 1; +} + /* Tigon3 never reports partial packet sends. So we do not * need special logic to handle SKBs that have not had all * of their frags sent yet, like SunGEM does. @@ -3165,8 +3172,10 @@ static void tg3_tx(struct tg3 *tp) (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH(tp { netif_tx_lock(tp-dev); if (netif_queue_stopped(tp-dev) - (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH(tp))) + (tg3_tx_avail(tp) TG3_TX_WAKEUP_THRESH(tp))) { + tg3_set_win(tp); netif_wake_queue(tp-dev); + } netif_tx_unlock(tp-dev); } } @@ -4007,8 +4016,13 @@ void tg3_kick_DMA(struct net_device *dev) if (unlikely(tg3_tx_avail(tp) = (MAX_SKB_FRAGS + 1))) { netif_stop_queue(dev); - if (tg3_tx_avail(tp) = TG3_TX_WAKEUP_THRESH(tp)) + dev-xmit_win = 1; + if (tg3_tx_avail(tp) = TG3_TX_WAKEUP_THRESH(tp)) { + tg3_set_win(tp); netif_wake_queue(dev); + } + } else { + tg3_set_win(tp); } mmiowb(); @@ -4085,6 +4099,7 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev) if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { if (!netif_queue_stopped(dev)) { netif_stop_queue(dev); + tp-dev-xmit_win = 1; /* This is a hard error, log it. */ printk(KERN_ERR PFX %s: BUG! Tx Ring full when @@ -4100,6 +4115,25 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev) return ret; } +static int tg3_start_bxmit(struct sk_buff *skb, struct net_device *dev) +{ + struct tg3 *tp = netdev_priv(dev); + + if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { + if (!netif_queue_stopped(dev)) { + netif_stop_queue(dev); + dev-xmit_win = 1; + + /* This is a hard error, log it. */ + printk(KERN_ERR PFX %s: BUG! Tx Ring full when + queue awake!\n, dev-name); + } + return NETDEV_TX_BUSY; + } + + return tg3_enqueue(skb, dev); +} + static int tg3_start_xmit_dma_bug(struct sk_buff *, struct net_device *); /* Use GSO to workaround a rare TSO bug that may be triggered when the @@ -4112,9 +4146,11 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb) /* Estimate the number of fragments in the worst case */ if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-gso_segs * 3))) { netif_stop_queue(tp-dev); + tp-dev-xmit_win = 1; if (tg3_tx_avail(tp) = (skb_shinfo(skb)-gso_segs * 3)) return NETDEV_TX_BUSY; + tg3_set_win(tp); netif_wake_queue(tp-dev); } @@ -4267,6 +4303,25 @@ static int tg3_enqueue_buggy(struct sk_buff *skb, struct net_device *dev) return NETDEV_TX_OK; } +static int tg3_start_bxmit_dma_bug(struct sk_buff *skb, struct net_device *dev) +{ + struct tg3 *tp = netdev_priv(dev); + + if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { + if (!netif_queue_stopped(dev)) { + netif_stop_queue(dev); + dev-xmit_win = 1; + + /* This is a hard error, log it. */ + printk(KERN_ERR PFX %s: BUG! Tx Ring full when + queue awake!\n, dev-name); + } + return NETDEV_TX_BUSY; + } + + return tg3_enqueue_buggy(skb, dev); +} + static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev) { struct tg3 *tp = netdev_priv(dev); @@ -4283,6 +4338,7 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev) if (unlikely(tg3_tx_avail(tp) = (skb_shinfo(skb)-nr_frags + 1))) { if (!netif_queue_stopped(dev)) { netif_stop_queue(dev); + dev-xmit_win = 1; /* This is a hard error, log it. */ printk(KERN_ERR PFX %s: BUG! Tx Ring full when @@ -11099,15 +11155,19 @@ static int __devinit tg3_get_invariants(struct tg3 *tp) else tp-tg3_flags = ~TG3_FLAG_POLL_SERDES; + tp-dev-hard_end_xmit = tg3_kick_DMA; /* All chips before 5787 can get confused if TX buffers * straddle the 4GB address boundary in some cases. */ if (GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5755 || GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5787 || - GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5906) - tp-dev-hard_start_xmit = tg3_start_xmit; - else - tp-dev-hard_start_xmit = tg3_start_xmit_dma_bug; + GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5906) { + tp-dev-hard_start_xmit = tg3_start_bxmit; + tp-dev-hard_prep_xmit = tg3_prep_frame; + } else { + tp-dev-hard_start_xmit = tg3_start_bxmit_dma_bug; + tp-dev-hard_prep_xmit = tg3_prep_bug_frame; + } tp-rx_offset = 2; if (GET_ASIC_REV(tp-pci_chip_rev_id) == ASIC_REV_5701 @@ -11955,6 +12015,8
Re: [PATCH][E1000E] some cleanups
jamal wrote: On Sun, 2007-30-09 at 11:16 -0700, Kok, Auke wrote: no, all the hardware that is commented should work just fine. I tested this driver on 82571, 82573 and ich8/ich9 - extensively. Something else is wrong then. Can you just uncomment the 82571EB bits in Dave's net-2.6.24 and just send a ping? If it works, let me know what you did so i can test next time i get a chance. the reason that we disable them is that we're going to migrate devices over in batches. At introduction we'll support ich9, afterwards we'll drop in the IDs of the other groups of silicon. Turn them on if you want people to start using that driver. Gotta wait a bit, otherwise we have confusion and a bit of breakage from two drivers with the same PCI IDs. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][E1000E] some cleanups
On Sun, 2007-30-09 at 15:23 -0400, Jeff Garzik wrote: Gotta wait a bit, otherwise we have confusion and a bit of breakage from two drivers with the same PCI IDs. ah, ok ;- When i was testing i compiled out e1000. I am willing to totaly migrate to e1000e, since major machine i have access to has PCIE. Auke, just let me know what you need to do other than uncommenting the table entries and i will leave you alone ;- cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IrDA: Oops fix for ksdazzle
Hi Dave, This is the last remaining patch for IrDA, against net-2.6.24. It fixes a kernel oops triggered by the ksdazzle SIR driver. We need more space for input frames, and 2048 should be plenty of it. Signed-off-by: Alex Villacís Lasso [EMAIL PROTECTED] Signed-off-by: Samuel Ortiz [EMAIL PROTECTED] --- drivers/net/irda/ksdazzle-sir.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) Index: net-2.6.24-quilt/drivers/net/irda/ksdazzle-sir.c === --- net-2.6.24-quilt.orig/drivers/net/irda/ksdazzle-sir.c 2007-10-01 01:53:56.0 +0300 +++ net-2.6.24-quilt/drivers/net/irda/ksdazzle-sir.c2007-10-01 01:53:58.0 +0300 @@ -1,7 +1,7 @@ /* * * Filename: ksdazzle.c -* Version: 0.1.1 +* Version: 0.1.2 * Description: Irda KingSun Dazzle USB Dongle * Status:Experimental * Author:Alex Villacís Lasso [EMAIL PROTECTED] @@ -113,6 +113,7 @@ #define KINGSUN_REQ_SEND 0x09 #define KINGSUN_SND_FIFO_SIZE2048 /* Max packet we can send */ +#define KINGSUN_RCV_MAX 2048 /* Max transfer we can receive */ struct ksdazzle_speedparams { __le32 baudrate;/* baud rate, little endian */ @@ -150,7 +151,7 @@ __u8 tx_payload[8]; struct urb *rx_urb; - __u8 rx_payload[8]; + __u8 *rx_buf; iobuff_t rx_unwrap_buff; struct usb_ctrlrequest *speed_setuprequest; @@ -440,7 +441,8 @@ /* Start reception. */ usb_fill_int_urb(kingsun-rx_urb, kingsun-usbdev, usb_rcvintpipe(kingsun-usbdev, kingsun-ep_in), -kingsun-rx_payload, 8, ksdazzle_rcv_irq, kingsun, 1); +kingsun-rx_buf, KINGSUN_RCV_MAX, ksdazzle_rcv_irq, +kingsun, 1); kingsun-rx_urb-status = 0; err = usb_submit_urb(kingsun-rx_urb, GFP_KERNEL); if (err) { @@ -641,6 +643,7 @@ kingsun-tx_buf_clear_sent = 0; kingsun-rx_urb = NULL; + kingsun-rx_buf = NULL; kingsun-rx_unwrap_buff.in_frame = FALSE; kingsun-rx_unwrap_buff.state = OUTSIDE_FRAME; kingsun-rx_unwrap_buff.skb = NULL; @@ -651,6 +654,11 @@ kingsun-speed_urb = NULL; kingsun-speedparams.baudrate = 0; + /* Allocate input buffer */ + kingsun-rx_buf = kmalloc(KINGSUN_RCV_MAX, GFP_KERNEL); + if (!kingsun-rx_buf) + goto free_mem; + /* Allocate output buffer */ kingsun-tx_buf_clear = kmalloc(KINGSUN_SND_FIFO_SIZE, GFP_KERNEL); if (!kingsun-tx_buf_clear) @@ -714,6 +722,7 @@ free_mem: kfree(kingsun-speed_setuprequest); kfree(kingsun-tx_buf_clear); + kfree(kingsun-rx_buf); free_netdev(net); err_out1: return ret; @@ -746,6 +755,7 @@ kfree(kingsun-speed_setuprequest); kfree(kingsun-tx_buf_clear); + kfree(kingsun-rx_buf); free_netdev(kingsun-netdev); usb_set_intfdata(intf, NULL); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3 Rev4] Initilize and populate age field
From: Varun Chandramohan [EMAIL PROTECTED] Date: Thu, 20 Sep 2007 20:57:51 +0530 @@ -420,6 +421,7 @@ static int fn_hash_insert(struct fib_tab else fa = fib_find_alias(f-fn_alias, tos, fi-fib_priority); + do_gettimeofday(tv); /* Now fa, if non-NULL, points to the first fib alias * with the same keys [prefix,tos,priority], if such key already * exists or to the node before which we will insert new one. gettimeofday() is expensive, we don't even use it to timestamp every incoming packet and we therefore should not do it every route cache entry we create in order to handle DoS situations efficiently I honestly don't like these patches. I literally cringe every time you post a new revision. It's either going to add new costs to route management or be so inaccurate as to be useless. I question it's usefulness even if implemented efficiently and accurately. I really don't see people lining up asking for a route aging metric. We could report zero and be compliant with the RFC, we don't age our route entries like the model of the SNMP vars seems to suggest, so it's honestly accurate. And this would mean no kernel changes, as the userland program could just report zero in the absense of a kernel provided value. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SFQ qdisc crashes with limit of 2 packets
From: Alexey Kuznetsov [EMAIL PROTECTED] Date: Fri, 21 Sep 2007 19:55:22 +0400 Hello! Remove artificial limitation for sfq queue limit. This is followup to Patrick's patch. A little optimization to enqueue routine allows to remove artificial limitation on queue length. Plus, testing showed that hash function used by SFQ is too bad or even worse. It does not even sweep the whole range of hash values. Switched to Jenkins' hash. Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] Applied, thanks Alexey. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] bnx2: factor out gzip unpacker
From: Michael Chan [EMAIL PROTECTED] Date: Fri, 21 Sep 2007 19:47:17 -0700 On Fri, 2007-09-21 at 10:49 -0700, David Miller wrote: From: Denys Vlasenko [EMAIL PROTECTED] Date: Fri, 21 Sep 2007 18:03:55 +0100 Do patches look ok to you? I'm travelling so I haven't looked closely yet :-) Michael can take a look and I'll try to do so as well tonight. I've already reviewed the earlier versions of the patch and have made some suggestions. This latest one looks ok to me and tested ok. I'll follow up later with another patch to remove all the zeros in other firmware sections, and to remove the gzip headers completely. Acked-by: Michael Chan [EMAIL PROTECTED] I've added these patches to net-2.6.24, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][E1000E] some cleanups
jamal wrote: On Sun, 2007-30-09 at 15:23 -0400, Jeff Garzik wrote: Gotta wait a bit, otherwise we have confusion and a bit of breakage from two drivers with the same PCI IDs. ah, ok ;- When i was testing i compiled out e1000. I am willing to totaly migrate to e1000e, since major machine i have access to has PCIE. Auke, just let me know what you need to do other than uncommenting the table entries and i will leave you alone ;- the IDs are the only thing needed to enable all pci-e e1000 hardware. by all means we need to have guys like you and Jeff test the commented IDs! I've been doing this myself and the e1000e driver goes to our labs for a period of testing from next week. Unfortunately they don't know how to break it that good as some of you guys ;) I'll personally try to get an 82571EB tested on monday. Auke - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Removing DAD in IPv6
Hi all, I am new to this list, so please forgive me if I say anything inappropriate. I am working on a mobility testbed based on IPv6. In order to improve the performance, I would like to remove the DAD process during the auto-configuration process of an IPv6 address for my testbed. During my experiment, I observed that when an address is configured based on a router advertisement, the outgoing packets can be sent out immediately. But the packets destined to local node can only be received after some time (about 1 sec). I found the lost packets are discarded in the kernel and never make it to the transport layer. I suspect it is caused by the DAD process. I would like to ask for help on how to remove or disable the DAD process properly, as long as the node can send, receive and forward packets immediately after a new IPv6 address is generated. Any pointer is appreciated. Thanks a lot in advance! Best Regards, Xia Yang - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3][NET_BATCH] net core use batching
On Sun, 30 Sep 2007, jamal wrote: This patch adds the usage of batching within the core. cheers, jamal [sep30-p2of3 text/plain (6.8KB)] [NET_BATCH] net core use batching This patch adds the usage of batching within the core. The same test methodology used in introducing txlock is used, with the following results on different kernels: ++--+-+++ | 64B | 128B| 256B| 512B |1024B | ++--+-+++ Original| 467482 | 463061 | 388267 | 216308 | 114704 | || | ||| txlock | 468922 | 464060 | 388298 | 216316 | 114709 | || | ||| tg3nobtx| 468012 | 464079 | 388293 | 216314 | 114704 | || | ||| tg3btxdr| 480794 | 475102 | 388298 | 216316 | 114705 | || | ||| tg3btxco| 481059 | 475423 | 388285 | 216308 | 114706 | ++--+-+++ The first two colums Original and txlock were introduced in an earlier patch and demonstrate a slight increase in performance with txlock. tg3nobtx shows the tg3 driver with no changes to support batching. The purpose of this test is to demonstrate the effect of introducing the core changes to a driver that doesnt support them. Although this patch brings down perfomance slightly compared to txlock for such netdevices, it is still better compared to just the original kernel. tg3btxdr demonstrates the effect of using -hard_batch_xmit() with tg3 driver. tg3btxco demonstrates the effect of letting the core do all the work. As can be seen the last two are not very different in performance. The difference is -hard_batch_xmit() introduces a new method which is intrusive. Have you done performance comparisons for the case of using 9000-byte jumbo frames? -Bill - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html