Re: [PATCH 00/10] Implement batching skb API
Hi Jamal, J Hadi Salim [EMAIL PROTECTED] wrote on 07/21/2007 06:48:41 PM: - Use a single qdisc interface to avoid code duplication and reduce maintainability (sch_generic.c size reduces by ~9%). - Has per device configurable parameter to turn on/off batching. - qdisc_restart gets slightly modified while looking simple without any checks for batching vs regular code (infact only two lines have changed - 1. instead of dev_dequeue_skb, a new batch-aware function is called; and 2. an extra call to hard_start_xmit_batch. - No change in__qdisc_run other than a new argument (from DM's idea). - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code. All the above are cosmetic differences. To me is the highest priority is making sure that batching is useful and what the limitations are. At some point, when all looks good - i dont mind adding an ethtool interface to turn off/on batching, merge with the new qdisc restart path instead of having a parallel path, solicit feedback on naming, where to allocate structs etc etc. All that is low prio if batching across a variety of hardware and applications doesnt prove useful. At the moment, i am unsure theres consistency to justify push batching in. Batching need not be useful for every hardware. If there is hardware that is useful to exploit batching (like clearly IPoIB is a good candidate as both the TX and the TX completion path can handle multiple skb processing, and I haven't looked at other drivers to see if any of them can do something similar), then IMHO it makes sense to enable batching for that hardware. It is upto the other drivers to determine whether converting to the batching API makes sense or not. And as indicated, the total size increase for adding the kernel support is also insignificant - 0.03%, or 1164 Bytes (using the 'size' command). Having said that below are the main architectural differences we have which is what we really need to discuss and see what proves useful: - Batching algo/processing is different (eg. if qdisc_restart() finds one skb in the batch list, it will try to batch more (upto a limit) instead of sending that out and batching the rest in the next call. This sounds a little more aggressive but maybe useful. I have experimented with setting upper bound limits (current patches have a pktgen interface to set the max to send) and have concluded that it is unneeded. Probing by letting the driver tell you what space is available has proven to be the best approach. I have been meaning to remove the code in pktgen which allows these limits. I don't quite agree with that approach, eg, if the blist is empty and the driver tells there is space for one packet, you will add one packet and the driver sends it out and the device is stopped (with potentially lot of skbs on dev-q). Then no packets are added till the queue is enabled, at which time a flood of skbs will be processed increasing latency and holding lock for a single longer duration. My approach will mitigate holding lock for longer times and instead send skbs to the device as long as we are within the limits. Infact in my rev2 patch (being today or tomorrow after handling Patrick's and Stephen's comments), I am even removing the driver specific xmit_slots as I find it is adding bloat and requires more cycles than calculating the value each time xmit is done (ofcourse in your approach it is required since the stack uses it). - Jamal's code has a separate hw prep handler called from the stack, and results are accessed in driver during xmit later. I have explained the reasoning to this a few times. A recent response to Michael Chan is here: http://marc.info/?l=linux-netdevm=118346921316657w=2 Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find it), I feel having prep will not help as no other cpu can execute the queue/xmit code anyway (E1000 is also a LLTX driver). Other driver that hold tx lock could get improvement however. And heres a response to you that i havent heard back on: http://marc.info/?l=linux-netdevm=118355539503924w=2 That is because it answered my query :) It is what I was expecting, but thanks for the explanation. My tests so far indicate this interface is useful. It doesnt apply well I wonder if you tried enabling/disabling 'prep' on E1000 to see how the performance is affected. If it helps, I guess you could send me a patch to add that and I can also test it to see what the effect is. I didn't add it since IPoIB wouldn't be able to exploit it (unless someone is kind enough to show me how to). So if i was to sum up this, (it would be useful discussion to have on these) the real difference is: a) you have an extra check on refilling the skb list when you find that it has a single skb. I tagged this as being potentially useful. It is very useful since extra processing is not required for one skb case -
Re: Races in net_rx_action vs netpoll?
From: Stephen Hemminger [EMAIL PROTECTED] Date: Thu, 19 Jul 2007 17:27:47 +0100 Please revisit the requirements that netconsole needs and redesign it from scratch. The existing code is causing too much breakage. Can it be done without breaking the semantics of network devices, or should we rewrite the driver interface to take have a different interface like netdev_sync_send_skb() that is slow, synchronous and non-interrupt (ie polls for completion). Of course, then people will complain that netconsole traffic slows the machine down. for completion. I couldn't agree more. Since netpoll runs outside of all of the normal netdevice locking rules, only the people using netpoll hit all the bugs. That means most of us do not test out these code path, which is bad. So, if anything, a more integrated implementation is essential. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]: Resurrect napi_poll patch.
From: David Miller [EMAIL PROTECTED] Date: Sat, 21 Jul 2007 20:54:25 -0700 (PDT) Ok, I'll put the napi_struct info the driver private and respin the patch. To facilitate discussion I converted the core and tg3, and include just that part of the patch below. Consequences: 1) rtnetlink and sysfs control of napi-weight had to be removed, we only have a device to deal with and dev -- napi is not a one to one relationship any longer 2) netpoll is totally broken, it wants to go from a netdevice to a singular napi_struct and that simply is not possible any longer I made the netpoll disabled case compile for now. Note that netpoll would be broken in this regard even without Rusty's suggestion that we are implementing here, it simply made the issue un-ignorable which I think is fantastic. Netpoll definitely needs to be dealt with, but as discussed with Patrick the rtnetlink and sysfs knob drop isn't critial. At worst, we can at least preserve the sysfs case by creating some kind of napi_create() that makes a sysfs directory and a weight file for a driver so that the weight can still be set and observed from there. Rusty, how does it look otherwise? diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c index 887b9a5..ee6c69c 100644 --- a/drivers/net/tg3.c +++ b/drivers/net/tg3.c @@ -574,7 +574,7 @@ static void tg3_restart_ints(struct tg3 *tp) static inline void tg3_netif_stop(struct tg3 *tp) { tp-dev-trans_start = jiffies; /* prevent tx timeout */ - netif_poll_disable(tp-dev); + napi_disable(tp-napi); netif_tx_disable(tp-dev); } @@ -585,7 +585,7 @@ static inline void tg3_netif_start(struct tg3 *tp) * so long as all callers are assured to have free tx slots * (such as after tg3_init_hw) */ - netif_poll_enable(tp-dev); + napi_enable(tp-napi); tp-hw_status-status |= SD_STATUS_UPDATED; tg3_enable_ints(tp); } @@ -3471,11 +3471,12 @@ next_pkt_nopost: return received; } -static int tg3_poll(struct net_device *netdev, int *budget) +static int tg3_poll(struct napi_struct *napi, int budget) { - struct tg3 *tp = netdev_priv(netdev); + struct tg3 *tp = container_of(napi, struct tg3, napi); + struct net_device *netdev = tp-dev; struct tg3_hw_status *sblk = tp-hw_status; - int done; + int work_done = 0; /* handle link change and other phy events */ if (!(tp-tg3_flags @@ -3494,7 +3495,7 @@ static int tg3_poll(struct net_device *netdev, int *budget) if (sblk-idx[0].tx_consumer != tp-tx_cons) { tg3_tx(tp); if (unlikely(tp-tg3_flags TG3_FLAG_TX_RECOVERY_PENDING)) { - netif_rx_complete(netdev); + netif_rx_complete(netdev, tp-napi); schedule_work(tp-reset_task); return 0; } @@ -3502,20 +3503,10 @@ static int tg3_poll(struct net_device *netdev, int *budget) /* run RX thread, within the bounds set by NAPI. * All RX locking is done by ensuring outside -* code synchronizes with dev-poll() +* code synchronizes with tg3-napi.poll() */ - if (sblk-idx[0].rx_producer != tp-rx_rcb_ptr) { - int orig_budget = *budget; - int work_done; - - if (orig_budget netdev-quota) - orig_budget = netdev-quota; - - work_done = tg3_rx(tp, orig_budget); - - *budget -= work_done; - netdev-quota -= work_done; - } + if (sblk-idx[0].rx_producer != tp-rx_rcb_ptr) + work_done = tg3_rx(tp, budget); if (tp-tg3_flags TG3_FLAG_TAGGED_STATUS) { tp-last_tag = sblk-status_tag; @@ -3524,13 +3515,12 @@ static int tg3_poll(struct net_device *netdev, int *budget) sblk-status = ~SD_STATUS_UPDATED; /* if no more work, tell net stack and NIC we're done */ - done = !tg3_has_work(tp); - if (done) { - netif_rx_complete(netdev); + if (!tg3_has_work(tp)) { + netif_rx_complete(netdev, tp-napi); tg3_restart_ints(tp); } - return (done ? 0 : 1); + return work_done; } static void tg3_irq_quiesce(struct tg3 *tp) @@ -3577,7 +3567,7 @@ static irqreturn_t tg3_msi_1shot(int irq, void *dev_id) prefetch(tp-rx_rcb[tp-rx_rcb_ptr]); if (likely(!tg3_irq_sync(tp))) - netif_rx_schedule(dev); /* schedule NAPI poll */ + netif_rx_schedule(dev, tp-napi); return IRQ_HANDLED; } @@ -3602,7 +3592,7 @@ static irqreturn_t tg3_msi(int irq, void *dev_id) */ tw32_mailbox(MAILBOX_INTERRUPT_0 + TG3_64BIT_REG_LOW, 0x0001); if (likely(!tg3_irq_sync(tp))) - netif_rx_schedule(dev); /* schedule NAPI poll */ + netif_rx_schedule(dev, tp-napi);
Re: [PATCH]: Resurrect napi_poll patch.
From: David Miller [EMAIL PROTECTED] Date: Sun, 22 Jul 2007 00:18:18 -0700 (PDT) To facilitate discussion I converted the core and tg3, and include just that part of the patch below. And for those for whom it makes it easier to play around with these changes, here is the updated e1000 conversion below. Making changes to e1000 is like visiting a crime scene. :-/ diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h index 16a6edf..781ed99 100644 --- a/drivers/net/e1000/e1000.h +++ b/drivers/net/e1000/e1000.h @@ -300,6 +300,7 @@ struct e1000_adapter { int cleaned_count); struct e1000_rx_ring *rx_ring; /* One per active queue */ #ifdef CONFIG_E1000_NAPI + struct napi_struct napi; struct net_device *polling_netdev; /* One per active queue */ #endif int num_tx_queues; diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index f48b659..13c877c 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -162,7 +162,7 @@ static irqreturn_t e1000_intr_msi(int irq, void *data); static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring); #ifdef CONFIG_E1000_NAPI -static int e1000_clean(struct net_device *poll_dev, int *budget); +static int e1000_clean(struct napi_struct *napi, int budget); static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter, struct e1000_rx_ring *rx_ring, int *work_done, int work_to_do); @@ -541,7 +541,7 @@ int e1000_up(struct e1000_adapter *adapter) clear_bit(__E1000_DOWN, adapter-flags); #ifdef CONFIG_E1000_NAPI - netif_poll_enable(adapter-netdev); + napi_enable(adapter-napi); #endif e1000_irq_enable(adapter); @@ -630,7 +630,7 @@ e1000_down(struct e1000_adapter *adapter) set_bit(__E1000_DOWN, adapter-flags); #ifdef CONFIG_E1000_NAPI - netif_poll_disable(netdev); + napi_disable(adapter-napi); #endif e1000_irq_disable(adapter); @@ -932,8 +932,8 @@ e1000_probe(struct pci_dev *pdev, netdev-tx_timeout = e1000_tx_timeout; netdev-watchdog_timeo = 5 * HZ; #ifdef CONFIG_E1000_NAPI - netdev-poll = e1000_clean; - netdev-weight = 64; + adapter-napi.poll = e1000_clean; + adapter-napi.weight = 64; #endif netdev-vlan_rx_register = e1000_vlan_rx_register; netdev-vlan_rx_add_vid = e1000_vlan_rx_add_vid; @@ -1146,7 +1146,7 @@ e1000_probe(struct pci_dev *pdev, netif_carrier_off(netdev); netif_stop_queue(netdev); #ifdef CONFIG_E1000_NAPI - netif_poll_disable(netdev); + napi_disable(adapter-napi); #endif strcpy(netdev-name, eth%d); @@ -1319,8 +1319,6 @@ e1000_sw_init(struct e1000_adapter *adapter) #ifdef CONFIG_E1000_NAPI for (i = 0; i adapter-num_rx_queues; i++) { adapter-polling_netdev[i].priv = adapter; - adapter-polling_netdev[i].poll = e1000_clean; - adapter-polling_netdev[i].weight = 64; dev_hold(adapter-polling_netdev[i]); set_bit(__LINK_STATE_START, adapter-polling_netdev[i].state); } @@ -1437,7 +1435,7 @@ e1000_open(struct net_device *netdev) clear_bit(__E1000_DOWN, adapter-flags); #ifdef CONFIG_E1000_NAPI - netif_poll_enable(netdev); + napi_enable(adapter-napi); #endif e1000_irq_enable(adapter); @@ -1476,6 +1474,10 @@ e1000_close(struct net_device *netdev) { struct e1000_adapter *adapter = netdev_priv(netdev); +#ifdef CONFIG_E1000_NAPI + napi_disable(adapter-napi); +#endif + WARN_ON(test_bit(__E1000_RESETTING, adapter-flags)); e1000_down(adapter); e1000_power_down_phy(adapter); @@ -3780,12 +3782,12 @@ e1000_intr_msi(int irq, void *data) } #ifdef CONFIG_E1000_NAPI - if (likely(netif_rx_schedule_prep(netdev))) { + if (likely(netif_rx_schedule_prep(netdev, adapter-napi))) { adapter-total_tx_bytes = 0; adapter-total_tx_packets = 0; adapter-total_rx_bytes = 0; adapter-total_rx_packets = 0; - __netif_rx_schedule(netdev); + __netif_rx_schedule(netdev, adapter-napi); } else e1000_irq_enable(adapter); #else @@ -3865,12 +3867,12 @@ e1000_intr(int irq, void *data) E1000_WRITE_REG(hw, IMC, ~0); E1000_WRITE_FLUSH(hw); } - if (likely(netif_rx_schedule_prep(netdev))) { + if (likely(netif_rx_schedule_prep(netdev, adapter-napi))) { adapter-total_tx_bytes = 0; adapter-total_tx_packets = 0; adapter-total_rx_bytes = 0; adapter-total_rx_packets = 0; - __netif_rx_schedule(netdev); + __netif_rx_schedule(netdev, adapter-napi);
Re: [PATCH]: Resurrect napi_poll patch.
On Sun, 2007-07-22 at 00:18 -0700, David Miller wrote: Rusty, how does it look otherwise? I like it. For a start, the simplification of the NAPI api was long overdue, and the damage done by separating the napi_struct is really minimal. Overall the tg3 driver just looks a little nicer now, and that's sweet. Unfortunately my complete ignorance of netpoll prevents me from making sensible comment there. This seems to have slipped in tho: --- a/net/core/dev.c +++ b/net/core/dev.c @@ -220,7 +220,8 @@ static RAW_NOTIFIER_HEAD(netdev_chain); * Device drivers call our routines to queue packets here. We empty the * queue in the local softnet handler. */ -DEFINE_PER_CPU(struct softnet_data, softnet_data) = { NULL }; + +DEFINE_PER_CPU(struct softnet_data, softnet_data) = { NULL, }; #ifdef CONFIG_SYSFS extern int netdev_sysfs_init(void); ISTR that noone is using buggy compilers which required per-cpu initializations now, so this can simply be dropped. (The new scheduler code doesn't initialize per-cpu, so if this is a problem it should be noticed). Thanks! Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]: Resurrect napi_poll patch.
From: Rusty Russell [EMAIL PROTECTED] Date: Sun, 22 Jul 2007 17:51:52 +1000 On Sun, 2007-07-22 at 00:18 -0700, David Miller wrote: --- a/net/core/dev.c +++ b/net/core/dev.c @@ -220,7 +220,8 @@ static RAW_NOTIFIER_HEAD(netdev_chain); * Device drivers call our routines to queue packets here. We empty the * queue in the local softnet handler. */ -DEFINE_PER_CPU(struct softnet_data, softnet_data) = { NULL }; + +DEFINE_PER_CPU(struct softnet_data, softnet_data) = { NULL, }; #ifdef CONFIG_SYSFS extern int netdev_sysfs_init(void); ISTR that noone is using buggy compilers which required per-cpu initializations now, so this can simply be dropped. (The new scheduler code doesn't initialize per-cpu, so if this is a problem it should be noticed). Good catch and I've removed it from my tree, this will show up in subsequent patch revisions. Thanks Rusty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/12 -Rev2] Implement batching skb API
This set of patches implements the batching API, and makes the following changes resulting from the review of the first set: Changes : - 1. Changed skb_blist from pointer to static as it saves only 12 bytes (i386), but bloats the code. 2. Removed requirement for driver to set features NETIF_F_BATCH_SKBS in register_netdev to enable batching as it is redundant. Changed this flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other user changable calls can modify this bit to enable/disable batching. 3. Added ethtool support to enable/disable batching (not tested). 4. Added rtnetlink support to enable/disable batching (not tested). 5. Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers should not have a small queue anyway (adding bloat). 6. skbs are purged from dev_deactivate instead of from unregister_netdev to drop all references to the device. 7. Removed changelog in source code in sch_generic.c, and unrelated renames from sch_generic.c (lockless, comments). 8. Removed xmit_slots entirely, as it was adding bloat (code and header) and not adding value (it is calculated and set twice in internal send routine and handle work completion, and referenced once in batch xmit; and can instead be calculated once in xmit). Issues : 1. Remove /sysfs support completely ? 2. Whether rtnetlink support is required as GSO has only ethtool ? Patches are described as: Mail 0/12 : This mail. Mail 1/12 : HOWTO documentation. Mail 2/12 : Changes to netdevice.h Mail 3/12 : dev.c changes. Mail 4/12 : Ethtool changes. Mail 5/12 : sysfs changes. Mail 6/12 : rtnetlink changes. Mail 7/12 : Change in qdisc_run qdisc_restart API, modify callers to use this API. Mail 8/12 : IPoIB include file changes. Mail 9/12 : IPoIB verbs changes Mail 10/12 : IPoIB multicast, CM changes Mail 11/12 : IPoIB xmit API addition Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c) I have started a 10 run test for various buffer sizes and processes, and will post the results on Monday. Please review and provide feedback/ideas; and consider for inclusion. Thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/12 -Rev2] HOWTO documentation for Batching SKB.
diff -ruNp org/Documentation/networking/Batching_skb_API.txt rev2/Documentation/networking/Batching_skb_API.txt --- org/Documentation/networking/Batching_skb_API.txt 1970-01-01 05:30:00.0 +0530 +++ rev2/Documentation/networking/Batching_skb_API.txt 2007-07-20 16:09:45.0 +0530 @@ -0,0 +1,91 @@ +HOWTO for batching skb API support +--- + +Section 1: What is batching skb API ? +Section 2: How batching API works vs the original API ? +Section 3: How drivers can support this API ? +Section 4: How users can work with this API ? + + +Introduction: Kernel support for batching skb +--- + +An extended API is supported in the netdevice layer, which is very similar +to the existing hard_start_xmit() API. Drivers which wish to take advantage +of this new API should implement this routine similar to how the +hard_start_xmit handler is written. The difference between these API's is +that while the existing hard_start_xmit processes one skb, the new API can +process multiple skbs (or even one) in a single call. It is also possible +for the driver writer to re-use most of the code from the existing API in +the new API without having code duplication. + + +Section 1: What is batching skb API ? +- + + This is a new API that is optionally exported by a driver. The pre- + requisite for a driver to use this API is that it should have a + reasonably sized hardware queue that can process multiple skbs. + + +Section 2: How batching API works vs the original API ? +--- + + The networking stack normally gets called from upper layer protocols + with a single skb to xmit. This skb is first enqueue'd and an + attempt is next made to transmit it immediately (via qdisc_run). + However, events like driver lock contention, queue stopped, etc, can + result in the skb not getting sent out, and it remains in the queue. + When a new xmit is called or when the queue is re-enabled, qdisc_run + could potentially find multiple packets in the queue, and have to + send them all out one by one iteratively. + + The batching skb API case was added to exploit this situation where + if there are multiple skbs, all of them can be sent to the device in + one shot. This reduces driver processing, locking at the driver (or + in stack for ~LLTX drivers) gets amortized over multiple skbs, and + in case of specific drivers where every xmit results in a completion + processing (like IPoIB), optimizations could be made in the driver + to get a completion for only the last skb that was sent which will + result in saving interrupts for every (but the last) skb that was + sent in the same batch. + + This batching can result in significant performance gains for + systems that have multiple data stream paths over the same network + interface card. + + +Section 3: How drivers can support this API ? +- + + The new API - dev-hard_start_xmit_batch(struct net_device *dev), + simplistically, can be written almost identically to the regular + xmit API (hard_start_xmit), except that all skbs on dev-skb_blist + should be processed by the driver instead of just one skb. The new + API doesn't get any skb as argument to process, instead it picks up + all the skbs from dev-skb_blist, where it was added by the stack, + and tries to send them out. + + Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in + dev-features, and dev-hard_start_xmit_batch should point to the + new API implemented for that driver. + + +Section 4: How users can work with this API ? +- + + Batching could be disabled for a particular device, e.g. on desktop + systems if only one stream of network activity for that device is + taking place, since performance could be slightly affected due to + extra processing that batching adds. Batching can be enabled if + more than one stream of network activity per device is being done, + e.g. on servers, or even desktop usage with multiple browser, chat, + file transfer sessions, etc. + + Per device batching can be enabled/disabled using: + + echo 1 /sys/class/net/device-name/tx_batch_skbs (enable) + echo 0 /sys/class/net/device-name/tx_batch_skbs (disable) + + E.g. to enable batching on eth0, run: + echo 1 /sys/class/net/eth0/tx_batch_skbs - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/12 -Rev2] Changes to netdevice.h
diff -ruNp org/include/linux/netdevice.h rev2/include/linux/netdevice.h --- org/include/linux/netdevice.h 2007-07-20 07:49:28.0 +0530 +++ rev2/include/linux/netdevice.h 2007-07-22 13:20:16.0 +0530 @@ -340,6 +340,7 @@ struct net_device #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN packets */ #define NETIF_F_GSO2048/* Enable software GSO. */ #define NETIF_F_LLTX 4096/* LockLess TX */ +#define NETIF_F_BATCH_ON 8192/* Batching skbs xmit API is enabled */ #define NETIF_F_MULTI_QUEUE16384 /* Has multiple TX/RX queues */ /* Segmentation offload features */ @@ -452,6 +453,7 @@ struct net_device struct Qdisc*qdisc_sleeping; struct list_headqdisc_list; unsigned long tx_queue_len; /* Max frames per queue allowed */ + struct sk_buff_head skb_blist; /* List of batch skbs */ /* Partially transmitted GSO packet. */ struct sk_buff *gso_skb; @@ -472,6 +474,9 @@ struct net_device void*priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_start_xmit_batch) (struct net_device + *dev); + /* These may be needed for future network-power-down code. */ unsigned long trans_start;/* Time (in jiffies) of last Tx */ @@ -582,6 +587,8 @@ struct net_device #defineNETDEV_ALIGN32 #defineNETDEV_ALIGN_CONST (NETDEV_ALIGN - 1) +#define BATCHING_ON(dev) ((dev-features NETIF_F_BATCH_ON) != 0) + static inline void *netdev_priv(const struct net_device *dev) { return dev-priv; @@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_add_skb_to_blist(struct sk_buff *skb, +struct net_device *dev); extern voiddev_init(void); @@ -1104,6 +1113,8 @@ extern void dev_set_promiscuity(struct extern voiddev_set_allmulti(struct net_device *dev, int inc); extern voidnetdev_state_change(struct net_device *dev); extern voidnetdev_features_change(struct net_device *dev); +extern int dev_change_tx_batching(struct net_device *dev, + unsigned long new_batch_skb); /* Load a device via the kmod */ extern voiddev_load(const char *name); extern voiddev_mcast_init(void); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/12 -Rev2] dev.c changes.
diff -ruNp org/net/core/dev.c rev2/net/core/dev.c --- org/net/core/dev.c 2007-07-20 07:49:28.0 +0530 +++ rev2/net/core/dev.c 2007-07-21 23:08:33.0 +0530 @@ -875,6 +875,48 @@ void netdev_state_change(struct net_devi } } +/* + * dev_change_tx_batching - Enable or disable batching for a driver that + * supports batching. + */ +int dev_change_tx_batching(struct net_device *dev, unsigned long new_batch_skb) +{ + int ret; + + if (!dev-hard_start_xmit_batch) { + /* Driver doesn't support skb batching */ + ret = -ENOTSUPP; + goto out; + } + + /* Handle invalid argument */ + if (new_batch_skb 0) { + ret = -EINVAL; + goto out; + } + + ret = 0; + + /* Check if new value is same as the current */ + if (!!(dev-features NETIF_F_BATCH_ON) == !!new_batch_skb) + goto out; + + spin_lock(dev-queue_lock); + if (new_batch_skb) { + dev-features |= NETIF_F_BATCH_ON; + dev-tx_queue_len = 1; + } else { + if (!skb_queue_empty(dev-skb_blist)) + skb_queue_purge(dev-skb_blist); + dev-features = ~NETIF_F_BATCH_ON; + dev-tx_queue_len = 1; + } + spin_unlock(dev-queue_lock); + +out: + return ret; +} + /** * dev_load- load a network module * @name: name of interface @@ -1414,6 +1456,45 @@ static int dev_gso_segment(struct sk_buf return 0; } +/* + * Add skb (skbs in case segmentation is required) to dev-skb_blist. We are + * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs + * are dequeued from this list when we call the driver, so the list is safe + * from simultaneous deletes too. + * + * Returns count of successful skb(s) added to skb_blist. + */ +int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev) +{ + if (!list_empty(ptype_all)) + dev_queue_xmit_nit(skb, dev); + + if (netif_needs_gso(dev, skb)) { + if (unlikely(dev_gso_segment(skb))) { + kfree(skb); + return 0; + } + + if (skb-next) { + int count = 0; + + do { + struct sk_buff *nskb = skb-next; + + skb-next = nskb-next; + __skb_queue_tail(dev-skb_blist, nskb); + count++; + } while (skb-next); + + skb-destructor = DEV_GSO_CB(skb)-destructor; + kfree_skb(skb); + return count; + } + } + __skb_queue_tail(dev-skb_blist, skb); + return 1; +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { if (likely(!skb-next)) { @@ -3397,6 +3483,12 @@ int register_netdevice(struct net_device } } + if (dev-hard_start_xmit_batch) { + dev-features |= NETIF_F_BATCH_ON; + skb_queue_head_init(dev-skb_blist); + dev-tx_queue_len = 1; + } + /* * nil rebuild_header routine, * that should be never called and used as just bug trap. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/12 -Rev2] Ethtool changes
diff -ruNp org/include/linux/ethtool.h rev2/include/linux/ethtool.h --- org/include/linux/ethtool.h 2007-07-21 13:39:50.0 +0530 +++ rev2/include/linux/ethtool.h2007-07-21 13:40:57.0 +0530 @@ -414,6 +414,8 @@ struct ethtool_ops { #define ETHTOOL_SUFO 0x0022 /* Set UFO enable (ethtool_value) */ #define ETHTOOL_GGSO 0x0023 /* Get GSO enable (ethtool_value) */ #define ETHTOOL_SGSO 0x0024 /* Set GSO enable (ethtool_value) */ +#define ETHTOOL_GBTX 0x0025 /* Get Batching (ethtool_value) */ +#define ETHTOOL_SBTX 0x0026 /* Set Batching (ethtool_value) */ /* compatibility with older code */ #define SPARC_ETH_GSET ETHTOOL_GSET diff -ruNp org/net/core/ethtool.c rev2/net/core/ethtool.c --- org/net/core/ethtool.c 2007-07-21 13:37:17.0 +0530 +++ rev2/net/core/ethtool.c 2007-07-21 22:55:38.0 +0530 @@ -648,6 +648,26 @@ static int ethtool_set_gso(struct net_de return 0; } +static int ethtool_get_batch(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata = { ETHTOOL_GBTX }; + + edata.data = BATCHING_ON(dev); + if (copy_to_user(useraddr, edata, sizeof(edata))) +return -EFAULT; + return 0; +} + +static int ethtool_set_batch(struct net_device *dev, char __user *useraddr) +{ + struct ethtool_value edata; + + if (copy_from_user(edata, useraddr, sizeof(edata))) + return -EFAULT; + + return dev_change_tx_batching(dev, edata.data); +} + static int ethtool_self_test(struct net_device *dev, char __user *useraddr) { struct ethtool_test test; @@ -959,6 +979,12 @@ int dev_ethtool(struct ifreq *ifr) case ETHTOOL_SGSO: rc = ethtool_set_gso(dev, useraddr); break; + case ETHTOOL_GBTX: + rc = ethtool_get_batch(dev, useraddr); + break; + case ETHTOOL_SBTX: + rc = ethtool_set_batch(dev, useraddr); + break; default: rc = -EOPNOTSUPP; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/12 -Rev2] sysfs changes.
diff -ruNp org/net/core/net-sysfs.c rev2/net/core/net-sysfs.c --- org/net/core/net-sysfs.c2007-07-20 07:49:28.0 +0530 +++ rev2/net/core/net-sysfs.c 2007-07-21 22:56:32.0 +0530 @@ -230,6 +230,21 @@ static ssize_t store_weight(struct devic return netdev_store(dev, attr, buf, len, change_weight); } +static ssize_t show_tx_batch_skb(struct device *dev, +struct device_attribute *attr, char *buf) +{ + struct net_device *netdev = to_net_dev(dev); + + return sprintf(buf, fmt_dec, BATCHING_ON(netdev)); +} + +static ssize_t store_tx_batch_skb(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + return netdev_store(dev, attr, buf, len, dev_change_tx_batching); +} + static struct device_attribute net_class_attributes[] = { __ATTR(addr_len, S_IRUGO, show_addr_len, NULL), __ATTR(iflink, S_IRUGO, show_iflink, NULL), @@ -246,6 +261,8 @@ static struct device_attribute net_class __ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags), __ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len, store_tx_queue_len), + __ATTR(tx_batch_skbs, S_IRUGO | S_IWUSR, show_tx_batch_skb, + store_tx_batch_skb), __ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight), {} }; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/12 -Rev2] rtnetlink changes.
diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h --- org/include/linux/if_link.h 2007-07-20 16:33:35.0 +0530 +++ rev2/include/linux/if_link.h2007-07-20 16:35:08.0 +0530 @@ -78,6 +78,8 @@ enum IFLA_LINKMODE, IFLA_LINKINFO, #define IFLA_LINKINFO IFLA_LINKINFO + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */ +#define IFLA_TXBTHSKB IFLA_TXBTHSKB __IFLA_MAX }; diff -ruNp org/net/core/rtnetlink.c rev2/net/core/rtnetlink.c --- org/net/core/rtnetlink.c2007-07-20 16:31:59.0 +0530 +++ rev2/net/core/rtnetlink.c 2007-07-21 22:27:10.0 +0530 @@ -634,6 +634,7 @@ static int rtnl_fill_ifinfo(struct sk_bu NLA_PUT_STRING(skb, IFLA_IFNAME, dev-name); NLA_PUT_U32(skb, IFLA_TXQLEN, dev-tx_queue_len); + NLA_PUT_U32(skb, IFLA_TXBTHSKB, BATCHING_ON(dev)); NLA_PUT_U32(skb, IFLA_WEIGHT, dev-weight); NLA_PUT_U8(skb, IFLA_OPERSTATE, netif_running(dev) ? dev-operstate : IF_OPER_DOWN); @@ -833,7 +834,8 @@ static int do_setlink(struct net_device if (tb[IFLA_TXQLEN]) dev-tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); - + if (tb[IFLA_TXBTHSKB]) + dev_change_tx_batching(dev, nla_get_u32(tb[IFLA_TXBTHSKB])); if (tb[IFLA_WEIGHT]) dev-weight = nla_get_u32(tb[IFLA_WEIGHT]); @@ -1072,6 +1074,9 @@ replay: nla_len(tb[IFLA_BROADCAST])); if (tb[IFLA_TXQLEN]) dev-tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); + if (tb[IFLA_TXBTHSKB]) + dev_change_tx_batching(dev, + nla_get_u32(tb[IFLA_TXBTHSKB])); if (tb[IFLA_WEIGHT]) dev-weight = nla_get_u32(tb[IFLA_WEIGHT]); if (tb[IFLA_OPERSTATE]) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/12 -Rev2] Change qdisc_run qdisc_restart API, callers
diff -ruNp org/include/net/pkt_sched.h rev2/include/net/pkt_sched.h --- org/include/net/pkt_sched.h 2007-07-20 07:49:28.0 +0530 +++ rev2/include/net/pkt_sched.h2007-07-20 16:09:45.0 +0530 @@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge struct rtattr *tab); extern void qdisc_put_rtab(struct qdisc_rate_table *tab); -extern void __qdisc_run(struct net_device *dev); +extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist); -static inline void qdisc_run(struct net_device *dev) +static inline void qdisc_run(struct net_device *dev, struct sk_buff_head *blist) { if (!netif_queue_stopped(dev) !test_and_set_bit(__LINK_STATE_QDISC_RUNNING, dev-state)) - __qdisc_run(dev); + __qdisc_run(dev, blist); } extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp, diff -ruNp org/net/sched/sch_generic.c rev2/net/sched/sch_generic.c --- org/net/sched/sch_generic.c 2007-07-20 07:49:28.0 +0530 +++ rev2/net/sched/sch_generic.c2007-07-22 12:11:10.0 +0530 @@ -59,10 +59,12 @@ static inline int qdisc_qlen(struct Qdis static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q) { - if (unlikely(skb-next)) - dev-gso_skb = skb; - else - q-ops-requeue(skb, q); + if (likely(skb)) { + if (unlikely(skb-next)) + dev-gso_skb = skb; + else + q-ops-requeue(skb, q); + } netif_schedule(dev); return 0; @@ -91,18 +93,23 @@ static inline int handle_dev_cpu_collisi /* * Same CPU holding the lock. It may be a transient * configuration error, when hard_start_xmit() recurses. We -* detect it by checking xmit owner and drop the packet when -* deadloop is detected. Return OK to try the next skb. +* detect it by checking xmit owner and drop skb (or all +* skbs in batching case) when deadloop is detected. Return +* OK to try the next skb. */ - kfree_skb(skb); + if (likely(skb)) + kfree_skb(skb); + else if (!skb_queue_empty(dev-skb_blist)) + skb_queue_purge(dev-skb_blist); + if (net_ratelimit()) printk(KERN_WARNING Dead loop on netdevice %s, fix it urgently!\n, dev-name); ret = qdisc_qlen(q); } else { /* -* Another cpu is holding lock, requeue delay xmits for -* some time. +* Another cpu is holding lock. Requeue skb and delay xmits +* for some time. */ __get_cpu_var(netdev_rx_stat).cpu_collision++; ret = dev_requeue_skb(skb, dev, q); @@ -112,6 +119,38 @@ static inline int handle_dev_cpu_collisi } /* + * Algorithm to get skb(s) is: + * - Non batching drivers, or if the batch list is empty and there is + * atmost one skb in the queue - dequeue skb and put it in *skbp to + * tell the caller to use the single xmit API. + * - Batching drivers where the batch list already contains atleast one + * skb or if there are multiple skbs in the queue: keep dequeue'ing + * skb's upto a limit and set *skbp to NULL to tell the caller to use + * the multiple xmit API. + * + * Returns: + * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL + * (in case 1 skbs present in blist for batching) + * 0 - no skbs to be sent. + */ +static inline int get_skb(struct net_device *dev, struct Qdisc *q, + struct sk_buff_head *blist, struct sk_buff **skbp) +{ + if (likely(!blist) || (!skb_queue_len(blist) qdisc_qlen(q) = 1)) { + return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL); + } else { + int max = dev-tx_queue_len - skb_queue_len(blist); + struct sk_buff *skb; + + while (max 0 (skb = dev_dequeue_skb(dev, q)) != NULL) + max -= dev_add_skb_to_blist(skb, dev); + + *skbp = NULL; + return 1; /* we have atleast one skb in blist */ + } +} + +/* * NOTE: Called under dev-queue_lock with locally disabled BH. * * __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this @@ -130,7 +169,8 @@ static inline int handle_dev_cpu_collisi * 0 - queue is not empty. * */ -static inline int qdisc_restart(struct net_device *dev) +static inline int qdisc_restart(struct net_device *dev, + struct sk_buff_head *blist) { struct Qdisc *q = dev-qdisc; struct
[PATCH 09/12 -Rev2] IPoIB verbs changes
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c --- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 07:49:28.0 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 16:09:45.0 +0530 @@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_ .max_send_sge = 1, .max_recv_sge = 1 }, - .sq_sig_type = IB_SIGNAL_ALL_WR, + .sq_sig_type = IB_SIGNAL_REQ_WR,/* 11.2.4.1 */ .qp_type = IB_QPT_UD }; - - int ret, size; + struct ib_send_wr *next_wr = NULL; + int i, ret, size; priv-pd = ib_alloc_pd(priv-ca); if (IS_ERR(priv-pd)) { @@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_ priv-dev-dev_addr[2] = (priv-qp-qp_num 8) 0xff; priv-dev-dev_addr[3] = (priv-qp-qp_num ) 0xff; - priv-tx_sge.lkey = priv-mr-lkey; - - priv-tx_wr.opcode = IB_WR_SEND; - priv-tx_wr.sg_list = priv-tx_sge; - priv-tx_wr.num_sge = 1; - priv-tx_wr.send_flags = IB_SEND_SIGNALED; + for (i = ipoib_sendq_size - 1; i = 0; i--) { + priv-tx_sge[i].lkey= priv-mr-lkey; + priv-tx_wr[i].opcode = IB_WR_SEND; + priv-tx_wr[i].sg_list = priv-tx_sge[i]; + priv-tx_wr[i].num_sge = 1; + priv-tx_wr[i].send_flags = 0; + + /* Link the list properly for provider to use */ + priv-tx_wr[i].next = next_wr; + next_wr = priv-tx_wr[i]; + } return 0; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/12 -Rev2] IPoIB multicast, CM changes
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c --- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-20 07:49:28.0 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c2007-07-20 16:09:45.0 +0530 @@ -493,14 +493,19 @@ static inline int post_send(struct ipoib unsigned int wr_id, u64 addr, int len) { + int ret; struct ib_send_wr *bad_wr; - priv-tx_sge.addr = addr; - priv-tx_sge.length = len; + priv-tx_sge[0].addr = addr; + priv-tx_sge[0].length= len; + + priv-tx_wr[0].wr_id = wr_id; - priv-tx_wr.wr_id = wr_id; + priv-tx_wr[0].next = NULL; + ret = ib_post_send(tx-qp, priv-tx_wr, bad_wr); + priv-tx_wr[0].next = priv-tx_wr[1]; - return ib_post_send(tx-qp, priv-tx_wr, bad_wr); + return ret; } void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx) diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c --- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 07:49:28.0 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 16:09:45.0 +0530 @@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc if (!memcmp(mcast-mcmember.mgid.raw, priv-dev-broadcast + 4, sizeof (union ib_gid))) { priv-qkey = be32_to_cpu(priv-broadcast-mcmember.qkey); - priv-tx_wr.wr.ud.remote_qkey = priv-qkey; + priv-tx_wr[0].wr.ud.remote_qkey = priv-qkey; } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) { @@ -736,7 +736,7 @@ out: } } - ipoib_send(dev, skb, mcast-ah, IB_MULTICAST_QPN); + ipoib_send(dev, skb, mcast-ah, IB_MULTICAST_QPN, 1); } unlock: - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/12 -Rev2] IPoIB xmit API addition
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c --- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-20 07:49:28.0 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c2007-07-22 00:08:37.0 +0530 @@ -242,8 +242,9 @@ repost: static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) { struct ipoib_dev_priv *priv = netdev_priv(dev); + int i = 0, num_completions; + int tx_ring_index = priv-tx_tail (ipoib_sendq_size - 1); unsigned int wr_id = wc-wr_id; - struct ipoib_tx_buf *tx_req; unsigned long flags; ipoib_dbg_data(priv, send completion: id %d, status: %d\n, @@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct return; } - tx_req = priv-tx_ring[wr_id]; + num_completions = wr_id - tx_ring_index + 1; + if (num_completions = 0) + num_completions += ipoib_sendq_size; - ib_dma_unmap_single(priv-ca, tx_req-mapping, - tx_req-skb-len, DMA_TO_DEVICE); + /* +* Handle skbs completion from tx_tail to wr_id. It is possible to +* handle WC's from earlier post_sends (possible multiple) in this +* iteration as we move from tx_tail to wr_id, since if the last +* WR (which is the one which had a completion request) failed to be +* sent for any of those earlier request(s), no completion +* notification is generated for successful WR's of those earlier +* request(s). +*/ + while (1) { + /* +* Could use while (i num_completions), but it is costly +* since in most cases there is 1 completion, and we end up +* doing an extra index = (index+1) (ipoib_sendq_size-1) +*/ + struct ipoib_tx_buf *tx_req = priv-tx_ring[tx_ring_index]; + + if (likely(tx_req-skb)) { + ib_dma_unmap_single(priv-ca, tx_req-mapping, + tx_req-skb-len, DMA_TO_DEVICE); - ++priv-stats.tx_packets; - priv-stats.tx_bytes += tx_req-skb-len; + ++priv-stats.tx_packets; + priv-stats.tx_bytes += tx_req-skb-len; - dev_kfree_skb_any(tx_req-skb); + dev_kfree_skb_any(tx_req-skb); + } + /* +* else this skb failed synchronously when posted and was +* freed immediately. +*/ + + if (++i == num_completions) + break; + + /* More WC's to handle */ + tx_ring_index = (tx_ring_index + 1) (ipoib_sendq_size - 1); + } spin_lock_irqsave(priv-tx_lock, flags); - ++priv-tx_tail; + + priv-tx_tail += num_completions; if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, priv-flags)) priv-tx_head - priv-tx_tail = ipoib_sendq_size 1) { clear_bit(IPOIB_FLAG_NETIF_STOPPED, priv-flags); netif_wake_queue(dev); } + spin_unlock_irqrestore(priv-tx_lock, flags); if (wc-status != IB_WC_SUCCESS @@ -340,78 +375,178 @@ void ipoib_ib_completion(struct ib_cq *c netif_rx_schedule(dev_ptr); } -static inline int post_send(struct ipoib_dev_priv *priv, - unsigned int wr_id, - struct ib_ah *address, u32 qpn, - u64 addr, int len) +/* + * post_send : Post WR(s) to the device. + * + * num_skbs is the number of WR's, 'start_index' is the first slot in + * tx_wr[] or tx_sge[]. Note: 'start_index' is normally zero, unless a + * previous post_send returned error and we are trying to send the untried + * WR's, in which case start_index will point to the first untried WR. + * + * We also break the WR link before posting so that the driver knows how + * many WR's to process, and this is set back after the post. + */ +static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn, + int start_index, int num_skbs, + struct ib_send_wr **bad_wr) { - struct ib_send_wr *bad_wr; + int ret; + struct ib_send_wr *last_wr, *next_wr; + + last_wr = priv-tx_wr[start_index + num_skbs - 1]; + + /* Set Completion Notification for last WR */ + last_wr-send_flags = IB_SEND_SIGNALED; - priv-tx_sge.addr = addr; - priv-tx_sge.length = len; + /* Terminate the last WR */ + next_wr = last_wr-next; + last_wr-next = NULL; - priv-tx_wr.wr_id = wr_id; - priv-tx_wr.wr.ud.remote_qpn = qpn; - priv-tx_wr.wr.ud.ah = address; + /* Send all the WR's in one doorbell */ + ret = ib_post_send(priv-qp, priv-tx_wr[start_index], bad_wr); - return
[PATCH 12/12 -Rev2] IPoIB xmit internals changes (ipoib_ib.c)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c --- org/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-20 07:49:28.0 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c 2007-07-22 00:08:28.0 +0530 @@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf goto err_drop; } } else - ipoib_send(dev, skb, path-ah, IPOIB_QPN(skb-dst-neighbour-ha)); + ipoib_send(dev, skb, path-ah, + IPOIB_QPN(skb-dst-neighbour-ha), 1); } else { neigh-ah = NULL; @@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b ipoib_dbg(priv, Send unicast ARP to %04x\n, be16_to_cpu(path-pathrec.dlid)); - ipoib_send(dev, skb, path-ah, IPOIB_QPN(phdr-hwaddr)); + ipoib_send(dev, skb, path-ah, IPOIB_QPN(phdr-hwaddr), 1); } else if ((path-query || !path_rec_start(dev, path)) skb_queue_len(path-queue) IPOIB_MAX_PATH_REC_QUEUE) { /* put pseudoheader back on for next time */ @@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu goto out; } - ipoib_send(dev, skb, neigh-ah, IPOIB_QPN(skb-dst-neighbour-ha)); + ipoib_send(dev, skb, neigh-ah, + IPOIB_QPN(skb-dst-neighbour-ha), 1); goto out; } @@ -753,6 +755,175 @@ out: return NETDEV_TX_OK; } +#defineXMIT_QUEUED_SKBS() \ + do {\ + if (num_skbs) { \ + ipoib_send(dev, NULL, old_neigh-ah, old_qpn, \ + num_skbs); \ + num_skbs = 0; \ + } \ + } while (0) + +/* + * TODO: Merge with ipoib_start_xmit to use the same code and have a + * transparent wrapper caller to xmit's, etc. + */ +static int ipoib_start_xmit_frames(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + struct sk_buff *skb; + struct sk_buff_head *blist; + int max_skbs, num_skbs = 0, tx_ring_index = -1; + u32 qpn, old_qpn = 0; + struct ipoib_neigh *neigh, *old_neigh = NULL; + unsigned long flags; + + if (unlikely(!spin_trylock_irqsave(priv-tx_lock, flags))) + return NETDEV_TX_LOCKED; + + blist = dev-skb_blist; + + /* +* Send atmost 'max_skbs' skbs. This also prevents the device getting +* full. +*/ + max_skbs = ipoib_sendq_size - (priv-tx_head - priv-tx_tail); + while (max_skbs-- 0 (skb = __skb_dequeue(blist)) != NULL) { + /* +* From here on, ipoib_send() cannot stop the queue as it +* uses the same initialization as 'max_skbs'. So we can +* optimize to not check for queue stopped for every skb. +*/ + if (likely(skb-dst skb-dst-neighbour)) { + if (unlikely(!*to_ipoib_neigh(skb-dst-neighbour))) { + XMIT_QUEUED_SKBS(); + ipoib_path_lookup(skb, dev); + continue; + } + + neigh = *to_ipoib_neigh(skb-dst-neighbour); + + if (ipoib_cm_get(neigh)) { + if (ipoib_cm_up(neigh)) { + XMIT_QUEUED_SKBS(); + ipoib_cm_send(dev, skb, + ipoib_cm_get(neigh)); + continue; + } + } else if (neigh-ah) { + if (unlikely(memcmp(neigh-dgid.raw, + skb-dst-neighbour-ha + 4, + sizeof(union ib_gid { + spin_lock(priv-lock); + /* +* It's safe to call ipoib_put_ah() +* inside priv-lock here, because we +* know that path-ah will always hold +* one more reference, so ipoib_put_ah() +* will never do more than decrement +* the ref count. +
[PATCH 08/12 -Rev2] IPoIB include file changes.
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h rev2/drivers/infiniband/ulp/ipoib/ipoib.h --- org/drivers/infiniband/ulp/ipoib/ipoib.h2007-07-20 07:49:28.0 +0530 +++ rev2/drivers/infiniband/ulp/ipoib/ipoib.h 2007-07-20 16:09:45.0 +0530 @@ -269,8 +269,8 @@ struct ipoib_dev_priv { struct ipoib_tx_buf *tx_ring; unsigned tx_head; unsigned tx_tail; - struct ib_sgetx_sge; - struct ib_send_wrtx_wr; + struct ib_sge*tx_sge; + struct ib_send_wr*tx_wr; struct ib_wc ibwc[IPOIB_NUM_WC]; @@ -365,8 +365,11 @@ static inline void ipoib_put_ah(struct i int ipoib_open(struct net_device *dev); int ipoib_add_pkey_attr(struct net_device *dev); +int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb, + struct ipoib_dev_priv *priv, int snum, int tx_index, + struct ipoib_ah *address, u32 qpn); void ipoib_send(struct net_device *dev, struct sk_buff *skb, - struct ipoib_ah *address, u32 qpn); + struct ipoib_ah *address, u32 qpn, int num_skbs); void ipoib_reap_ah(struct work_struct *work); void ipoib_flush_paths(struct net_device *dev); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
+ /* + * Handle skbs completion from tx_tail to wr_id. It is possible to + * handle WC's from earlier post_sends (possible multiple) in this + * iteration as we move from tx_tail to wr_id, since if the last + * WR (which is the one which had a completion request) failed to be + * sent for any of those earlier request(s), no completion + * notification is generated for successful WR's of those earlier + * request(s). + */ AFAIK a signalled WR will always generate a completion. What am I missing? + /* + * Better error handling can be done here, like free + * all untried skbs if err == -ENOMEM. However at this + * time, we re-try all the skbs, all of which will + * likely fail anyway (unless device finished sending + * some out in the meantime). This is not a regression + * since the earlier code is not doing this either. + */ Are you retrying posting skbs? Why is this a good idea? AFAIK, earlier code did not retry posting WRs at all. The comment seems to imply that post send fails as a result of SQ overflow - do you see SQ overflow errors in your testing? AFAIK, IPoIB should never overflow the SQ. -- MST - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] CONFIG_NET=n - lots of link time errors
On Jul 21 2007 19:12, David Miller wrote: Enabling drivers from Devices Networking (in menuconfig), for example SLIP and/or PLIP, throws link time errors when CONFIG_NET itself is =n. Have CONFIG_NETDEVICES depend on CONFIG_NET. Signed-off-by: Jan Engelhardt [EMAIL PROTECTED] This is the second time I've seen this change in the past few days, and people seem to hit it quite readily with randconfig. It seems reasonable and I'll apply it, thanks Jan. While randconfig is nice and good, this error actually came by manually tuning the .config for a specific goal (namely: make a small kernel for an ancient i386 without modifying any code). Turning off CONFIG_NET was then just an idea away, because I knew that TCP was quite big. Jan -- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NEIGH]: Combine neighbour cleanup and release
Introduces neigh_cleanup_and_release() to be used after a neighbour has been removed from its neighbour table. Serves as preparation to add event notifications. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6/net/core/neighbour.c === --- net-2.6.orig/net/core/neighbour.c 2007-07-22 11:41:46.0 +0200 +++ net-2.6/net/core/neighbour.c2007-07-22 11:42:02.0 +0200 @@ -104,6 +104,14 @@ static int neigh_blackhole(struct sk_buf return -ENETDOWN; } +static void neigh_cleanup_and_release(struct neighbour *neigh) +{ + if (neigh-parms-neigh_cleanup) + neigh-parms-neigh_cleanup(neigh); + + neigh_release(neigh); +} + /* * It is random distribution in the interval (1/2)*base...(3/2)*base. * It corresponds to default IPv6 settings and is not overridable, @@ -140,9 +148,7 @@ static int neigh_forced_gc(struct neigh_ n-dead = 1; shrunk = 1; write_unlock(n-lock); - if (n-parms-neigh_cleanup) - n-parms-neigh_cleanup(n); - neigh_release(n); + neigh_cleanup_and_release(n); continue; } write_unlock(n-lock); @@ -213,9 +219,7 @@ static void neigh_flush_dev(struct neigh NEIGH_PRINTK2(neigh %p is stray.\n, n); } write_unlock(n-lock); - if (n-parms-neigh_cleanup) - n-parms-neigh_cleanup(n); - neigh_release(n); + neigh_cleanup_and_release(n); } } } @@ -676,9 +680,7 @@ static void neigh_periodic_timer(unsigne *np = n-next; n-dead = 1; write_unlock(n-lock); - if (n-parms-neigh_cleanup) - n-parms-neigh_cleanup(n); - neigh_release(n); + neigh_cleanup_and_release(n); continue; } write_unlock(n-lock); @@ -2094,11 +2096,8 @@ void __neigh_for_each_release(struct nei } else np = n-next; write_unlock(n-lock); - if (release) { - if (n-parms-neigh_cleanup) - n-parms-neigh_cleanup(n); - neigh_release(n); - } + if (release) + neigh_cleanup_and_release(n); } } } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NEIGH]: Netlink notifications
Currently neighbour event notifications are limited to update notifications and only sent if the ARP daemon is enabled. This patch extends the existing notification code by also reporting neighbours being removed due to gc or administratively and removes the dependency on the ARP daemon. This allows to keep track of neighbour states without periodically fetching the complete neighbour table. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6/net/core/neighbour.c === --- net-2.6.orig/net/core/neighbour.c 2007-07-22 11:42:02.0 +0200 +++ net-2.6/net/core/neighbour.c2007-07-22 11:49:15.0 +0200 @@ -54,9 +54,8 @@ #define PNEIGH_HASHMASK0xF static void neigh_timer_handler(unsigned long arg); -#ifdef CONFIG_ARPD -static void neigh_app_notify(struct neighbour *n); -#endif +static void __neigh_notify(struct neighbour *n, int type, int flags); +static void neigh_update_notify(struct neighbour *neigh); static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev); void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev); @@ -109,6 +108,7 @@ static void neigh_cleanup_and_release(st if (neigh-parms-neigh_cleanup) neigh-parms-neigh_cleanup(neigh); + __neigh_notify(neigh, RTM_DELNEIGH, 0); neigh_release(neigh); } @@ -829,13 +829,10 @@ static void neigh_timer_handler(unsigned out: write_unlock(neigh-lock); } + if (notify) - call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); + neigh_update_notify(neigh); -#ifdef CONFIG_ARPD - if (notify neigh-parms-app_probes) - neigh_app_notify(neigh); -#endif neigh_release(neigh); } @@ -1064,11 +1061,8 @@ out: write_unlock_bh(neigh-lock); if (notify) - call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); -#ifdef CONFIG_ARPD - if (notify neigh-parms-app_probes) - neigh_app_notify(neigh); -#endif + neigh_update_notify(neigh); + return err; } @@ -2001,6 +1995,11 @@ nla_put_failure: return -EMSGSIZE; } +static void neigh_update_notify(struct neighbour *neigh) +{ + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); + __neigh_notify(neigh, RTM_NEWNEIGH, 0); +} static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb, struct netlink_callback *cb) @@ -2420,7 +2419,6 @@ static const struct file_operations neig #endif /* CONFIG_PROC_FS */ -#ifdef CONFIG_ARPD static inline size_t neigh_nlmsg_size(void) { return NLMSG_ALIGN(sizeof(struct ndmsg)) @@ -2452,16 +2450,11 @@ errout: rtnl_set_sk_err(RTNLGRP_NEIGH, err); } +#ifdef CONFIG_ARPD void neigh_app_ns(struct neighbour *n) { __neigh_notify(n, RTM_GETNEIGH, NLM_F_REQUEST); } - -static void neigh_app_notify(struct neighbour *n) -{ - __neigh_notify(n, RTM_NEWNEIGH, 0); -} - #endif /* CONFIG_ARPD */ #ifdef CONFIG_SYSCTL - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RTNETLINK]: Fix warning if !CONFIG_KMOD
replay label is unused otherwise. Signed-off-by: Thomas Graf [EMAIL PROTECTED] Index: net-2.6/net/core/rtnetlink.c === --- net-2.6.orig/net/core/rtnetlink.c 2007-07-22 11:41:46.0 +0200 +++ net-2.6/net/core/rtnetlink.c2007-07-22 12:04:27.0 +0200 @@ -952,7 +952,9 @@ static int rtnl_newlink(struct sk_buff * struct nlattr *linkinfo[IFLA_INFO_MAX+1]; int err; +#ifdef CONFIG_KMOD replay: +#endif err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy); if (err 0) return err; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3 issues
patric wrote: Hi, Think i got something working for me at least, and the fix is quite minimal and only downside that i could see from it was that you might get a small delay when bringing up the interface, but that's probably better than getting a non-functional interface that reports that it's up. The fix seems to be quite simple with just a random sleep at the end of tg3_setup_fiber_by_hand(): tw32_f(MAC_MODE, tp-mac_mode); udelay(40); } out: udelay( net_random() % 400 ); return current_link_up; } Not sure that this is a good fix or if it might break on other systems, but maybe you could have a quick look at that? Regards, Patric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Last update on this. udelay( 100 + net_random % 300 ) seems to work much better and i have not had a single problem getting the link up within 10 seconds of a cold or warm-boot, and most often the link comes up directly without any sort of delay instead like before when it could hang for 30 seconds before getting a link, if you even got a link. /Patric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/10] Implement batching skb API
KK, On Sun, 2007-22-07 at 11:57 +0530, Krishna Kumar2 wrote: Batching need not be useful for every hardware. My concern is there is no consistency in results. I see improvements on something which you say dont. You see improvement in something that Evgeniy doesnt etc. There are many knobs and we need in the minimal to find out how those play. I don't quite agree with that approach, eg, if the blist is empty and the driver tells there is space for one packet, you will add one packet and the driver sends it out and the device is stopped (with potentially lot of skbs on dev-q). Then no packets are added till the queue is enabled, at which time a flood of skbs will be processed increasing latency and holding lock for a single longer duration. My approach will mitigate holding lock for longer times and instead send skbs to the device as long as we are within the limits. Just as a side note _I do have this feature_ in the pktgen piece. Infact, You can tell pktgen what that bound is as opposed to the hard coding(look at the pktgen batchl parameter). I have not found it to be useful experimentally; actually, i should say i could not zone on a useful value by experimenting and it was better to turn it off. I never tried adding it to the qdisc path - but this is something i could try and as i said it may prove useful. Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find it), I feel having prep will not help as no other cpu can execute the queue/xmit code anyway (E1000 is also a LLTX driver). My experiments show it is useful (in a very visible way using pktgen) for e1000 to have the prep() interface. Other driver that hold tx lock could get improvement however. So you do see the value then with non LLTX drivers, right? ;- The value is also there in LLTX drivers even if in just formating a skb ready for transmit. If this is not clear i could do a much longer writeup on my thought evolution towards adding prep(). I wonder if you tried enabling/disabling 'prep' on E1000 to see how the performance is affected. Absolutely. And regardless of whether its beneficial or not for e1000, theres clear benefit in the tg3 for example. If it helps, I guess you could send me a patch to add that and I can also test it to see what the effect is. I didn't add it since IPoIB wouldn't be able to exploit it (unless someone is kind enough to show me how to). Such core code should not just be focussed on IPOIB. I think the code I have is ready and stable, I am not sure how to intepret that - are you saying all-is-good and we should just push your code in? It sounds disingenuous but i may have misread you. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: checksum of lsrr
Hello, From: Ken-ichirou MATSUZAWA [EMAIL PROTECTED] Subject: checksum of lsrr Date: Wed, 04 Jul 2007 22:28:24 +0900 (JST) if we change destination address, we need recompute not only ip checksum but also tcp's, don't we? I was ashamed of myself for much confused. My simple story is: rt1 rt2 mybox rt1 is far from rt2 and when I changed bandwith between rt1 and rt2, I want to confirm if it was changed correctly. A idea occured to me is using lsrr. Then I tried with nc but it was not worked. To accept packet from rt2 to mybox which src and dst address are mybox, I changed: --- net/ipv4/fib_frontend.c.orig2007-07-23 00:01:17.0 +0900 +++ net/ipv4/fib_frontend.c 2007-07-23 00:03:11.0 +0900 @@ -206,7 +206,7 @@ if (fib_lookup(fl, res)) goto last_resort; - if (res.type != RTN_UNICAST) + if (res.type != RTN_UNICAST likely(res.type != RTN_LOCAL)) goto e_inval_res; *spec_dst = FIB_RES_PREFSRC(res); fib_combine_itag(itag, res); Would you tell me is this acceptable? I'm afraid of security issue. thanks, - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question: how to detect if a qdisc is root or not?
Waskiewicz Jr, Peter P wrote: I dont think I understand. Whats the problem with setting sch-parent on initialization instead on grafting as I did in my example patch? Please explain the problems arrising on unload in detail. sch-parent is getting set on initialization, and for the root and ingress qdiscs, it's left at zero. If I change that value, when the root qdisc is unloaded and pfifo_fast is put back into place, the qdisc_destroy() walks the tree and attempts to free memory from the handle pointed at by sch-parent. First of all, qdisc destruction never propagates up, only down the tree. Secondly neither qdisc_destroy nor pfifo nor prio even look at sch-parent. So this is completely wrong, the only place where sch-parent is used for walking through the tree is qdisc_tree_decrease_qlen. It stops when sch-parent is NULL, Where are you getting this? sch-parent is an *integer*. so sch-parent is actually being set as intended. The only thing that confused me is that nowhere in the qdisc is TC_H_ROOT included explicitly, rather, the root qdisc is where sch-parent is NULL. So I misunderstood what was actually wrong. The qdisc code is ok as-is, it's just that the top-level qdisc (root and ingress) have a sch-parent of NULL, which is being set correctly today. Hope that clarifies. Not at all :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Fix sch_prio to detect the root qdisc loading
PJ Waskiewicz wrote: The sch-parent handle will be NULL for the scheduler that is TC_H_ROOT. Change this check in prio_tune() so that only the root qdisc can be multiqueue-enabled. Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED] --- net/sched/sch_prio.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index 2d8c084..271051e 100644 --- a/net/sched/sch_prio.c +++ b/net/sched/sch_prio.c @@ -239,11 +239,13 @@ static int prio_tune(struct Qdisc *sch, struct rtattr *opt) /* If we're multiqueue, make sure the number of incoming bands * matches the number of queues on the device we're associating with. * If the number of bands requested is zero, then set q-bands to - * dev-egress_subqueue_count. + * dev-egress_subqueue_count. Also, the root qdisc must be the + * only one that is enabled for multiqueue, since it's the only one + * that interacts with the underlying device. */ q-mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]); if (q-mq) { - if (sch-handle != TC_H_ROOT) + if (sch-parent != NULL) As explained in my last mail, sch-parent is an integer. And it is set when grafting the qdisc, not on initilization, so it is always 0 here when coming from prio_init. This untested patch should make sure its always set correctly. diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 13c09bc..0975547 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -380,6 +380,10 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned int n) return; while ((parentid = sch-parent)) { sch = qdisc_lookup(sch-dev, TC_H_MAJ(parentid)); + if (sch == NULL) { + WARN_ON(parentid != TC_H_ROOT); + return; + } cops = sch-ops-cl_ops; if (cops-qlen_notify) { cl = cops-get(sch, parentid); @@ -420,8 +424,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, unsigned long cl = cops-get(parent, classid); if (cl) { err = cops-graft(parent, cl, new, old); - if (new) - new-parent = classid; cops-put(parent, cl); } } @@ -436,7 +438,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent, */ static struct Qdisc * -qdisc_create(struct net_device *dev, u32 handle, struct rtattr **tca, int *errp) +qdisc_create(struct net_device *dev, u32 parent, u32 handle, +struct rtattr **tca, int *errp) { int err; struct rtattr *kind = tca[TCA_KIND-1]; @@ -482,6 +485,8 @@ qdisc_create(struct net_device *dev, u32 handle, struct rtattr **tca, int *errp) goto err_out2; } + new-parent = parent; + if (handle == TC_H_INGRESS) { sch-flags |= TCQ_F_INGRESS; sch-stats_lock = dev-ingress_lock; @@ -758,9 +763,11 @@ create_n_graft: if (!(n-nlmsg_flagsNLM_F_CREATE)) return -ENOENT; if (clid == TC_H_INGRESS) - q = qdisc_create(dev, tcm-tcm_parent, tca, err); + q = qdisc_create(dev, tcm-tcm_parent, tcm-tcm_parent, +tca, err); else - q = qdisc_create(dev, tcm-tcm_handle, tca, err); + q = qdisc_create(dev, tcm-tcm_parent, tcm-tcm_handle, +tca, err); if (q == NULL) { if (err == -EAGAIN) goto replay;
[PATCH] Fix irlan_eth.c - unused variable 'in_dev' warning
Hi, I got this warning on current git: ... net/irda/irlan/irlan_eth.c: In function 'irlan_eth_send_gratuitous_arp': net/irda/irlan/irlan_eth.c:301: warning: unused variable 'in_dev' ... This is because CONFIG_INET is not set in my config. Signed-off-by: Gabriel Craciunescu [EMAIL PROTECTED] --- diff --git a/net/irda/irlan/irlan_eth.c b/net/irda/irlan/irlan_eth.c index c421521..ea0b21a 100644 --- a/net/irda/irlan/irlan_eth.c +++ b/net/irda/irlan/irlan_eth.c @@ -298,6 +298,7 @@ void irlan_eth_flow_indication(void *instance, void *sap, LOCAL_FLOW flow) */ void irlan_eth_send_gratuitous_arp(struct net_device *dev) { +#ifdef CONFIG_INET struct in_device *in_dev; /* @@ -305,7 +306,6 @@ void irlan_eth_send_gratuitous_arp(struct net_device *dev) * is useful if we have changed access points on the same * subnet. */ -#ifdef CONFIG_INET IRDA_DEBUG(4, IrLAN: Sending gratuitous ARP\n); rcu_read_lock(); in_dev = __in_dev_get_rcu(dev); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.
Krishna Kumar2 wrote: Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 11:46:36 PM: The check for tx_queue_len is wrong though, its only a default which can be overriden and some qdiscs don't care for it at all. I think it should not matter whether qdiscs use this or not, or even if it is modified (unless it is made zero in which case this breaks). The intention behind this check is to make sure that not more than tx_queue_len skbs are in all queues put together (q-qdisc + dev-skb_blist), otherwise the blist can become too large and breaks the idea of tx_queue_len. Is that a good justification ? Its a good justification, but on second thought the entire idea of a single queue after qdiscs that is refilled independantly of transmissions times etc. make be worry a bit. By changing the timing you're effectively changing the qdiscs behaviour, at least in some cases. SFQ is a good example, but I believe it affects most work-conserving qdiscs. Think of this situation: 100 packets of flow 1 arrive 50 packets of flow 1 are sent 100 packets for flow 2 arrive remaining packets are sent On the wire you'll first see 50 packets of flow 1, than 100 packets alternate of flow 1 and 2, then 50 packets flow 2. With your additional queue all packets of flow 1 are pulled out of the qdisc immediately and put in the fifo. When the 100 packets of the second flow arrive they will also get pulled out immediately and are put in the fifo behind the remaining 50 packets of flow 1. So what you get on the wire is: 100 packets of flow 1 100 packets of flow 1 So SFQ is without any effect. This is not completely avoidable of course, but you can and should limit the damage by only pulling out as much packets as the driver can take and have the driver stop the queue afterwards. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
Krishna Kumar wrote: @@ -472,6 +474,9 @@ struct net_device void*priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_start_xmit_batch) (struct net_device + *dev); + Os this function really needed? Can't you just call hard_start_xmit with a NULL skb and have the driver use dev-blist? /* These may be needed for future network-power-down code. */ unsigned long trans_start;/* Time (in jiffies) of last Tx */ @@ -582,6 +587,8 @@ struct net_device #define NETDEV_ALIGN32 #define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1) +#define BATCHING_ON(dev) ((dev-features NETIF_F_BATCH_ON) != 0) + static inline void *netdev_priv(const struct net_device *dev) { return dev-priv; @@ -832,6 +839,8 @@ extern intdev_set_mac_address(struct n struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_add_skb_to_blist(struct sk_buff *skb, + struct net_device *dev); Again, function signatures should be introduced in the same patch that contains the function. Splitting by file doesn't make sense. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12 -Rev2] rtnetlink changes.
Krishna Kumar wrote: diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h --- org/include/linux/if_link.h 2007-07-20 16:33:35.0 +0530 +++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.0 +0530 @@ -78,6 +78,8 @@ enum IFLA_LINKMODE, IFLA_LINKINFO, #define IFLA_LINKINFO IFLA_LINKINFO + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */ +#define IFLA_TXBTHSKB IFLA_TXBTHSKB Ughh what a name :) I prefer pronouncable names since they are much easier to remember and don't need comments explaining what they mean. But I actually think offering just an ethtool interface would be better, at least for now. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS mount gives ENETDOWN in -git15
On Sat, 2007-07-21 at 15:31 +0200, Andi Kleen wrote: I tried to mount another nfs mount on a system running with nfsroot. But I get # mount basil:/home /basil/home/ mount: Network is down The network is not down of course, the system is happily running with nfs root from that server. Userland is older SUSE 10.0 Does Al's patch help in any way? Cheers Trond ---BeginMessage--- Obviously broken on little-endian; fortunately, the option is not frequently used... Signed-off-by: Al Viro [EMAIL PROTECTED] --- diff --git a/fs/nfs/super.c b/fs/nfs/super.c index b34b7a7..b2a851c 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -732,7 +732,7 @@ static int nfs_parse_mount_options(char *raw, return 0; if (option 0 || option 65535) return 0; - mnt-nfs_server.address.sin_port = htonl(option); + mnt-nfs_server.address.sin_port = htons(option); break; case Opt_rsize: if (match_int(args, mnt-rsize)) ---End Message---
netlink messages on 'ip l s down' event
hello, I noticed that ipv4 subsystem sends RTM_NEWLINK message with new flags on 'ip l s down' event, but ipv6 subsystem sends RTM_DELLINK message. Is it inconsistency or feature? If it is inconsistency, which approach will be used in future? (tested kernels 2.6.20,21,22) Thanks for reply. regards, -- Milan Kocian [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
bug in tcp/ip stack
I tracked down something that appears to be a small bug in networking code. The way in witch i can reproduce it a complex one but it works 100%, so here comes the details: I noticed strange packets on my fw coming from mail server with RST/ACK flags set coming from source port with no one listening on it and no connection attempts made to them from outside. There are few messages on forums describing same problem and calling them alien ACK/RST packets. Postfix mail server gives this behavior if for some reason client resets connection but some packets from client arrives after the RST, the serer box responds with RST and then with RST/ACK (with the wrong source port number). Here is packet dump 10.0010.0.0.25410.0.0.68TCP5 smtp [SYN] Seq=0 Len=0 20.00103610.0.0.6810.0.0.254TCPsmtp 5 [SYN, ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 30.00109610.0.0.25410.0.0.68TCP5 smtp [ACK] Seq=1 Ack=1 Win=1500 Len=0 40.00112510.0.0.25410.0.0.68SMTPCommand: EHLO localhost 50.00115010.0.0.25410.0.0.68TCP5 smtp [RST] Seq=17 Len=0 60.00117510.0.0.25410.0.0.68TCP5 smtp [FIN, ACK] Seq=17 Ack=1 Win=1500 Len=0 70.00125110.0.0.6810.0.0.254TCPsmtp 5 [ACK] Seq=1 Ack=17 Win=5840 Len=0 80.00128410.0.0.6810.0.0.254TCPsmtp 5 [RST] Seq=1 Len=0 !!!90.21842710.0.0.6810.0.0.254TCP32768 5 [RST, ACK] Seq=0 Ack=0 Win=5840 Len=0 It is not the postfix bug, it is present in current 2.6.x and 2.4.x kernel versions but not in the 2.2.x tree, so after investigation i found it was introduced in 2.4.0-test9-pre3 back in year 2000 and survived for 7 years WOW :) Whole 2.4.0-test9-pre3 diff is pretty big, but i managed to find lines responsible for this, they are located in include/net/tcp.h in function tcp_enter_cwr if (sk-prev !(sk-userlocksSOCK_BINDPORT_LOCK)) tcp_put_port(sk); It is not a big problem but under some setups the fw's conntrack table can get filled pretty quickly, because wrong port number changes every time. Can, you please check this out? Evalds - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Fix sch_prio to detect the root qdisc loading
As explained in my last mail, sch-parent is an integer. And it is set when grafting the qdisc, not on initilization, so it is always 0 here when coming from prio_init. This untested patch should make sure its always set correctly. Yes, I'm using NULL and 0 interchangeably here, since in the sch_api code, qdisc_graft(), sch-parent is referenced using NULL and not 0. I know it's a u32, and the value is getting set to the proper handle when the qdisc is not the root qdisc. When it's the root qdisc, it's left to be 0. This patch was tested as well, but looking at the history now, I didn't set bands correctly, so that's why the multiqueue case failed. My mistake, and I'll keep working on this. Sorry for the extra noise, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [IrDA] KS959 USB IrDA dongle support
Hi Dave, Last IrDA patch for 2.6.23-rc1, if it's not too late... This patch adds support for the KingSun KS-959 USB IrDA dongle. This dongle does not follow the usb-irda specification, so it needs its own special driver. First, it uses control URBs for data transfer, instead of bulk or interrupt transfers; the only interrupt endpoint exposed seems to be a dummy to prevent the interface from being rejected. Second, it uses obfuscation and padding at the USB traffic level, for no apparent reason other than to make reverse engineering harder (full details on obfuscation in comments at beginning of source). Although it is advertised as a 4 Mbps FIR dongle, it apparently loses packets at speeds greater than 57600 bps. On plugin, this dongle reports vendor and device IDs: 0x07d0:0x4959 . From: Alex Villacís Lasso [EMAIL PROTECTED] Signed-off-by: Alex Villacís Lasso [EMAIL PROTECTED] Signed-off-by: Samuel Ortiz [EMAIL PROTECTED] --- drivers/net/irda/Kconfig | 14 + drivers/net/irda/Makefile|1 + drivers/net/irda/ks959-sir.c | 919 ++ 3 files changed, 934 insertions(+), 0 deletions(-) create mode 100644 drivers/net/irda/ks959-sir.c diff --git a/drivers/net/irda/Kconfig b/drivers/net/irda/Kconfig index 2098d0a..35321dc 100644 --- a/drivers/net/irda/Kconfig +++ b/drivers/net/irda/Kconfig @@ -164,6 +164,20 @@ config EP7211_DONGLE +config KS959_DONGLE + tristate KingSun KS-959 IrDA-USB dongle + depends on IRDA USB EXPERIMENTAL + help + Say Y or M here if you want to build support for the KingSun KS-959 + IrDA-USB bridge device driver. + + This USB bridge does not conform to the IrDA-USB device class + specification, and therefore needs its own specific driver. + This dongle supports SIR speeds only (9600 through 57600 bps). + + To compile it as a module, choose M here: the module will be called + ks959-sir. + comment Old SIR device drivers config IRPORT_SIR diff --git a/drivers/net/irda/Makefile b/drivers/net/irda/Makefile index 2808ef5..9212c15 100644 --- a/drivers/net/irda/Makefile +++ b/drivers/net/irda/Makefile @@ -47,6 +47,7 @@ obj-$(CONFIG_MA600_DONGLE)+= ma600-sir.o obj-$(CONFIG_TOIM3232_DONGLE) += toim3232-sir.o obj-$(CONFIG_EP7211_DONGLE)+= ep7211-sir.o obj-$(CONFIG_KINGSUN_DONGLE) += kingsun-sir.o +obj-$(CONFIG_KS959_DONGLE) += ks959-sir.o # The SIR helper module sir-dev-objs := sir_dev.o sir_dongle.o diff --git a/drivers/net/irda/ks959-sir.c b/drivers/net/irda/ks959-sir.c new file mode 100644 index 000..12e3435 --- /dev/null +++ b/drivers/net/irda/ks959-sir.c @@ -0,0 +1,919 @@ +/* +* +* Filename: ks959-sir.c +* Version: 0.1.2 +* Description: Irda KingSun KS-959 USB Dongle +* Status:Experimental +* Author:Alex Villacís Lasso [EMAIL PROTECTED] +* with help from Domen Puncer [EMAIL PROTECTED] +* +*Based on stir4200, mcs7780, kingsun-sir drivers. +* +*This program is free software; you can redistribute it and/or modify +*it under the terms of the GNU General Public License as published by +*the Free Software Foundation; either version 2 of the License. +* +*This program is distributed in the hope that it will be useful, +*but WITHOUT ANY WARRANTY; without even the implied warranty of +*MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +*GNU General Public License for more details. +* +*You should have received a copy of the GNU General Public License +*along with this program; if not, write to the Free Software +*Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. +* +*/ + +/* + * Following is my most current (2007-07-17) understanding of how the Kingsun + * KS-959 dongle is supposed to work. This information was deduced by + * reverse-engineering and examining the USB traffic captured with USBSnoopy + * from the WinXP driver. Feel free to update here as more of the dongle is + * known. + * + * My most sincere thanks must go to Domen Puncer [EMAIL PROTECTED] for + * invaluable help in cracking the obfuscation and padding required for this + * dongle. + * + * General: This dongle exposes one interface with one interrupt IN endpoint. + * However, the interrupt endpoint is NOT used at all for this dongle. Instead, + * this dongle uses control transfers for everything, including sending and + * receiving the IrDA frame data. Apparently the interrupt endpoint is just a + * dummy to ensure the dongle has a valid interface to present to the PC. And I + * thought the DonShine dongle was weird... In addition, this dongle uses + * obfuscation (?!?!?), applied at the USB level, to hide the traffic, both sent + * and received, from the dongle. I call it obfuscation because the XOR keying + * and
Re: NFS mount gives ENETDOWN in -git15
On 7/21/07, Andi Kleen [EMAIL PROTECTED] wrote: I tried to mount another nfs mount on a system running with nfsroot. But I get # mount basil:/home /basil/home/ mount: Network is down The network is not down of course, the system is happily running with nfs root from that server. Userland is older SUSE 10.0 Excerpt from strace mount: -Andi What is your CONFIG_NETDEVICES_MULTIQUEUE in .config? YH - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
drivers/net/xen-netfront.c: bogus code
The Coverity checker spotted the following bogus code in drivers/net/xen-netfront.c: -- snip -- ... static void xennet_alloc_rx_buffers(struct net_device *dev) { ... for (nr_flips = i = 0; ; i++) { skb = __skb_dequeue(np-rx_batch); if (skb == NULL) break; skb-dev = dev; id = xennet_rxidx(req_prod + i); BUG_ON(np-rx_skbs[id]); np-rx_skbs[id] = skb; ref = gnttab_claim_grant_reference(np-gref_rx_head); BUG_ON((signed short)ref 0); np-grant_rx_ref[id] = ref; pfn = page_to_pfn(skb_shinfo(skb)-frags[0].page); vaddr = page_address(skb_shinfo(skb)-frags[0].page); req = RING_GET_REQUEST(np-rx, req_prod + i); gnttab_grant_foreign_access_ref(ref, np-xbdev-otherend_id, pfn_to_mfn(pfn), 0); req-id = id; req-gref = ref; } if (nr_flips != 0) { ... -- snip -- Note that nr_flips is always 0 in the last line. cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
net/9p/mux.c: use-after-free
The Coverity checker spotted the following use-after-free in net/9p/mux.c: -- snip -- ... struct p9_conn *p9_conn_create(struct p9_transport *trans, int msize, unsigned char *extended) { ... if (!m-tagpool) { kfree(m); return ERR_PTR(PTR_ERR(m-tagpool)); } ... -- snip -- cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
net/bluetooth/rfcomm/tty.c: use-after-free
Commit 8de0a15483b357d0f0b821330ec84d1660cadc4e added the following use-after-free in net/bluetooth/rfcomm/tty.c: -- snip -- ... static int rfcomm_dev_add(struct rfcomm_dev_req *req, struct rfcomm_dlc *dlc) { ... if (IS_ERR(dev-tty_dev)) { list_del(dev-list); kfree(dev); return PTR_ERR(dev-tty_dev); } ... -- snip -- Spotted by the Coverity checker. cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [IrDA] KS959 USB IrDA dongle support
From: Samuel Ortiz [EMAIL PROTECTED] Date: Mon, 23 Jul 2007 00:32:17 +0300 Last IrDA patch for 2.6.23-rc1, if it's not too late... Too late, Linus just closed the merge window, you had two weeks to submit this :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition
Hi Micheal, Michael S. Tsirkin [EMAIL PROTECTED] wrote on 07/22/2007 03:11:36 PM: + /* +* Handle skbs completion from tx_tail to wr_id. It is possible to +* handle WC's from earlier post_sends (possible multiple) in this +* iteration as we move from tx_tail to wr_id, since if the last +* WR (which is the one which had a completion request) failed to be +* sent for any of those earlier request(s), no completion +* notification is generated for successful WR's of those earlier +* request(s). +*/ AFAIK a signalled WR will always generate a completion. What am I missing? Yes, signalled WR will generate a completion. I am trying to catch the case where, say, I send 64 skbs and set signalling for only the last skb and the others are set to NO signalling. Now if the driver found the last WR was bad for some reason, it will synchronously fail the send for that WR (which happens to be the only one that is signalled). So after the 1 to 63 skbs are finished, there will be no completion called. That was my understanding of how this works, and coded it that way so that the next post will clean up the previous one's completion. + /* + * Better error handling can be done here, like free + * all untried skbs if err == -ENOMEM. However at this + * time, we re-try all the skbs, all of which will + * likely fail anyway (unless device finished sending + * some out in the meantime). This is not a regression + * since the earlier code is not doing this either. + */ Are you retrying posting skbs? Why is this a good idea? AFAIK, earlier code did not retry posting WRs at all. Not exactly. If I send 64 skbs to the device and the provider returned a bad WR at skb # 50, then I will have to try skb# 51-64 again since the provider has not attemped to send those out as it bails out at the first failure. The provider ofcourse has already sent out skb# 1-49 before returning failure at skb# 50. So it is not strictly retry, just xmit of next skbs which is what the current code also does. I tested this part out by simulating errors in mthca_post_send and verified that the next iteration clears up the remaining skbs. The comment seems to imply that post send fails as a result of SQ overflow - Correct. do you see SQ overflow errors in your testing? No. AFAIK, IPoIB should never overflow the SQ. Correct. It should never happen unless IPoIB has a bug :) I guess the comment should be removed ? Thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/12 -Rev2] rtnetlink changes.
Hi Patrick, Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:40:37 PM: Krishna Kumar wrote: diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h --- org/include/linux/if_link.h 2007-07-20 16:33:35.0 +0530 +++ rev2/include/linux/if_link.h 2007-07-20 16:35:08.0 +0530 @@ -78,6 +78,8 @@ enum IFLA_LINKMODE, IFLA_LINKINFO, #define IFLA_LINKINFO IFLA_LINKINFO + IFLA_TXBTHSKB, /* Driver support for Batch'd skbs */ +#define IFLA_TXBTHSKB IFLA_TXBTHSKB Ughh what a name :) I prefer pronouncable names since they are much easier to remember and don't need comments explaining what they mean. But I actually think offering just an ethtool interface would be better, at least for now. Great, I will remove /sys and rtnetlink and keep the Ethtool i/f. Thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/12 -Rev2] Changes to netdevice.h
Hi Patrick, Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:36:51 PM: Krishna Kumar wrote: @@ -472,6 +474,9 @@ struct net_device void *priv; /* pointer to private data */ int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); + int (*hard_start_xmit_batch) (struct net_device + *dev); + Os this function really needed? Can't you just call hard_start_xmit with a NULL skb and have the driver use dev-blist? Probably not. I will see how to do it this way and get back to you. /* These may be needed for future network-power-down code. */ unsigned long trans_start; /* Time (in jiffies) of last Tx */ @@ -582,6 +587,8 @@ struct net_device #define NETDEV_ALIGN 32 #define NETDEV_ALIGN_CONST (NETDEV_ALIGN - 1) +#define BATCHING_ON(dev) ((dev-features NETIF_F_BATCH_ON) != 0) + static inline void *netdev_priv(const struct net_device *dev) { return dev-priv; @@ -832,6 +839,8 @@ extern int dev_set_mac_address(struct n struct sockaddr *); extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev); +extern int dev_add_skb_to_blist(struct sk_buff *skb, +struct net_device *dev); Again, function signatures should be introduced in the same patch that contains the function. Splitting by file doesn't make sense. Right. I did it for some but missed this. Sorry, will redo. thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/10] Implement batching skb API
Hi Evgeniy, Evgeniy Polyakov [EMAIL PROTECTED] wrote on 07/20/2007 06:24:23 PM: Hi Krishna. On Fri, Jul 20, 2007 at 12:01:49PM +0530, Krishna Kumar ([EMAIL PROTECTED]) wrote: After fine-tuning qdisc and other changes, I modified IPoIB to use this API, and now get good gains. Summary for TCP No Delay: 1 process improves for all cases from 1.4% to 49.5%; 4 process has almost identical improvements from -1.7% to 59.1%; 16 process case also improves in the range of -1.2% to 33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%). UDP was tested with 1 process netperf with small increase in BW but big improvement in Service Demand. Netperf latency tests show small drop in transaction rate (results in separate attachment). What about round-robin tcp time and latency test? In theory such batching mode should not change that timings, but practice can show new aspects. The TCP RR results show a slight impact, however the service demand shows good improvement. The results are (I did TCP RR - 1 process, 1,8,32,128,512 buffer sizes; and UDP RR - 1 process, 1 byte buffer size) : Results for TCR RR (1 process) ORG code: SizeR-R CPU%S.Demand 1 521346.02 5.481346.145 8 129463.14 6.74418.370 32128899.73 7.51467.106 128 127230.15 5.42340.876 512 119605.68 6.48435.650 Results for TCR RR (1 process) NEW code (and change%): SizeR-R CPU%S.Demand 1 516596.62 (-0.91%) 5.741423.819 (5.77%) 8 129184.46 (-.22%) 5.43336.747 (-19.51%) 32128238.35 (-.51%) 5.43339.213 (-27.38%) 128 126545.79 (-.54%) 5.36339.188 (-0.50%) 512 119297.49 (-.26%) 5.16346.185 (-20.54%) Results for UDP RR 1 process ORG NEW code: Code Size R-RCPU% S.Demand -- ORG 1539327.86 5.68 1348.985 NEW 1540669.33 (0.25%) 6.05 1434.180 (6.32%) I will review code later this week (likely tomorrow) and if there will be some issues return back. Thanks! I had just submitted Rev2 on Sunday, please let me know what you find. Regards, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Age Entry For IPv4 IPv6 Route Table
Hi, According to the RFC 4292 (IP Forwarding Table MIB) there is a need for an age entry for all the routes in therouting table. The entry in the RFC is inetCidrRouteAge and oid is inetCidrRouteAge.1.10. Many snmp application require this age entry. So iam adding the age field in the routing table for ipv4 and ipv6 and providing the interface for this value netlink. Signed-off-by: Varun Chandramohan [EMAIL PROTECTED] --- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] New attribute RTA_AGE
A new attribute RTA_AGE is added for the age value to be exported to userlevel using netlink Signed-off-by: Varun Chandramohan [EMAIL PROTECTED] --- include/linux/rtnetlink.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index 6127858..884f507 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -263,6 +263,7 @@ enum rtattr_type_t RTA_SESSION, RTA_MP_ALGO, /* no longer used */ RTA_TABLE, + RTA_AGE, __RTA_MAX }; -- 1.4.3.4 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Add new timeval_to_sec function
A new function for converting timeval to time_t is added in time.h. Its a common function used in different places. Signed-off-by: Varun Chandramohan [EMAIL PROTECTED] --- include/linux/time.h | 11 +++ 1 files changed, 11 insertions(+), 0 deletions(-) diff --git a/include/linux/time.h b/include/linux/time.h index dda9be6..c81baa6 100644 --- a/include/linux/time.h +++ b/include/linux/time.h @@ -147,6 +147,17 @@ static inline s64 timeval_to_ns(const st } /** + * timeval_to_sec - Convert timeval to seconds + * @tv: pointer to the timeval variable to be converted + * + * Returns the seconds representation of timeval parameter. + */ +static inline time_t timeval_to_sec(const struct timeval *tv) +{ + return (tv-tv_sec + (tv-tv_usec + 50)/100); +} + +/** * ns_to_timespec - Convert nanoseconds to timespec * @nsec: the nanoseconds value to be converted * -- 1.4.3.4 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] Initilize and populate age field
The age field is filled with the current time at the time of creation of the route. When the routes are dumped then the age value stored in the route structure is subtracted from the current time value and the difference is the age expressed in secs. Signed-off-by: Varun Chandramohan [EMAIL PROTECTED] --- net/ipv4/fib_hash.c |3 +++ net/ipv4/fib_lookup.h|3 ++- net/ipv4/fib_semantics.c | 16 +--- net/ipv4/fib_trie.c |1 + 4 files changed, 19 insertions(+), 4 deletions(-) diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c index 07e843a..faa7364 100644 --- a/net/ipv4/fib_hash.c +++ b/net/ipv4/fib_hash.c @@ -448,6 +448,7 @@ static int fn_hash_insert(struct fib_tab fa-fa_info = fi; fa-fa_type = cfg-fc_type; fa-fa_scope = cfg-fc_scope; + fa-fa_age = 0; state = fa-fa_state; fa-fa_state = ~FA_S_ACCESSED; fib_hash_genid++; @@ -507,6 +508,7 @@ static int fn_hash_insert(struct fib_tab new_fa-fa_type = cfg-fc_type; new_fa-fa_scope = cfg-fc_scope; new_fa-fa_state = 0; + new_fa-fa_age = 0; /* * Insert new entry to the list. @@ -697,6 +699,7 @@ fn_hash_dump_bucket(struct sk_buff *skb, f-fn_key, fz-fz_order, fa-fa_tos, + fa-fa_age, fa-fa_info, NLM_F_MULTI) 0) { cb-args[4] = i; diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h index eef9eec..c9145b5 100644 --- a/net/ipv4/fib_lookup.h +++ b/net/ipv4/fib_lookup.h @@ -13,6 +13,7 @@ struct fib_alias { u8 fa_type; u8 fa_scope; u8 fa_state; + time_t fa_age; }; #define FA_S_ACCESSED 0x01 @@ -27,7 +28,7 @@ extern struct fib_info *fib_create_info( extern int fib_nh_match(struct fib_config *cfg, struct fib_info *fi); extern int fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event, u32 tb_id, u8 type, u8 scope, __be32 dst, -int dst_len, u8 tos, struct fib_info *fi, +int dst_len, u8 tos, time_t *age, struct fib_info *fi, unsigned int); extern void rtmsg_fib(int event, __be32 key, struct fib_alias *fa, int dst_len, u32 tb_id, struct nl_info *info, diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index c434119..1822d92 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -278,7 +278,8 @@ static inline size_t fib_nlmsg_size(stru + nla_total_size(4) /* RTA_TABLE */ + nla_total_size(4) /* RTA_DST */ + nla_total_size(4) /* RTA_PRIORITY */ -+ nla_total_size(4); /* RTA_PREFSRC */ ++ nla_total_size(4) /* RTA_PREFSRC */ ++ nla_total_size(4); /*RTA_AGE*/ /* space for nested metrics */ payload += nla_total_size((RTAX_MAX * nla_total_size(4))); @@ -313,7 +314,7 @@ void rtmsg_fib(int event, __be32 key, st err = fib_dump_info(skb, info-pid, seq, event, tb_id, fa-fa_type, fa-fa_scope, key, dst_len, - fa-fa_tos, fa-fa_info, nlm_flags); + fa-fa_tos, fa-fa_age, fa-fa_info, nlm_flags); if (err 0) { /* -EMSGSIZE implies BUG in fib_nlmsg_size() */ WARN_ON(err == -EMSGSIZE); @@ -940,11 +941,12 @@ __be32 __fib_res_prefsrc(struct fib_resu } int fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event, - u32 tb_id, u8 type, u8 scope, __be32 dst, int dst_len, u8 tos, + u32 tb_id, u8 type, u8 scope, __be32 dst, int dst_len, u8 tos, time_t *age, struct fib_info *fi, unsigned int flags) { struct nlmsghdr *nlh; struct rtmsg *rtm; + struct timeval tv; nlh = nlmsg_put(skb, pid, seq, event, sizeof(*rtm), flags); if (nlh == NULL) @@ -985,6 +987,14 @@ int fib_dump_info(struct sk_buff *skb, u NLA_PUT_U32(skb, RTA_FLOW, fi-fib_nh[0].nh_tclassid); #endif } + + do_gettimeofday(tv); + if (!*age) { + *age = timeval_to_sec(tv); + NLA_PUT_U32(skb, RTA_AGE, *age); + } else { + NLA_PUT_U32(skb, RTA_AGE, timeval_to_sec(tv) - *age); + } #ifdef CONFIG_IP_ROUTE_MULTIPATH if (fi-fib_nhs 1) { struct rtnexthop *rtnh; diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 30e332a..be2d1d6
[PATCH 4/4] Initialize and fill IPv6 route age
The age field of the ipv6 route structures are initilized with the current timeval at the time of route creation. When the route dump is called the route age value stored in the structure is subtracted from the present timeval and the difference is passed on as the route age. Signed-off-by: Varun Chandramohan [EMAIL PROTECTED] --- include/net/ip6_fib.h |1 + include/net/ip6_route.h |3 +++ net/ipv6/addrconf.c |5 + net/ipv6/route.c| 23 +++ 4 files changed, 28 insertions(+), 4 deletions(-) diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h index c48ea87..e30a1cf 100644 --- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -98,6 +98,7 @@ struct rt6_info u32 rt6i_flags; u32 rt6i_metric; + time_t rt6i_age; atomic_trt6i_ref; struct fib6_table *rt6i_table; diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 5456fdd..fc9716c 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -36,6 +36,9 @@ struct route_info { #define RT6_LOOKUP_F_REACHABLE 0x2 #define RT6_LOOKUP_F_HAS_SADDR 0x4 +#define RT6_SET_ROUTE_INFO 0x0 +#define RT6_GET_ROUTE_INFO 0x1 + extern struct rt6_info ip6_null_entry; #ifdef CONFIG_IPV6_MULTIPLE_TABLES diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 5a5f8bd..715c766 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4187,6 +4187,7 @@ EXPORT_SYMBOL(unregister_inet6addr_notif int __init addrconf_init(void) { + struct timeval tv; int err = 0; /* The addrconf netdev notifier requires that loopback_dev @@ -4214,10 +4215,14 @@ int __init addrconf_init(void) if (err) return err; + do_gettimeofday(tv); ip6_null_entry.rt6i_idev = in6_dev_get(loopback_dev); + ip6_null_entry.rt6i_age = timeval_to_sec(tv); #ifdef CONFIG_IPV6_MULTIPLE_TABLES ip6_prohibit_entry.rt6i_idev = in6_dev_get(loopback_dev); + ip6_prohibit_entry.rt6i_age = timeval_to_sec(tv); ip6_blk_hole_entry.rt6i_idev = in6_dev_get(loopback_dev); + ip6_blk_hole_entry.rt6i_age = timeval_to_sec(tv); #endif register_netdevice_notifier(ipv6_dev_notf); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index fe8d983..9386c05 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -600,7 +600,14 @@ static int __ip6_ins_rt(struct rt6_info { int err; struct fib6_table *table; + struct timeval tv; + do_gettimeofday(tv); + /* Update the timeval for new routes +* We add it here to make it common irrespective +* of how the new route is added. +*/ + rt-rt6i_age = timeval_to_sec(tv); table = rt-rt6i_table; write_lock_bh(table-tb6_lock); err = fib6_add(table-tb6_root, rt, info); @@ -2111,6 +2118,7 @@ static inline size_t rt6_nlmsg_size(void + nla_total_size(4) /* RTA_IIF */ + nla_total_size(4) /* RTA_OIF */ + nla_total_size(4) /* RTA_PRIORITY */ + + nla_total_size(4) /*RTA_AGE*/ + RTAX_MAX * nla_total_size(4) /* RTA_METRICS */ + nla_total_size(sizeof(struct rta_cacheinfo)); } @@ -2118,10 +2126,11 @@ static inline size_t rt6_nlmsg_size(void static int rt6_fill_node(struct sk_buff *skb, struct rt6_info *rt, struct in6_addr *dst, struct in6_addr *src, int iif, int type, u32 pid, u32 seq, -int prefix, unsigned int flags) +int prefix, unsigned int flags, int dumpflg) { struct rtmsg *rtm; struct nlmsghdr *nlh; + struct timeval tv; long expires; u32 table; @@ -2185,6 +2194,12 @@ static int rt6_fill_node(struct sk_buff if (ipv6_get_saddr(rt-u.dst, dst, saddr_buf) == 0) NLA_PUT(skb, RTA_PREFSRC, 16, saddr_buf); } + + do_gettimeofday(tv); + if (dumpflg) + NLA_PUT_U32(skb, RTA_AGE, timeval_to_sec(tv) - rt-rt6i_age); + else + NLA_PUT_U32(skb, RTA_AGE, rt-rt6i_age); if (rtnetlink_put_metrics(skb, rt-u.dst.metrics) 0) goto nla_put_failure; @@ -,7 +2237,7 @@ int rt6_dump_route(struct rt6_info *rt, return rt6_fill_node(arg-skb, rt, NULL, NULL, 0, RTM_NEWROUTE, NETLINK_CB(arg-cb-skb).pid, arg-cb-nlh-nlmsg_seq, -prefix, NLM_F_MULTI); +prefix, NLM_F_MULTI, RT6_GET_ROUTE_INFO); } static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, void *arg) @@ -2287,7 +2302,7 @@ static int inet6_rtm_getroute(struct sk_ err = rt6_fill_node(skb, rt, fl.fl6_dst, fl.fl6_src, iif, RTM_NEWROUTE,
Re: [PATCH 00/10] Implement batching skb API
Hi Jamal, J Hadi Salim [EMAIL PROTECTED] wrote on 07/22/2007 06:21:09 PM: My concern is there is no consistency in results. I see improvements on something which you say dont. You see improvement in something that Evgeniy doesnt etc. Hmmm ? Evgeniy has not even tested my code to find some regression :) And you may possibly not find much improvement in E1000 when you run iperf (which is what I do) compared to pktgen. I can re-run and confirm this since my last E1000 run was quite some time back. My point is that batching not being viable for E1000 (or tg3) need not be the sole criterea for inclusion. If IPoIB or other drivers can take advantage of it and get better results, then batching can be considered. Maybe E1000 too can get improvements if some one with more expertise tries to add this API (not judging your driver writing capabilities - just stating that driver writers will know more knobs to exploit a complex device like E1000). Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find it), I feel having prep will not help as no other cpu can execute the queue/xmit code anyway (E1000 is also a LLTX driver). My experiments show it is useful (in a very visible way using pktgen) for e1000 to have the prep() interface. I meant : have you compared results of batching with prep on vs prep off, and what is the difference in BW ? Other driver that hold tx lock could get improvement however. So you do see the value then with non LLTX drivers, right? ;- No. I see value only in non-LLTX drivers which also gets the same TX lock in the RX path. If different locks are got by TX/RX, then since you are holding queue_lock before calling 'prep', this excludes other TX from running at the same time. In that case, pre-poning the get of the tx_lock to do the 'prep' will not cause any degradation (since no other tx can run anyway, while rx can run as it gets a different lock). The value is also there in LLTX drivers even if in just formating a skb ready for transmit. If this is not clear i could do a much longer writeup on my thought evolution towards adding prep(). In LLTX drivers, the driver does the 'prep' without holding the tx_lock in any case, so there should be no improvement. Could you send the write-up since I really don't see the value in prep unless the driver is non-LLTX *and* TX/RX holds the same TX lock. I think that is the sole criterea, right ? If it helps, I guess you could send me a patch to add that and I can also test it to see what the effect is. I didn't add it since IPoIB wouldn't be able to exploit it (unless someone is kind enough to show me how to). Such core code should not just be focussed on IPOIB. There is *nothing* IPoIB specific or focus in my code. I said adding prep doesn't work for IPoIB and so it is pointless to add bloat to the code until some code can actually take advantage of this feature (I am sure you will agree). Which is why I also mentioned to please send me a patch if you find it useful for any driver rather than rejecting this idea. I think the code I have is ready and stable, I am not sure how to intepret that - are you saying all-is-good and we should just push your code in? I am only too well aware that Dave will not accept any code (having experienced with Mobile IPv6 a long time back when he said to move most of it to userspace and he was absolutely correct :). What I meant to say is that there isn't much point in saying that your code is not ready or you are using old code base, or has multiple restart functions, or is not tested enough, etc, and then say let's re-do/rethink the whole implementation when my code is already working and giving good results. Unless you have some design issues with it, or code is written badly, is not maintainable, not linux style compliant, is buggy, will not handle some case/workload, type of issues. OTOH, if you find some cases that are better handled with : 1. prep handler 2. xmit_win (which I don't have now), then please send me patches and I will also test out and incorporate. It sounds disingenuous but i may have misread you. (lacking in frankness, candor, or sincerity; falsely or hypocritically ingenuous; insincere) Sorry, no response to personal comments and have a flame-war :) Thanks, - KK - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-2.6.21 - networking dies after random time
Ok, I've bisected this problem and found that this patch broke my NIC: 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit commit 76d2160147f43f982dfe881404cfde9fd0a9da21 Author: Ingo Molnar [EMAIL PROTECTED] Date: Fri Feb 16 01:28:24 2007 -0800 [PATCH] genirq: do not mask interrupts by default Never mask interrupts immediately upon request. Disabling interrupts in high-performance codepaths is rare, and on the other hand this change could recover lost edges (or even other types of lost interrupts) by conservatively only masking interrupts after they happen. (NOTE: with this change the highlevel irq-disable code still soft-disables this IRQ line - and if such an interrupt happens then the IRQ flow handler keeps the IRQ masked.) Mark i8529A controllers as 'never loses an edge'. Signed-off-by: Ingo Molnar [EMAIL PROTECTED] Cc: Thomas Gleixner [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] Signed-off-by: Linus Torvalds [EMAIL PROTECTED] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=76d2160147f43f982dfe881404cfde9fd0a9da21 After reverting it on top of 2.6.21.3 (with d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2 - [PATCH] genirq: remove IRQ_DISABLED (which ment remove IRQ_DELAYED_DISABLE)), the problem didn't show up :) (http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2) So I cooked patch like below and everything is working fine (so far) Fix default_disable interrupt function (broken by [PATCH] genirq: do not mask interrupts by default) - revert removal of codepath which was invoked when removed flag (IRQ_DELAYED_DISABLE) wag NOT set Signed-off-by: Marcin Slusarz [EMAIL PROTECTED] --- diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c index 76a9106..0bb23cd 100644 --- a/kernel/irq/chip.c +++ b/kernel/irq/chip.c @@ -230,6 +230,8 @@ static void default_enable(unsigned int irq) */ static void default_disable(unsigned int irq) { + struct irq_desc *desc = irq_desc + irq; + desc-chip-mask(irq); } /* (Sorry for whitespace damage, but I have to send it from webmail :|) (I'm a kernel noob, so don't kill me if my patch is wrong ;) ps: Here is the beginning of this thread: http://lkml.org/lkml/2007/6/16/182 Regards, Marcin Slusarz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html