Re: [PATCH 00/10] Implement batching skb API

2007-07-22 Thread Krishna Kumar2
Hi Jamal,

J Hadi Salim [EMAIL PROTECTED] wrote on 07/21/2007 06:48:41 PM:

 - Use a single qdisc interface to avoid code duplication and reduce
   maintainability (sch_generic.c size reduces by ~9%).
 - Has per device configurable parameter to turn on/off batching.
 - qdisc_restart gets slightly modified while looking simple without
   any checks for batching vs regular code (infact only two lines
have
   changed - 1. instead of dev_dequeue_skb, a new batch-aware
function
   is called; and 2. an extra call to hard_start_xmit_batch.

 - No change in__qdisc_run other than a new argument (from DM's
idea).
 - Applies to latest net-2.6.23 compared to 2.6.22-rc4 code.

 All the above are cosmetic differences. To me is the highest priority
 is making sure that batching is useful and what the limitations are.
 At some point, when all looks good - i dont mind adding an ethtool
 interface to turn off/on batching, merge with the new qdisc restart path
 instead of having a parallel path, solicit feedback on naming, where to
 allocate structs etc etc. All that is low prio if batching across a
 variety of hardware and applications doesnt prove useful. At the moment,
 i am unsure theres consistency to justify push batching in.

Batching need not be useful for every hardware. If there is hardware that
is useful to exploit batching (like clearly IPoIB is a good candidate as
both the TX and the TX completion path can handle multiple skb processing,
and I haven't looked at other drivers to see if any of them can do
something
similar), then IMHO it makes sense to enable batching for that hardware. It
is upto the other drivers to determine whether converting to the batching
API makes sense or not. And as indicated, the total size increase for
adding
the kernel support is also insignificant - 0.03%, or 1164 Bytes (using the
'size' command).

 Having said that below are the main architectural differences we have
 which is what we really need to discuss and see what proves useful:

  - Batching algo/processing is different (eg. if
qdisc_restart() finds
   one skb in the batch list, it will try to batch more (upto a
limit)
   instead of sending that out and batching the rest in the next
call.

 This sounds a little more aggressive but maybe useful.
 I have experimented with setting upper bound limits (current patches
 have a pktgen interface to set the max to send) and have concluded that
 it is unneeded. Probing by letting the driver tell you what space is
 available has proven to be the best approach. I have been meaning to
 remove the code in pktgen which allows these limits.

I don't quite agree with that approach, eg, if the blist is empty and the
driver tells there is space for one packet, you will add one packet and
the driver sends it out and the device is stopped (with potentially lot of
skbs on dev-q). Then no packets are added till the queue is enabled, at
which time a flood of skbs will be processed increasing latency and holding
lock for a single longer duration. My approach will mitigate holding lock
for longer times and instead send skbs to the device as long as we are
within
the limits.

Infact in my rev2 patch (being today or tomorrow after handling Patrick's
and
Stephen's comments), I am even removing the driver specific xmit_slots as I
find it is adding bloat and requires more cycles than calculating the value
each time xmit is done (ofcourse in your approach it is required since the
stack uses it).

 - Jamal's code has a separate hw prep handler called from the stack,
   and results are accessed in driver during xmit later.

 I have explained the reasoning to this a few times. A recent response to
 Michael Chan is here:
 http://marc.info/?l=linux-netdevm=118346921316657w=2

Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find
it),
I feel having prep will not help as no other cpu can execute the queue/xmit
code anyway (E1000 is also a LLTX driver). Other driver that hold tx lock
could
get improvement however.

 And heres a response to you that i havent heard back on:
 http://marc.info/?l=linux-netdevm=118355539503924w=2

That is because it answered my query :) It is what I was expecting, but
thanks
for the explanation.

 My tests so far indicate this interface is useful. It doesnt apply well

I wonder if you tried enabling/disabling 'prep' on E1000 to see how the
performance is affected. If it helps, I guess you could send me a patch to
add that and I can also test it to see what the effect is. I didn't add it
since IPoIB wouldn't be able to exploit it (unless someone is kind enough
to show me how to).

 So if i was to sum up this, (it would be useful discussion to have on
 these) the real difference is:

 a) you have an extra check on refilling the skb list when you find that
 it has a single skb. I tagged this as being potentially useful.

It is very useful since extra processing is not required for one skb case -

Re: Races in net_rx_action vs netpoll?

2007-07-22 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Thu, 19 Jul 2007 17:27:47 +0100

 Please revisit the requirements that netconsole needs and redesign
 it from scratch.  The existing code is causing too much breakage.

 Can it be done without breaking the semantics of network devices, or
 should we rewrite the driver interface to take have a different
 interface like netdev_sync_send_skb() that is slow, synchronous and
 non-interrupt (ie polls for completion).  Of course, then people
 will complain that netconsole traffic slows the machine down.  for
 completion.

I couldn't agree more.

Since netpoll runs outside of all of the normal netdevice locking
rules, only the people using netpoll hit all the bugs.  That means
most of us do not test out these code path, which is bad.

So, if anything, a more integrated implementation is essential.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Resurrect napi_poll patch.

2007-07-22 Thread David Miller
From: David Miller [EMAIL PROTECTED]
Date: Sat, 21 Jul 2007 20:54:25 -0700 (PDT)

 Ok, I'll put the napi_struct info the driver private and respin
 the patch.

To facilitate discussion I converted the core and tg3, and include
just that part of the patch below.

Consequences:

1) rtnetlink and sysfs control of napi-weight had to be removed,
   we only have a device to deal with and dev -- napi is not a
   one to one relationship any longer

2) netpoll is totally broken, it wants to go from a netdevice to
   a singular napi_struct and that simply is not possible any longer

   I made the netpoll disabled case compile for now.

   Note that netpoll would be broken in this regard even without
   Rusty's suggestion that we are implementing here, it simply
   made the issue un-ignorable which I think is fantastic.

Netpoll definitely needs to be dealt with, but as discussed with
Patrick the rtnetlink and sysfs knob drop isn't critial.  At worst, we
can at least preserve the sysfs case by creating some kind of
napi_create() that makes a sysfs directory and a weight file for a
driver so that the weight can still be set and observed from there.

Rusty, how does it look otherwise?

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 887b9a5..ee6c69c 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -574,7 +574,7 @@ static void tg3_restart_ints(struct tg3 *tp)
 static inline void tg3_netif_stop(struct tg3 *tp)
 {
tp-dev-trans_start = jiffies; /* prevent tx timeout */
-   netif_poll_disable(tp-dev);
+   napi_disable(tp-napi);
netif_tx_disable(tp-dev);
 }
 
@@ -585,7 +585,7 @@ static inline void tg3_netif_start(struct tg3 *tp)
 * so long as all callers are assured to have free tx slots
 * (such as after tg3_init_hw)
 */
-   netif_poll_enable(tp-dev);
+   napi_enable(tp-napi);
tp-hw_status-status |= SD_STATUS_UPDATED;
tg3_enable_ints(tp);
 }
@@ -3471,11 +3471,12 @@ next_pkt_nopost:
return received;
 }
 
-static int tg3_poll(struct net_device *netdev, int *budget)
+static int tg3_poll(struct napi_struct *napi, int budget)
 {
-   struct tg3 *tp = netdev_priv(netdev);
+   struct tg3 *tp = container_of(napi, struct tg3, napi);
+   struct net_device *netdev = tp-dev;
struct tg3_hw_status *sblk = tp-hw_status;
-   int done;
+   int work_done = 0;
 
/* handle link change and other phy events */
if (!(tp-tg3_flags 
@@ -3494,7 +3495,7 @@ static int tg3_poll(struct net_device *netdev, int 
*budget)
if (sblk-idx[0].tx_consumer != tp-tx_cons) {
tg3_tx(tp);
if (unlikely(tp-tg3_flags  TG3_FLAG_TX_RECOVERY_PENDING)) {
-   netif_rx_complete(netdev);
+   netif_rx_complete(netdev, tp-napi);
schedule_work(tp-reset_task);
return 0;
}
@@ -3502,20 +3503,10 @@ static int tg3_poll(struct net_device *netdev, int 
*budget)
 
/* run RX thread, within the bounds set by NAPI.
 * All RX locking is done by ensuring outside
-* code synchronizes with dev-poll()
+* code synchronizes with tg3-napi.poll()
 */
-   if (sblk-idx[0].rx_producer != tp-rx_rcb_ptr) {
-   int orig_budget = *budget;
-   int work_done;
-
-   if (orig_budget  netdev-quota)
-   orig_budget = netdev-quota;
-
-   work_done = tg3_rx(tp, orig_budget);
-
-   *budget -= work_done;
-   netdev-quota -= work_done;
-   }
+   if (sblk-idx[0].rx_producer != tp-rx_rcb_ptr)
+   work_done = tg3_rx(tp, budget);
 
if (tp-tg3_flags  TG3_FLAG_TAGGED_STATUS) {
tp-last_tag = sblk-status_tag;
@@ -3524,13 +3515,12 @@ static int tg3_poll(struct net_device *netdev, int 
*budget)
sblk-status = ~SD_STATUS_UPDATED;
 
/* if no more work, tell net stack and NIC we're done */
-   done = !tg3_has_work(tp);
-   if (done) {
-   netif_rx_complete(netdev);
+   if (!tg3_has_work(tp)) {
+   netif_rx_complete(netdev, tp-napi);
tg3_restart_ints(tp);
}
 
-   return (done ? 0 : 1);
+   return work_done;
 }
 
 static void tg3_irq_quiesce(struct tg3 *tp)
@@ -3577,7 +3567,7 @@ static irqreturn_t tg3_msi_1shot(int irq, void *dev_id)
prefetch(tp-rx_rcb[tp-rx_rcb_ptr]);
 
if (likely(!tg3_irq_sync(tp)))
-   netif_rx_schedule(dev); /* schedule NAPI poll */
+   netif_rx_schedule(dev, tp-napi);
 
return IRQ_HANDLED;
 }
@@ -3602,7 +3592,7 @@ static irqreturn_t tg3_msi(int irq, void *dev_id)
 */
tw32_mailbox(MAILBOX_INTERRUPT_0 + TG3_64BIT_REG_LOW, 0x0001);
if (likely(!tg3_irq_sync(tp)))
-   netif_rx_schedule(dev); /* schedule NAPI poll */
+   netif_rx_schedule(dev, tp-napi);
 
 

Re: [PATCH]: Resurrect napi_poll patch.

2007-07-22 Thread David Miller
From: David Miller [EMAIL PROTECTED]
Date: Sun, 22 Jul 2007 00:18:18 -0700 (PDT)

 To facilitate discussion I converted the core and tg3, and include
 just that part of the patch below.

And for those for whom it makes it easier to play around with these
changes, here is the updated e1000 conversion below.

Making changes to e1000 is like visiting a crime scene. :-/

diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index 16a6edf..781ed99 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -300,6 +300,7 @@ struct e1000_adapter {
int cleaned_count);
struct e1000_rx_ring *rx_ring;  /* One per active queue */
 #ifdef CONFIG_E1000_NAPI
+   struct napi_struct napi;
struct net_device *polling_netdev;  /* One per active queue */
 #endif
int num_tx_queues;
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index f48b659..13c877c 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -162,7 +162,7 @@ static irqreturn_t e1000_intr_msi(int irq, void *data);
 static boolean_t e1000_clean_tx_irq(struct e1000_adapter *adapter,
 struct e1000_tx_ring *tx_ring);
 #ifdef CONFIG_E1000_NAPI
-static int e1000_clean(struct net_device *poll_dev, int *budget);
+static int e1000_clean(struct napi_struct *napi, int budget);
 static boolean_t e1000_clean_rx_irq(struct e1000_adapter *adapter,
 struct e1000_rx_ring *rx_ring,
 int *work_done, int work_to_do);
@@ -541,7 +541,7 @@ int e1000_up(struct e1000_adapter *adapter)
clear_bit(__E1000_DOWN, adapter-flags);
 
 #ifdef CONFIG_E1000_NAPI
-   netif_poll_enable(adapter-netdev);
+   napi_enable(adapter-napi);
 #endif
e1000_irq_enable(adapter);
 
@@ -630,7 +630,7 @@ e1000_down(struct e1000_adapter *adapter)
set_bit(__E1000_DOWN, adapter-flags);
 
 #ifdef CONFIG_E1000_NAPI
-   netif_poll_disable(netdev);
+   napi_disable(adapter-napi);
 #endif
e1000_irq_disable(adapter);
 
@@ -932,8 +932,8 @@ e1000_probe(struct pci_dev *pdev,
netdev-tx_timeout = e1000_tx_timeout;
netdev-watchdog_timeo = 5 * HZ;
 #ifdef CONFIG_E1000_NAPI
-   netdev-poll = e1000_clean;
-   netdev-weight = 64;
+   adapter-napi.poll = e1000_clean;
+   adapter-napi.weight = 64;
 #endif
netdev-vlan_rx_register = e1000_vlan_rx_register;
netdev-vlan_rx_add_vid = e1000_vlan_rx_add_vid;
@@ -1146,7 +1146,7 @@ e1000_probe(struct pci_dev *pdev,
netif_carrier_off(netdev);
netif_stop_queue(netdev);
 #ifdef CONFIG_E1000_NAPI
-   netif_poll_disable(netdev);
+   napi_disable(adapter-napi);
 #endif
 
strcpy(netdev-name, eth%d);
@@ -1319,8 +1319,6 @@ e1000_sw_init(struct e1000_adapter *adapter)
 #ifdef CONFIG_E1000_NAPI
for (i = 0; i  adapter-num_rx_queues; i++) {
adapter-polling_netdev[i].priv = adapter;
-   adapter-polling_netdev[i].poll = e1000_clean;
-   adapter-polling_netdev[i].weight = 64;
dev_hold(adapter-polling_netdev[i]);
set_bit(__LINK_STATE_START, adapter-polling_netdev[i].state);
}
@@ -1437,7 +1435,7 @@ e1000_open(struct net_device *netdev)
clear_bit(__E1000_DOWN, adapter-flags);
 
 #ifdef CONFIG_E1000_NAPI
-   netif_poll_enable(netdev);
+   napi_enable(adapter-napi);
 #endif
 
e1000_irq_enable(adapter);
@@ -1476,6 +1474,10 @@ e1000_close(struct net_device *netdev)
 {
struct e1000_adapter *adapter = netdev_priv(netdev);
 
+#ifdef CONFIG_E1000_NAPI
+   napi_disable(adapter-napi);
+#endif
+
WARN_ON(test_bit(__E1000_RESETTING, adapter-flags));
e1000_down(adapter);
e1000_power_down_phy(adapter);
@@ -3780,12 +3782,12 @@ e1000_intr_msi(int irq, void *data)
}
 
 #ifdef CONFIG_E1000_NAPI
-   if (likely(netif_rx_schedule_prep(netdev))) {
+   if (likely(netif_rx_schedule_prep(netdev, adapter-napi))) {
adapter-total_tx_bytes = 0;
adapter-total_tx_packets = 0;
adapter-total_rx_bytes = 0;
adapter-total_rx_packets = 0;
-   __netif_rx_schedule(netdev);
+   __netif_rx_schedule(netdev, adapter-napi);
} else
e1000_irq_enable(adapter);
 #else
@@ -3865,12 +3867,12 @@ e1000_intr(int irq, void *data)
E1000_WRITE_REG(hw, IMC, ~0);
E1000_WRITE_FLUSH(hw);
}
-   if (likely(netif_rx_schedule_prep(netdev))) {
+   if (likely(netif_rx_schedule_prep(netdev, adapter-napi))) {
adapter-total_tx_bytes = 0;
adapter-total_tx_packets = 0;
adapter-total_rx_bytes = 0;
adapter-total_rx_packets = 0;
-   __netif_rx_schedule(netdev);
+   __netif_rx_schedule(netdev, adapter-napi);
 

Re: [PATCH]: Resurrect napi_poll patch.

2007-07-22 Thread Rusty Russell
On Sun, 2007-07-22 at 00:18 -0700, David Miller wrote:
 Rusty, how does it look otherwise?

I like it.  For a start, the simplification of the NAPI api was long
overdue, and the damage done by separating the napi_struct is really
minimal.  Overall the tg3 driver just looks a little nicer now, and
that's sweet.

Unfortunately my complete ignorance of netpoll prevents me from making
sensible comment there.  This seems to have slipped in tho:

 --- a/net/core/dev.c
 +++ b/net/core/dev.c
 @@ -220,7 +220,8 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
   *   Device drivers call our routines to queue packets here. We empty the
   *   queue in the local softnet handler.
   */
 -DEFINE_PER_CPU(struct softnet_data, softnet_data) = { NULL };
 +
 +DEFINE_PER_CPU(struct softnet_data, softnet_data) = { NULL, };
  
  #ifdef CONFIG_SYSFS
  extern int netdev_sysfs_init(void);

ISTR that noone is using buggy compilers which required per-cpu
initializations now, so this can simply be dropped.

(The new scheduler code doesn't initialize per-cpu, so if this is a
problem it should be noticed).

Thanks!
Rusty.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: Resurrect napi_poll patch.

2007-07-22 Thread David Miller
From: Rusty Russell [EMAIL PROTECTED]
Date: Sun, 22 Jul 2007 17:51:52 +1000

 On Sun, 2007-07-22 at 00:18 -0700, David Miller wrote:
  --- a/net/core/dev.c
  +++ b/net/core/dev.c
  @@ -220,7 +220,8 @@ static RAW_NOTIFIER_HEAD(netdev_chain);
* Device drivers call our routines to queue packets here. We empty the
* queue in the local softnet handler.
*/
  -DEFINE_PER_CPU(struct softnet_data, softnet_data) = { NULL };
  +
  +DEFINE_PER_CPU(struct softnet_data, softnet_data) = { NULL, };
   
   #ifdef CONFIG_SYSFS
   extern int netdev_sysfs_init(void);
 
 ISTR that noone is using buggy compilers which required per-cpu
 initializations now, so this can simply be dropped.
 
 (The new scheduler code doesn't initialize per-cpu, so if this is a
 problem it should be noticed).

Good catch and I've removed it from my tree, this will show
up in subsequent patch revisions.

Thanks Rusty.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/12 -Rev2] Implement batching skb API

2007-07-22 Thread Krishna Kumar
This set of patches implements the batching API, and makes the following
changes resulting from the review of the first set:

Changes :
-
1.  Changed skb_blist from pointer to static as it saves only 12 bytes
(i386), but bloats the code.
2.  Removed requirement for driver to set features  NETIF_F_BATCH_SKBS
in register_netdev to enable batching as it is redundant. Changed this
flag to NETIF_F_BATCH_ON and it is set by register_netdev, and other
user changable calls can modify this bit to enable/disable batching.
3.  Added ethtool support to enable/disable batching (not tested).
4.  Added rtnetlink support to enable/disable batching (not tested).
5.  Removed MIN_QUEUE_LEN_BATCH for batching as high performance drivers
should not have a small queue anyway (adding bloat).
6.  skbs are purged from dev_deactivate instead of from unregister_netdev
to drop all references to the device.
7.  Removed changelog in source code in sch_generic.c, and unrelated renames
from sch_generic.c (lockless, comments).
8.  Removed xmit_slots entirely, as it was adding bloat (code and header)
and not adding value (it is calculated and set twice in internal send
routine and handle work completion, and referenced once in batch xmit;
and can instead be calculated once in xmit).

Issues :

1. Remove /sysfs support completely ?
2. Whether rtnetlink support is required as GSO has only ethtool ?

Patches are described as:
Mail 0/12  : This mail.
Mail 1/12  : HOWTO documentation.
Mail 2/12  : Changes to netdevice.h
Mail 3/12  : dev.c changes.
Mail 4/12  : Ethtool changes.
Mail 5/12  : sysfs changes.
Mail 6/12  : rtnetlink changes.
Mail 7/12  : Change in qdisc_run  qdisc_restart API, modify callers
 to use this API.
Mail 8/12  : IPoIB include file changes.
Mail 9/12  : IPoIB verbs changes
Mail 10/12 : IPoIB multicast, CM changes
Mail 11/12 : IPoIB xmit API addition
Mail 12/12 : IPoIB xmit internals changes (ipoib_ib.c)

I have started a 10 run test for various buffer sizes and processes, and
will post the results on Monday.

Please review and provide feedback/ideas; and consider for inclusion.

Thanks,

- KK
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/12 -Rev2] HOWTO documentation for Batching SKB.

2007-07-22 Thread Krishna Kumar
diff -ruNp org/Documentation/networking/Batching_skb_API.txt 
rev2/Documentation/networking/Batching_skb_API.txt
--- org/Documentation/networking/Batching_skb_API.txt   1970-01-01 
05:30:00.0 +0530
+++ rev2/Documentation/networking/Batching_skb_API.txt  2007-07-20 
16:09:45.0 +0530
@@ -0,0 +1,91 @@
+HOWTO for batching skb API support
+---
+
+Section 1: What is batching skb API ?
+Section 2: How batching API works vs the original API ?
+Section 3: How drivers can support this API ?
+Section 4: How users can work with this API ?
+
+
+Introduction: Kernel support for batching skb
+---
+
+An extended API is supported in the netdevice layer, which is very similar
+to the existing hard_start_xmit() API. Drivers which wish to take advantage
+of this new API should implement this routine similar to how the
+hard_start_xmit handler is written. The difference between these API's is
+that while the existing hard_start_xmit processes one skb, the new API can
+process multiple skbs (or even one) in a single call. It is also possible
+for the driver writer to re-use most of the code from the existing API in
+the new API without having code duplication.
+
+
+Section 1: What is batching skb API ?
+-
+
+   This is a new API that is optionally exported by a driver. The pre-
+   requisite for a driver to use this API is that it should have a
+   reasonably sized hardware queue that can process multiple skbs.
+
+
+Section 2: How batching API works vs the original API ?
+---
+
+   The networking stack normally gets called from upper layer protocols
+   with a single skb to xmit. This skb is first enqueue'd and an
+   attempt is next made to transmit it immediately (via qdisc_run).
+   However, events like driver lock contention, queue stopped, etc, can
+   result in the skb not getting sent out, and it remains in the queue.
+   When a new xmit is called or when the queue is re-enabled, qdisc_run
+   could potentially find multiple packets in the queue, and have to
+   send them all out one by one iteratively.
+
+   The batching skb API case was added to exploit this situation where
+   if there are multiple skbs, all of them can be sent to the device in
+   one shot. This reduces driver processing, locking at the driver (or
+   in stack for ~LLTX drivers) gets amortized over multiple skbs, and
+   in case of specific drivers where every xmit results in a completion
+   processing (like IPoIB), optimizations could be made in the driver
+   to get a completion for only the last skb that was sent which will
+   result in saving interrupts for every (but the last) skb that was
+   sent in the same batch.
+
+   This batching can result in significant performance gains for
+   systems that have multiple data stream paths over the same network
+   interface card.
+
+
+Section 3: How drivers can support this API ?
+-
+
+   The new API - dev-hard_start_xmit_batch(struct net_device *dev),
+   simplistically, can be written almost identically to the regular
+   xmit API (hard_start_xmit), except that all skbs on dev-skb_blist
+   should be processed by the driver instead of just one skb. The new
+   API doesn't get any skb as argument to process, instead it picks up
+   all the skbs from dev-skb_blist, where it was added by the stack,
+   and tries to send them out.
+
+   Batching requires the driver to set the NETIF_F_BATCH_SKBS bit in
+   dev-features, and dev-hard_start_xmit_batch should point to the
+   new API implemented for that driver.
+
+
+Section 4: How users can work with this API ?
+-
+
+   Batching could be disabled for a particular device, e.g. on desktop
+   systems if only one stream of network activity for that device is
+   taking place, since performance could be slightly affected due to
+   extra processing that batching adds. Batching can be enabled if
+   more than one stream of network activity per device is being done,
+   e.g. on servers, or even desktop usage with multiple browser, chat,
+   file transfer sessions, etc.
+
+   Per device batching can be enabled/disabled using:
+
+   echo 1  /sys/class/net/device-name/tx_batch_skbs (enable)
+   echo 0  /sys/class/net/device-name/tx_batch_skbs (disable)
+
+   E.g. to enable batching on eth0, run:
+   echo 1  /sys/class/net/eth0/tx_batch_skbs
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/12 -Rev2] Changes to netdevice.h

2007-07-22 Thread Krishna Kumar
diff -ruNp org/include/linux/netdevice.h rev2/include/linux/netdevice.h
--- org/include/linux/netdevice.h   2007-07-20 07:49:28.0 +0530
+++ rev2/include/linux/netdevice.h  2007-07-22 13:20:16.0 +0530
@@ -340,6 +340,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED1024/* Device cannot handle VLAN 
packets */
 #define NETIF_F_GSO2048/* Enable software GSO. */
 #define NETIF_F_LLTX   4096/* LockLess TX */
+#define NETIF_F_BATCH_ON   8192/* Batching skbs xmit API is enabled */
 #define NETIF_F_MULTI_QUEUE16384   /* Has multiple TX/RX queues */
 
/* Segmentation offload features */
@@ -452,6 +453,7 @@ struct net_device
struct Qdisc*qdisc_sleeping;
struct list_headqdisc_list;
unsigned long   tx_queue_len;   /* Max frames per queue allowed 
*/
+   struct sk_buff_head skb_blist;  /* List of batch skbs */
 
/* Partially transmitted GSO packet. */
struct sk_buff  *gso_skb;
@@ -472,6 +474,9 @@ struct net_device
void*priv;  /* pointer to private data  */
int (*hard_start_xmit) (struct sk_buff *skb,
struct net_device *dev);
+   int (*hard_start_xmit_batch) (struct net_device
+ *dev);
+
/* These may be needed for future network-power-down code. */
unsigned long   trans_start;/* Time (in jiffies) of last Tx 
*/
 
@@ -582,6 +587,8 @@ struct net_device
 #defineNETDEV_ALIGN32
 #defineNETDEV_ALIGN_CONST  (NETDEV_ALIGN - 1)
 
+#define BATCHING_ON(dev)   ((dev-features  NETIF_F_BATCH_ON) != 0)
+
 static inline void *netdev_priv(const struct net_device *dev)
 {
return dev-priv;
@@ -832,6 +839,8 @@ extern int  dev_set_mac_address(struct n
struct sockaddr *);
 extern int dev_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev);
+extern int dev_add_skb_to_blist(struct sk_buff *skb,
+struct net_device *dev);
 
 extern voiddev_init(void);
 
@@ -1104,6 +1113,8 @@ extern void   dev_set_promiscuity(struct 
 extern voiddev_set_allmulti(struct net_device *dev, int inc);
 extern voidnetdev_state_change(struct net_device *dev);
 extern voidnetdev_features_change(struct net_device *dev);
+extern int dev_change_tx_batching(struct net_device *dev,
+  unsigned long new_batch_skb);
 /* Load a device via the kmod */
 extern voiddev_load(const char *name);
 extern voiddev_mcast_init(void);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/12 -Rev2] dev.c changes.

2007-07-22 Thread Krishna Kumar
diff -ruNp org/net/core/dev.c rev2/net/core/dev.c
--- org/net/core/dev.c  2007-07-20 07:49:28.0 +0530
+++ rev2/net/core/dev.c 2007-07-21 23:08:33.0 +0530
@@ -875,6 +875,48 @@ void netdev_state_change(struct net_devi
}
 }
 
+/*
+ * dev_change_tx_batching - Enable or disable batching for a driver that
+ * supports batching.
+ */
+int dev_change_tx_batching(struct net_device *dev, unsigned long new_batch_skb)
+{
+   int ret;
+
+   if (!dev-hard_start_xmit_batch) {
+   /* Driver doesn't support skb batching */
+   ret = -ENOTSUPP;
+   goto out;
+   }
+
+   /* Handle invalid argument */
+   if (new_batch_skb  0) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   ret = 0;
+
+   /* Check if new value is same as the current */
+   if (!!(dev-features  NETIF_F_BATCH_ON) == !!new_batch_skb)
+   goto out;
+
+   spin_lock(dev-queue_lock);
+   if (new_batch_skb) {
+   dev-features |= NETIF_F_BATCH_ON;
+   dev-tx_queue_len = 1;
+   } else {
+   if (!skb_queue_empty(dev-skb_blist))
+   skb_queue_purge(dev-skb_blist);
+   dev-features = ~NETIF_F_BATCH_ON;
+   dev-tx_queue_len = 1;
+   }
+   spin_unlock(dev-queue_lock);
+
+out:
+   return ret;
+}
+
 /**
  * dev_load- load a network module
  * @name: name of interface
@@ -1414,6 +1456,45 @@ static int dev_gso_segment(struct sk_buf
return 0;
 }
 
+/*
+ * Add skb (skbs in case segmentation is required) to dev-skb_blist. We are
+ * holding QDISC RUNNING bit, so no one else can add to this list. Also, skbs
+ * are dequeued from this list when we call the driver, so the list is safe
+ * from simultaneous deletes too.
+ *
+ * Returns count of successful skb(s) added to skb_blist.
+ */
+int dev_add_skb_to_blist(struct sk_buff *skb, struct net_device *dev)
+{
+   if (!list_empty(ptype_all))
+   dev_queue_xmit_nit(skb, dev);
+
+   if (netif_needs_gso(dev, skb)) {
+   if (unlikely(dev_gso_segment(skb))) {
+   kfree(skb);
+   return 0;
+   }
+
+   if (skb-next) {
+   int count = 0;
+
+   do {
+   struct sk_buff *nskb = skb-next;
+
+   skb-next = nskb-next;
+   __skb_queue_tail(dev-skb_blist, nskb);
+   count++;
+   } while (skb-next);
+
+   skb-destructor = DEV_GSO_CB(skb)-destructor;
+   kfree_skb(skb);
+   return count;
+   }
+   }
+   __skb_queue_tail(dev-skb_blist, skb);
+   return 1;
+}
+
 int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
if (likely(!skb-next)) {
@@ -3397,6 +3483,12 @@ int register_netdevice(struct net_device
}
}
 
+   if (dev-hard_start_xmit_batch) {
+   dev-features |= NETIF_F_BATCH_ON;
+   skb_queue_head_init(dev-skb_blist);
+   dev-tx_queue_len = 1;
+   }
+
/*
 *  nil rebuild_header routine,
 *  that should be never called and used as just bug trap.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/12 -Rev2] Ethtool changes

2007-07-22 Thread Krishna Kumar
diff -ruNp org/include/linux/ethtool.h rev2/include/linux/ethtool.h
--- org/include/linux/ethtool.h 2007-07-21 13:39:50.0 +0530
+++ rev2/include/linux/ethtool.h2007-07-21 13:40:57.0 +0530
@@ -414,6 +414,8 @@ struct ethtool_ops {
 #define ETHTOOL_SUFO   0x0022 /* Set UFO enable (ethtool_value) */
 #define ETHTOOL_GGSO   0x0023 /* Get GSO enable (ethtool_value) */
 #define ETHTOOL_SGSO   0x0024 /* Set GSO enable (ethtool_value) */
+#define ETHTOOL_GBTX   0x0025 /* Get Batching (ethtool_value) */
+#define ETHTOOL_SBTX   0x0026 /* Set Batching (ethtool_value) */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
diff -ruNp org/net/core/ethtool.c rev2/net/core/ethtool.c
--- org/net/core/ethtool.c  2007-07-21 13:37:17.0 +0530
+++ rev2/net/core/ethtool.c 2007-07-21 22:55:38.0 +0530
@@ -648,6 +648,26 @@ static int ethtool_set_gso(struct net_de
return 0;
 }
 
+static int ethtool_get_batch(struct net_device *dev, char __user *useraddr)
+{
+   struct ethtool_value edata = { ETHTOOL_GBTX };
+
+   edata.data = BATCHING_ON(dev);
+   if (copy_to_user(useraddr, edata, sizeof(edata)))
+return -EFAULT;
+   return 0;
+}
+
+static int ethtool_set_batch(struct net_device *dev, char __user *useraddr)
+{
+   struct ethtool_value edata;
+
+   if (copy_from_user(edata, useraddr, sizeof(edata)))
+   return -EFAULT;
+
+   return dev_change_tx_batching(dev, edata.data);
+}
+
 static int ethtool_self_test(struct net_device *dev, char __user *useraddr)
 {
struct ethtool_test test;
@@ -959,6 +979,12 @@ int dev_ethtool(struct ifreq *ifr)
case ETHTOOL_SGSO:
rc = ethtool_set_gso(dev, useraddr);
break;
+   case ETHTOOL_GBTX:
+   rc = ethtool_get_batch(dev, useraddr);
+   break;
+   case ETHTOOL_SBTX:
+   rc = ethtool_set_batch(dev, useraddr);
+   break;
default:
rc =  -EOPNOTSUPP;
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/12 -Rev2] sysfs changes.

2007-07-22 Thread Krishna Kumar
diff -ruNp org/net/core/net-sysfs.c rev2/net/core/net-sysfs.c
--- org/net/core/net-sysfs.c2007-07-20 07:49:28.0 +0530
+++ rev2/net/core/net-sysfs.c   2007-07-21 22:56:32.0 +0530
@@ -230,6 +230,21 @@ static ssize_t store_weight(struct devic
return netdev_store(dev, attr, buf, len, change_weight);
 }
 
+static ssize_t show_tx_batch_skb(struct device *dev,
+struct device_attribute *attr, char *buf)
+{
+   struct net_device *netdev = to_net_dev(dev);
+
+   return sprintf(buf, fmt_dec, BATCHING_ON(netdev));
+}
+
+static ssize_t store_tx_batch_skb(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+   return netdev_store(dev, attr, buf, len, dev_change_tx_batching);
+}
+
 static struct device_attribute net_class_attributes[] = {
__ATTR(addr_len, S_IRUGO, show_addr_len, NULL),
__ATTR(iflink, S_IRUGO, show_iflink, NULL),
@@ -246,6 +261,8 @@ static struct device_attribute net_class
__ATTR(flags, S_IRUGO | S_IWUSR, show_flags, store_flags),
__ATTR(tx_queue_len, S_IRUGO | S_IWUSR, show_tx_queue_len,
   store_tx_queue_len),
+   __ATTR(tx_batch_skbs, S_IRUGO | S_IWUSR, show_tx_batch_skb,
+  store_tx_batch_skb),
__ATTR(weight, S_IRUGO | S_IWUSR, show_weight, store_weight),
{}
 };
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/12 -Rev2] rtnetlink changes.

2007-07-22 Thread Krishna Kumar
diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
--- org/include/linux/if_link.h 2007-07-20 16:33:35.0 +0530
+++ rev2/include/linux/if_link.h2007-07-20 16:35:08.0 +0530
@@ -78,6 +78,8 @@ enum
IFLA_LINKMODE,
IFLA_LINKINFO,
 #define IFLA_LINKINFO IFLA_LINKINFO
+   IFLA_TXBTHSKB,  /* Driver support for Batch'd skbs */
+#define IFLA_TXBTHSKB IFLA_TXBTHSKB
__IFLA_MAX
 };
 
diff -ruNp org/net/core/rtnetlink.c rev2/net/core/rtnetlink.c
--- org/net/core/rtnetlink.c2007-07-20 16:31:59.0 +0530
+++ rev2/net/core/rtnetlink.c   2007-07-21 22:27:10.0 +0530
@@ -634,6 +634,7 @@ static int rtnl_fill_ifinfo(struct sk_bu
 
NLA_PUT_STRING(skb, IFLA_IFNAME, dev-name);
NLA_PUT_U32(skb, IFLA_TXQLEN, dev-tx_queue_len);
+   NLA_PUT_U32(skb, IFLA_TXBTHSKB, BATCHING_ON(dev));
NLA_PUT_U32(skb, IFLA_WEIGHT, dev-weight);
NLA_PUT_U8(skb, IFLA_OPERSTATE,
   netif_running(dev) ? dev-operstate : IF_OPER_DOWN);
@@ -833,7 +834,8 @@ static int do_setlink(struct net_device 
 
if (tb[IFLA_TXQLEN])
dev-tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
-
+   if (tb[IFLA_TXBTHSKB])
+   dev_change_tx_batching(dev, nla_get_u32(tb[IFLA_TXBTHSKB]));
if (tb[IFLA_WEIGHT])
dev-weight = nla_get_u32(tb[IFLA_WEIGHT]);
 
@@ -1072,6 +1074,9 @@ replay:
   nla_len(tb[IFLA_BROADCAST]));
if (tb[IFLA_TXQLEN])
dev-tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
+   if (tb[IFLA_TXBTHSKB])
+   dev_change_tx_batching(dev,
+  nla_get_u32(tb[IFLA_TXBTHSKB]));
if (tb[IFLA_WEIGHT])
dev-weight = nla_get_u32(tb[IFLA_WEIGHT]);
if (tb[IFLA_OPERSTATE])
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/12 -Rev2] Change qdisc_run qdisc_restart API, callers

2007-07-22 Thread Krishna Kumar
diff -ruNp org/include/net/pkt_sched.h rev2/include/net/pkt_sched.h
--- org/include/net/pkt_sched.h 2007-07-20 07:49:28.0 +0530
+++ rev2/include/net/pkt_sched.h2007-07-20 16:09:45.0 +0530
@@ -80,13 +80,13 @@ extern struct qdisc_rate_table *qdisc_ge
struct rtattr *tab);
 extern void qdisc_put_rtab(struct qdisc_rate_table *tab);
 
-extern void __qdisc_run(struct net_device *dev);
+extern void __qdisc_run(struct net_device *dev, struct sk_buff_head *blist);
 
-static inline void qdisc_run(struct net_device *dev)
+static inline void qdisc_run(struct net_device *dev, struct sk_buff_head 
*blist)
 {
if (!netif_queue_stopped(dev) 
!test_and_set_bit(__LINK_STATE_QDISC_RUNNING, dev-state))
-   __qdisc_run(dev);
+   __qdisc_run(dev, blist);
 }
 
 extern int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
diff -ruNp org/net/sched/sch_generic.c rev2/net/sched/sch_generic.c
--- org/net/sched/sch_generic.c 2007-07-20 07:49:28.0 +0530
+++ rev2/net/sched/sch_generic.c2007-07-22 12:11:10.0 +0530
@@ -59,10 +59,12 @@ static inline int qdisc_qlen(struct Qdis
 static inline int dev_requeue_skb(struct sk_buff *skb, struct net_device *dev,
  struct Qdisc *q)
 {
-   if (unlikely(skb-next))
-   dev-gso_skb = skb;
-   else
-   q-ops-requeue(skb, q);
+   if (likely(skb)) {
+   if (unlikely(skb-next))
+   dev-gso_skb = skb;
+   else
+   q-ops-requeue(skb, q);
+   }
 
netif_schedule(dev);
return 0;
@@ -91,18 +93,23 @@ static inline int handle_dev_cpu_collisi
/*
 * Same CPU holding the lock. It may be a transient
 * configuration error, when hard_start_xmit() recurses. We
-* detect it by checking xmit owner and drop the packet when
-* deadloop is detected. Return OK to try the next skb.
+* detect it by checking xmit owner and drop skb (or all
+* skbs in batching case) when deadloop is detected. Return
+* OK to try the next skb.
 */
-   kfree_skb(skb);
+   if (likely(skb))
+   kfree_skb(skb);
+   else if (!skb_queue_empty(dev-skb_blist))
+   skb_queue_purge(dev-skb_blist);
+
if (net_ratelimit())
printk(KERN_WARNING Dead loop on netdevice %s, 
   fix it urgently!\n, dev-name);
ret = qdisc_qlen(q);
} else {
/*
-* Another cpu is holding lock, requeue  delay xmits for
-* some time.
+* Another cpu is holding lock. Requeue skb and delay xmits
+* for some time.
 */
__get_cpu_var(netdev_rx_stat).cpu_collision++;
ret = dev_requeue_skb(skb, dev, q);
@@ -112,6 +119,38 @@ static inline int handle_dev_cpu_collisi
 }
 
 /*
+ * Algorithm to get skb(s) is:
+ * - Non batching drivers, or if the batch list is empty and there is
+ *   atmost one skb in the queue - dequeue skb and put it in *skbp to
+ *   tell the caller to use the single xmit API.
+ * - Batching drivers where the batch list already contains atleast one
+ *   skb or if there are multiple skbs in the queue: keep dequeue'ing
+ *   skb's upto a limit and set *skbp to NULL to tell the caller to use
+ *   the multiple xmit API.
+ *
+ * Returns:
+ * 1 - atleast one skb is to be sent out, *skbp contains skb or NULL
+ * (in case 1 skbs present in blist for batching)
+ * 0 - no skbs to be sent.
+ */
+static inline int get_skb(struct net_device *dev, struct Qdisc *q,
+ struct sk_buff_head *blist, struct sk_buff **skbp)
+{
+   if (likely(!blist) || (!skb_queue_len(blist)  qdisc_qlen(q) = 1)) {
+   return likely((*skbp = dev_dequeue_skb(dev, q)) != NULL);
+   } else {
+   int max = dev-tx_queue_len - skb_queue_len(blist);
+   struct sk_buff *skb;
+
+   while (max  0  (skb = dev_dequeue_skb(dev, q)) != NULL)
+   max -= dev_add_skb_to_blist(skb, dev);
+
+   *skbp = NULL;
+   return 1;   /* we have atleast one skb in blist */
+   }
+}
+
+/*
  * NOTE: Called under dev-queue_lock with locally disabled BH.
  *
  * __LINK_STATE_QDISC_RUNNING guarantees only one CPU can process this
@@ -130,7 +169,8 @@ static inline int handle_dev_cpu_collisi
  * 0 - queue is not empty.
  *
  */
-static inline int qdisc_restart(struct net_device *dev)
+static inline int qdisc_restart(struct net_device *dev,
+   struct sk_buff_head *blist)
 {
struct Qdisc *q = dev-qdisc;
struct 

[PATCH 09/12 -Rev2] IPoIB verbs changes

2007-07-22 Thread Krishna Kumar
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 
rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_verbs.c  2007-07-20 
07:49:28.0 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 2007-07-20 
16:09:45.0 +0530
@@ -152,11 +152,11 @@ int ipoib_transport_dev_init(struct net_
.max_send_sge = 1,
.max_recv_sge = 1
},
-   .sq_sig_type = IB_SIGNAL_ALL_WR,
+   .sq_sig_type = IB_SIGNAL_REQ_WR,/* 11.2.4.1 */
.qp_type = IB_QPT_UD
};
-
-   int ret, size;
+   struct ib_send_wr *next_wr = NULL;
+   int i, ret, size;
 
priv-pd = ib_alloc_pd(priv-ca);
if (IS_ERR(priv-pd)) {
@@ -197,12 +197,17 @@ int ipoib_transport_dev_init(struct net_
priv-dev-dev_addr[2] = (priv-qp-qp_num   8)  0xff;
priv-dev-dev_addr[3] = (priv-qp-qp_num  )  0xff;
 
-   priv-tx_sge.lkey   = priv-mr-lkey;
-
-   priv-tx_wr.opcode  = IB_WR_SEND;
-   priv-tx_wr.sg_list = priv-tx_sge;
-   priv-tx_wr.num_sge = 1;
-   priv-tx_wr.send_flags  = IB_SEND_SIGNALED;
+   for (i = ipoib_sendq_size - 1; i = 0; i--) {
+   priv-tx_sge[i].lkey= priv-mr-lkey;
+   priv-tx_wr[i].opcode   = IB_WR_SEND;
+   priv-tx_wr[i].sg_list  = priv-tx_sge[i];
+   priv-tx_wr[i].num_sge  = 1;
+   priv-tx_wr[i].send_flags   = 0;
+
+   /* Link the list properly for provider to use */
+   priv-tx_wr[i].next = next_wr;
+   next_wr = priv-tx_wr[i];
+   }
 
return 0;
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/12 -Rev2] IPoIB multicast, CM changes

2007-07-22 Thread Krishna Kumar
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_cm.c 2007-07-20 07:49:28.0 
+0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_cm.c2007-07-20 
16:09:45.0 +0530
@@ -493,14 +493,19 @@ static inline int post_send(struct ipoib
unsigned int wr_id,
u64 addr, int len)
 {
+   int ret;
struct ib_send_wr *bad_wr;
 
-   priv-tx_sge.addr = addr;
-   priv-tx_sge.length   = len;
+   priv-tx_sge[0].addr  = addr;
+   priv-tx_sge[0].length= len;
+
+   priv-tx_wr[0].wr_id  = wr_id;
 
-   priv-tx_wr.wr_id = wr_id;
+   priv-tx_wr[0].next = NULL;
+   ret = ib_post_send(tx-qp, priv-tx_wr, bad_wr);
+   priv-tx_wr[0].next = priv-tx_wr[1];
 
-   return ib_post_send(tx-qp, priv-tx_wr, bad_wr);
+   return ret;
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct 
ipoib_cm_tx *tx)
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_multicast.c  2007-07-20 
07:49:28.0 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 2007-07-20 
16:09:45.0 +0530
@@ -217,7 +217,7 @@ static int ipoib_mcast_join_finish(struc
if (!memcmp(mcast-mcmember.mgid.raw, priv-dev-broadcast + 4,
sizeof (union ib_gid))) {
priv-qkey = be32_to_cpu(priv-broadcast-mcmember.qkey);
-   priv-tx_wr.wr.ud.remote_qkey = priv-qkey;
+   priv-tx_wr[0].wr.ud.remote_qkey = priv-qkey;
}
 
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) {
@@ -736,7 +736,7 @@ out:
}
}
 
-   ipoib_send(dev, skb, mcast-ah, IB_MULTICAST_QPN);
+   ipoib_send(dev, skb, mcast-ah, IB_MULTICAST_QPN, 1);
}
 
 unlock:
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/12 -Rev2] IPoIB xmit API addition

2007-07-22 Thread Krishna Kumar
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_ib.c 2007-07-20 07:49:28.0 
+0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_ib.c2007-07-22 
00:08:37.0 +0530
@@ -242,8 +242,9 @@ repost:
 static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
+   int i = 0, num_completions;
+   int tx_ring_index = priv-tx_tail  (ipoib_sendq_size - 1);
unsigned int wr_id = wc-wr_id;
-   struct ipoib_tx_buf *tx_req;
unsigned long flags;
 
ipoib_dbg_data(priv, send completion: id %d, status: %d\n,
@@ -255,23 +256,57 @@ static void ipoib_ib_handle_tx_wc(struct
return;
}
 
-   tx_req = priv-tx_ring[wr_id];
+   num_completions = wr_id - tx_ring_index + 1;
+   if (num_completions = 0)
+   num_completions += ipoib_sendq_size;
 
-   ib_dma_unmap_single(priv-ca, tx_req-mapping,
-   tx_req-skb-len, DMA_TO_DEVICE);
+   /*
+* Handle skbs completion from tx_tail to wr_id. It is possible to
+* handle WC's from earlier post_sends (possible multiple) in this
+* iteration as we move from tx_tail to wr_id, since if the last
+* WR (which is the one which had a completion request) failed to be
+* sent for any of those earlier request(s), no completion
+* notification is generated for successful WR's of those earlier
+* request(s).
+*/
+   while (1) {
+   /*
+* Could use while (i  num_completions), but it is costly
+* since in most cases there is 1 completion, and we end up
+* doing an extra index = (index+1)  (ipoib_sendq_size-1)
+*/
+   struct ipoib_tx_buf *tx_req = priv-tx_ring[tx_ring_index];
+
+   if (likely(tx_req-skb)) {
+   ib_dma_unmap_single(priv-ca, tx_req-mapping,
+   tx_req-skb-len, DMA_TO_DEVICE);
 
-   ++priv-stats.tx_packets;
-   priv-stats.tx_bytes += tx_req-skb-len;
+   ++priv-stats.tx_packets;
+   priv-stats.tx_bytes += tx_req-skb-len;
 
-   dev_kfree_skb_any(tx_req-skb);
+   dev_kfree_skb_any(tx_req-skb);
+   }
+   /*
+* else this skb failed synchronously when posted and was
+* freed immediately.
+*/
+
+   if (++i == num_completions)
+   break;
+
+   /* More WC's to handle */
+   tx_ring_index = (tx_ring_index + 1)  (ipoib_sendq_size - 1);
+   }
 
spin_lock_irqsave(priv-tx_lock, flags);
-   ++priv-tx_tail;
+
+   priv-tx_tail += num_completions;
if (unlikely(test_bit(IPOIB_FLAG_NETIF_STOPPED, priv-flags)) 
priv-tx_head - priv-tx_tail = ipoib_sendq_size  1) {
clear_bit(IPOIB_FLAG_NETIF_STOPPED, priv-flags);
netif_wake_queue(dev);
}
+
spin_unlock_irqrestore(priv-tx_lock, flags);
 
if (wc-status != IB_WC_SUCCESS 
@@ -340,78 +375,178 @@ void ipoib_ib_completion(struct ib_cq *c
netif_rx_schedule(dev_ptr);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
-   unsigned int wr_id,
-   struct ib_ah *address, u32 qpn,
-   u64 addr, int len)
+/*
+ * post_send : Post WR(s) to the device.
+ *
+ * num_skbs is the number of WR's, 'start_index' is the first slot in
+ * tx_wr[] or tx_sge[]. Note: 'start_index' is normally zero, unless a
+ * previous post_send returned error and we are trying to send the untried
+ * WR's, in which case start_index will point to the first untried WR.
+ *
+ * We also break the WR link before posting so that the driver knows how
+ * many WR's to process, and this is set back after the post.
+ */
+static inline int post_send(struct ipoib_dev_priv *priv, u32 qpn,
+   int start_index, int num_skbs,
+   struct ib_send_wr **bad_wr)
 {
-   struct ib_send_wr *bad_wr;
+   int ret;
+   struct ib_send_wr *last_wr, *next_wr;
+
+   last_wr = priv-tx_wr[start_index + num_skbs - 1];
+
+   /* Set Completion Notification for last WR */
+   last_wr-send_flags = IB_SEND_SIGNALED;
 
-   priv-tx_sge.addr = addr;
-   priv-tx_sge.length   = len;
+   /* Terminate the last WR */
+   next_wr = last_wr-next;
+   last_wr-next = NULL;
 
-   priv-tx_wr.wr_id = wr_id;
-   priv-tx_wr.wr.ud.remote_qpn  = qpn;
-   priv-tx_wr.wr.ud.ah  = address;
+   /* Send all the WR's in one doorbell */
+   ret = ib_post_send(priv-qp, priv-tx_wr[start_index], bad_wr);
 
-   return 

[PATCH 12/12 -Rev2] IPoIB xmit internals changes (ipoib_ib.c)

2007-07-22 Thread Krishna Kumar
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib_main.c 
rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c
--- org/drivers/infiniband/ulp/ipoib/ipoib_main.c   2007-07-20 
07:49:28.0 +0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib_main.c  2007-07-22 
00:08:28.0 +0530
@@ -558,7 +558,8 @@ static void neigh_add_path(struct sk_buf
goto err_drop;
}
} else
-   ipoib_send(dev, skb, path-ah, 
IPOIB_QPN(skb-dst-neighbour-ha));
+   ipoib_send(dev, skb, path-ah,
+  IPOIB_QPN(skb-dst-neighbour-ha), 1);
} else {
neigh-ah  = NULL;
 
@@ -638,7 +639,7 @@ static void unicast_arp_send(struct sk_b
ipoib_dbg(priv, Send unicast ARP to %04x\n,
  be16_to_cpu(path-pathrec.dlid));
 
-   ipoib_send(dev, skb, path-ah, IPOIB_QPN(phdr-hwaddr));
+   ipoib_send(dev, skb, path-ah, IPOIB_QPN(phdr-hwaddr), 1);
} else if ((path-query || !path_rec_start(dev, path)) 
   skb_queue_len(path-queue)  IPOIB_MAX_PATH_REC_QUEUE) {
/* put pseudoheader back on for next time */
@@ -704,7 +705,8 @@ static int ipoib_start_xmit(struct sk_bu
goto out;
}
 
-   ipoib_send(dev, skb, neigh-ah, 
IPOIB_QPN(skb-dst-neighbour-ha));
+   ipoib_send(dev, skb, neigh-ah,
+  IPOIB_QPN(skb-dst-neighbour-ha), 1);
goto out;
}
 
@@ -753,6 +755,175 @@ out:
return NETDEV_TX_OK;
 }
 
+#defineXMIT_QUEUED_SKBS()  
\
+   do {\
+   if (num_skbs) { \
+   ipoib_send(dev, NULL, old_neigh-ah, old_qpn,   \
+  num_skbs);   \
+   num_skbs = 0;   \
+   }   \
+   } while (0)
+
+/*
+ * TODO: Merge with ipoib_start_xmit to use the same code and have a
+ * transparent wrapper caller to xmit's, etc.
+ */
+static int ipoib_start_xmit_frames(struct net_device *dev)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+   struct sk_buff *skb;
+   struct sk_buff_head *blist;
+   int max_skbs, num_skbs = 0, tx_ring_index = -1;
+   u32 qpn, old_qpn = 0;
+   struct ipoib_neigh *neigh, *old_neigh = NULL;
+   unsigned long flags;
+
+   if (unlikely(!spin_trylock_irqsave(priv-tx_lock, flags)))
+   return NETDEV_TX_LOCKED;
+
+   blist = dev-skb_blist;
+
+   /*
+* Send atmost 'max_skbs' skbs. This also prevents the device getting
+* full.
+*/
+   max_skbs = ipoib_sendq_size - (priv-tx_head - priv-tx_tail);
+   while (max_skbs--  0  (skb = __skb_dequeue(blist)) != NULL) {
+   /*
+* From here on, ipoib_send() cannot stop the queue as it
+* uses the same initialization as 'max_skbs'. So we can
+* optimize to not check for queue stopped for every skb.
+*/
+   if (likely(skb-dst  skb-dst-neighbour)) {
+   if (unlikely(!*to_ipoib_neigh(skb-dst-neighbour))) {
+   XMIT_QUEUED_SKBS();
+   ipoib_path_lookup(skb, dev);
+   continue;
+   }
+
+   neigh = *to_ipoib_neigh(skb-dst-neighbour);
+
+   if (ipoib_cm_get(neigh)) {
+   if (ipoib_cm_up(neigh)) {
+   XMIT_QUEUED_SKBS();
+   ipoib_cm_send(dev, skb,
+ ipoib_cm_get(neigh));
+   continue;
+   }
+   } else if (neigh-ah) {
+   if (unlikely(memcmp(neigh-dgid.raw,
+   skb-dst-neighbour-ha + 4,
+   sizeof(union ib_gid {
+   spin_lock(priv-lock);
+   /*
+* It's safe to call ipoib_put_ah()
+* inside priv-lock here, because we
+* know that path-ah will always hold
+* one more reference, so ipoib_put_ah()
+* will never do more than decrement
+* the ref count.
+  

[PATCH 08/12 -Rev2] IPoIB include file changes.

2007-07-22 Thread Krishna Kumar
diff -ruNp org/drivers/infiniband/ulp/ipoib/ipoib.h 
rev2/drivers/infiniband/ulp/ipoib/ipoib.h
--- org/drivers/infiniband/ulp/ipoib/ipoib.h2007-07-20 07:49:28.0 
+0530
+++ rev2/drivers/infiniband/ulp/ipoib/ipoib.h   2007-07-20 16:09:45.0 
+0530
@@ -269,8 +269,8 @@ struct ipoib_dev_priv {
struct ipoib_tx_buf *tx_ring;
unsigned tx_head;
unsigned tx_tail;
-   struct ib_sgetx_sge;
-   struct ib_send_wrtx_wr;
+   struct ib_sge*tx_sge;
+   struct ib_send_wr*tx_wr;
 
struct ib_wc ibwc[IPOIB_NUM_WC];
 
@@ -365,8 +365,11 @@ static inline void ipoib_put_ah(struct i
 int ipoib_open(struct net_device *dev);
 int ipoib_add_pkey_attr(struct net_device *dev);
 
+int ipoib_process_skb(struct net_device *dev, struct sk_buff *skb,
+ struct ipoib_dev_priv *priv, int snum, int tx_index,
+ struct ipoib_ah *address, u32 qpn);
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
-   struct ipoib_ah *address, u32 qpn);
+   struct ipoib_ah *address, u32 qpn, int num_skbs);
 void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_flush_paths(struct net_device *dev);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition

2007-07-22 Thread Michael S. Tsirkin
 + /*
 +  * Handle skbs completion from tx_tail to wr_id. It is possible to
 +  * handle WC's from earlier post_sends (possible multiple) in this
 +  * iteration as we move from tx_tail to wr_id, since if the last
 +  * WR (which is the one which had a completion request) failed to be
 +  * sent for any of those earlier request(s), no completion
 +  * notification is generated for successful WR's of those earlier
 +  * request(s).
 +  */

AFAIK a signalled WR will always generate a completion.
What am I missing?

 
 + /*
 +  * Better error handling can be done here, like free
 +  * all untried skbs if err == -ENOMEM. However at this
 +  * time, we re-try all the skbs, all of which will
 +  * likely fail anyway (unless device finished sending
 +  * some out in the meantime). This is not a regression
 +  * since the earlier code is not doing this either.
 +  */

Are you retrying posting skbs? Why is this a good idea?
AFAIK, earlier code did not retry posting WRs at all.
The comment seems to imply that post send fails as a result of SQ overflow -
do you see SQ overflow errors in your testing?
AFAIK, IPoIB should never overflow the SQ.

-- 
MST
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] CONFIG_NET=n - lots of link time errors

2007-07-22 Thread Jan Engelhardt

On Jul 21 2007 19:12, David Miller wrote:
 Enabling drivers from Devices  Networking (in menuconfig), for 
 example SLIP and/or PLIP, throws link time errors when CONFIG_NET itself 
 is =n. Have CONFIG_NETDEVICES depend on CONFIG_NET.
 
 Signed-off-by: Jan Engelhardt [EMAIL PROTECTED]

This is the second time I've seen this change in the past few
days, and people seem to hit it quite readily with randconfig.

It seems reasonable and I'll apply it, thanks Jan.

While randconfig is nice and good, this error actually came by manually
tuning the .config for a specific goal (namely: make a small kernel for an
ancient i386 without modifying any code). Turning off CONFIG_NET
was then just an idea away, because I knew that TCP was quite big.


Jan
-- 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NEIGH]: Combine neighbour cleanup and release

2007-07-22 Thread Thomas Graf
Introduces neigh_cleanup_and_release() to be used after a
neighbour has been removed from its neighbour table. Serves
as preparation to add event notifications.

Signed-off-by: Thomas Graf [EMAIL PROTECTED]

Index: net-2.6/net/core/neighbour.c
===
--- net-2.6.orig/net/core/neighbour.c   2007-07-22 11:41:46.0 +0200
+++ net-2.6/net/core/neighbour.c2007-07-22 11:42:02.0 +0200
@@ -104,6 +104,14 @@ static int neigh_blackhole(struct sk_buf
return -ENETDOWN;
 }
 
+static void neigh_cleanup_and_release(struct neighbour *neigh)
+{
+   if (neigh-parms-neigh_cleanup)
+   neigh-parms-neigh_cleanup(neigh);
+
+   neigh_release(neigh);
+}
+
 /*
  * It is random distribution in the interval (1/2)*base...(3/2)*base.
  * It corresponds to default IPv6 settings and is not overridable,
@@ -140,9 +148,7 @@ static int neigh_forced_gc(struct neigh_
n-dead = 1;
shrunk  = 1;
write_unlock(n-lock);
-   if (n-parms-neigh_cleanup)
-   n-parms-neigh_cleanup(n);
-   neigh_release(n);
+   neigh_cleanup_and_release(n);
continue;
}
write_unlock(n-lock);
@@ -213,9 +219,7 @@ static void neigh_flush_dev(struct neigh
NEIGH_PRINTK2(neigh %p is stray.\n, n);
}
write_unlock(n-lock);
-   if (n-parms-neigh_cleanup)
-   n-parms-neigh_cleanup(n);
-   neigh_release(n);
+   neigh_cleanup_and_release(n);
}
}
 }
@@ -676,9 +680,7 @@ static void neigh_periodic_timer(unsigne
*np = n-next;
n-dead = 1;
write_unlock(n-lock);
-   if (n-parms-neigh_cleanup)
-   n-parms-neigh_cleanup(n);
-   neigh_release(n);
+   neigh_cleanup_and_release(n);
continue;
}
write_unlock(n-lock);
@@ -2094,11 +2096,8 @@ void __neigh_for_each_release(struct nei
} else
np = n-next;
write_unlock(n-lock);
-   if (release) {
-   if (n-parms-neigh_cleanup)
-   n-parms-neigh_cleanup(n);
-   neigh_release(n);
-   }
+   if (release)
+   neigh_cleanup_and_release(n);
}
}
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NEIGH]: Netlink notifications

2007-07-22 Thread Thomas Graf
Currently neighbour event notifications are limited to update
notifications and only sent if the ARP daemon is enabled. This
patch extends the existing notification code by also reporting
neighbours being removed due to gc or administratively and
removes the dependency on the ARP daemon. This allows to keep
track of neighbour states without periodically fetching the
complete neighbour table.

Signed-off-by: Thomas Graf [EMAIL PROTECTED]

Index: net-2.6/net/core/neighbour.c
===
--- net-2.6.orig/net/core/neighbour.c   2007-07-22 11:42:02.0 +0200
+++ net-2.6/net/core/neighbour.c2007-07-22 11:49:15.0 +0200
@@ -54,9 +54,8 @@
 #define PNEIGH_HASHMASK0xF
 
 static void neigh_timer_handler(unsigned long arg);
-#ifdef CONFIG_ARPD
-static void neigh_app_notify(struct neighbour *n);
-#endif
+static void __neigh_notify(struct neighbour *n, int type, int flags);
+static void neigh_update_notify(struct neighbour *neigh);
 static int pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev);
 void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev);
 
@@ -109,6 +108,7 @@ static void neigh_cleanup_and_release(st
if (neigh-parms-neigh_cleanup)
neigh-parms-neigh_cleanup(neigh);
 
+   __neigh_notify(neigh, RTM_DELNEIGH, 0);
neigh_release(neigh);
 }
 
@@ -829,13 +829,10 @@ static void neigh_timer_handler(unsigned
 out:
write_unlock(neigh-lock);
}
+
if (notify)
-   call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
+   neigh_update_notify(neigh);
 
-#ifdef CONFIG_ARPD
-   if (notify  neigh-parms-app_probes)
-   neigh_app_notify(neigh);
-#endif
neigh_release(neigh);
 }
 
@@ -1064,11 +1061,8 @@ out:
write_unlock_bh(neigh-lock);
 
if (notify)
-   call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
-#ifdef CONFIG_ARPD
-   if (notify  neigh-parms-app_probes)
-   neigh_app_notify(neigh);
-#endif
+   neigh_update_notify(neigh);
+
return err;
 }
 
@@ -2001,6 +1995,11 @@ nla_put_failure:
return -EMSGSIZE;
 }
 
+static void neigh_update_notify(struct neighbour *neigh)
+{
+   call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh);
+   __neigh_notify(neigh, RTM_NEWNEIGH, 0);
+}
 
 static int neigh_dump_table(struct neigh_table *tbl, struct sk_buff *skb,
struct netlink_callback *cb)
@@ -2420,7 +2419,6 @@ static const struct file_operations neig
 
 #endif /* CONFIG_PROC_FS */
 
-#ifdef CONFIG_ARPD
 static inline size_t neigh_nlmsg_size(void)
 {
return NLMSG_ALIGN(sizeof(struct ndmsg))
@@ -2452,16 +2450,11 @@ errout:
rtnl_set_sk_err(RTNLGRP_NEIGH, err);
 }
 
+#ifdef CONFIG_ARPD
 void neigh_app_ns(struct neighbour *n)
 {
__neigh_notify(n, RTM_GETNEIGH, NLM_F_REQUEST);
 }
-
-static void neigh_app_notify(struct neighbour *n)
-{
-   __neigh_notify(n, RTM_NEWNEIGH, 0);
-}
-
 #endif /* CONFIG_ARPD */
 
 #ifdef CONFIG_SYSCTL
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RTNETLINK]: Fix warning if !CONFIG_KMOD

2007-07-22 Thread Thomas Graf
replay label is unused otherwise.

Signed-off-by: Thomas Graf [EMAIL PROTECTED]

Index: net-2.6/net/core/rtnetlink.c
===
--- net-2.6.orig/net/core/rtnetlink.c   2007-07-22 11:41:46.0 +0200
+++ net-2.6/net/core/rtnetlink.c2007-07-22 12:04:27.0 +0200
@@ -952,7 +952,9 @@ static int rtnl_newlink(struct sk_buff *
struct nlattr *linkinfo[IFLA_INFO_MAX+1];
int err;
 
+#ifdef CONFIG_KMOD
 replay:
+#endif
err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
if (err  0)
return err;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tg3 issues

2007-07-22 Thread patric

patric wrote:


Hi,

Think i got something working for me at least, and the fix is quite 
minimal and only downside that i could see from it was that you might 
get a small delay when bringing up the interface, but that's probably 
better than getting a non-functional interface that reports that it's up.


The fix seems to be quite simple with just a random sleep at the end 
of  tg3_setup_fiber_by_hand():


   tw32_f(MAC_MODE, tp-mac_mode);
   udelay(40);
   }

out:
   udelay( net_random() % 400 );
   return current_link_up;
}

Not sure that this is a good fix or if it might break on other 
systems, but maybe you could have a quick look at that?


Regards,
Patric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Last update on this.

udelay( 100 + net_random % 300 ) seems to work much better and i have 
not had a single problem getting the link up within 10 seconds of a cold 
or warm-boot, and most often the link comes up directly without any sort 
of delay instead like before when it could hang for 30 seconds before 
getting a link, if you even got a link.


/Patric

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] Implement batching skb API

2007-07-22 Thread jamal
KK,

On Sun, 2007-22-07 at 11:57 +0530, Krishna Kumar2 wrote:

 Batching need not be useful for every hardware. 

My concern is there is no consistency in results. I see improvements on
something which you say dont. You see improvement in something that
Evgeniy doesnt etc. 
There are many knobs and we need in the minimal to find out how those
play.

 I don't quite agree with that approach, eg, if the blist is empty and the
 driver tells there is space for one packet, you will add one packet and
 the driver sends it out and the device is stopped (with potentially lot of
 skbs on dev-q). Then no packets are added till the queue is enabled, at
 which time a flood of skbs will be processed increasing latency and holding
 lock for a single longer duration. My approach will mitigate holding lock
 for longer times and instead send skbs to the device as long as we are
 within the limits.

Just as a side note _I do have this feature_ in the pktgen piece.
Infact, You can tell pktgen what that bound is as opposed to the hard
coding(look at the pktgen batchl parameter). I have not found it to be
useful experimentally; actually, i should say i could not zone on a
useful value by experimenting and it was better to turn it off.
I never tried adding it to the qdisc path - but this is something i
could try and as i said it may prove useful.

 Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't find
 it),
 I feel having prep will not help as no other cpu can execute the queue/xmit
 code anyway (E1000 is also a LLTX driver).

My experiments show it is useful (in a very visible way using pktgen)
for e1000 to have the prep() interface.

  Other driver that hold tx lock could get improvement however.

So you do see the value then with non LLTX drivers, right? ;- 
The value is also there in LLTX drivers even if in just formating a skb
ready for transmit. If this is not clear i could do a much longer
writeup on my thought evolution towards adding prep().

 I wonder if you tried enabling/disabling 'prep' on E1000 to see how the
 performance is affected. 

Absolutely. And regardless of whether its beneficial or not for e1000,
theres clear benefit in the tg3 for example.

 If it helps, I guess you could send me a patch to
 add that and I can also test it to see what the effect is. I didn't add it
 since IPoIB wouldn't be able to exploit it (unless someone is kind enough
 to show me how to).

Such core code should not just be focussed on IPOIB.

 I think the code I have is ready and stable, 

I am not sure how to intepret that - are you saying all-is-good and we
should just push your code in? It sounds disingenuous but i may have
misread you. 

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: checksum of lsrr

2007-07-22 Thread Ken-ichirou MATSUZAWA
 Hello, 

From: Ken-ichirou MATSUZAWA [EMAIL PROTECTED]
Subject: checksum of lsrr
Date: Wed, 04 Jul 2007 22:28:24 +0900 (JST)

 if we change destination address, we need recompute not only ip
 checksum but also tcp's, don't we? 

I was ashamed of myself for much confused. My simple story is:

rt1  rt2  mybox

rt1 is far from rt2 and when I changed bandwith between rt1 and rt2,
I want to confirm if it was changed correctly. A idea occured to me
is using lsrr. Then I tried with nc but it was not worked. To accept
packet from rt2 to mybox which src and dst address are mybox, I
changed:

--- net/ipv4/fib_frontend.c.orig2007-07-23 00:01:17.0 +0900
+++ net/ipv4/fib_frontend.c 2007-07-23 00:03:11.0 +0900
@@ -206,7 +206,7 @@
 
if (fib_lookup(fl, res))
goto last_resort;
-   if (res.type != RTN_UNICAST)
+   if (res.type != RTN_UNICAST  likely(res.type != RTN_LOCAL))
goto e_inval_res;
*spec_dst = FIB_RES_PREFSRC(res);
fib_combine_itag(itag, res);

Would you tell me is this acceptable? I'm afraid of security issue.
thanks,
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question: how to detect if a qdisc is root or not?

2007-07-22 Thread Patrick McHardy
Waskiewicz Jr, Peter P wrote:
I dont think I understand. Whats the problem with setting 
sch-parent on initialization instead on grafting as I did in 
my example patch?
Please explain the problems arrising on unload in detail.
 
 
 sch-parent is getting set on initialization, and for the root and
 ingress qdiscs, it's left at zero.  If I change that value, when the
 root qdisc is unloaded and pfifo_fast is put back into place, the
 qdisc_destroy() walks the tree and attempts to free memory from the
 handle pointed at by sch-parent.


First of all, qdisc destruction never propagates up, only down
the tree. Secondly neither qdisc_destroy nor pfifo nor prio
even look at sch-parent. So this is completely wrong, the only
place where sch-parent is used for walking through the tree
is qdisc_tree_decrease_qlen.

 It stops when sch-parent is NULL,

Where are you getting this? sch-parent is an *integer*.

 so
 sch-parent is actually being set as intended.  The only thing that
 confused me is that nowhere in the qdisc is TC_H_ROOT included
 explicitly, rather, the root qdisc is where sch-parent is NULL.
 
 So I misunderstood what was actually wrong.  The qdisc code is ok as-is,
 it's just that the top-level qdisc (root and ingress) have a sch-parent
 of NULL, which is being set correctly today.
 
 Hope that clarifies.

Not at all :)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: Fix sch_prio to detect the root qdisc loading

2007-07-22 Thread Patrick McHardy
PJ Waskiewicz wrote:
 The sch-parent handle will be NULL for the scheduler that is TC_H_ROOT.
 Change this check in prio_tune() so that only the root qdisc can be
 multiqueue-enabled.
 
 Signed-off-by: Peter P Waskiewicz Jr [EMAIL PROTECTED]
 ---
 
  net/sched/sch_prio.c |6 --
  1 files changed, 4 insertions(+), 2 deletions(-)
 
 diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
 index 2d8c084..271051e 100644
 --- a/net/sched/sch_prio.c
 +++ b/net/sched/sch_prio.c
 @@ -239,11 +239,13 @@ static int prio_tune(struct Qdisc *sch, struct rtattr 
 *opt)
   /* If we're multiqueue, make sure the number of incoming bands
* matches the number of queues on the device we're associating with.
* If the number of bands requested is zero, then set q-bands to
 -  * dev-egress_subqueue_count.
 +  * dev-egress_subqueue_count.  Also, the root qdisc must be the
 +  * only one that is enabled for multiqueue, since it's the only one
 +  * that interacts with the underlying device.
*/
   q-mq = RTA_GET_FLAG(tb[TCA_PRIO_MQ - 1]);
   if (q-mq) {
 - if (sch-handle != TC_H_ROOT)
 + if (sch-parent != NULL)


As explained in my last mail, sch-parent is an integer. And it is
set when grafting the qdisc, not on initilization, so it is always
0 here when coming from prio_init.

This untested patch should make sure its always set correctly.
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 13c09bc..0975547 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -380,6 +380,10 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned 
int n)
return;
while ((parentid = sch-parent)) {
sch = qdisc_lookup(sch-dev, TC_H_MAJ(parentid));
+   if (sch == NULL) {
+   WARN_ON(parentid != TC_H_ROOT);
+   return;
+   }
cops = sch-ops-cl_ops;
if (cops-qlen_notify) {
cl = cops-get(sch, parentid);
@@ -420,8 +424,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc 
*parent,
unsigned long cl = cops-get(parent, classid);
if (cl) {
err = cops-graft(parent, cl, new, old);
-   if (new)
-   new-parent = classid;
cops-put(parent, cl);
}
}
@@ -436,7 +438,8 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc 
*parent,
  */
 
 static struct Qdisc *
-qdisc_create(struct net_device *dev, u32 handle, struct rtattr **tca, int 
*errp)
+qdisc_create(struct net_device *dev, u32 parent, u32 handle,
+struct rtattr **tca, int *errp)
 {
int err;
struct rtattr *kind = tca[TCA_KIND-1];
@@ -482,6 +485,8 @@ qdisc_create(struct net_device *dev, u32 handle, struct 
rtattr **tca, int *errp)
goto err_out2;
}
 
+   new-parent = parent;
+
if (handle == TC_H_INGRESS) {
sch-flags |= TCQ_F_INGRESS;
sch-stats_lock = dev-ingress_lock;
@@ -758,9 +763,11 @@ create_n_graft:
if (!(n-nlmsg_flagsNLM_F_CREATE))
return -ENOENT;
if (clid == TC_H_INGRESS)
-   q = qdisc_create(dev, tcm-tcm_parent, tca, err);
+   q = qdisc_create(dev, tcm-tcm_parent, tcm-tcm_parent,
+tca, err);
else
-   q = qdisc_create(dev, tcm-tcm_handle, tca, err);
+   q = qdisc_create(dev, tcm-tcm_parent, tcm-tcm_handle,
+tca, err);
if (q == NULL) {
if (err == -EAGAIN)
goto replay;


[PATCH] Fix irlan_eth.c - unused variable 'in_dev' warning

2007-07-22 Thread Gabriel C
Hi,

I got this warning on current git:

...

net/irda/irlan/irlan_eth.c: In function 'irlan_eth_send_gratuitous_arp':
net/irda/irlan/irlan_eth.c:301: warning: unused variable 'in_dev'

...

This is because CONFIG_INET is not set in my config.


Signed-off-by: Gabriel Craciunescu [EMAIL PROTECTED]

---

diff --git a/net/irda/irlan/irlan_eth.c b/net/irda/irlan/irlan_eth.c
index c421521..ea0b21a 100644
--- a/net/irda/irlan/irlan_eth.c
+++ b/net/irda/irlan/irlan_eth.c
@@ -298,6 +298,7 @@ void irlan_eth_flow_indication(void *instance, void *sap, 
LOCAL_FLOW flow)
  */
 void irlan_eth_send_gratuitous_arp(struct net_device *dev)
 {
+#ifdef CONFIG_INET
struct in_device *in_dev;
 
/*
@@ -305,7 +306,6 @@ void irlan_eth_send_gratuitous_arp(struct net_device *dev)
 * is useful if we have changed access points on the same
 * subnet.
 */
-#ifdef CONFIG_INET
IRDA_DEBUG(4, IrLAN: Sending gratuitous ARP\n);
rcu_read_lock();
in_dev = __in_dev_get_rcu(dev);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 05/10] sch_generic.c changes.

2007-07-22 Thread Patrick McHardy
Krishna Kumar2 wrote:
 Patrick McHardy [EMAIL PROTECTED] wrote on 07/20/2007 11:46:36 PM:
 
The check for tx_queue_len is wrong though,
its only a default which can be overriden and some qdiscs don't
care for it at all.
 
 
 I think it should not matter whether qdiscs use this or not, or even if it
 is modified (unless it is made zero in which case this breaks). The
 intention behind this check is to make sure that not more than tx_queue_len
 skbs are in all queues put together (q-qdisc + dev-skb_blist), otherwise
 the blist can become too large and breaks the idea of tx_queue_len. Is that
 a good justification ?


Its a good justification, but on second thought the entire idea of
a single queue after qdiscs that is refilled independantly of
transmissions times etc. make be worry a bit. By changing the
timing you're effectively changing the qdiscs behaviour, at least
in some cases. SFQ is a good example, but I believe it affects most
work-conserving qdiscs. Think of this situation:

100 packets of flow 1 arrive
50 packets of flow 1 are sent
100 packets for flow 2 arrive
remaining packets are sent

On the wire you'll first see 50 packets of flow 1, than 100 packets
alternate of flow 1 and 2, then 50 packets flow 2.

With your additional queue all packets of flow 1 are pulled out of
the qdisc immediately and put in the fifo. When the 100 packets of
the second flow arrive they will also get pulled out immediately
and are put in the fifo behind the remaining 50 packets of flow 1.
So what you get on the wire is:

100 packets of flow 1
100 packets of flow 1

So SFQ is without any effect. This is not completely avoidable of
course, but you can and should limit the damage by only pulling
out as much packets as the driver can take and have the driver
stop the queue afterwards.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12 -Rev2] Changes to netdevice.h

2007-07-22 Thread Patrick McHardy
Krishna Kumar wrote:
 @@ -472,6 +474,9 @@ struct net_device
   void*priv;  /* pointer to private data  */
   int (*hard_start_xmit) (struct sk_buff *skb,
   struct net_device *dev);
 + int (*hard_start_xmit_batch) (struct net_device
 +   *dev);
 +


Os this function really needed? Can't you just call hard_start_xmit with
a NULL skb and have the driver use dev-blist?

   /* These may be needed for future network-power-down code. */
   unsigned long   trans_start;/* Time (in jiffies) of last Tx 
 */
  
 @@ -582,6 +587,8 @@ struct net_device
  #define  NETDEV_ALIGN32
  #define  NETDEV_ALIGN_CONST  (NETDEV_ALIGN - 1)
  
 +#define BATCHING_ON(dev) ((dev-features  NETIF_F_BATCH_ON) != 0)
 +
  static inline void *netdev_priv(const struct net_device *dev)
  {
   return dev-priv;
 @@ -832,6 +839,8 @@ extern intdev_set_mac_address(struct n
   struct sockaddr *);
  extern int   dev_hard_start_xmit(struct sk_buff *skb,
   struct net_device *dev);
 +extern int   dev_add_skb_to_blist(struct sk_buff *skb,
 +  struct net_device *dev);


Again, function signatures should be introduced in the same patch
that contains the function. Splitting by file doesn't make sense.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12 -Rev2] rtnetlink changes.

2007-07-22 Thread Patrick McHardy
Krishna Kumar wrote:
 diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
 --- org/include/linux/if_link.h   2007-07-20 16:33:35.0 +0530
 +++ rev2/include/linux/if_link.h  2007-07-20 16:35:08.0 +0530
 @@ -78,6 +78,8 @@ enum
   IFLA_LINKMODE,
   IFLA_LINKINFO,
  #define IFLA_LINKINFO IFLA_LINKINFO
 + IFLA_TXBTHSKB,  /* Driver support for Batch'd skbs */
 +#define IFLA_TXBTHSKB IFLA_TXBTHSKB


Ughh what a name :) I prefer pronouncable names since they are
much easier to remember and don't need comments explaining
what they mean.

But I actually think offering just an ethtool interface would
be better, at least for now.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS mount gives ENETDOWN in -git15

2007-07-22 Thread Trond Myklebust
On Sat, 2007-07-21 at 15:31 +0200, Andi Kleen wrote:
 I tried to mount another nfs mount on a system running with nfsroot.
 But I get
 
 # mount basil:/home /basil/home/
 mount: Network is down
 
 The network is not down of course, the system is happily running with nfs 
 root from that
 server. Userland is older SUSE 10.0

Does Al's patch help in any way?

Cheers
  Trond

---BeginMessage---
Obviously broken on little-endian; fortunately, the option is
not frequently used...

Signed-off-by: Al Viro [EMAIL PROTECTED]
---
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index b34b7a7..b2a851c 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -732,7 +732,7 @@ static int nfs_parse_mount_options(char *raw,
return 0;
if (option  0 || option  65535)
return 0;
-   mnt-nfs_server.address.sin_port = htonl(option);
+   mnt-nfs_server.address.sin_port = htons(option);
break;
case Opt_rsize:
if (match_int(args, mnt-rsize))
---End Message---


netlink messages on 'ip l s down' event

2007-07-22 Thread Milan Kocian
hello,

I noticed that ipv4 subsystem sends RTM_NEWLINK message with new flags
on 'ip l s down' event, but ipv6 subsystem sends RTM_DELLINK message. Is
it inconsistency or feature? If it is inconsistency, which approach will
be used in future? (tested kernels 2.6.20,21,22)

Thanks for reply.

regards,

-- 
Milan Kocian [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bug in tcp/ip stack

2007-07-22 Thread john

I tracked down something that appears to be a small bug in networking code.
The way in witch i can reproduce it a complex one but it works 100%, so 
here comes the details:


I noticed strange packets on my fw coming from mail server with RST/ACK 
flags set coming from source port with no one listening on it and no 
connection attempts made to them from outside.
There are few messages on forums describing same problem and calling 
them alien ACK/RST packets.


Postfix mail server gives this behavior if for some reason client resets 
connection but some packets from client arrives after the RST, the serer 
box responds with RST and then with RST/ACK (with the wrong source port 
number).


Here is packet dump

10.0010.0.0.25410.0.0.68TCP5  smtp [SYN] 
Seq=0 Len=0
20.00103610.0.0.6810.0.0.254TCPsmtp  5 [SYN, 
ACK] Seq=0 Ack=1 Win=5840 Len=0 MSS=1460
30.00109610.0.0.25410.0.0.68TCP5  smtp [ACK] 
Seq=1 Ack=1 Win=1500 Len=0

40.00112510.0.0.25410.0.0.68SMTPCommand: EHLO localhost
50.00115010.0.0.25410.0.0.68TCP5  smtp [RST] 
Seq=17 Len=0
60.00117510.0.0.25410.0.0.68TCP5  smtp [FIN, 
ACK] Seq=17 Ack=1 Win=1500 Len=0
70.00125110.0.0.6810.0.0.254TCPsmtp  5 [ACK] 
Seq=1 Ack=17 Win=5840 Len=0
80.00128410.0.0.6810.0.0.254TCPsmtp  5 [RST] 
Seq=1 Len=0
!!!90.21842710.0.0.6810.0.0.254TCP32768  5 
[RST, ACK] Seq=0 Ack=0 Win=5840 Len=0


It is not the postfix bug, it is present in current 2.6.x and 2.4.x 
kernel versions but not in the 2.2.x tree, so after investigation i 
found it was introduced in 2.4.0-test9-pre3 back in year 2000 and 
survived for 7 years WOW :)



Whole 2.4.0-test9-pre3 diff is pretty big, but i managed to find lines 
responsible for this,

they are located in include/net/tcp.h

in function tcp_enter_cwr

if (sk-prev  !(sk-userlocksSOCK_BINDPORT_LOCK))
 tcp_put_port(sk);



It is not a big problem but under some setups the fw's conntrack table 
can get filled pretty quickly,  because wrong port number changes every 
time.


Can, you please check this out?

Evalds
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] NET: Fix sch_prio to detect the root qdisc loading

2007-07-22 Thread Waskiewicz Jr, Peter P

 As explained in my last mail, sch-parent is an integer. And 
 it is set when grafting the qdisc, not on initilization, so 
 it is always 0 here when coming from prio_init.
 
 This untested patch should make sure its always set correctly.

Yes, I'm using NULL and 0 interchangeably here, since in the sch_api
code, qdisc_graft(), sch-parent is referenced using NULL and not 0.  I
know it's a u32, and the value is getting set to the proper handle when
the qdisc is not the root qdisc.  When it's the root qdisc, it's left to
be 0.

This patch was tested as well, but looking at the history now, I didn't
set bands correctly, so that's why the multiqueue case failed.  My
mistake, and I'll keep working on this.


Sorry for the extra noise,
-PJ
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] [IrDA] KS959 USB IrDA dongle support

2007-07-22 Thread Samuel Ortiz
Hi Dave,

Last IrDA patch for 2.6.23-rc1, if it's not too late...

This patch adds support for the KingSun KS-959 USB IrDA dongle.

This dongle does not follow the usb-irda specification, so it needs its own
special driver. First, it uses control URBs for data transfer, instead of
bulk or interrupt transfers; the only interrupt endpoint exposed seems to
be a dummy to prevent the interface from being rejected. Second, it uses
obfuscation and padding at the USB traffic level, for no apparent reason
other than to make reverse engineering harder (full details on obfuscation
in comments at beginning of source). Although it is advertised as a 4 Mbps
FIR dongle, it apparently loses packets at speeds greater than 57600 bps.

On plugin, this dongle reports vendor and device IDs: 0x07d0:0x4959 .

From: Alex Villacís Lasso [EMAIL PROTECTED]

Signed-off-by: Alex Villacís Lasso [EMAIL PROTECTED]
Signed-off-by: Samuel Ortiz [EMAIL PROTECTED]
---
 drivers/net/irda/Kconfig |   14 +
 drivers/net/irda/Makefile|1 +
 drivers/net/irda/ks959-sir.c |  919 ++
 3 files changed, 934 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/irda/ks959-sir.c

diff --git a/drivers/net/irda/Kconfig b/drivers/net/irda/Kconfig
index 2098d0a..35321dc 100644
--- a/drivers/net/irda/Kconfig
+++ b/drivers/net/irda/Kconfig
@@ -164,6 +164,20 @@ config EP7211_DONGLE
 
 
 
+config KS959_DONGLE
+   tristate KingSun KS-959 IrDA-USB dongle
+   depends on IRDA  USB  EXPERIMENTAL
+   help
+ Say Y or M here if you want to build support for the KingSun KS-959
+ IrDA-USB bridge device driver.
+
+ This USB bridge does not conform to the IrDA-USB device class
+ specification, and therefore needs its own specific driver.
+ This dongle supports SIR speeds only (9600 through 57600 bps).
+
+ To compile it as a module, choose M here: the module will be called
+ ks959-sir.
+
 comment Old SIR device drivers
 
 config IRPORT_SIR
diff --git a/drivers/net/irda/Makefile b/drivers/net/irda/Makefile
index 2808ef5..9212c15 100644
--- a/drivers/net/irda/Makefile
+++ b/drivers/net/irda/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_MA600_DONGLE)+= ma600-sir.o
 obj-$(CONFIG_TOIM3232_DONGLE)  += toim3232-sir.o
 obj-$(CONFIG_EP7211_DONGLE)+= ep7211-sir.o
 obj-$(CONFIG_KINGSUN_DONGLE)   += kingsun-sir.o
+obj-$(CONFIG_KS959_DONGLE) += ks959-sir.o
 
 # The SIR helper module
 sir-dev-objs := sir_dev.o sir_dongle.o
diff --git a/drivers/net/irda/ks959-sir.c b/drivers/net/irda/ks959-sir.c
new file mode 100644
index 000..12e3435
--- /dev/null
+++ b/drivers/net/irda/ks959-sir.c
@@ -0,0 +1,919 @@
+/*
+*
+* Filename:  ks959-sir.c
+* Version:   0.1.2
+* Description:   Irda KingSun KS-959 USB Dongle
+* Status:Experimental
+* Author:Alex Villacís Lasso [EMAIL PROTECTED]
+* with help from Domen Puncer [EMAIL PROTECTED]
+*
+*Based on stir4200, mcs7780, kingsun-sir drivers.
+*
+*This program is free software; you can redistribute it and/or modify
+*it under the terms of the GNU General Public License as published by
+*the Free Software Foundation; either version 2 of the License.
+*
+*This program is distributed in the hope that it will be useful,
+*but WITHOUT ANY WARRANTY; without even the implied warranty of
+*MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+*GNU General Public License for more details.
+*
+*You should have received a copy of the GNU General Public License
+*along with this program; if not, write to the Free Software
+*Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+*
+*/
+
+/*
+ * Following is my most current (2007-07-17) understanding of how the Kingsun
+ * KS-959 dongle is supposed to work. This information was deduced by
+ * reverse-engineering and examining the USB traffic captured with USBSnoopy
+ * from the WinXP driver. Feel free to update here as more of the dongle is
+ * known.
+ *
+ * My most sincere thanks must go to Domen Puncer [EMAIL PROTECTED] for
+ * invaluable help in cracking the obfuscation and padding required for this
+ * dongle.
+ *
+ * General: This dongle exposes one interface with one interrupt IN endpoint.
+ * However, the interrupt endpoint is NOT used at all for this dongle. Instead,
+ * this dongle uses control transfers for everything, including sending and
+ * receiving the IrDA frame data. Apparently the interrupt endpoint is just a
+ * dummy to ensure the dongle has a valid interface to present to the PC. And I
+ * thought the DonShine dongle was weird... In addition, this dongle uses
+ * obfuscation (?!?!?), applied at the USB level, to hide the traffic, both 
sent
+ * and received, from the dongle. I call it obfuscation because the XOR keying
+ * and 

Re: NFS mount gives ENETDOWN in -git15

2007-07-22 Thread Yinghai Lu

On 7/21/07, Andi Kleen [EMAIL PROTECTED] wrote:


I tried to mount another nfs mount on a system running with nfsroot.
But I get

# mount basil:/home /basil/home/
mount: Network is down

The network is not down of course, the system is happily running with nfs root 
from that
server. Userland is older SUSE 10.0

Excerpt from strace mount:

-Andi



What is your CONFIG_NETDEVICES_MULTIQUEUE in .config?

YH
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


drivers/net/xen-netfront.c: bogus code

2007-07-22 Thread Adrian Bunk
The Coverity checker spotted the following bogus code
in drivers/net/xen-netfront.c:

--  snip  --

...
static void xennet_alloc_rx_buffers(struct net_device *dev)
{
...
for (nr_flips = i = 0; ; i++) {
skb = __skb_dequeue(np-rx_batch);
if (skb == NULL)
break;

skb-dev = dev;

id = xennet_rxidx(req_prod + i);

BUG_ON(np-rx_skbs[id]);
np-rx_skbs[id] = skb;

ref = gnttab_claim_grant_reference(np-gref_rx_head);
BUG_ON((signed short)ref  0);
np-grant_rx_ref[id] = ref;

pfn = page_to_pfn(skb_shinfo(skb)-frags[0].page);
vaddr = page_address(skb_shinfo(skb)-frags[0].page);

req = RING_GET_REQUEST(np-rx, req_prod + i);
gnttab_grant_foreign_access_ref(ref,
np-xbdev-otherend_id,
pfn_to_mfn(pfn),
0);

req-id = id;
req-gref = ref;
}

if (nr_flips != 0) {
...

--  snip  --

Note that nr_flips is always 0 in the last line.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


net/9p/mux.c: use-after-free

2007-07-22 Thread Adrian Bunk
The Coverity checker spotted the following use-after-free
in net/9p/mux.c:

--  snip  --

...
struct p9_conn *p9_conn_create(struct p9_transport *trans, int msize,
unsigned char *extended)
{
...
if (!m-tagpool) {
kfree(m);
return ERR_PTR(PTR_ERR(m-tagpool));
}
...

--  snip  --


cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


net/bluetooth/rfcomm/tty.c: use-after-free

2007-07-22 Thread Adrian Bunk
Commit 8de0a15483b357d0f0b821330ec84d1660cadc4e added the following 
use-after-free in net/bluetooth/rfcomm/tty.c:

--  snip  --

...
static int rfcomm_dev_add(struct rfcomm_dev_req *req, struct rfcomm_dlc *dlc)
{
...
if (IS_ERR(dev-tty_dev)) {
list_del(dev-list);
kfree(dev);
return PTR_ERR(dev-tty_dev);
}
...

--  snip  --

Spotted by the Coverity checker.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [IrDA] KS959 USB IrDA dongle support

2007-07-22 Thread David Miller
From: Samuel Ortiz [EMAIL PROTECTED]
Date: Mon, 23 Jul 2007 00:32:17 +0300

 Last IrDA patch for 2.6.23-rc1, if it's not too late...

Too late, Linus just closed the merge window, you had two
weeks to submit this :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12 -Rev2] IPoIB xmit API addition

2007-07-22 Thread Krishna Kumar2
Hi Micheal,

Michael S. Tsirkin [EMAIL PROTECTED] wrote on 07/22/2007 03:11:36
PM:

  +   /*
  +* Handle skbs completion from tx_tail to wr_id. It is possible to
  +* handle WC's from earlier post_sends (possible multiple) in this
  +* iteration as we move from tx_tail to wr_id, since if the last
  +* WR (which is the one which had a completion request) failed to
be
  +* sent for any of those earlier request(s), no completion
  +* notification is generated for successful WR's of those earlier
  +* request(s).
  +*/

 AFAIK a signalled WR will always generate a completion.
 What am I missing?

Yes, signalled WR will generate a completion. I am trying to catch the case
where, say, I send 64 skbs and set signalling for only the last skb and the
others are set to NO signalling. Now if the driver found the last WR was
bad
for some reason, it will synchronously fail the send for that WR (which
happens to be the only one that is signalled). So after the 1 to 63 skbs
are
finished, there will be no completion called. That was my understanding of
how
this works, and coded it that way so that the next post will clean up the
previous one's completion.

 
  + /*
  +  * Better error handling can be done here, like free
  +  * all untried skbs if err == -ENOMEM. However at this
  +  * time, we re-try all the skbs, all of which will
  +  * likely fail anyway (unless device finished sending
  +  * some out in the meantime). This is not a regression
  +  * since the earlier code is not doing this either.
  +  */

 Are you retrying posting skbs? Why is this a good idea?
 AFAIK, earlier code did not retry posting WRs at all.

Not exactly. If I send 64 skbs to the device and the provider returned a
bad WR at skb # 50, then I will have to try skb# 51-64 again since the
provider has not attemped to send those out as it bails out at the first
failure. The provider ofcourse has already sent out skb# 1-49 before
returning failure at skb# 50. So it is not strictly retry, just xmit of
next skbs which is what the current code also does. I tested this part out
by simulating errors in mthca_post_send and verified that the next
iteration clears up the remaining skbs.

 The comment seems to imply that post send fails as a result of SQ
overflow -

Correct.

 do you see SQ overflow errors in your testing?

No.

 AFAIK, IPoIB should never overflow the SQ.

Correct. It should never happen unless IPoIB has a bug :) I guess the
comment
should be removed ?

Thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/12 -Rev2] rtnetlink changes.

2007-07-22 Thread Krishna Kumar2
Hi Patrick,

Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:40:37 PM:

 Krishna Kumar wrote:
  diff -ruNp org/include/linux/if_link.h rev2/include/linux/if_link.h
  --- org/include/linux/if_link.h   2007-07-20 16:33:35.0 +0530
  +++ rev2/include/linux/if_link.h   2007-07-20 16:35:08.0 +0530
  @@ -78,6 +78,8 @@ enum
  IFLA_LINKMODE,
  IFLA_LINKINFO,
   #define IFLA_LINKINFO IFLA_LINKINFO
  +   IFLA_TXBTHSKB,  /* Driver support for Batch'd skbs */
  +#define IFLA_TXBTHSKB IFLA_TXBTHSKB


 Ughh what a name :) I prefer pronouncable names since they are
 much easier to remember and don't need comments explaining
 what they mean.

 But I actually think offering just an ethtool interface would
 be better, at least for now.

Great, I will remove /sys and rtnetlink and keep the Ethtool i/f.

Thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12 -Rev2] Changes to netdevice.h

2007-07-22 Thread Krishna Kumar2
Hi Patrick,

Patrick McHardy [EMAIL PROTECTED] wrote on 07/22/2007 10:36:51 PM:

 Krishna Kumar wrote:
  @@ -472,6 +474,9 @@ struct net_device
  void *priv;   /* pointer to private data   */
  int (*hard_start_xmit) (struct sk_buff *skb,
 struct net_device *dev);
  +   int (*hard_start_xmit_batch) (struct net_device
  +   *dev);
  +


 Os this function really needed? Can't you just call hard_start_xmit with
 a NULL skb and have the driver use dev-blist?

Probably not. I will see how to do it this way and get back to you.

  /* These may be needed for future network-power-down code. */
  unsigned long  trans_start;   /* Time (in jiffies) of last Tx
*/
 
  @@ -582,6 +587,8 @@ struct net_device
   #define   NETDEV_ALIGN  32
   #define   NETDEV_ALIGN_CONST   (NETDEV_ALIGN - 1)
 
  +#define BATCHING_ON(dev)   ((dev-features  NETIF_F_BATCH_ON) != 0)
  +
   static inline void *netdev_priv(const struct net_device *dev)
   {
  return dev-priv;
  @@ -832,6 +839,8 @@ extern int  dev_set_mac_address(struct n
  struct sockaddr *);
   extern int  dev_hard_start_xmit(struct sk_buff *skb,
  struct net_device *dev);
  +extern int  dev_add_skb_to_blist(struct sk_buff *skb,
  +struct net_device *dev);


 Again, function signatures should be introduced in the same patch
 that contains the function. Splitting by file doesn't make sense.

Right. I did it for some but missed this. Sorry, will redo.

thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/10] Implement batching skb API

2007-07-22 Thread Krishna Kumar2
Hi Evgeniy,

Evgeniy Polyakov [EMAIL PROTECTED] wrote on 07/20/2007 06:24:23 PM:

 Hi Krishna.

 On Fri, Jul 20, 2007 at 12:01:49PM +0530, Krishna Kumar
([EMAIL PROTECTED]) wrote:
  After fine-tuning qdisc and other changes, I modified IPoIB to use this
API,
  and now get good gains. Summary for TCP  No Delay: 1 process improves
for
  all cases from 1.4% to 49.5%; 4 process has almost identical
improvements
  from -1.7% to 59.1%; 16 process case also improves in the range of
-1.2% to
  33.4%; while 64 process doesn't have much improvement (-3.3% to 12.4%).
UDP
  was tested with 1 process netperf with small increase in BW but big
  improvement in Service Demand. Netperf latency tests show small drop in
  transaction rate (results in separate attachment).

 What about round-robin tcp time and latency test? In theory such batching
 mode should not change that timings, but practice can show new aspects.

The TCP RR results show a slight impact, however the service demand shows
good improvement. The results are (I did TCP RR - 1 process, 1,8,32,128,512
buffer sizes; and UDP RR - 1 process, 1 byte buffer size) :

Results for TCR RR (1 process) ORG code:
SizeR-R   CPU%S.Demand

1 521346.02   5.481346.145
8 129463.14   6.74418.370
32128899.73   7.51467.106
128   127230.15   5.42340.876
512   119605.68   6.48435.650


Results for TCR RR (1 process) NEW code (and change%):
SizeR-R   CPU%S.Demand

1 516596.62 (-0.91%)  5.741423.819 (5.77%)
8 129184.46 (-.22%)   5.43336.747 (-19.51%)
32128238.35 (-.51%)   5.43339.213 (-27.38%)
128   126545.79 (-.54%)   5.36339.188 (-0.50%)
512   119297.49 (-.26%)   5.16346.185 (-20.54%)


  Results for UDP RR 1 process ORG  NEW code:
Code   Size  R-RCPU%  S.Demand
--
ORG 1539327.86  5.68  1348.985
NEW 1540669.33 (0.25%)  6.05  1434.180 (6.32%)


 I will review code later this week (likely tomorrow) and if there will
 be some issues return back.

Thanks! I had just submitted Rev2 on Sunday, please let me know what you
find.

Regards,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Age Entry For IPv4 IPv6 Route Table

2007-07-22 Thread Varun Chandramohan
Hi,

According to the RFC 4292 (IP Forwarding Table MIB) there is a need for an age 
entry for all the routes in therouting table. The entry in the RFC is 
inetCidrRouteAge and oid is inetCidrRouteAge.1.10.
Many snmp application require this age entry. So iam adding the age field in 
the routing table for ipv4 and ipv6 and providing the interface for this value 
netlink.

Signed-off-by: Varun Chandramohan [EMAIL PROTECTED]
---

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] New attribute RTA_AGE

2007-07-22 Thread Varun Chandramohan
A new attribute RTA_AGE is added for the age value to be exported to userlevel 
using netlink

Signed-off-by: Varun Chandramohan [EMAIL PROTECTED]
---
 include/linux/rtnetlink.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 6127858..884f507 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -263,6 +263,7 @@ enum rtattr_type_t
RTA_SESSION,
RTA_MP_ALGO, /* no longer used */
RTA_TABLE,
+   RTA_AGE,
__RTA_MAX
 };
 
-- 
1.4.3.4

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] Add new timeval_to_sec function

2007-07-22 Thread Varun Chandramohan
A new function for converting timeval to time_t is added in time.h. Its a 
common function used in different
places.

Signed-off-by: Varun Chandramohan [EMAIL PROTECTED]
---
 include/linux/time.h |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/time.h b/include/linux/time.h
index dda9be6..c81baa6 100644
--- a/include/linux/time.h
+++ b/include/linux/time.h
@@ -147,6 +147,17 @@ static inline s64 timeval_to_ns(const st
 }
 
 /**
+ * timeval_to_sec - Convert timeval to seconds
+ * @tv: pointer to the timeval variable to be converted
+ *
+ * Returns the seconds representation of timeval parameter.
+ */
+static inline time_t timeval_to_sec(const struct timeval *tv)
+{
+   return (tv-tv_sec + (tv-tv_usec + 50)/100);
+}
+
+/**
  * ns_to_timespec - Convert nanoseconds to timespec
  * @nsec:  the nanoseconds value to be converted
  *
-- 
1.4.3.4

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] Initilize and populate age field

2007-07-22 Thread Varun Chandramohan
The age field is filled with the current time at the time of creation of the 
route. When the routes are dumped
then the age value stored in the route structure is subtracted from the current 
time value and the difference is the age expressed in secs.

Signed-off-by: Varun Chandramohan [EMAIL PROTECTED]
---
 net/ipv4/fib_hash.c  |3 +++
 net/ipv4/fib_lookup.h|3 ++-
 net/ipv4/fib_semantics.c |   16 +---
 net/ipv4/fib_trie.c  |1 +
 4 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 07e843a..faa7364 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -448,6 +448,7 @@ static int fn_hash_insert(struct fib_tab
fa-fa_info = fi;
fa-fa_type = cfg-fc_type;
fa-fa_scope = cfg-fc_scope;
+   fa-fa_age = 0;
state = fa-fa_state;
fa-fa_state = ~FA_S_ACCESSED;
fib_hash_genid++;
@@ -507,6 +508,7 @@ static int fn_hash_insert(struct fib_tab
new_fa-fa_type = cfg-fc_type;
new_fa-fa_scope = cfg-fc_scope;
new_fa-fa_state = 0;
+   new_fa-fa_age = 0;
 
/*
 * Insert new entry to the list.
@@ -697,6 +699,7 @@ fn_hash_dump_bucket(struct sk_buff *skb,
  f-fn_key,
  fz-fz_order,
  fa-fa_tos,
+ fa-fa_age,
  fa-fa_info,
  NLM_F_MULTI)  0) {
cb-args[4] = i;
diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index eef9eec..c9145b5 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -13,6 +13,7 @@ struct fib_alias {
u8  fa_type;
u8  fa_scope;
u8  fa_state;
+   time_t  fa_age;
 };
 
 #define FA_S_ACCESSED  0x01
@@ -27,7 +28,7 @@ extern struct fib_info *fib_create_info(
 extern int fib_nh_match(struct fib_config *cfg, struct fib_info *fi);
 extern int fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event,
 u32 tb_id, u8 type, u8 scope, __be32 dst,
-int dst_len, u8 tos, struct fib_info *fi,
+int dst_len, u8 tos, time_t *age, struct fib_info *fi,
 unsigned int);
 extern void rtmsg_fib(int event, __be32 key, struct fib_alias *fa,
  int dst_len, u32 tb_id, struct nl_info *info,
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c434119..1822d92 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -278,7 +278,8 @@ static inline size_t fib_nlmsg_size(stru
 + nla_total_size(4) /* RTA_TABLE */
 + nla_total_size(4) /* RTA_DST */
 + nla_total_size(4) /* RTA_PRIORITY */
-+ nla_total_size(4); /* RTA_PREFSRC */
++ nla_total_size(4) /* RTA_PREFSRC */
++ nla_total_size(4); /*RTA_AGE*/
 
/* space for nested metrics */
payload += nla_total_size((RTAX_MAX * nla_total_size(4)));
@@ -313,7 +314,7 @@ void rtmsg_fib(int event, __be32 key, st
 
err = fib_dump_info(skb, info-pid, seq, event, tb_id,
fa-fa_type, fa-fa_scope, key, dst_len,
-   fa-fa_tos, fa-fa_info, nlm_flags);
+   fa-fa_tos, fa-fa_age, fa-fa_info, nlm_flags);
if (err  0) {
/* -EMSGSIZE implies BUG in fib_nlmsg_size() */
WARN_ON(err == -EMSGSIZE);
@@ -940,11 +941,12 @@ __be32 __fib_res_prefsrc(struct fib_resu
 }
 
 int fib_dump_info(struct sk_buff *skb, u32 pid, u32 seq, int event,
- u32 tb_id, u8 type, u8 scope, __be32 dst, int dst_len, u8 tos,
+ u32 tb_id, u8 type, u8 scope, __be32 dst, int dst_len, u8 
tos, time_t *age,
  struct fib_info *fi, unsigned int flags)
 {
struct nlmsghdr *nlh;
struct rtmsg *rtm;
+   struct timeval tv;
 
nlh = nlmsg_put(skb, pid, seq, event, sizeof(*rtm), flags);
if (nlh == NULL)
@@ -985,6 +987,14 @@ int fib_dump_info(struct sk_buff *skb, u
NLA_PUT_U32(skb, RTA_FLOW, fi-fib_nh[0].nh_tclassid);
 #endif
}
+
+   do_gettimeofday(tv);
+   if (!*age) {
+   *age = timeval_to_sec(tv);
+   NLA_PUT_U32(skb, RTA_AGE, *age);
+   } else {
+   NLA_PUT_U32(skb, RTA_AGE, timeval_to_sec(tv) - *age);
+   }
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
if (fi-fib_nhs  1) {
struct rtnexthop *rtnh;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 30e332a..be2d1d6 

[PATCH 4/4] Initialize and fill IPv6 route age

2007-07-22 Thread Varun Chandramohan
The age field of the ipv6 route structures are initilized with the current 
timeval at the time of route   creation. When the route dump is called the 
route age value stored in the structure is subtracted from the present 
timeval and the difference is passed on as the route age.

Signed-off-by: Varun Chandramohan [EMAIL PROTECTED]
---
 include/net/ip6_fib.h   |1 +
 include/net/ip6_route.h |3 +++
 net/ipv6/addrconf.c |5 +
 net/ipv6/route.c|   23 +++
 4 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index c48ea87..e30a1cf 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -98,6 +98,7 @@ struct rt6_info

u32 rt6i_flags;
u32 rt6i_metric;
+   time_t  rt6i_age;
atomic_trt6i_ref;
struct fib6_table   *rt6i_table;
 
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 5456fdd..fc9716c 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -36,6 +36,9 @@ struct route_info {
 #define RT6_LOOKUP_F_REACHABLE 0x2
 #define RT6_LOOKUP_F_HAS_SADDR 0x4
 
+#define RT6_SET_ROUTE_INFO 0x0
+#define RT6_GET_ROUTE_INFO 0x1
+
 extern struct rt6_info ip6_null_entry;
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 5a5f8bd..715c766 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4187,6 +4187,7 @@ EXPORT_SYMBOL(unregister_inet6addr_notif
 
 int __init addrconf_init(void)
 {
+   struct timeval tv;
int err = 0;
 
/* The addrconf netdev notifier requires that loopback_dev
@@ -4214,10 +4215,14 @@ int __init addrconf_init(void)
if (err)
return err;
 
+   do_gettimeofday(tv);
ip6_null_entry.rt6i_idev = in6_dev_get(loopback_dev);
+   ip6_null_entry.rt6i_age = timeval_to_sec(tv);
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
ip6_prohibit_entry.rt6i_idev = in6_dev_get(loopback_dev);
+   ip6_prohibit_entry.rt6i_age = timeval_to_sec(tv);
ip6_blk_hole_entry.rt6i_idev = in6_dev_get(loopback_dev);
+   ip6_blk_hole_entry.rt6i_age = timeval_to_sec(tv);
 #endif
 
register_netdevice_notifier(ipv6_dev_notf);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index fe8d983..9386c05 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -600,7 +600,14 @@ static int __ip6_ins_rt(struct rt6_info
 {
int err;
struct fib6_table *table;
+   struct timeval tv;
 
+   do_gettimeofday(tv);
+   /* Update the timeval for new routes
+* We add it here to make it common irrespective
+* of how the new route is added.
+*/
+   rt-rt6i_age = timeval_to_sec(tv);
table = rt-rt6i_table;
write_lock_bh(table-tb6_lock);
err = fib6_add(table-tb6_root, rt, info);
@@ -2111,6 +2118,7 @@ static inline size_t rt6_nlmsg_size(void
   + nla_total_size(4) /* RTA_IIF */
   + nla_total_size(4) /* RTA_OIF */
   + nla_total_size(4) /* RTA_PRIORITY */
+  + nla_total_size(4) /*RTA_AGE*/
   + RTAX_MAX * nla_total_size(4) /* RTA_METRICS */
   + nla_total_size(sizeof(struct rta_cacheinfo));
 }
@@ -2118,10 +2126,11 @@ static inline size_t rt6_nlmsg_size(void
 static int rt6_fill_node(struct sk_buff *skb, struct rt6_info *rt,
 struct in6_addr *dst, struct in6_addr *src,
 int iif, int type, u32 pid, u32 seq,
-int prefix, unsigned int flags)
+int prefix, unsigned int flags, int dumpflg)
 {
struct rtmsg *rtm;
struct nlmsghdr *nlh;
+   struct timeval tv;
long expires;
u32 table;
 
@@ -2185,6 +2194,12 @@ static int rt6_fill_node(struct sk_buff
if (ipv6_get_saddr(rt-u.dst, dst, saddr_buf) == 0)
NLA_PUT(skb, RTA_PREFSRC, 16, saddr_buf);
}
+   
+   do_gettimeofday(tv);
+   if (dumpflg)
+   NLA_PUT_U32(skb, RTA_AGE, timeval_to_sec(tv) - rt-rt6i_age);
+   else
+   NLA_PUT_U32(skb, RTA_AGE, rt-rt6i_age);
 
if (rtnetlink_put_metrics(skb, rt-u.dst.metrics)  0)
goto nla_put_failure;
@@ -,7 +2237,7 @@ int rt6_dump_route(struct rt6_info *rt,
 
return rt6_fill_node(arg-skb, rt, NULL, NULL, 0, RTM_NEWROUTE,
 NETLINK_CB(arg-cb-skb).pid, arg-cb-nlh-nlmsg_seq,
-prefix, NLM_F_MULTI);
+prefix, NLM_F_MULTI, RT6_GET_ROUTE_INFO);
 }
 
 static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr* nlh, 
void *arg)
@@ -2287,7 +2302,7 @@ static int inet6_rtm_getroute(struct sk_
 
err = rt6_fill_node(skb, rt, fl.fl6_dst, fl.fl6_src, iif,
RTM_NEWROUTE, 

Re: [PATCH 00/10] Implement batching skb API

2007-07-22 Thread Krishna Kumar2
Hi Jamal,

J Hadi Salim [EMAIL PROTECTED] wrote on 07/22/2007 06:21:09 PM:

 My concern is there is no consistency in results. I see improvements on
 something which you say dont. You see improvement in something that
 Evgeniy doesnt etc.

Hmmm ? Evgeniy has not even tested my code to find some regression :) And
you may possibly not find much improvement in E1000 when you run iperf
(which
is what I do) compared to pktgen. I can re-run and confirm this since my
last
E1000 run was quite some time back.

My point is that batching not being viable for E1000 (or tg3) need not be
the
sole criterea for inclusion. If IPoIB or other drivers can take advantage
of
it and get better results, then batching can be considered. Maybe E1000 too
can get improvements if some one with more expertise tries to add this API
(not judging your driver writing capabilities - just stating that driver
writers will know more knobs to exploit a complex device like E1000).

  Since E1000 doesn't seem to use the TX lock on RX (atleast I couldn't
find
  it),
  I feel having prep will not help as no other cpu can execute the
queue/xmit
  code anyway (E1000 is also a LLTX driver).

 My experiments show it is useful (in a very visible way using pktgen)
 for e1000 to have the prep() interface.

I meant : have you compared results of batching with prep on vs prep off,
and
what is the difference in BW ?

   Other driver that hold tx lock could get improvement however.

 So you do see the value then with non LLTX drivers, right? ;-

No. I see value only in non-LLTX drivers which also gets the same TX lock
in the RX path. If different locks are got by TX/RX, then since you are
holding queue_lock before calling 'prep', this excludes other TX from
running at the same time. In that case, pre-poning the get of the tx_lock
to do the 'prep' will not cause any degradation (since no other tx can run
anyway, while rx can run as it gets a different lock).

 The value is also there in LLTX drivers even if in just formating a skb
 ready for transmit. If this is not clear i could do a much longer
 writeup on my thought evolution towards adding prep().

In LLTX drivers, the driver does the 'prep' without holding the tx_lock in
any case, so there should be no improvement. Could you send the write-up
since
I really don't see the value in prep unless the driver is non-LLTX *and*
TX/RX holds the same TX lock. I think that is the sole criterea, right ?

  If it helps, I guess you could send me a patch to
  add that and I can also test it to see what the effect is. I didn't add
it
  since IPoIB wouldn't be able to exploit it (unless someone is kind
enough
  to show me how to).

 Such core code should not just be focussed on IPOIB.

There is *nothing* IPoIB specific or focus in my code. I said adding prep
doesn't
work for IPoIB and so it is pointless to add bloat to the code until some
code can
actually take advantage of this feature (I am sure you will agree). Which
is why I
also mentioned to please send me a patch if you find it useful for any
driver
rather than rejecting this idea.

  I think the code I have is ready and stable,

 I am not sure how to intepret that - are you saying all-is-good and we
 should just push your code in?

I am only too well aware that Dave will not accept any code (having
experienced with Mobile IPv6 a long time back when he said to move most
of it to userspace and he was absolutely correct :). What I meant to say
is that there isn't much point in saying that your code is not ready or
you are using old code base, or has multiple restart functions, or is not
tested enough, etc, and then say let's re-do/rethink the whole
implementation when my code is already working and giving good results.
Unless you have some design issues with it, or code is written badly, is
not maintainable, not linux style compliant, is buggy, will not handle
some case/workload, type of issues.

OTOH, if you find some cases that are better handled with :
  1. prep handler
  2. xmit_win (which I don't have now),
then please send me patches and I will also test out and incorporate.

 It sounds disingenuous but i may have misread you.

(lacking in frankness, candor, or sincerity; falsely or hypocritically
ingenuous; insincere)  Sorry, no response to personal comments and
have a flame-war :)

Thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-22 Thread Marcin Ślusarz

Ok, I've bisected this problem and found that this patch broke my NIC:

76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
commit 76d2160147f43f982dfe881404cfde9fd0a9da21
Author: Ingo Molnar [EMAIL PROTECTED]
Date:   Fri Feb 16 01:28:24 2007 -0800

   [PATCH] genirq: do not mask interrupts by default

   Never mask interrupts immediately upon request.  Disabling interrupts in
   high-performance codepaths is rare, and on the other hand this change could
   recover lost edges (or even other types of lost interrupts) by
conservatively
   only masking interrupts after they happen.  (NOTE: with this change the
   highlevel irq-disable code still soft-disables this IRQ line - and
if such an
   interrupt happens then the IRQ flow handler keeps the IRQ masked.)

   Mark i8529A controllers as 'never loses an edge'.

   Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
   Cc: Thomas Gleixner [EMAIL PROTECTED]
   Signed-off-by: Andrew Morton [EMAIL PROTECTED]
   Signed-off-by: Linus Torvalds [EMAIL PROTECTED]

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=76d2160147f43f982dfe881404cfde9fd0a9da21

After reverting it on top of 2.6.21.3 (with
d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2 - [PATCH] genirq: remove
IRQ_DISABLED (which ment remove IRQ_DELAYED_DISABLE)), the problem
didn't show up :)
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2)

So I cooked patch like below and everything is working fine (so far)

Fix default_disable interrupt function (broken by [PATCH] genirq: do
not mask interrupts by default) - revert removal of codepath which was
invoked when removed flag (IRQ_DELAYED_DISABLE) wag NOT set

Signed-off-by: Marcin Slusarz [EMAIL PROTECTED]
---
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 76a9106..0bb23cd 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -230,6 +230,8 @@ static void default_enable(unsigned int irq)
 */
static void default_disable(unsigned int irq)
{
+   struct irq_desc *desc = irq_desc + irq;
+   desc-chip-mask(irq);
}

/*

(Sorry for whitespace damage, but I have to send it from webmail :|)
(I'm a kernel noob, so don't kill me if my patch is wrong ;)
ps: Here is the beginning of this thread: http://lkml.org/lkml/2007/6/16/182


Regards,
Marcin Slusarz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html