Re: [RFC] net: napi fix

2007-12-13 Thread Andrew Gallatin

Joonwoo Park wrote:
 2007/12/13, Kok, Auke [EMAIL PROTECTED]:
 David Miller wrote:
 From: Andrew Gallatin [EMAIL PROTECTED]
 Date: Wed, 12 Dec 2007 12:29:23 -0500

 Is the netif_running() check even required?
 No, it is not.

 When a device is brought down, one of the first things
 that happens is that we wait for all pending NAPI polls
 to complete, then block any new polls from starting.
 I think this was previously (pre-2.6.24) not the case, which is why 
e1000 et al

 has this check as well and that's exactly what is causing most of the
 net_rx_action oopses in the first place. Without the netif_running() 
check
 previously the drivers were just unusable with NAPI and prone to 
many races with
 down (i.e. touching some ethtool ioctl which wants to do a reset 
while routing
 small packets at high numbers). that's why we added the 
netif_running() check in

 the first place :)

 There might be more drivers lurking that need this change...

 Auke


 Also in my case, without netif_running() check, I cannot do ifconfig 
down.

 It stucked if packet generator was sending packets.

If the netif_running() check is indeed required to make a device break
out of napi polling and respond to an ifconfig down, then I think the
netif_running() check should be moved up into net_rx_action() to avoid
potential for driver complexity and bugs like the ones you found.

Drew
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] net: napi fix

2007-12-13 Thread Andrew Gallatin

Stephen Hemminger wrote:

On Thu, 13 Dec 2007 06:19:38 -0800 (PST)
David Miller [EMAIL PROTECTED] wrote:


From: Andrew Gallatin [EMAIL PROTECTED]
Date: Thu, 13 Dec 2007 09:13:54 -0500


If the netif_running() check is indeed required to make a device break
out of napi polling and respond to an ifconfig down, then I think the
netif_running() check should be moved up into net_rx_action() to avoid
potential for driver complexity and bugs like the ones you found.

That, or something like it, definitely sounds reasonable and much
better than putting the check into every driver :-)
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


It is not possible to do netif_running() check in generic code as currently
written because of the case of devices where a single NAPI object is
being used to handle two devices. The association between napi and netdevice
is M to N.  There are cases like niu that have multiple NAPI's and one
netdevice; and devices like sky2 that can have one NAPI and 2 netdevice's.


Ah, now I see.  I forgot that not every device has a 1:1::napi:netdev
relationship.

Could we make an optional *dev_state field in the napi structure.
It would be initialized to __LINK_STATE_START.  Devices which have
a 1:1 NAPI:netdevice relationship would set it to netdev-state.
The generic code would then do a test_bit(__LINK_STATE_START, 
napi-dev_state), and 1:1 drivers could remove this check.

M:N drivers would pay for a useless (to them) test_bit, and would
have to provide their own netif_running check to get termination
under heavy load.

Just an idea, perhaps there is a better way which is less hacky.

Or perhaps we should just leave things as is.

Drew
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] net: napi fix

2007-12-12 Thread Andrew Gallatin

David Miller wrote:

From: Andrew Gallatin [EMAIL PROTECTED]
Date: Wed, 12 Dec 2007 12:29:23 -0500


Is the netif_running() check even required?


No, it is not.

When a device is brought down, one of the first things
that happens is that we wait for all pending NAPI polls
to complete, then block any new polls from starting.


Great, thanks.  I will submit a patch to remove the bogus
check.  This should fix myri10ge properly.


Thank you,

Drew
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] net: napi fix

2007-12-12 Thread Andrew Gallatin

[I apologize for loosing threading, I'm replying from the archives]

 The problem is that the driver is doing a NAPI completion and
 re-enabling chip interrupts with work_done == weight, and that is
 illegal.

The only time at least myri10ge will do this is due to
the !netif_running(netdev) check.   Eg, from myri10ge's poll:

work_done = myri10ge_clean_rx_done(mgp, budget);

if (work_done  budget || !netif_running(netdev)) {
netif_rx_complete(netdev, napi);
put_be32(htonl(3), mgp-irq_claim);
}

Is the netif_running() check even required? Is this just
a bad way to solve a race with running NAPI at down() time
that would be better solved by putting a napi_synchronize()
in the driver's down() routine?

I'd rather fix this right than add another check to a
questionable code path.

Thanks,

Drew
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][NETDEV]: remove netif_running() check from myri10ge_poll()

2007-12-12 Thread Andrew Gallatin

Remove the bogus netif_running() check from myri10ge_poll().

This eliminates any chance that myri10ge_poll() can trigger
an oops by calling netif_rx_complete() and returning
with work_done == budget.

Signed-off-by: Andrew Gallatin [EMAIL PROTECTED]
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 8def865..c90958f 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -1239,7 +1239,7 @@ static int myri10ge_poll(struct napi_str
/* process as many rx events as NAPI will allow */
work_done = myri10ge_clean_rx_done(mgp, budget);
 
-   if (work_done  budget || !netif_running(netdev)) {
+   if (work_done  budget) {
netif_rx_complete(netdev, napi);
put_be32(htonl(3), mgp-irq_claim);
}


[PATCH (resubmit)]: fix lro_gen_skb() alignment

2007-12-04 Thread Andrew Gallatin


Add a field to the lro_mgr struct so that drivers can specify how much
padding is required to align layer 3 headers when a packet is copied
into a freshly allocated skb by inet_lro.c:lro_gen_skb().  Without
padding, skbs generated by LRO will cause alignment warnings on
architectures which require strict alignment (seen on sparc64).

Myri10GE is updated to use this field.

Signed off by: Andrew Gallatin [EMAIL PROTECTED]

diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 0f306dd..8def865 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -1979,6 +1979,7 @@ static int myri10ge_open(struct net_devi
lro_mgr-lro_arr = mgp-rx_done.lro_desc;
lro_mgr-get_frag_header = myri10ge_get_frag_header;
lro_mgr-max_aggr = myri10ge_lro_max_pkts;
+   lro_mgr-frag_align_pad = 2;
if (lro_mgr-max_aggr  MAX_SKB_FRAGS)
lro_mgr-max_aggr = MAX_SKB_FRAGS;
 
diff --git a/include/linux/inet_lro.h b/include/linux/inet_lro.h
index 1246d46..80335b7 100644
--- a/include/linux/inet_lro.h
+++ b/include/linux/inet_lro.h
@@ -91,6 +91,9 @@ #define LRO_F_EXTRACT_VLAN_ID 2  /* Set 
int max_desc; /* Max number of LRO descriptors  */
int max_aggr; /* Max number of LRO packets to be aggregated */
 
+   int frag_align_pad; /* Padding required to properly align layer 3
+* headers in generated skb when using frags */
+
struct net_lro_desc *lro_arr; /* Array of LRO descriptors */
 
/*
diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c
index ac3b1d3..9a96c27 100644
--- a/net/ipv4/inet_lro.c
+++ b/net/ipv4/inet_lro.c
@@ -401,10 +401,11 @@ static struct sk_buff *lro_gen_skb(struc
int data_len = len;
int hdr_len = min(len, hlen);
 
-   skb = netdev_alloc_skb(lro_mgr-dev, hlen);
+   skb = netdev_alloc_skb(lro_mgr-dev, hlen + lro_mgr-frag_align_pad);
if (!skb)
return NULL;
 
+   skb_reserve(skb, lro_mgr-frag_align_pad);
skb-len = len;
skb-data_len = len - hdr_len;
skb-truesize += true_size;


Re: [PATCH]: fix lro_gen_skb() alignment

2007-12-02 Thread Andrew Gallatin

Herbert Xu wrote:

On Fri, Nov 30, 2007 at 02:35:43PM -0500, Andrew Gallatin wrote:

Isn't the value of 2 ethernet-specific (to round the 14-byte header up
to 16)?  Given that the rest of the lro code is fairly careful to
calculate mac_hdr_len etc it seems as if it would be cleaner to make
this independent of the specific L2 being used.

(And I plan on using the LRO module for IP-over-InfiniBand so this is
not completely theoretical)


Good point!

We really should rename NET_IP_ALIGN so that both Ethernet and DMA
occur in it somehow :)


Good point.  I tend to think all the world is ethernet.
Perhaps the better way would be to simply add an alignment pad
field to lro_mgr?  When the driver initializes it, it specifies
any padding needed.  Ethernet drivers would specify 2.


Just pass in the mac_hdr_len, and calculate the padding as


That was my first thought as well, but it turns out that
when lro_gen_skb() is called via the out1 label, mac_hdr_len
may not be known.  It seemed simplest and cleanest to just
make it a field in lro_mgr.

Drew
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: fix lro_gen_skb() alignment

2007-11-30 Thread Andrew Gallatin

Roland Dreier wrote:

   -   skb = netdev_alloc_skb(lro_mgr-dev, hlen);
   +   skb = netdev_alloc_skb(lro_mgr-dev, hlen + NET_IP_ALIGN);
   NET_IP_ALIGN should only be used if you're DMAing into the skb head.
   Otherwise you should say 2.  It would be nice to have another macro
   for that I suppose.
  
  It is certainly simple enough to say 2.  Thank you for pointing

  this out.  I have attached a patch to do that..
  
  Signed off by: Andrew Gallatin [EMAIL PROTECTED]


Isn't the value of 2 ethernet-specific (to round the 14-byte header up
to 16)?  Given that the rest of the lro code is fairly careful to
calculate mac_hdr_len etc it seems as if it would be cleaner to make
this independent of the specific L2 being used.

(And I plan on using the LRO module for IP-over-InfiniBand so this is
not completely theoretical)


Good point.  I tend to think all the world is ethernet.
Perhaps the better way would be to simply add an alignment pad
field to lro_mgr?  When the driver initializes it, it specifies
any padding needed.  Ethernet drivers would specify 2.

Is this acceptable?

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]: fix lro_gen_skb() alignment

2007-11-30 Thread Andrew Gallatin

Herbert Xu wrote:

Andrew Gallatin [EMAIL PROTECTED] wrote:

diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c
index ac3b1d3..91e9371 100644
--- a/net/ipv4/inet_lro.c
+++ b/net/ipv4/inet_lro.c
@@ -401,10 +401,11 @@ static struct sk_buff *lro_gen_skb(struc
   int data_len = len;
   int hdr_len = min(len, hlen);

-   skb = netdev_alloc_skb(lro_mgr-dev, hlen);
+   skb = netdev_alloc_skb(lro_mgr-dev, hlen + NET_IP_ALIGN);


NET_IP_ALIGN should only be used if you're DMAing into the skb head.
Otherwise you should say 2.  It would be nice to have another macro
for that I suppose.


It is certainly simple enough to say 2.  Thank you for pointing
this out.  I have attached a patch to do that..

Signed off by: Andrew Gallatin [EMAIL PROTECTED]

Drew
diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c
index ac3b1d3..35b816e 100644
--- a/net/ipv4/inet_lro.c
+++ b/net/ipv4/inet_lro.c
@@ -401,10 +401,11 @@ static struct sk_buff *lro_gen_skb(struc
int data_len = len;
int hdr_len = min(len, hlen);
 
-   skb = netdev_alloc_skb(lro_mgr-dev, hlen);
+   skb = netdev_alloc_skb(lro_mgr-dev, hlen + 2);
if (!skb)
return NULL;
 
+   skb_reserve(skb, 2);
skb-len = len;
skb-data_len = len - hdr_len;
skb-truesize += true_size;


[PATCH]: fix lro_gen_skb() alignment

2007-11-27 Thread Andrew Gallatin

The inet_lro.c:lro_gen_skb() function fails to include
NET_IP_ALIGN padding at the front of the sk_buffs it creates,
leading to alignment warnings on architectures which require
strict alignment (seen on sparc64).  The attached patch
adds NET_IP_ALIGN padding.

Signed off by: Andrew Gallatin [EMAIL PROTECTED]

Drew

diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c
index ac3b1d3..91e9371 100644
--- a/net/ipv4/inet_lro.c
+++ b/net/ipv4/inet_lro.c
@@ -401,10 +401,11 @@ static struct sk_buff *lro_gen_skb(struc
int data_len = len;
int hdr_len = min(len, hlen);
 
-   skb = netdev_alloc_skb(lro_mgr-dev, hlen);
+   skb = netdev_alloc_skb(lro_mgr-dev, hlen + NET_IP_ALIGN);
if (!skb)
return NULL;
 
+   skb_reserve(skb, NET_IP_ALIGN);
skb-len = len;
skb-data_len = len - hdr_len;
skb-truesize += true_size;


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Andrew Gallatin

David Miller wrote:
 From: Herbert Xu [EMAIL PROTECTED]
 Date: Tue, 20 Nov 2007 14:09:18 +0800

 David Miller [EMAIL PROTECTED] wrote:
 Fundamentally, I really don't like this change, it batches to the
 point where it begins to erode the natural ACK clocking of TCP, and I
 therefore am very likely to revert it before merging to Linus.
 Perhaps make it a tunable that defaults to off?

 That's one idea.

I'd certainly prefer the option to have a tunable to having our
customers see performance regressions when they switch to
the kernel's LRO.

 But if it's there the risk it to have it end up being turned on
 always by distribution vendors, making our off-default pointless
 and the internet gets crapped up with misbehaving Linux TCP
 stacks anyways.

If vendors are going to pick this up, there is the risk of them just
applying this patch (which currently has no tunable to disable it),
leaving their users stuck with it enabled.  At least with a tunable,
it would be easy for them to turn it off.  And the comments surrounding
it could make it clear why it should default to off.

FWIW, we've seen TCP perform well in a WAN setting using our NICs and
our LRO which does this ack aggregation.  For example, the last 2
Supercomputing Bandwidth Challenges (making the most of a 10Gb/s
WAN connection) were won by teams using our NICs, with drivers that
did this sort of ack aggregation.

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] LRO ack aggregation

2007-11-20 Thread Andrew Gallatin

David Miller wrote:
 From: Andrew Gallatin [EMAIL PROTECTED]
 Date: Tue, 20 Nov 2007 06:47:57 -0500

 David Miller wrote:
   From: Herbert Xu [EMAIL PROTECTED]
   Date: Tue, 20 Nov 2007 14:09:18 +0800
  
   David Miller [EMAIL PROTECTED] wrote:
   Fundamentally, I really don't like this change, it batches to the
   point where it begins to erode the natural ACK clocking of TCP, 
and I

   therefore am very likely to revert it before merging to Linus.
   Perhaps make it a tunable that defaults to off?
  
   That's one idea.

 I'd certainly prefer the option to have a tunable to having our
 customers see performance regressions when they switch to
 the kernel's LRO.

 Please qualify this because by itself it's an inaccurate statement.

 It would cause a performance regression in situations where the is
 nearly no packet loss, no packet reordering, and the receiver has
 strong enough cpu power.

Yes, a regression of nearly 1Gb/s in some cases as I mentioned
when I submitted the patch.



 Show me something over real backbones, talking to hundres or thousands
 of clients scattered all over the world.  That's what people will be
 using these high end NICs for front facing services, and that's where
 loss happens and stretch ACKs hurt performance.


I can't.  I think most 10GbE on endstations is used either in the
sever room, or on dedicated links.  My experience with 10GbE users is
limited to my interactions with people using our NICs who contact our
support.  Of those, I can recall only a tiny handful who were using
10GbE on a normal internet facing connection (and the ones I dealt
with were actually running a different OS).  The vast majority were in
a well controlled, lossless environment.  It is quite ironic.  The
very fact that I cannot provide you with examples of internet facing
people using LRO (w/ack aggr) in more normal applications tends to
support my point that most 10GbE users seem to be in lossless
environments.

 ACK stretching is bad bad bad for everything outside of some well
 controlled test network bubble.

I just want those in the bubble to continue have the best performance
possible in their situation.  If it is a tunable the defaults to off,
that is great.

Hmm.. rather than a global tunable, what if it was a
network driver managed tunable which toggled a flag in the
lro_mgr features?  Would that be better?

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH]: Fix myri10ge NAPI oops warnings

2007-10-31 Thread Andrew Gallatin


When testing the myri10ge driver with 2.6.24-rc1, I found
that the machine crashed under heavy load:

Unable to handle kernel paging request at 00100108 RIP:
 [803cc8dd] net_rx_action+0x11b/0x184

The address corresponds to the list_move_tail() in
netif_rx_complete():
if (unlikely(work == weight))
list_move_tail(n-poll_list, list);

Eventually, I traced the crashes to calling netif_rx_complete() with
work_done == budget.  From looking at other drivers, it appears that
one should only call netif_rx_complete() when work_done  budget.

To fix it, I changed the test in myri10ge_poll() so that it refers
to to work_done rather than looking at the rx ring status.  If
work_done is  budget, then that implies we have no more packets to
process. Any races will be resolved by the NIC when the write to
irq_claim is made.

In myri10ge_clean_rx_done(), if we ever exceeded our budget, it would
report a work_done one larger than was acutally done.  This is because
the increment was done in the conditional, so work_done would be
incremented regardless of whether or not the test passed or failed.
This would lead to the WARN_ON_ONCE(work  weight); warning in
net_rx_action triggering.  I've moved the increment of work_done
inside the loop.  Note that this would only be a problem when we had
exceeded our budget.

Signed off by: Andrew Gallatin [EMAIL PROTECTED]

Andrew Gallatin Myricom Inc


diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
index 366e62a..0f306dd 100644
--- a/drivers/net/myri10ge/myri10ge.c
+++ b/drivers/net/myri10ge/myri10ge.c
@@ -1151,7 +1151,7 @@ static inline int myri10ge_clean_rx_done
u16 length;
__wsum checksum;
 
-   while (rx_done-entry[idx].length != 0  work_done++  budget) {
+   while (rx_done-entry[idx].length != 0  work_done  budget) {
length = ntohs(rx_done-entry[idx].length);
rx_done-entry[idx].length = 0;
checksum = csum_unfold(rx_done-entry[idx].checksum);
@@ -1167,6 +1167,7 @@ static inline int myri10ge_clean_rx_done
rx_bytes += rx_ok * (unsigned long)length;
cnt++;
idx = cnt  (myri10ge_max_intr_slots - 1);
+   work_done++;
}
rx_done-idx = idx;
rx_done-cnt = cnt;
@@ -1233,13 +1234,12 @@ static int myri10ge_poll(struct napi_str
struct myri10ge_priv *mgp =
container_of(napi, struct myri10ge_priv, napi);
struct net_device *netdev = mgp-dev;
-   struct myri10ge_rx_done *rx_done = mgp-rx_done;
int work_done;
 
/* process as many rx events as NAPI will allow */
work_done = myri10ge_clean_rx_done(mgp, budget);
 
-   if (rx_done-entry[rx_done-idx].length == 0 || !netif_running(netdev)) 
{
+   if (work_done  budget || !netif_running(netdev)) {
netif_rx_complete(netdev, napi);
put_be32(htonl(3), mgp-irq_claim);
}


Re: [PATCH]: Fix myri10ge NAPI oops warnings

2007-10-31 Thread Andrew Gallatin

Stephen Hemminger wrote:

On Wed, 31 Oct 2007 17:40:06 -0400
Andrew Gallatin [EMAIL PROTECTED] wrote:


When testing the myri10ge driver with 2.6.24-rc1, I found
that the machine crashed under heavy load:

Unable to handle kernel paging request at 00100108 RIP:
  [803cc8dd] net_rx_action+0x11b/0x184

The address corresponds to the list_move_tail() in
netif_rx_complete():
 if (unlikely(work == weight))
 list_move_tail(n-poll_list, list);

Eventually, I traced the crashes to calling netif_rx_complete() with
work_done == budget.  From looking at other drivers, it appears that
one should only call netif_rx_complete() when work_done  budget.

To fix it, I changed the test in myri10ge_poll() so that it refers
to to work_done rather than looking at the rx ring status.  If
work_done is  budget, then that implies we have no more packets to
process. Any races will be resolved by the NIC when the write to
irq_claim is made.

In myri10ge_clean_rx_done(), if we ever exceeded our budget, it would
report a work_done one larger than was acutally done.  This is because
the increment was done in the conditional, so work_done would be
incremented regardless of whether or not the test passed or failed.
This would lead to the WARN_ON_ONCE(work  weight); warning in
net_rx_action triggering.  I've moved the increment of work_done
inside the loop.  Note that this would only be a problem when we had
exceeded our budget.

Signed off by: Andrew Gallatin [EMAIL PROTECTED]

Andrew Gallatin Myricom Inc




Yes, this looks right.
How could the check in netif_rx_complete be changed to catch this better?


I'm sorry, but I don't really know.  I'm afraid that I am
rather clueless about the new NAPI, and found this by
stumbling around in the dark. Naively, it seems like some
global option to override all napi weights (eg, to 1) would
be helpful so that if a driver had this problem, it could
be easily reproduced by the first packet received.  At least
I found setting our weight to one made the bug rather
obvious to me.

Drew


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] LRO ack aggregation

2007-10-23 Thread Andrew Gallatin

Hi,

We recently did some performance comparisons between the new inet_lro
LRO support in the kernel, and our Myri10GE in-driver LRO.
For receive, we found they were nearly identical.  However, for
transmit, we found that Myri10GE's LRO shows much lower CPU
utilization.  We traced the CPU utilization difference to our driver
LRO aggregating TCP acks, and the inet_lro module not doing so.

I've attached a patch which adds support to inet_lro for aggregating
pure acks.  Aggregating pure acks (segments with TCP_PAYLOAD_LENGTH ==
0) entails freeing the skb (or putting the page in the frags case).
The patch also handles trimming (typical for 54-byte pure ack frames
which have been padded to the ethernet minimum 60 byte frame size).
In the frags case, I tried to keep things simple by only doing the
trim when the entire frame fits in the first frag.  To be safe, I
ensure that the padding is all 0 (or, more exactly, was some pattern
whose checksum is -0) so that it doesn't impact hardware checksums.

This patch also fixes a small bug in the skb LRO path dealing with
vlans that I found when doing my own testing.  Specifically, in the
desc-active case, the existing code always fails the
lro_tcp_ip_check() for NICs without LRO_F_EXTRACT_VLAN_ID, because it
 fails to subtract the vlan_hdr_len from the skb-len.

Jan-Bernd Themann ([EMAIL PROTECTED]) has tested the patch using
the eHEA driver (skb codepath), and I have tested it using Myri10GE
(both frags and skb codepath).

Using a pair of identical low-end 2.0GHz Athlon 64 x2 3800+ with
Myri10GE 10GbE NICs, I ran 10 iterations of netperf TCP_SENDFILE
tests, taking the median run for comparison purposes.  The receiver
was running Myri10GE + patched inet_lro:

TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to rome-my
(192.168.1.16) port 0 AF_INET : cpu bind
Recv   SendSend  Utilization   Service
Demand
Socket Socket  Message  Elapsed  Send Recv Send Recv
Size   SizeSize Time Throughput  localremote   local
remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB us/KB

Myri10GE driver-specific LRO:
  87380  65536  6553660.02  9442.65   16.2469.310.282 1.203
Myri10GE + unpatched inet_lro:
  87380  65536  6553660.02  9442.88   20.1069.110.349 1.199
Myri10GE + patched inet_lro:
  87380  65536  6553660.02  9443.30   16.9568.970.294 1.197

The important bit here is the sender's CPU utilization, and service
demand (cost per byte).  As you can see, without aggregating ack's,
the overhead on the sender is roughly 20% higher, even when sending to
a receiver which uses LRO.  The differences are even more dramatic
when sending to a receiver which does not use LRO (and hence sends
more frequent acks).

Below is the same benchmark, run between a pair of 4-way 3.0GHz Xeon
5160 machines (Dell 2950) with Myri10GE NICs.  The receiver is running
Solaris 10U4, which does not do LRO, and is acking at approximately
8:1 (or ~100K acks/sec):

Myri10GE driver-specific LRO:
196712  65536  6553660.01  9280.09   7.14 45.370.252   1.602
Myri10GE + unpatched inet_lro:
196712  65536  6553660.01  8530.80   10.5144.600.404   1.713
Myri10GE + patched inet_lro:
196712  65536  6553660.00  9249.65   7.21 45.900.255   1.626

Signed off by: Andrew Gallatin [EMAIL PROTECTED]


Andrew Gallatin
Myricom Inc.

diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c
index ac3b1d3..eba145b 100644
--- a/net/ipv4/inet_lro.c
+++ b/net/ipv4/inet_lro.c
@@ -58,9 +58,6 @@ static int lro_tcp_ip_check(struct iphdr
if (ntohs(iph-tot_len) != len)
return -1;
 
-   if (TCP_PAYLOAD_LENGTH(iph, tcph) == 0)
-   return -1;
-
if (iph-ihl != IPH_LEN_WO_OPTIONS)
return -1;
 
@@ -223,6 +220,11 @@ static void lro_add_packet(struct net_lr
 
lro_add_common(lro_desc, iph, tcph, tcp_data_len);
 
+   if (tcp_data_len == 0) {
+   dev_kfree_skb_any(skb);
+   return;
+   }
+
skb_pull(skb, (skb-len - tcp_data_len));
parent-truesize += skb-truesize;
 
@@ -244,6 +246,11 @@ static void lro_add_frags(struct net_lro
 
lro_add_common(lro_desc, iph, tcph, tcp_data_len);
 
+   if (tcp_data_len == 0) {
+   put_page(skb_frags[0].page);
+   return;
+   }
+
skb-truesize += truesize;
 
skb_frags[0].page_offset += hlen;
@@ -338,6 +345,8 @@ static int __lro_proc_skb(struct net_lro
struct tcphdr *tcph;
u64 flags;
int vlan_hdr_len = 0;
+   int pkt_len;
+   int trim;
 
if (!lro_mgr-get_skb_header
|| lro_mgr-get_skb_header(skb, (void *)iph, (void *)tcph,
@@ -355,6 +364,17 @@ static int __lro_proc_skb(struct net_lro
 !test_bit(LRO_F_EXTRACT_VLAN_ID, lro_mgr-features))
vlan_hdr_len = VLAN_HLEN;
 
+   /* strip padding

Re: myri10ge net-2.6.24 build fix

2007-08-09 Thread Andrew Gallatin

David Miller wrote:
 I had to add the following patch to fix the build after
 the LRO changes, I have no idea how you could have compile
 tested that patch let alone done any real testing on it :-/

Whoops.  I'm very sorry about that.  Future patches will be submitted
by our Linux guy, who knows the correct procedures. :)

FWIW, the patch I sent came from a script which filters our internal
driver (which I always work with), and that line was erroneously
filtered by the script we use to remove all the stuff you guys frown
on (like our LRO, compat shims for older kernels, optional support to
receive into skbs rather than pages, etc).  My testing was done with
our un-filtered upstream driver.

Since our driver specific LRO is now gone (hurray!), I removed the LRO
filtering, re-ran the filter script, and arrived at the same patch
you committed.

Thanks for fixing my mistake!

Drew


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lro: myri10ge example how to use LRO

2007-08-03 Thread Andrew Gallatin

To follow up on Jan-Bernd Themann's LRO patch earlier today,
this patch shows how the generic LRO interface can be used for
page based drivers.

Again, many thanks to Jan-Bernd Themann for leading this effort.

Drew

Singed off by: Andrew Gallatin [EMAIL PROTECTED]
diff -urNp a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
--- a/drivers/net/myri10ge/myri10ge.c   2007-07-24 15:57:12.0 -0400
+++ b/drivers/net/myri10ge/myri10ge.c   2007-08-03 13:07:48.0 -0400
@@ -48,6 +48,7 @@
 #include linux/etherdevice.h
 #include linux/if_ether.h
 #include linux/if_vlan.h
+#include linux/inet_lro.h
 #include linux/ip.h
 #include linux/inet.h
 #include linux/in.h
@@ -62,6 +63,8 @@
 #include linux/io.h
 #include linux/log2.h
 #include net/checksum.h
+#include net/ip.h
+#include net/tcp.h
 #include asm/byteorder.h
 #include asm/io.h
 #include asm/processor.h
@@ -89,6 +92,7 @@ MODULE_LICENSE(Dual BSD/GPL);
 
 #define MYRI10GE_EEPROM_STRINGS_SIZE 256
 #define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2)
+#define MYRI10GE_MAX_LRO_DESCRIPTORS 8
 
 #define MYRI10GE_NO_CONFIRM_DATA htonl(0x)
 #define MYRI10GE_NO_RESPONSE_RESULT 0x
@@ -151,6 +155,8 @@ struct myri10ge_rx_done {
dma_addr_t bus;
int cnt;
int idx;
+   struct net_lro_mgr lro_mgr;
+   struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS];
 };
 
 struct myri10ge_priv {
@@ -276,6 +282,14 @@ static int myri10ge_debug = -1;/* defau
 module_param(myri10ge_debug, int, 0);
 MODULE_PARM_DESC(myri10ge_debug, Debug level (0=none,...,16=all));
 
+static int myri10ge_lro = 1;
+module_param(myri10ge_lro, int, S_IRUGO);
+MODULE_PARM_DESC(myri10ge_lro, Enable large receive offload\n);
+
+static int myri10ge_lro_max_pkts = MYRI10GE_LRO_MAX_PKTS;
+module_param(myri10ge_lro_max_pkts, int, S_IRUGO);
+MODULE_PARM_DESC(myri10ge_lro, Number of LRO packets to be aggregated\n);
+
 static int myri10ge_fill_thresh = 256;
 module_param(myri10ge_fill_thresh, int, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(myri10ge_fill_thresh, Number of empty rx slots allowed\n);
@@ -1019,6 +1033,15 @@ myri10ge_rx_done(struct myri10ge_priv *m
remainder -= MYRI10GE_ALLOC_SIZE;
}
 
+   if (mgp-csum_flag  myri10ge_lro) {
+   rx_frags[0].page_offset += MXGEFW_PAD;
+   rx_frags[0].size -= MXGEFW_PAD;
+   len -= MXGEFW_PAD;
+   lro_receive_frags(mgp-rx_done.lro_mgr, rx_frags,
+ len, len, (void *)(unsigned long)csum, csum);
+   return 1;
+   }
+
hlen = MYRI10GE_HLEN  len ? len : MYRI10GE_HLEN;
 
/* allocate an skb to attach the page(s) to. */
@@ -1137,6 +1160,9 @@ static inline void myri10ge_clean_rx_don
mgp-stats.rx_packets += rx_packets;
mgp-stats.rx_bytes += rx_bytes;
 
+   if (myri10ge_lro)
+   lro_flush_all(rx_done-lro_mgr);
+
/* restock receive rings if needed */
if (mgp-rx_small.fill_cnt - mgp-rx_small.cnt  myri10ge_fill_thresh)
myri10ge_alloc_rx_pages(mgp, mgp-rx_small,
@@ -1378,7 +1404,8 @@ static const char myri10ge_gstrings_stat
dropped_pause, dropped_bad_phy, dropped_bad_crc32,
dropped_unicast_filtered, dropped_multicast_filtered,
dropped_runt, dropped_overrun, dropped_no_small_buffer,
-   dropped_no_big_buffer
+   dropped_no_big_buffer, LRO aggregated, LRO flushed,
+   LRO avg aggr, LRO no_desc
 };
 
 #define MYRI10GE_NET_STATS_LEN  21
@@ -1444,6 +1471,14 @@ myri10ge_get_ethtool_stats(struct net_de
data[i++] = (unsigned int)ntohl(mgp-fw_stats-dropped_overrun);
data[i++] = (unsigned int)ntohl(mgp-fw_stats-dropped_no_small_buffer);
data[i++] = (unsigned int)ntohl(mgp-fw_stats-dropped_no_big_buffer);
+   data[i++] = mgp-rx_done.lro_mgr.stats.aggregated;
+   data[i++] = mgp-rx_done.lro_mgr.stats.flushed;
+   if (mgp-rx_done.lro_mgr.stats.flushed)
+   data[i++] = mgp-rx_done.lro_mgr.stats.aggregated /
+   mgp-rx_done.lro_mgr.stats.flushed;
+   else
+   data[i++] = 0;
+   data[i++] = mgp-rx_done.lro_mgr.stats.no_desc;
 }
 
 static void myri10ge_set_msglevel(struct net_device *netdev, u32 value)
@@ -1717,10 +1752,69 @@ static void myri10ge_free_irq(struct myr
pci_disable_msi(pdev);
 }
 
+static int
+myri10ge_get_frag_header(struct skb_frag_struct *frag, void **mac_hdr,
+void **ip_hdr, void **tcpudp_hdr,
+u64 * hdr_flags, void *priv)
+{
+   struct ethhdr *eh;
+   struct vlan_ethhdr *veh;
+   struct iphdr *iph;
+   u8 *va = page_address(frag-page) + frag-page_offset;
+   unsigned long ll_hlen;
+   __wsum csum = (__wsum) (unsigned long)priv;
+
+   /* find the mac header, aborting if not IPv4 */
+
+   eh = (struct ethhdr *)va;
+   *mac_hdr = eh;
+   ll_hlen = ETH_HLEN;
+   if (eh-h_proto != htons

Re: [PATCH] lro: myri10ge example how to use LRO

2007-08-03 Thread Andrew Gallatin

Kok, Auke wrote:

Andrew Gallatin wrote:

To follow up on Jan-Bernd Themann's LRO patch earlier today,
this patch shows how the generic LRO interface can be used for
page based drivers.

Again, many thanks to Jan-Bernd Themann for leading this effort.

Drew

Singed off by: Andrew Gallatin [EMAIL PROTECTED]




please take a look at my lro patch for ethtool and see if it works for 
you, instead of adding another generic module parameter that doesn't 
need to be there.


That looks very nice, and will indeed work for me.

Thanks,

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-31 Thread Andrew Gallatin

Jan-Bernd Themann wrote:

On Monday 30 July 2007 22:32, Andrew Gallatin wrote:



Second, you still need to set skb-ip_summed = CHECKSUM_UNNECESSARY
when modified packets are flushed, else the stack will see bad
checksums for packets from CHECKSUM_COMPLETE drivers using the
skb interface.  Fixed in the attached patch.


I thought about it... As we do update the TCP checksum for aggregated
packets we could add a second ip_summed field in the net_lro_mgr struct 
used for aggregated packets to support HW that does not have any checksum helper
functionality. These drivers could set this ip_summed field to CHECKSUM_NONE, 
and thus leave the checksum check to the stack. I'm not sure if these old devices benefit

a lot from LRO. So what do you think?


This might be handy, and it would also fix the problem with
CHECKSUM_PARTIAL drivers using the skb interface by allowing them
to set it to CHECKSUM_UNNECESSARY.



Fourth, I did some traffic sniffing to try to figure out what's going
on above, and saw tcpdump complain about bad checksums.  Have you tried
running tcpdump -s 65535 -vvv?  Have you also seen bad checksums?
I seem to see this for both page- and skb-based versions of the driver.



Hmmm, can't confirm that. For our skb-based version I see
correct checksums for aggregated packets and for the page-based version as well.
I used: (tcpdump -i ethX -s 0 -w dump.bin) in combination with ethereal.
Don't see problems as well with your tcpdump command.


I'm still trying to get a handle on this.  It happens both with
page based and skb based receive for me..  I would not be
surprised if I was doing something wrong in myri10ge.  But
I don't see it without LRO, or with my LRO.  I'll let you
know when I figure it out..

In the meantime, in case you have any insight, I've left a
capture of a small netcat transfer of a 64KB file full
of zeros at http://www.myri.com/staff/gallatin/lro/netcat_dump.bz2

Drew

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Andrew Gallatin


Here is a quick reply before something more official can
be written up:

Linas Vepstas wrote:

 -- what is LRO?

Large Receive Offload

 -- Basic principles of operation?

LRO is analogous to a receive side version of TSO.  The NIC (or
driver) merges several consecutive segments from the same connection,
fixing up checksums, etc.  Rather than up to 45 separate 1500 byte
frames (meaning up to 45 trips through the network stack), the driver
merges them into one 65212 byte frame.  It currently works only
with TCP over IPv4.

LRO was, AFAIK, first though of by Neterion.  They had a paper about
it at OLS2005.
http://www.linuxinsight.com/files/ols2005/grossman-reprint.pdf

 -- Can I use it in my driver?

Yes, it can be used in any driver.

 -- Does my hardware have to have some special feature before I can 
use it?


No.

 -- What sort of performance improvement does it provide? Throughput?
Latency? CPU usage? How does it affect DMA allocation? Does it
improve only a certain type of traffic (large/small packets, etc.)

The benefit is directly proportional to the packet rate.

See my reply to the previous RFC for performance information.  The
executive summary is that for the myri10ge 10GbE driver on low end
hardware with 1500b frames, I've seen it increase throughput by a
factor of nearly 2.5x, while at the same time reducing CPU utilization
by 17%.  The affect for jumbo frames is less dramatic, but still
impressive (1.10x, 14% CPU reduction)

You can achieve better speedups if your driver receives into
high-order pages.

 -- Example code? What's the API? How should my driver use it?

The 3/4 in this patch showed an example of converting a driver
to use LRO for skb based receive buffers.   I'm working on
a patch for myri10ge that shows how you would use it in a driver
which receives into pages.

Cheers,

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4][RFC] lro: Generic Large Receive Offload for TCP traffic

2007-07-30 Thread Andrew Gallatin

Jan-Bernd Themann wrote:

Hi,

this patch set contains the latest generic LRO code, a Kconfig / Makefile
and an eHEA patch demonstrating how the aggregate SKB interface has to
to be used.
Drew, could you provide a patch for the myri10ge driver to show how the
receive in page interface works?


Hi,

Thank you for the many fixes, and especially for the counters!

I was working on testing the myri10ge patch, and I ran into a few
problems.  I've attached a patch to inet_lro.c to fix some of them,
and a patch to myri10ge.c to show how to use the page based
interface.  Both patches are signed off by Andrew Gallatin
[EMAIL PROTECTED]

First, the LRO_MAX_PG_HLEN is still a problem.  Minimally sized 60
byte frames still cause problems in lro_gen_skb due to skb-len
going negative.  Fixed in the attached patch.  It may be simpler
to just drop LRO_MAX_PG_HLEN to ETH_ZLEN, but I'm not sure if
that is enough.  Are there smart NICs which might chop off padding
themselves?

Second, you still need to set skb-ip_summed = CHECKSUM_UNNECESSARY
when modified packets are flushed, else the stack will see bad
checksums for packets from CHECKSUM_COMPLETE drivers using the
skb interface.  Fixed in the attached patch.

Third, for some reason I have yet to figure out, this version of the
patch takes a while to ramp up, but only when the receiver MTU
is 9000 and the sender MTU is 1500.  If the receiver MTU is also
1500, even a 10 second netperf easily shows line rate.  I don't
see this with our LRO, and I'm so far at a loss to explain it.

Here is the first 3 seconds of  output from a simple program
which diffs the interface counters once / sec.
   Ipkts   IBytesOpkts   Obytes

rx MTU=9000:
   0000
  7299092   14 1102
 105   1589707  420
  750324   113598708417804  1068240
  814427   123304247818529  740



rx MTU=1500
   0000
  44134466818016813132   788182
  815938   123532865618716  1122960
  817698   123799477218628  1117680
.

Once it has ramped up, the bandwidth is fine, and there
doesn't seem to be much difference between setting the
receiver MTU to 1500 or 9000.  But it takes a very long
netperf run to overcome the ramp up period.

Fourth, I did some traffic sniffing to try to figure out what's going
on above, and saw tcpdump complain about bad checksums.  Have you tried
running tcpdump -s 65535 -vvv?  Have you also seen bad checksums?
I seem to see this for both page- and skb-based versions of the driver.

Last, the pointer to the TCP header in __lro_proc_segment() in the
unaccelerated vlan hdr case needs to also be offset by vlan_hdr_len
from skb-data.   I've fixed this in the attached patch.



Thanks,

Drew
--- linux-2.6.23-rc1.orig/net/ipv4/inet_lro.c   2007-07-30 13:10:49.0 
-0400
+++ linux-2.6.23-rc1/net/ipv4/inet_lro.c2007-07-30 15:17:51.0 
-0400
@@ -126,6 +126,7 @@ static void lro_update_tcp_ip_header(str
htons(lro_desc-ip_tot_len) -
IP_HDR_LEN(iph), IPPROTO_TCP,
lro_desc-data_csum);
+   lro_desc-parent-ip_summed = CHECKSUM_UNNECESSARY;
 }
 
 static __wsum lro_tcp_data_csum(struct iphdr *iph, struct tcphdr *tcph, int 
len)
@@ -396,6 +397,9 @@ static struct sk_buff *lro_gen_skb(struc
if (!skb)
return NULL;
 
+   if (hlen  len)
+   hlen = len;
+
skb-len = len;
skb-data_len = len - hlen;
skb-truesize += true_size;
diff -urp a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c
--- a/drivers/net/myri10ge/myri10ge.c   2007-07-24 15:57:12.0 -0400
+++ b/drivers/net/myri10ge/myri10ge.c   2007-07-30 16:12:06.0 -0400
@@ -48,6 +48,7 @@
 #include linux/etherdevice.h
 #include linux/if_ether.h
 #include linux/if_vlan.h
+#include linux/inet_lro.h
 #include linux/ip.h
 #include linux/inet.h
 #include linux/in.h
@@ -62,6 +63,8 @@
 #include linux/io.h
 #include linux/log2.h
 #include net/checksum.h
+#include net/ip.h
+#include net/tcp.h
 #include asm/byteorder.h
 #include asm/io.h
 #include asm/processor.h
@@ -89,6 +92,7 @@ MODULE_LICENSE(Dual BSD/GPL);
 
 #define MYRI10GE_EEPROM_STRINGS_SIZE 256
 #define MYRI10GE_MAX_SEND_DESC_TSO ((65536 / 2048) * 2)
+#define MYRI10GE_MAX_LRO_DESCRIPTORS 8
 
 #define MYRI10GE_NO_CONFIRM_DATA htonl(0x)
 #define MYRI10GE_NO_RESPONSE_RESULT 0x
@@ -151,6 +155,8 @@ struct myri10ge_rx_done {
dma_addr_t bus;
int cnt;
int idx;
+   struct net_lro_mgr lro_mgr;
+   struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS];
 };
 
 struct myri10ge_priv {
@@ -276,6 +282,14 @@ static int myri10ge_debug = -1;/* defau
 module_param(myri10ge_debug, int, 0

Re: [RFC 0/1] lro: Generic Large Receive Offload for TCP traffic

2007-07-27 Thread Andrew Gallatin

Jan-Bernd Themann wrote:

 On Wednesday 25 July 2007 19:17, Andrew Gallatin wrote:

 3) Padded frames.

 I may be missing something, but I don't see where you
 either strip padding from frames or reject padded frames.
 (see the pskb_trim_rcsum() in net/ipv4/ip_input.c:ip_rcv()

 I think I missed something :-) Will fix that.
 In lro_tcp_ip_check we check for the SKB aggregation interface
 the skb-len against ip-tot_len. This catches padded frames as
 eth_type_trans(skb, dev) reduces the length of the SKB.
 However, the possible VLAN header is not taken into account.
 And for the receive in pages interface a wrong length is passed
 as argument as well.

 I'd suggest to reject padded frames for aggregation as we do now
 (when bugs are fixed) and thus keep the code simple.
 I guess that padded frames don't occur too often in high load
 situations. If we detect a real performance issue we can still
 change that later.

The one case where performance may be at issue is in aggregating Acks
on connections w/o TCP timestamps where a frame size of 54 bytes is
padded out to 60.  I did some very quick measurements using netperf
-tTCP_SENDFILE on the same athlons mentioned earlier using our 1.3.1
driver.  I see a roughly 8% CPU increase (~17% - ~25%) when I disable
LRO (and hence Ack aggregation) on the sender.  This works out to an
increase in service demand from ~0.3 to ~0.44 according to netperf.
With LRO enabled, our counters show we're aggregating acks at a
roughly 3:1 ratio.  This is probably an optimization that can be done
later, but it is helpful.

This reminds me.. what would you think about adding some sort of
counters, ideally per-interface, to expose how well LRO is working?
At the simplest level, you could add them to the lro mgr struct and
let drivers export them via ethtool.  I think a central approach might
be more appropriate.  At any rate, I'd prefer the final
version to at least have counters to indicate how many packets were
aggregated, how many packets were flushed, and how many times we
failed to aggregate something due to insufficient net_lro_desc
descriptors.

Thanks again for taking the lead on this,

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/1] lro: Generic Large Receive Offload for TCP traffic

2007-07-25 Thread Andrew Gallatin

Hi,

I've ported myri10ge to use the new LRO interface.  I have attached a
preliminary patch to myri10ge.  I'm very pleased to note that the
performance is on-par with my own LRO used by our out-of-tree driver.
(except when using mixed MTUS, see performance data below).

As I expected, actually porting our driver to use the LRO interface
gave me a far better understanding of the interface, and allowed for
better feedback.  I have attached a patch to the LRO code which
addresses some of the issues I mention below.

Please find below a performance summary, as well as my comments
on the LRO code, and patches to myri10ge and inet_lro. Both patches
are Signed-off-by: Andrew J. Gallatin [EMAIL PROTECTED]


Cheers,

Drew

===
Performance:
===

Here is a performance summary taken on my very low-end 2.0GHz AMD
Athlon(tm) 64 X2 Dual Core Processor 3800+ running 2.6.23-rc1 and
receiving a netperf TCP_SENDFILE test from an identical sender (which
was running 2.6.22 and our 1.3.1 out of tree driver).  The netserver
process was bound to a different core than the interrupt handler.  The
data reported is from the median of 5 60 second netperf tests.  The
following settings were in /etc/sysctl.conf on both machines:

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 2500
net.ipv4.tcp_timestamps = 0


RX Performance for Sender MTU=1500, Receiver MTU=1500 expressed as
x Gb/s, y %CPU receiver utilization:

rxbuf(1) 1.3.1(2)  inet_lro   no LRO
----   ---
4K pg8.9 78%   8.8 77%  3.7 89%
8K pg9.2 77%   9.1 77%  3.7 89%
16Kpg9.4 73%   9.4 73%  3.8 89%
32Kpg9.4 72%   9.4 72%  3.9 89%
skb  N/A N/A   5.5 90%  4.1 92%

RX Performance for Sender MTU=1500, Receiver MTU=9000 expressed as
x Gb/s, y %CPU receiver utilization:

rxbuf(1) 1.3.1(2)  inet_lro   no LRO
----   ---
4K pg8.9 78%   7.3 79%  3.7 89%
8K pg9.2 77%   7.6 79%  3.7 89%
16Kpg9.4 73%   8.0 78%  3.8 89%
32Kpg9.4 72%   8.2 79%  3.9 89%
skb  N/A N/A   4.9 92%  4.1 92%

RX Performance for Sender MTU=9000, Receiver MTU=9000 expressed as
x Gb/s, y %CPU receiver utilization:

rxbuf(1) 1.3.1(2)  inet_lro   no LRO
----   ---
4K pg9.9 63%   9.6 66%  8.3 71%
8K pg9.9 60%   9.9 63%  8.4 72%
16Kpg9.9 55%   9.9 55%  8.7 70%
32Kpg9.9 53%   9.9 53%  8.9 67%
skb  N/A N/A   9.9 68%  8.7 72%

(1) xK pg means the driver was configured to adjust the receive page
size using MYRI10GE_ALLOC_ORDER.  skb means an internal variant
of our driver which receives into skbs rather than pages was used.

(2) 1.3.1 is our latest out of tree driver which uses the myri10ge
specific frags-based LRO code previously submitted and rejected.

===
Code review / comments:
===

1) Checksum information for CHECKSUM_COMPLETE drivers.

Our NIC passes partial checksums to our driver.  In the current code,
it seems impossible for page based CHECKSUM_COMPLETE drivers to behave
correctly in the case of rejected frames.  Eg, there is no way
to pass the partial checksum to the LRO module so that it gets
set in the skb header and passed up the stack.

Further, there seems to be no (easy) way to use CHECKSUM_COMPLETE
on an aggregated packet at LRO flush time.  By the time a packet
is aggregated, the partial checksum from the first segment is
out of date.

I think it would make sense to require that a CHECKSUM_COMPLETE style
driver verify the checksum in its get_frag_header / get_skb_header
callback.  This allows the LRO code to unconditionally set
CHECKSUM_UNNECESSARY.

The attached a patch modifies the code to do this.


2) Non-accelerated VLAN tags

Our firmware currently does not do vlan tag insertion
and removal.  This causes a problem in __lro_proc_segment()
where the tcp and ip headers are setup to point into the
newly created skb.  A frame containing an unstripped vlan
tag will cause the headers to be garbage.

The attached patch modifies __lro_proc_segment() to offset
those pointers by VLAN_HLEN when required.

3) Padded frames.

I may be missing something, but I don't see where you
either strip padding from frames or reject padded frames.
(see the pskb_trim_rcsum() in net/ipv4/ip_input.c:ip_rcv()

I did not add such a feature as I was confused about the intended
use of len/true_size.

Also, trimming is a pain when dealing with pure frags (without a
containing skb).  We have code in our out-of-kernel driver to deal
with it which you are welcome to use.


4) LRO_MIN_PG_HLEN (== 80)

This confuses me.  Can you please explain what you're trying to do?
Because of this, I kept getting crashes in the skb_pull() done by
eth_type_trans() because I was passing segments which were 60 bytes
and the skb-data_len of the skb constructed by lro_gen_skb() 

Re: [RFC 0/1] lro: Generic Large Receive Offload for TCP traffic

2007-07-21 Thread Andrew Gallatin

On 7/20/07, Jan-Bernd Themann [EMAIL PROTECTED] wrote:

Hi,

Thanks a lot for your comments so far.
This generic LRO patch differs from the last one in several points.
A new interface for a receive in pages mode has been added and tested
with an eHEA prototype. Seems to work well.

Does this extended interface seem to be sufficient?


Thank you for this!

At least for me, I find it is best to try to use an interface rather
than simply reading a diff.  So I will port Myri10GE to use the new
interface so that I can give better feedback,  I'll try my best to do
this  by early next week.

Thank you again,

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/3] lro: Generic Large Receive Offload for TCP traffic

2007-07-12 Thread Andrew Gallatin

On 7/11/07, Jan-Bernd Themann [EMAIL PROTECTED] wrote:

Generic Large Receive Offload proposal


I'm very glad that somebody is stepping up to take responsibility
for this!

I'm the primary author of the Myricom Myri10GE driver, and its LRO mechanism
(which has been rejected several times when posted here).  I don't subscribe
to the LKML or netdev lists (the b/w is way too much for me).
Thankfully, my colleague Brice (who maintains our driver in the kernel)
forwarded me this message.

I looked at your patch, and I believe that we can improve the
performance further by using a slightly different approach, or at
least making that approach optional.

When I first did our LRO implementation, it aggregated packets
essentially the same way you are doing it -- by appending packets to
the frag_list.  I did quite extensive profiling, and the most expensive
operations seemed to be the allocation and freeing of memory.  A
colleague of mine (Loic, CC'ed) came up with the brilliant idea of
receiving into pages.  When I implemented his suggestion, it turned
out to be much, much more efficient to receive into pages, and to
accumulate the pages to the frags array.

The benchmarks I did on very low end machines in my lab (2GHz
amd64 x2 3800+) showed that the receiver was saturated at roughly
4.2Gb/s without LRO, 5.7Gb/s with frag_list based LRO, 8.6Gb/s with
frags array based LRO, and somewhat idle at line rate with frags array
based LRO and 32KB order=3 pages. (This is 1500b frames, BTW).

The savings comes from being able to do fewer allocations.  For
example, 2 1500b packets fit in a single page.  So, for a full frag
array, we have 16 1/2 4KB pages and a single skb holding them.
This works out to be 9 allocations for roughly 23KB of payload,
rather than 16. Using order 3 (32KB) pages, it gets even better,
and we have just 2 allocations per full skb frag list.

So... It would be wonderful if your patch could also deal with
data residing in pages, rather than in skbs.  I can understand
how you might not want to modify your driver to do this, which
is why I'm asking about making it optional.  However,
your driver would probably benefit from receiving into
pages, and I encourage you to investigate that.

I'm picturing an additional path into lro, such as:

int lro_maybe_receive_frags(struct net_lro_mgr *lro_mgr,
   struct skb_frag_struct *frags,
   int len,  void *priv);


This would return 0 if the packet was accepted, and an
error if it was not.  It would then call a modified
__lro_proc_skb() (perhaps better named __lro_proc_segment())
which would have the length as an argument (so as to avoid
the need to pass the skb to lro_tcp_ip_check()) in addition
to the *frags.

The only real additional work would be in having an alternate path
inside lro_init_desc() which allocated an skb to hang the pages from
if the passed skb was null, and in having an alternate
lro_add_packet() path which added the frag(s), rather than chaining an
skb.

Also, your patch does not seem to maintain TCP checksums.
To be fair, we did not maintain checksums in our earlier LRO
work either, and it was much simpler that way! :) However,
not maintaining the TCP checksum can lead to user complaints
(invalid csum reported by traffic sniffers), as well as problems
when various filtering software is used which attempts to
re-write the TCP header.

I would very much appreciate it if you could look at my LRO
implementation in our Myri10GE driver which has been
posted (and rejected) several times in the past by Brice.
The source code is also available at:
http://www.myri.com/ftp/pub/Myri10GE/myri10ge-linux.1.3.0.tgz.
It uses a page based approach, and it maintains correct TCP
checksums.

This driver is Dual BSD/GPL licensed, and you are free to take
whatever you like.

Thank you again for stepping up with a generic implementation!

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vlan net_device features flag is 0?

2006-04-06 Thread Andrew Gallatin
On 4/5/06, Ben Greear [EMAIL PROTECTED] wrote:
 Andrew Gallatin wrote:
  I'm working on a driver for a 10GbE nic.  I've just gotten to the point 
  where I
  am verifying that 802.1q vlans work without hardware vlan offload.  It
  seems like
  the netdev features flags (NETIF_F_SG|NETIF_F_IP_CSUM|NETIF_F_TSO) are not
  being inherited by the vlan device.   This leads to very high CPU 
  utilization,
  especially  when running applications which use sendfile, since it forces 
  data
  copies.
 
  I have verified this is the problem by printing the vlan device's features 
  from
  the end of register_vlan_device().  If copy real_dev's features to new_dev,
  then the performance problems disappear.

 This seems logical to me.  We might need some dynamic logic to deal
 with vlans on top of bridge devices (I believe bridges adjust *their*
 capabilities dynamically as devices are added to and removed from the bridge.)

So you are sayiing that you'd be willing to copy real_dev's
features flags when creating a vlan?  Great!  Unfortunately,
I know nothing about bridges.  How would this dynamic logic
work?

We would  also need to catch ethtool updates to
real_dev's features flags, so that ethtook -K would work as expected
on vlan devices.   Or perhaps the dynamic logic to deal with
vlans on top of bridges would be able to deal with it.

  This is especially confusing, since it appears ethtool's -K/-k commands
  fall through to the real device.  So it appears like the ethX.VLAN dev
  has scatter-gather and checksum offload enabled, when actually it
  does not.
 
  Am I hitting a weird corner case in having a device which does
  checksum/sg/tso and not vlan offload? Will this magically go away if I
  implement some vlan hardware assist feature?

 I doubt that will matter, but it's possible.  DaveM and others know more about
 the hardware assist path that I do...

Yes,  I think you are correct, and it should not matter.  i was just grasping
at straws.

Thank you,

Drew
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html