Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-12 Thread Stephen Hemminger
On Fri, 12 Oct 2007 09:08:58 -0700
Brandeburg, Jesse [EMAIL PROTECTED] wrote:

 Andi Kleen wrote:
  When the hw TX queue gains space, the driver self-batches packets
  from the sw queue to the hw queue.
  
  I don't really see the advantage over the qdisc in that scheme.
  It's certainly not simpler and probably more code and would likely
  also not require less locks (e.g. a currently lockless driver
  would need a new lock for its sw queue). Also it is unclear to me
  it would be really any faster.
 
 related to this comment, does Linux have a lockless (using atomics)
 singly linked list element?  That would be very useful in a driver hot
 path.

Use RCU? or write a generic version and get it reviewed.  You really
want someone with knowledge of all the possible barrier impacts to
review it.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-12 Thread Andi Kleen
 Use RCU? or write a generic version and get it reviewed.  You really
 want someone with knowledge of all the possible barrier impacts to
 review it.

I guess he was thinking of using cmpxchg; but we don't support this
in portable code.

RCU is not really suitable for this because it assume
writing is relatively rare which is definitely not the case for a qdisc.
Also general list management with RCU is quite expensive anyways --
it would require a full copy (that is the 'C' in RCU which Linux
generally doesn't use at all) 

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-12 Thread Brandeburg, Jesse
Andi Kleen wrote:
 When the hw TX queue gains space, the driver self-batches packets
 from the sw queue to the hw queue.
 
 I don't really see the advantage over the qdisc in that scheme.
 It's certainly not simpler and probably more code and would likely
 also not require less locks (e.g. a currently lockless driver
 would need a new lock for its sw queue). Also it is unclear to me
 it would be really any faster.

related to this comment, does Linux have a lockless (using atomics)
singly linked list element?  That would be very useful in a driver hot
path.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-12 Thread Andi Kleen
 related to this comment, does Linux have a lockless (using atomics)
 singly linked list element?  That would be very useful in a driver hot
 path.

No; it doesn't. At least not a portable one.
Besides they tend to be not faster anyways because e.g. cmpxchg tends
to be as slow as an explicit spinlock.

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-11 Thread Krishna Kumar2
Hi Dave,

David Miller wrote on 10/10/2007 02:13:31 AM:

  Hopefully that new qdisc will just use the TX rings of the hardware
  directly. They are typically large enough these days. That might avoid
  some locking in this critical path.

 Indeed, I also realized last night that for the default qdiscs
 we do a lot of stupid useless work.  If the queue is a FIFO
 and the device can take packets, we should send it directly
 and not stick it into the qdisc at all.

Since you are talking of how it should be done in the *current* code,
I feel LLTX drivers will not work nicely with this.

Actually I was trying this change a couple of weeks back, but felt
that doin go would result in out of order packets (skbs present in
q which were not sent out for LLTX failure will be sent out only at
next net_tx_action, while other skbs are sent ahead).

One option is to first call qdisc_run() and then process this skb,
but that is ugly (requeue handling).

However I guess this can be done cleanly once LLTX is removed.

Thanks,

- KK

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen
 A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you

With TSO really? 

 increase the size much more performance starts to go down due to L2
 cache thrashing.

Another possibility would be to consider using cache avoidance
instructions while updating the TX ring (e.g. write combining 
on x86) 

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 11:16:44 +0200

  A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
 
 With TSO really? 

Yes.

  increase the size much more performance starts to go down due to L2
  cache thrashing.
 
 Another possibility would be to consider using cache avoidance
 instructions while updating the TX ring (e.g. write combining 
 on x86) 

The chip I was working with at the time (UltraSPARC-IIi) compressed
all the linear stores into 64-byte full cacheline transactions via
the store buffer.

It's true that it would allocate in the L2 cache on a miss, which
is different from your suggestion.

In fact, such a thing might not pan out well, because most of the time
you write a single descriptor or two, and that isn't a full cacheline,
which means a read/modify/write is the only coherent way to make such
a write to RAM.

Sure you could batch, but I'd rather give the chip work to do unless
I unequivocably knew I'd have enough pending to fill a cacheline's
worth of descriptors.  And since you suggest we shouldn't queue in
software... :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Herbert Xu
On Wed, Oct 10, 2007 at 11:16:44AM +0200, Andi Kleen wrote:
  A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
 
 With TSO really? 

Hardware queues are generally per-page rather than per-skb so
it'd fill up quicker than a software queue even with TSO.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED]
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen
On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
 The chip I was working with at the time (UltraSPARC-IIi) compressed
 all the linear stores into 64-byte full cacheline transactions via
 the store buffer.

That's a pretty old CPU. Conclusions on more modern ones might be different.

 In fact, such a thing might not pan out well, because most of the time
 you write a single descriptor or two, and that isn't a full cacheline,
 which means a read/modify/write is the only coherent way to make such
 a write to RAM.

x86 WC does R-M-W and is coherent of course. The main difference is 
just that the result is not cached.  When the hardware accesses the cache line
then the cache should be also invalidated.

 Sure you could batch, but I'd rather give the chip work to do unless
 I unequivocably knew I'd have enough pending to fill a cacheline's
 worth of descriptors.  And since you suggest we shouldn't queue in
 software... :-)

Hmm, it probably would need to be coupled with batched submission if 
multiple packets are available you're right. Probably not worth doing explicit
queueing though.

I suppose it would be an interesting experiment at least.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 12:23:31 +0200

 On Wed, Oct 10, 2007 at 02:25:50AM -0700, David Miller wrote:
  The chip I was working with at the time (UltraSPARC-IIi) compressed
  all the linear stores into 64-byte full cacheline transactions via
  the store buffer.
 
 That's a pretty old CPU. Conclusions on more modern ones might be different.

Cache matters, just scale the numbers.

 I suppose it would be an interesting experiment at least.

Absolutely.

I've always gotten very poor results when increasing the TX queue a
lot, for example with NIU the point of diminishing returns seems to
be in the range of 256-512 TX descriptor entries and this was with
1.6Ghz cpus.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Andi Kleen
 We've done similar testing with ixgbe to push maximum descriptor counts,
 and we lost performance very quickly in the same range you're quoting on
 NIU.

Did you try it with WC writes to the ring or CLFLUSH?

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Bill Fink
On Tue, 09 Oct 2007, David Miller wrote:

 From: jamal [EMAIL PROTECTED]
 Date: Tue, 09 Oct 2007 17:56:46 -0400
 
  if the h/ware queues are full because of link pressure etc, you drop. We
  drop today when the s/ware queues are full. The driver txmit lock takes
  place of the qdisc queue lock etc. I am assuming there is still need for
  that locking. The filter/classification scheme still works as is and
  select classes which map to rings. tc still works as is etc.
 
 I understand your suggestion.
 
 We have to keep in mind, however, that the sw queue right now is 1000
 packets.  I heavily discourage any driver author to try and use any
 single TX queue of that size.  Which means that just dropping on back
 pressure might not work so well.
 
 Or it might be perfect and signal TCP to backoff, who knows! :-)

I can't remember the details anymore, but for 10-GigE, I have encountered
cases where I was able to significantly increase TCP performance by
increasing the txqueuelen to 1, which is the setting I now use for
any 10-GigE testing.

-Bill
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread Waskiewicz Jr, Peter P
 -Original Message-
 From: Andi Kleen [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, October 10, 2007 9:02 AM
 To: Waskiewicz Jr, Peter P
 Cc: David Miller; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 netdev@vger.kernel.org; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; 
 [EMAIL PROTECTED]
 Subject: Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net 
 core use batching
 
  We've done similar testing with ixgbe to push maximum descriptor 
  counts, and we lost performance very quickly in the same 
 range you're 
  quoting on NIU.
 
 Did you try it with WC writes to the ring or CLFLUSH?
 
 -Andi

Hmm, I think it might be slightly different, but it still shows queue
depth vs. performance.  I was actually referring to how many descriptors
we can represent a packet with before it becomes a problem wrt
performance.  This morning I tried to actually push my ixgbe NIC hard
enough to come close to filling the ring with packets (384-byte
packets), and even on my 8-core Xeon I can't do it.  My system can't
generate enough I/O to fill the hardware queues before CPUs max out.

-PJ Waskiewicz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 09:08:48 -0400

 On Wed, 2007-10-10 at 03:44 -0700, David Miller wrote:
 
  I've always gotten very poor results when increasing the TX queue a
  lot, for example with NIU the point of diminishing returns seems to
  be in the range of 256-512 TX descriptor entries and this was with
  1.6Ghz cpus.
 
 Is it interupt per packet? From my experience, you may find interesting
 results varying tx interupt mitigation parameters in addition to the
 ring parameters.
 Unfortunately when you do that, optimal parameters also depends on
 packet size. so what may work for 64B, wont work well for 1400B.

No, it was not interrupt per-packet, I was telling the chip to
interrupt me every 1/4 of the ring.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-10 Thread David Miller
From: Bill Fink [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 12:02:15 -0400

 On Tue, 09 Oct 2007, David Miller wrote:
 
  We have to keep in mind, however, that the sw queue right now is 1000
  packets.  I heavily discourage any driver author to try and use any
  single TX queue of that size.  Which means that just dropping on back
  pressure might not work so well.
  
  Or it might be perfect and signal TCP to backoff, who knows! :-)
 
 I can't remember the details anymore, but for 10-GigE, I have encountered
 cases where I was able to significantly increase TCP performance by
 increasing the txqueuelen to 1, which is the setting I now use for
 any 10-GigE testing.

For some reason this does not surprise me.

We bumped the ethernet default up to 1000 for gigabit.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Stephen Hemminger
On 09 Oct 2007 18:51:51 +0200
Andi Kleen [EMAIL PROTECTED] wrote:

 David Miller [EMAIL PROTECTED] writes:
  
  2) Switch the default qdisc away from pfifo_fast to a new DRR fifo
 with load balancing using the code in #1.  I think this is kind
 of in the territory of what Peter said he is working on.
 
 Hopefully that new qdisc will just use the TX rings of the hardware
 directly. They are typically large enough these days. That might avoid
 some locking in this critical path.
 
 I know this is controversial, but realistically I doubt users
 benefit at all from the prioritization that pfifo provides.
 
 I agree. For most interfaces the priority is probably dubious.
 Even for DSL the prioritization will be likely usually done in a router
 these days.
 
 Also for the fast interfaces where we do TSO priority doesn't work
 very well anyways -- with large packets there is not too much 
 to prioritize.
 
  3) Work on discovering a way to make the locking on transmit as
 localized to the current thread of execution as possible.  Things
 like RCU and statistic replication, techniques we use widely
 elsewhere in the stack, begin to come to mind.
 
 If the data is just passed on to the hardware queue, why is any 
 locking needed at all? (except for the driver locking of course)
 
 -Andi

I wonder about the whole idea of queueing in general at such high speeds.
Given the normal bi-modal distribution of packets, and the predominance
of 1500 byte MTU; does it make sense to even have any queueing in software
at all?


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Andi Kleen
 I wonder about the whole idea of queueing in general at such high speeds.
 Given the normal bi-modal distribution of packets, and the predominance
 of 1500 byte MTU; does it make sense to even have any queueing in software
 at all?

Yes that is my point -- it should just pass it through directly
and the driver can then put it into the different per CPU (or per
whatever) queues managed by the hardware.

The only thing the qdisc needs to do is to set some bit that says
it is ok to put this into difference queues; don't need strict ordering 

Otherwise if the drivers did that unconditionally they might cause
problems with other qdiscs.

This would also require that the driver exports some hint 
to the upper layer on how large its internal queues are. A device
with a short queue would still require pfifo_fast. Long queue
devices could just pass through. That again could be a single flag.

-Andi
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: 09 Oct 2007 18:51:51 +0200

 Hopefully that new qdisc will just use the TX rings of the hardware
 directly. They are typically large enough these days. That might avoid
 some locking in this critical path.

Indeed, I also realized last night that for the default qdiscs
we do a lot of stupid useless work.  If the queue is a FIFO
and the device can take packets, we should send it directly
and not stick it into the qdisc at all.

 If the data is just passed on to the hardware queue, why is any 
 locking needed at all? (except for the driver locking of course)

Absolutely.

Our packet scheduler subsystem is great, but by default it should just
get out of the way.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Stephen Hemminger
On Tue, 09 Oct 2007 13:43:31 -0700 (PDT)
David Miller [EMAIL PROTECTED] wrote:

 From: Andi Kleen [EMAIL PROTECTED]
 Date: 09 Oct 2007 18:51:51 +0200
 
  Hopefully that new qdisc will just use the TX rings of the hardware
  directly. They are typically large enough these days. That might avoid
  some locking in this critical path.
 
 Indeed, I also realized last night that for the default qdiscs
 we do a lot of stupid useless work.  If the queue is a FIFO
 and the device can take packets, we should send it directly
 and not stick it into the qdisc at all.
 
  If the data is just passed on to the hardware queue, why is any 
  locking needed at all? (except for the driver locking of course)
 
 Absolutely.
 
 Our packet scheduler subsystem is great, but by default it should just
 get out of the way.

I was thinking why not have a default transmit queue len of 0 like
the virtual devices.

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Stephen Hemminger [EMAIL PROTECTED]
Date: Tue, 9 Oct 2007 13:53:40 -0700

 I was thinking why not have a default transmit queue len of 0 like
 the virtual devices.

I'm not so sure.

Even if the device has huge queues I still think we need a software
queue for when the hardware one backs up.

It is even beneficial to stick with reasonably sized TX queues because
it keeps the total resident state accessed by the CPU within the
bounds of the L2 cache.  If you go past that it actually hurts to make
the TX queue larger instead of helps even if it means you never hit
back pressure.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread jamal

On Tue, 2007-09-10 at 14:22 -0700, David Miller wrote:

 Even if the device has huge queues I still think we need a software
 queue for when the hardware one backs up.

It should be fine to just pretend the qdisc exists despite it sitting
in the driver and not have s/ware queues at all to avoid all the
challenges that qdiscs bring; 
if the h/ware queues are full because of link pressure etc, you drop. We
drop today when the s/ware queues are full. The driver txmit lock takes
place of the qdisc queue lock etc. I am assuming there is still need for
that locking. The filter/classification scheme still works as is and
select classes which map to rings. tc still works as is etc.

cheers,
jamal

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: jamal [EMAIL PROTECTED]
Date: Tue, 09 Oct 2007 17:56:46 -0400

 if the h/ware queues are full because of link pressure etc, you drop. We
 drop today when the s/ware queues are full. The driver txmit lock takes
 place of the qdisc queue lock etc. I am assuming there is still need for
 that locking. The filter/classification scheme still works as is and
 select classes which map to rings. tc still works as is etc.

I understand your suggestion.

We have to keep in mind, however, that the sw queue right now is 1000
packets.  I heavily discourage any driver author to try and use any
single TX queue of that size.  Which means that just dropping on back
pressure might not work so well.

Or it might be perfect and signal TCP to backoff, who knows! :-)

While working out this issue in my mind, it occured to me that we
can put the sw queue into the driver as well.

The idea is that the network stack, as in the pure hw queue scheme,
unconditionally always submits new packets to the driver.  Therefore
even if the hw TX queue is full, the driver can still queue to an
internal sw queue with some limit (say 1000 for ethernet, as is used
now).

When the hw TX queue gains space, the driver self-batches packets
from the sw queue to the hw queue.

It sort of obviates the need for mid-level queue batching in the
generic networking.  Compared to letting the driver self-batch,
the mid-level batching approach is pure overhead.

We seem to be sort of all mentioning similar ideas.  For example, you
can get the above kind of scheme today by using a mid-level queue
length of zero, and I believe this idea was mentioned by Stephen
Hemminger earlier.

I may experiment with this in the NIU driver.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread Andi Kleen
On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
 We have to keep in mind, however, that the sw queue right now is 1000
 packets.  I heavily discourage any driver author to try and use any
 single TX queue of that size.  

Why would you discourage them? 

If 1000 is ok for a software queue why would it not be ok
for a hardware queue?

 Which means that just dropping on back
 pressure might not work so well.
 
 Or it might be perfect and signal TCP to backoff, who knows! :-)

1000 packets is a lot. I don't have hard data, but gut feeling 
is less would also do.

And if the hw queues are not enough a better scheme might be to
just manage this in the sockets in sendmsg. e.g. provide a wait queue that
drivers can wake up and let them block on more queue.

 The idea is that the network stack, as in the pure hw queue scheme,
 unconditionally always submits new packets to the driver.  Therefore
 even if the hw TX queue is full, the driver can still queue to an
 internal sw queue with some limit (say 1000 for ethernet, as is used
 now).

 
 When the hw TX queue gains space, the driver self-batches packets
 from the sw queue to the hw queue.

I don't really see the advantage over the qdisc in that scheme.
It's certainly not simpler and probably more code and would likely
also not require less locks (e.g. a currently lockless driver
would need a new lock for its sw queue). Also it is unclear to me
it would be really any faster.

-Andi

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: [PATCH 2/3][NET_BATCH] net core use batching

2007-10-09 Thread David Miller
From: Andi Kleen [EMAIL PROTECTED]
Date: Wed, 10 Oct 2007 02:37:16 +0200

 On Tue, Oct 09, 2007 at 05:04:35PM -0700, David Miller wrote:
  We have to keep in mind, however, that the sw queue right now is 1000
  packets.  I heavily discourage any driver author to try and use any
  single TX queue of that size.  
 
 Why would you discourage them? 
 
 If 1000 is ok for a software queue why would it not be ok
 for a hardware queue?

Because with the software queue, you aren't accessing 1000 slots
shared with the hardware device which does shared-ownership
transactions on those L2 cache lines with the cpu.

Long ago I did a test on gigabit on a cpu with only 256K of
L2 cache.  Using a smaller TX queue make things go faster,
and it's exactly because of these L2 cache effects.

 1000 packets is a lot. I don't have hard data, but gut feeling 
 is less would also do.

I'll try to see how backlogged my 10Gb tests get when a strong
sender is sending to a weak receiver.

 And if the hw queues are not enough a better scheme might be to
 just manage this in the sockets in sendmsg. e.g. provide a wait queue that
 drivers can wake up and let them block on more queue.

TCP does this already, but it operates in a lossy manner.

 I don't really see the advantage over the qdisc in that scheme.
 It's certainly not simpler and probably more code and would likely
 also not require less locks (e.g. a currently lockless driver
 would need a new lock for its sw queue). Also it is unclear to me
 it would be really any faster.

You still need a lock to guard hw TX enqueue from hw TX reclaim.

A 256 entry TX hw queue fills up trivially on 1GB and 10GB, but if you
increase the size much more performance starts to go down due to L2
cache thrashing.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html