Re: Why do soft interrupt coelescing?

2001-10-17 Thread Kenneth D. Merry

On Mon, Oct 15, 2001 at 11:35:51 -0700, Terry Lambert wrote:
 Kenneth D. Merry wrote:
[ ... ]
   This is actually very bad: you want to drop packets before you
   insert them into the queue, rather than after they are in the
   queue.  This is because you want the probability of the drop
   (assuming the queue is not maxed out: otherwise, the probabilty
   should be 100%) to be proportional to the exponential moving
   average of the queue depth, after that depth exceeds a drop
   threshold.  In other words, you want to use RED.
  
  Which queue?  The packets are dropped before they get to ether_input().
 
 The easy answer is any queue, since what you are becoming
 concerned with is pool retention time: you want to throw
 away packets before a queue overflow condition makes it so
 you are getting more than you can actually process.

Ahh.

[ ...RED... ]

   Maybe I'm being harsh in calling it spam'ming.  It does the
   wrong thing, by dropping the oldest unprocessed packets first.
   A FIFO drop is absolutely the wrong thing to do in an attack
   or overload case, when you want to shed load.  I consider the
   packet that is being dropped to have been spam'med by the
   card replacing it with another packet, rather than dropping
   the replacement packet instead.
  
   The real place for this drop is before it gets to card memory,
   not after it is in host memory; Floyd, Jacobsen, Mogul, etc.,
   all agree on that.
  
  As I mentioned above, how would you do that without some sort of traffic
  shaper on the wire?
 
 The easiest answer is to RED queue in the card firmware.

Ahh, but then the packets are likely going to be in card memory already.
A card with a reasonable amount of cache (e.g. the Tigon II) and onboard
firmware will probably dump the packet into an already-set-up spot in
memory.

You'd probably need a special mode of interaction between the firmware and
the packet receiving hardware to tell it when to drop packets.

  My focus with gigabit ethernet was to get maximal throughput out of a small
  number of connections.  Dealing with a large number of connections is a
  different problem, and I'm sure it uncovers lots of nifty bugs.
 
 8-).  I guess that you are more interested in intermediate hops
 and site to site VPN, while I'm more interested in connection
 termination (big servers, SSL termination, and single client VPN).

Actually, the specific application I worked on for my former employer was
moving large amounts (many gigabytes) of video at high speed via FTP
between FreeBSD-based video servers.  (And between the video servers and
video editor PCs, and data backup servers.)

It actually worked fairly well, and is in production at a number of TV
stations now. :)

   I'd actually prefer to avoid the other DMA; I'd also like
   to avoid the packet receipt order change that results from
   DMA'ing over the previous contents, in the case that an mbuf
   can't be allocated.  I'd rather just let good packets in with
   a low (but non-zero) statistical probability, relative to a
   slew of bad packets, rather than letting a lot of bad packets
   from a persistant attacker push my good data out with the bad.
  
  Easier said than done -- dumping random packets would be difficult with a
  ring-based structure.  Probably what you'd have to do is have an extra pool
  of mbufs lying around that would get thrown in at random times when mbufs
  run out to allow some packets to get through.
  
  The problem is, once you exhaust that pool, you're back to the same old
  problem if you're completely out of mbufs.
  
  You could probably also start shrinking the number of buffers in the ring,
  but as I said before, you'd need a method for the kernel to notify the
  driver that more mbufs are available.
 
 You'd be better off shrinking the window size across all
 the connections, I think.
 
 As to difficult to do, I actually have RED queue code, which
 I adapted from the formula in a paper.  I have no problem
 giving that code out.
 
 The real issue is that the BSD queue macros involved in the
 queues really need to be modified to include an elements on
 queue count for the calculation of the moving average.

[   ]

   OK, I will rediff and generate context diffs; expect them to
   be sent in 24 hours or so from now.
  
  It's been longer than that...
 
 Sorry; I've been doing a lot this weekend.  I will redo them
 at work today, and resend them tonight... definitely.
 
 
Generally the ICMP response tells you how big the maximum MTU is, so you
don't have to guess.
  
   Maybe it's the ICMP response; I still haven't had a chance to
   hold Michael down and drag the information out of him.  8-).
  
  Maybe what's the ICMP response?
 
 The difference between working and not working.

Yes, with TCP, the ICMP response is required in order for the path MTU (or
rather MSS) to be autonegotiated properly.  It'll work without the ICMP
response assuming that the minimum of the MTUs on either end is less 

Re: Why do soft interrupt coelescing?

2001-10-15 Thread Terry Lambert

Kenneth D. Merry wrote:
 Dropping packets before they get into card memory would only be possible
 with some sort of traffic shaper/dropping mechanism on the wire to drop
 things before they get to the card at all.

Actually, DEC had a congestion control mechanism that worked by
marking all packets, over a certainl level of congestion (this
was sometimes called the DECbits approach).  You can do the
same thing with any intermediate hop router, so long as it is
better at moving packets than your destination host.

It turns out that even if the intermediate hop and the host at
the destination are running the same hardware and OS, the cost
is going to be higher to do the terminal processing than it is
to do the network processing, so you are able to use the tagging
to indicate to the terminal hop which flows to drop packets out
of before processing.

Cisco routers can do this (using the CMU firmware) on a per
flow basis, leaving policy up to the end node.  Very neat.


[ ... per connection overhead, and overcommit ... ]

 You could always just put 4G of RAM in the machine, since memory is so
 cheap now. :)
 
 At some point you'll hit a limit in the number of connections the processor
 can actually handle.

In practice, particularly for HTTP or FTP flows, you can
halve the amount of memory expected to be used.  This is
because the vast majority of the data is generally pushed
in only one direction.

For HTTP 1.1 persistant connections, you can, for the most
part, also assume that the connections are bursty -- that
is, that there is a human attached to the other end, who
will spend some time examining the content before making
another request (you can assume the same thing for 1.0, but
that doesn't count against persistant connection count,
unless you also include time spent in TIME_WAIT).

So overcommit turns out to be O.K. -- which is what I was
trying to say in a back-handed way, in the last post.

If you include window control (i.e. you care about overall
throughput, and not about individual connections), then
you can safely service 1,000,000 connections with 4G on a
FreeBSD box.


  This is actually very bad: you want to drop packets before you
  insert them into the queue, rather than after they are in the
  queue.  This is because you want the probability of the drop
  (assuming the queue is not maxed out: otherwise, the probabilty
  should be 100%) to be proportional to the exponential moving
  average of the queue depth, after that depth exceeds a drop
  threshold.  In other words, you want to use RED.
 
 Which queue?  The packets are dropped before they get to ether_input().

The easy answer is any queue, since what you are becoming
concerned with is pool retention time: you want to throw
away packets before a queue overflow condition makes it so
you are getting more than you can actually process.


 Dropping random packets would be difficult.

The R in RED is Random for Random Early Detection (or
Random Early Drop, for a minority of the literature), true.

But the randomness involved is whether you drop vs. insert a
given packet, not whether you drop a random packet from the
queue contents.  Really dropping random queue elements would
be incredibly bad.

The problem is that, during an attack, the number of packets
you get is proportionally huge, compared to the non-attack
packets (the ones you want to let through).  A RED approach
will prevent new packets being enqueued: it protects the host
system's ability to continue degraded processing, by making
the link appear lossy -- the closer to queue full, the
more lossy the link.

If you were to drop random packets already in the queue,
then the proportional probability of dumping good packets is
equal to the queue depth times the number of bad packets
divided by the number of total packets.  In other words, a
continuous attack will almost certainly push all good packets
out of the queue before they reach the head.

Dropping packets prior to insertion maintains the ratio of
bad and good packets, so it doesn't inflate the danger to
the good packets by the relative queue depth: thus dropping
before insertion is a significantly better strategy than
dropping after insertion, for any queue depth over 1.

  Maybe I'm being harsh in calling it spam'ming.  It does the
  wrong thing, by dropping the oldest unprocessed packets first.
  A FIFO drop is absolutely the wrong thing to do in an attack
  or overload case, when you want to shed load.  I consider the
  packet that is being dropped to have been spam'med by the
  card replacing it with another packet, rather than dropping
  the replacement packet instead.
 
  The real place for this drop is before it gets to card memory,
  not after it is in host memory; Floyd, Jacobsen, Mogul, etc.,
  all agree on that.
 
 As I mentioned above, how would you do that without some sort of traffic
 shaper on the wire?

The easiest answer is to RED queue in the card firmware.


 My focus with gigabit ethernet was to get 

Re: Why do soft interrupt coelescing?

2001-10-14 Thread Kenneth D. Merry

On Thu, Oct 11, 2001 at 01:02:09 -0700, Terry Lambert wrote:
 Kenneth D. Merry wrote:
  If the receive ring for that packet size is full, it will hold off on
  DMAs.  If all receive rings are full, there's no reason to send more
  interrupts.
 
 I think that this does nothing, in the FreeBSD case, since the
 data from the card will generally be drained much faster than
 it accrues, into the input queue.  Whether it gets processed
 out of there before you run out of mbufs is another matter.
 
 [ ... ]
 
  Anyway, if all three rings fill up, then yes, there won't be a reason to
  send receive interrupts.
 
 I think this can't really happen, since interrupt processing
 has the highest priority, compared to stack processing or
 application level processing.  8-(.

Yep, it doesn't happen very often in the default case.

   OK, assuming you meant that the copies would stall, and the
   data not be copied (which is technically the right thing to
   do, assuming a source quench style livelock avoidance, which
   doesn't currently exist)...
  
  The data isn't copied, it's DMAed from the card to host memory.  The card
  will save incoming packets to a point, but once it runs out of memory to
  store them it starts dropping packets altogether.
 
 I think that the DMA will not be stalled, at least as the driver
 currently exists; you and I agreed on that already (see below).
 My concern in this case is that, if the card is using the bus to
 copy packets from card memory to the receive ring, then the bus
 isn't available for other work, which is bad.  It's better to
 drop the packets before putting them in card memory (FIFO drop
 fails to avoid the case where a continuous attack pushes all
 good packets out).

Dropping packets before they get into card memory would only be possible
with some sort of traffic shaper/dropping mechanism on the wire to drop
things before they get to the card at all.

   The problem is still that you end up doing interrupt processing
   until you run out of mbufs, and then you have the problem of
   not being able to transmit responses, for lack of mbufs.
  
  In theory you would have configured your system with enough mbufs
  to handle the situation, and the slowness of the system would cause
  the windows on the sender to fill up, so they'll stop sending data
  until the receiver starts responding again.  That's the whole purpose
  of backoff and slow start -- to find a happy medium for the
  transmitter and receiver so that data flows at a constant rate.
 
 In practice, mbuf memory is just as overcommitted as all other
 memory, and given a connection count target, you are talking a
 full transmit and full receive window worth of data at 16k a
 pop -- 32k per connection.
 
 Even a modest maximum connection count of ~30,000 connections --
 something even an unpatches 4.3 FreeBSD could handle -- means
 that you need 1G of RAM for the connections alone, if you disallow
 overcommit.  In practice, that would mean ~20,000 connections,
 when you count page table entries, open file table entries, vnodes,
 inpcb's, tcpcb's, etc..  And that's a generaous estimate, which
 assumes that you tweak your kernel properly.

You could always just put 4G of RAM in the machine, since memory is so
cheap now. :)

At some point you'll hit a limit in the number of connections the processor
can actually handle.

 One approach to this is to control the window sizes based on
 th amount of free reserve you have available, but this will
 actually damage overall throughput, particularly on links
 with a higher latency.

Yep.

   In the ti driver case, the inability to get another mbuf to
   replace the one that will be taken out of the ring means that
   the mbuf gets reused for more data -- NOT that the data flow
   in the form of DMA from the card ends up being halted until
   mbufs become available.
  
  True.
 
 This is actually very bad: you want to drop packets before you
 insert them into the queue, rather than after they are in the
 queue.  This is because you want the probability of the drop
 (assuming the queue is not maxed out: otherwise, the probabilty
 should be 100%) to be proportional to the exponential moving
 average of the queue depth, after that depth exceeds a drop
 threshold.  In other words, you want to use RED.

Which queue?  The packets are dropped before they get to ether_input().

Dropping random packets would be difficult.

   Please look at what happens in the case of an allocation
   failure, for any driver that does not allow shrinking the
   ring of receive mbufs (the ti is one example).
  
  It doesn't spam things, which is what you were suggesting before, but
  as you pointed out, it will effectively drop packets if it can't get new
  mbufs.
 
 Maybe I'm being harsh in calling it spam'ming.  It does the
 wrong thing, by dropping the oldest unprocessed packets first.
 A FIFO drop is absolutely the wrong thing to do in an attack
 or overload case, when you want to shed load.  I consider 

Re: Why do soft interrupt coelescing?

2001-10-11 Thread Terry Lambert

Kenneth D. Merry wrote:
 If the receive ring for that packet size is full, it will hold off on
 DMAs.  If all receive rings are full, there's no reason to send more
 interrupts.

I think that this does nothing, in the FreeBSD case, since the
data from the card will generally be drained much faster than
it accrues, into the input queue.  Whether it gets processed
out of there before you run out of mbufs is another matter.

[ ... ]

 Anyway, if all three rings fill up, then yes, there won't be a reason to
 send receive interrupts.

I think this can't really happen, since interrupt processing
has the highest priority, compared to stack processing or
application level processing.  8-(.


  OK, assuming you meant that the copies would stall, and the
  data not be copied (which is technically the right thing to
  do, assuming a source quench style livelock avoidance, which
  doesn't currently exist)...
 
 The data isn't copied, it's DMAed from the card to host memory.  The card
 will save incoming packets to a point, but once it runs out of memory to
 store them it starts dropping packets altogether.

I think that the DMA will not be stalled, at least as the driver
currently exists; you and I agreed on that already (see below).
My concern in this case is that, if the card is using the bus to
copy packets from card memory to the receive ring, then the bus
isn't available for other work, which is bad.  It's better to
drop the packets before putting them in card memory (FIFO drop
fails to avoid the case where a continuous attack pushes all
good packets out).


  The problem is still that you end up doing interrupt processing
  until you run out of mbufs, and then you have the problem of
  not being able to transmit responses, for lack of mbufs.
 
 In theory you would have configured your system with enough mbufs
 to handle the situation, and the slowness of the system would cause
 the windows on the sender to fill up, so they'll stop sending data
 until the receiver starts responding again.  That's the whole purpose
 of backoff and slow start -- to find a happy medium for the
 transmitter and receiver so that data flows at a constant rate.

In practice, mbuf memory is just as overcommitted as all other
memory, and given a connection count target, you are talking a
full transmit and full receive window worth of data at 16k a
pop -- 32k per connection.

Even a modest maximum connection count of ~30,000 connections --
something even an unpatches 4.3 FreeBSD could handle -- means
that you need 1G of RAM for the connections alone, if you disallow
overcommit.  In practice, that would mean ~20,000 connections,
when you count page table entries, open file table entries, vnodes,
inpcb's, tcpcb's, etc..  And that's a generaous estimate, which
assumes that you tweak your kernel properly.

One approach to this is to control the window sizes based on
th amount of free reserve you have available, but this will
actually damage overall throughput, particularly on links
with a higher latency.


  In the ti driver case, the inability to get another mbuf to
  replace the one that will be taken out of the ring means that
  the mbuf gets reused for more data -- NOT that the data flow
  in the form of DMA from the card ends up being halted until
  mbufs become available.
 
 True.

This is actually very bad: you want to drop packets before you
insert them into the queue, rather than after they are in the
queue.  This is because you want the probability of the drop
(assuming the queue is not maxed out: otherwise, the probabilty
should be 100%) to be proportional to the exponential moving
average of the queue depth, after that depth exceeds a drop
threshold.  In other words, you want to use RED.


  Please look at what happens in the case of an allocation
  failure, for any driver that does not allow shrinking the
  ring of receive mbufs (the ti is one example).
 
 It doesn't spam things, which is what you were suggesting before, but
 as you pointed out, it will effectively drop packets if it can't get new
 mbufs.

Maybe I'm being harsh in calling it spam'ming.  It does the
wrong thing, by dropping the oldest unprocessed packets first.
A FIFO drop is absolutely the wrong thing to do in an attack
or overload case, when you want to shed load.  I consider the
packet that is being dropped to have been spam'med by the
card replacing it with another packet, rather than dropping
the replacement packet instead.

The real place for this drop is before it gets to card memory,
not after it is in host memory; Floyd, Jacobsen, Mogul, etc.,
all agree on that.


 Yes, it could shrink the pool, by just not replacing those mbufs in the
 ring (and therefore not notifying the card that that slot is available
 again), but then it would likely need some mechanism to allow it to be
 notified that another buffer is available for it, so it can then allocate
 receive buffers again.
 
 In practice I haven't found the number of mbufs in the system to be a
 problem 

Re: Why do soft interrupt coelescing?

2001-10-10 Thread Terry Lambert

Kenneth D. Merry wrote:
 eh?  The card won't write past the point that has been acked by the kernel.
 If the kernel hasn't acked the packets and one of the receive rings fills
 up, the card will hold off on sending packets up to the kernel.

Uh, eh?

You mean the card will hold off on DMA and interrupts?  This
has not been my experience.  Is this with firmware other than
the default and/or the 4.3-RELEASE FreeBSD driver?


 I agree that you can end up spending large portions of your time doing
 interrupt processing, but I haven't seen instances of buffer overrun, at
 least not in the kernel.  The case where you'll see a buffer overrun, at
 least with the ti(4) driver, is when you have a sender that's faster than
 the receiver.
 
 So the receiver can't process the data in time and the card just drops
 packets.

OK, assuming you meant that the copies would stall, and the
data not be copied (which is technically the right thing to
do, assuming a source quench style livelock avoidance, which
doesn't currently exist)...

The problem is still that you end up doing interrupt processing
until you run out of mbufs, and then you have the problem of
not being able to transmit responses, for lack of mbufs.

In the ti driver case, the inability to get another mbuf to
replace the one that will be taken out of the ring means that
the mbuf gets reused for more data -- NOT that the data flow
in the form of DMA from the card ends up being halted until
mbufs become available.

The real problem here is that most received packets want a
response; for things like web servers, where the average
request is ~.5k and the average response is ~11k, this means
that you would need to establish use-based watermarking, to
seperate the mbuff pool into transmit and receive resources;
in practice, this doesn't really work, if you are getting
your content from a seperate server (e.g. an NFS server that
provides content for a web farm, etc.).


 That's a different situation from the card spamming the receive
 ring over and over again, which is what you're describing.  I've
 never seen that happen, and if it does actually happen, I'd be
 interested in seeing evidence.

Please look at what happens in the case of an allocation
failure, for any driver that does not allow shrinking the
ring of receive mbufs (the ti is one example).


  Without hacking firmware, the best you can do is to ensure
  that you process as much of all the traffic as you possibly
  can, and that means avoiding livelock.
 
 Uhh, the Tigon firmware *does* drop packets when there is no
 more room in the proper receive ring on the host side.  It
 doesn't spam things.
 
 What gives you that idea?  You've really got some strange ideas
 about what goes on with that board.  Why would someone design
 firmware so obviously broken?

The driver does it on purpose, by not giving away the mbuf
in the receive ring, until it has an mbuf to replace it.

Maybe this should be rewritten to not mark the packet as
received, and thus allow the card to overwrite it.  There
are two problems with that approach however.  The first is
what happens when you reach mbuf exhaustion, and the only
way you can clear out received mbufs is to process the data
in a user space appication which never gets to run, and when
it does get to run, can't write a response for a request and
response protocol, such that it can't free up any mbufs?  The
second is that, in the face of a denial of service attack,
the correct approach (according to Van Jacobsen) is to do a
random drop, and rely on the fact that the attack packets,
being proportionally more of the queue contents, get dropped
with a higher probability... so while you _can_ do this, it
is really a bad idea, if you are trying to make your stack
robust against attacks.

The other thing that you appear to be missing is that the
most common case is that you have plenty of mbufs, and you
keep getting interrupts, replacing the mbufs in the receive
ring, and pushing the data into the ether input by giving
away the full mbufs.

The problem occurs when you are receiving at such a high rate
that you don't have any free cycles to run NETISR, and thus
you can not empty the queue from which ipintr is called with
data.

In other words, it's not really the card's fault that the OS
didn't run the stack at hardware interrupt.

  What this means is that you get more benefit in the soft
  interrupt coelescing than you otherwise would get, when
  you are doing LRP.
 
  But, you do get *some* benefit from doing it anyway, even
  if your ether input processing is light: so long as it is
  non-zero, you get benefit.
 
  Note that LRP itself is not a panacea for livelock, since
  it just moves the scheduling problem from the IRQ-NETISR
  scheduling into the NETISR-process scheduling.  You end
  up needing to implement weighted fair share or other code
  to ensure that the user space process is permitted to run,
  so you end up monitoring queue depth or something else,
  and deciding 

Re: Why do soft interrupt coelescing?

2001-10-10 Thread Kenneth D. Merry

On Wed, Oct 10, 2001 at 01:59:48 -0700, Terry Lambert wrote:
 Kenneth D. Merry wrote:
  eh?  The card won't write past the point that has been acked by the kernel.
  If the kernel hasn't acked the packets and one of the receive rings fills
  up, the card will hold off on sending packets up to the kernel.
 
 Uh, eh?
 
 You mean the card will hold off on DMA and interrupts?  This
 has not been my experience.  Is this with firmware other than
 the default and/or the 4.3-RELEASE FreeBSD driver?

If the receive ring for that packet size is full, it will hold off on
DMAs.  If all receive rings are full, there's no reason to send more
interrupts.

Keep in mind that there are three receive rings by default on the Tigon
boards -- mini, standard and jumbo.  The size of the buffers in each ring
is configurable, but basically all packets smaller than a certain size will
get routed into the mini ring.  All packets larger than a certain size will
get routed into the jumbo ring.  All packets in between will get routed
into the standard ring.

If there isn't enough space in the mini or jumbo rings for a packet, it'll
get routed into the standard ring if there is space there.  (In the case of
a jumbo packet, it may take up multiple buffers on the ring.)

Anyway, if all three rings fill up, then yes, there won't be a reason to
send receive interrupts.

  I agree that you can end up spending large portions of your time doing
  interrupt processing, but I haven't seen instances of buffer overrun, at
  least not in the kernel.  The case where you'll see a buffer overrun, at
  least with the ti(4) driver, is when you have a sender that's faster than
  the receiver.
  
  So the receiver can't process the data in time and the card just drops
  packets.
 
 OK, assuming you meant that the copies would stall, and the
 data not be copied (which is technically the right thing to
 do, assuming a source quench style livelock avoidance, which
 doesn't currently exist)...

The data isn't copied, it's DMAed from the card to host memory.  The card
will save incoming packets to a point, but once it runs out of memory to
store them it starts dropping packets altogether.

 The problem is still that you end up doing interrupt processing
 until you run out of mbufs, and then you have the problem of
 not being able to transmit responses, for lack of mbufs.

In theory you would have configured your system with enough mbufs to handle
the situation, and the slowness of the system would cause the windows on
the sender to fill up, so they'll stop sending data until the receiver
starts responding again.  That's the whole purpose of backoff and slow
start -- to find a happy medium for the transmitter and receiver so that
data flows at a constant rate.

 In the ti driver case, the inability to get another mbuf to
 replace the one that will be taken out of the ring means that
 the mbuf gets reused for more data -- NOT that the data flow
 in the form of DMA from the card ends up being halted until
 mbufs become available.

True.

 The real problem here is that most received packets want a
 response; for things like web servers, where the average
 request is ~.5k and the average response is ~11k, this means
 that you would need to establish use-based watermarking, to
 seperate the mbuff pool into transmit and receive resources;
 in practice, this doesn't really work, if you are getting
 your content from a seperate server (e.g. an NFS server that
 provides content for a web farm, etc.).
 
 
  That's a different situation from the card spamming the receive
  ring over and over again, which is what you're describing.  I've
  never seen that happen, and if it does actually happen, I'd be
  interested in seeing evidence.
 
 Please look at what happens in the case of an allocation
 failure, for any driver that does not allow shrinking the
 ring of receive mbufs (the ti is one example).

It doesn't spam things, which is what you were suggesting before, but
as you pointed out, it will effectively drop packets if it can't get new
mbufs.

Yes, it could shrink the pool, by just not replacing those mbufs in the
ring (and therefore not notifying the card that that slot is available
again), but then it would likely need some mechanism to allow it to be
notified that another buffer is available for it, so it can then allocate
receive buffers again.

In practice I haven't found the number of mbufs in the system to be a
problem that would really make this a big issue.  I generally configure the
number of mbufs to be high enough that it isn't a problem in the first
place.

   Without hacking firmware, the best you can do is to ensure
   that you process as much of all the traffic as you possibly
   can, and that means avoiding livelock.
  
  Uhh, the Tigon firmware *does* drop packets when there is no
  more room in the proper receive ring on the host side.  It
  doesn't spam things.
  
  What gives you that idea?  You've really got some strange ideas
  about what goes on 

Re: Why do soft interrupt coelescing?

2001-10-09 Thread Terry Lambert

Kenneth D. Merry wrote:
[ ... soft interrupt coelescing ... ]

 As you say above, this is actually a good thing.  I don't see how this ties
 into the patch to introduce some sort of interrupt coalescing into the
 ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
 on the board to do what you want.

I have tweaked all tunables on the board, and I have not gotten
anywhere near the increased performance.

The limit on how far you can push this is based on how much
RAM you can have on the card, and the limits to coelescing.

Here's the reason: when you receive packets to the board, they
get DMA'ed into the ring.  No matter how large the ring, it
won't matter, if the ring is not being emptied asynchronously
relative to it being filled.

In the case of a full-on receiver livelock situation, the ring
contents will be continuously overwritten.  This is actually
what happens when you put a ti card into a machine with a
slower processor, and hit it hard.

In the case of interrupt processing, where you jam the data up
through ether input at interrupt time, the buffer will be able
to potentially overrun, as well.  Admittedly, you can spend a
huge percentage of your CPU time in interrupt processing, and
if your CPU is fast enough, unload the queue very quickly.

But if you then look at doing this for multiple gigabit cards
at the same time, you quickly reach the limits... and you
spend so much of your time in interrupt processing, that you
spend none running NETISR.

So you have moved your livelock up one layer.


In any case, doing the coelescing on the board delays the
packet processing until that number of packets has been
received, or a timer expires.  The timer latency must be
increased proportionally to the maximum number of packets
that you coelesce into a single interrupt.

In other words, you do not interleave your I/O when you
do this, and the bursty conditions that result in your
coelescing window ending up full or close to full are
the conditions under which you should be attempting the
maximum concurrency you can possibly attain.

Basically, in any case where the load is high enough to
trigger the hardware coelescing, the ring would need to
be the next power of two larger to ensure that the end
does not overwrite the beginning of the ring.

In practice, the firmware on the card does not support
this, so what you do instead is push a couple of packets
that may have been corrupted through DMA occurring during
the fact -- in other words, you drop packets.

This is arguably correct, in that it permits you to shed
load, _but_ the DMAs still occur into your rings; it would
be much better if the load were shed by the card firmware,
based on some knowledge of ring depth instead (RED Queueing),
since this would leave the bus clear for other traffinc (e.g.
communication with main memory to provide network content for
the cards for, e.g., and NFS server, etc.).

Without hacking firmware, the best you can do is to ensure
that you process as much of all the traffic as you possibly
can, and that means avoiding livelock.


[ ... LRP ... ]

 That sounds cool, but I still don't see how this ties into the patch you
 sent out.

OK.  LRP removes NETISR entirely.

This is the approach Van Jacobson stated he used in his
mythical TCP/IP stack, which we may never see.

What this does is push the stack processing down to the
interrupt time for the hardware interrupt.  This is a
good idea, in that it avoids the livelock for the NETISR
never running because you are too busy taking hardware
interrupts to be able to do any stack processing.

The way this ties into the patch is that doing the stack
processing at interrupt time increases the per ether
input processing cycle overhead up.

What this means is that you get more benefit in the soft
interrupt coelescing than you otherwise would get, when
you are doing LRP.

But, you do get *some* benefit from doing it anyway, even
if your ether input processing is light: so long as it is
non-zero, you get benefit.

Note that LRP itself is not a panacea for livelock, since
it just moves the scheduling problem from the IRQ-NETISR
scheduling into the NETISR-process scheduling.  You end
up needing to implement weighted fair share or other code
to ensure that the user space process is permitted to run,
so you end up monitoring queue depth or something else,
and deciding not to reenable interrupts on the card until
you hit a low water mark, indicating processing has taken
place (see the papers by Druschel et. al. and Floyd et. al.).


   It isn't terribly clear what you're doing in the patch, since it isn't a
   context diff.
 
  It's a cvs diff output.  You could always check out a sys
  tree, apply it, and then cvs diff -c (or -u or whatever your
  favorite option is) to get a diff more to your tastes.
 
 As Peter Wemm pointed out, we can't use non-context diffs safely without
 the exact time, date and branch of the source files.  This introduces an
 additional burden 

Re: Why do soft interrupt coelescing?

2001-10-09 Thread Kenneth D. Merry

On Tue, Oct 09, 2001 at 12:28:02 -0700, Terry Lambert wrote:
 Kenneth D. Merry wrote:
 [ ... soft interrupt coelescing ... ]
 
  As you say above, this is actually a good thing.  I don't see how this ties
  into the patch to introduce some sort of interrupt coalescing into the
  ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
  on the board to do what you want.
 
 I have tweaked all tunables on the board, and I have not gotten
 anywhere near the increased performance.
 
 The limit on how far you can push this is based on how much
 RAM you can have on the card, and the limits to coelescing.
 
 Here's the reason: when you receive packets to the board, they
 get DMA'ed into the ring.  No matter how large the ring, it
 won't matter, if the ring is not being emptied asynchronously
 relative to it being filled.
 
 In the case of a full-on receiver livelock situation, the ring
 contents will be continuously overwritten.  This is actually
 what happens when you put a ti card into a machine with a
 slower processor, and hit it hard.

eh?  The card won't write past the point that has been acked by the kernel.
If the kernel hasn't acked the packets and one of the receive rings fills
up, the card will hold off on sending packets up to the kernel.

 In the case of interrupt processing, where you jam the data up
 through ether input at interrupt time, the buffer will be able
 to potentially overrun, as well.  Admittedly, you can spend a
 huge percentage of your CPU time in interrupt processing, and
 if your CPU is fast enough, unload the queue very quickly.
 
 But if you then look at doing this for multiple gigabit cards
 at the same time, you quickly reach the limits... and you
 spend so much of your time in interrupt processing, that you
 spend none running NETISR.

I agree that you can end up spending large portions of your time doing
interrupt processing, but I haven't seen instances of buffer overrun, at
least not in the kernel.  The case where you'll see a buffer overrun, at
least with the ti(4) driver, is when you have a sender that's faster than
the receiver.

So the receiver can't process the data in time and the card just drops
packets.

That's a different situation from the card spamming the receive ring over
and over again, which is what you're describing.  I've never seen that
happen, and if it does actually happen, I'd be interested in seeing
evidence.

 So you have moved your livelock up one layer.
 
 
 In any case, doing the coelescing on the board delays the
 packet processing until that number of packets has been
 received, or a timer expires.  The timer latency must be
 increased proportionally to the maximum number of packets
 that you coelesce into a single interrupt.
 
 In other words, you do not interleave your I/O when you
 do this, and the bursty conditions that result in your
 coelescing window ending up full or close to full are
 the conditions under which you should be attempting the
 maximum concurrency you can possibly attain.
 
 Basically, in any case where the load is high enough to
 trigger the hardware coelescing, the ring would need to
 be the next power of two larger to ensure that the end
 does not overwrite the beginning of the ring.
 
 In practice, the firmware on the card does not support
 this, so what you do instead is push a couple of packets
 that may have been corrupted through DMA occurring during
 the fact -- in other words, you drop packets.
 
 This is arguably correct, in that it permits you to shed
 load, _but_ the DMAs still occur into your rings; it would
 be much better if the load were shed by the card firmware,
 based on some knowledge of ring depth instead (RED Queueing),
 since this would leave the bus clear for other traffinc (e.g.
 communication with main memory to provide network content for
 the cards for, e.g., and NFS server, etc.).
 
 Without hacking firmware, the best you can do is to ensure
 that you process as much of all the traffic as you possibly
 can, and that means avoiding livelock.

Uhh, the Tigon firmware *does* drop packets when there is no more room in
the proper receive ring on the host side.  It doesn't spam things.

What gives you that idea?  You've really got some strange ideas about what
goes on with that board.  Why would someone design firmware so obviously
broken?

 [ ... LRP ... ]
 
  That sounds cool, but I still don't see how this ties into the patch you
  sent out.
 
 OK.  LRP removes NETISR entirely.
 
 This is the approach Van Jacobson stated he used in his
 mythical TCP/IP stack, which we may never see.
 
 What this does is push the stack processing down to the
 interrupt time for the hardware interrupt.  This is a
 good idea, in that it avoids the livelock for the NETISR
 never running because you are too busy taking hardware
 interrupts to be able to do any stack processing.
 
 The way this ties into the patch is that doing the stack
 processing at interrupt time increases the per ether
 input 

Re: Why do soft interrupt coelescing?

2001-10-08 Thread Kenneth D. Merry

On Sun, Oct 07, 2001 at 00:56:44 -0700, Terry Lambert wrote:
 Kenneth D. Merry wrote:
  [ I don't particularly want to get involved in this thread...but... ]
  
  Can you explain why the ti(4) driver needs a coalescing patch?  It already
  has in-firmware coalescing paramters that are tuneable by the user.  It
  also already processes all outstanding BDs in ti_rxeof() and ti_txeof().
 
 
 The answer to your question is that the card will continue to DMA
 into the ring buffer, even though you are in the middle of the
 interrupt service routine, and that the amount of time taken in
 ether input is long enough that you can have more packets come in
 while you are processing (this is actually a good thing).
 
 This is even *more* likely with hardware interrupt coelescing,
 since the default setting is to coelesce 32 packets into a
 single interrupt, meaning that you have up to 32 iterations of
 ether input to call, and thus the amount of time spent processing
 them actually affords *more* time for additional packets to come
 in.

As you say above, this is actually a good thing.  I don't see how this ties
into the patch to introduce some sort of interrupt coalescing into the
ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
on the board to do what you want.

 In my own personal situation, I have also implemented Lazy
 Receiver Processing (per the research done by Rice University
 and in the Click Router project; no relation to ClickArray),
 which does all stack processing at the hardware interrupt, rather
 then queueing between the hardware interrupt and NETISR, so my
 processing path is actually longer; I get more benefit from the
 change than you would, but on a heavily loaded system, you would
 also get some benefit, if you were able to load the wire heavily
 enough.
 
 The LRP implementation should be considered by FreeBSD as well,
 since it takes the connection rate from ~7,000/second up to
 ~23,000/second, by avoiding the NetISR.  Rice University did
 an implementation in 2.2.x, and then another one (using resource
 containers -- I recommend against this one, not only because of
 license issues with the second implementation) for 4.2; both
 sets of research were done in FreeBSD.  Unfortunately, neither
 implementation was production quality (among other things, they
 broke RFC 1323, and they have to run a complete duplicate stack
 as a different protocol family because some of their assumptions
 make it non-interoperable with other protocol stacks).

That sounds cool, but I still don't see how this ties into the patch you
sent out.

  It isn't terribly clear what you're doing in the patch, since it isn't a
  context diff.
 
 It's a cvs diff output.  You could always check out a sys
 tree, apply it, and then cvs diff -c (or -u or whatever your
 favorite option is) to get a diff more to your tastes.

As Peter Wemm pointed out, we can't use non-context diffs safely without
the exact time, date and branch of the source files.  This introduces an
additional burden for no real reason other than you neglected to use -c or
-u with cvs diff.

  You also never gave any details behind your statement last week:
  Because at the time the Tigon II was released, the jumbogram
  wire format had not solidified.  Therefore cards built during
  that time used different wire data for the jumbogram framing.
  
  I asked, in response:
  
  Can you give more details?  Did someone decide on a different ethertype
  than 0x8870 or something?
  
  That's really the only thing that's different between a standard ethernet
  frame and a jumbo frame.  (other than the size)
 
 I believe it was the implementation of the length field.  I
 would have to get more information from the person who did
 the interoperability testing for the autonegotiation (which
 failed between the Tigon II and the Intel Gigabit cards).  I
 can assure you anecdotally, however, that autonegotiation
 _did_ fail.

I would believe that autonegotiation (i.e. 10/100/1000) might fail,
especially if you're using 1000BaseT Tigon II boards.  However, I would
like more details on the failure.  It's entirely possible that it could be
fixed in the firmware, probably without too much trouble.

I find it somewhat hard to believe that Intel would ship a gigabit board
that didn't interoperate with the board that up until probably recently was
probably the predominant gigabit board out there.

Ken
-- 
Kenneth Merry
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-08 Thread Alfred Perlstein

* Kenneth D. Merry [EMAIL PROTECTED] [011009 00:11] wrote:
 
 As you say above, this is actually a good thing.  I don't see how this ties
 into the patch to introduce some sort of interrupt coalescing into the
 ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
 on the board to do what you want.

No matter how hard you tweak the board, an interrupt may still
trigger while you process a hardware interrupt, this causes an
additional poll which can cause additional coalescing.

-- 
-Alfred Perlstein [[EMAIL PROTECTED]]
'Instead of asking why a piece of software is using 1970s technology,
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-08 Thread Mike Smith

 * Kenneth D. Merry [EMAIL PROTECTED] [011009 00:11] wrote:
  
  As you say above, this is actually a good thing.  I don't see how this ties
  into the patch to introduce some sort of interrupt coalescing into the
  ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
  on the board to do what you want.
 
 No matter how hard you tweak the board, an interrupt may still
 trigger while you process a hardware interrupt, this causes an
 additional poll which can cause additional coalescing.

I don't think I understand what sort of crack you are smoking.

If an interrupt-worthy condition is asserted on the board, you aren't 
going to leave your typical interrupt handler anyway; this sort of 
coalescing already happens without any help.

-- 
... every activity meets with opposition, everyone who acts has his
rivals and unfortunately opponents also.  But not because people want
to be opponents, rather because the tasks and relationships force
people to take different points of view.  [Dr. Fritz Todt]
   V I C T O R Y   N O T   V E N G E A N C E



To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-08 Thread Kenneth D. Merry

On Tue, Oct 09, 2001 at 00:18:57 -0500, Alfred Perlstein wrote:
 * Kenneth D. Merry [EMAIL PROTECTED] [011009 00:11] wrote:
  
  As you say above, this is actually a good thing.  I don't see how this ties
  into the patch to introduce some sort of interrupt coalescing into the
  ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
  on the board to do what you want.
 
 No matter how hard you tweak the board, an interrupt may still
 trigger while you process a hardware interrupt, this causes an
 additional poll which can cause additional coalescing.

At least in the case of the Tigon, it won't interrupt while there is a '1'
written in mailbox 0.  (This happens in ti_intr().)

Ken
-- 
Kenneth Merry
[EMAIL PROTECTED]

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-08 Thread Alfred Perlstein

* Mike Smith [EMAIL PROTECTED] [011009 00:25] wrote:
  * Kenneth D. Merry [EMAIL PROTECTED] [011009 00:11] wrote:
   
   As you say above, this is actually a good thing.  I don't see how this ties
   into the patch to introduce some sort of interrupt coalescing into the
   ti(4) driver.   IMO, you should be able to tweak the coalescing parameters
   on the board to do what you want.
  
  No matter how hard you tweak the board, an interrupt may still
  trigger while you process a hardware interrupt, this causes an
  additional poll which can cause additional coalescing.
 
 I don't think I understand what sort of crack you are smoking.
 
 If an interrupt-worthy condition is asserted on the board, you aren't 
 going to leave your typical interrupt handler anyway; this sort of 
 coalescing already happens without any help.

After talking to you on IRC it's become obvious that this doesn't
exactly happen without help.  It's more of a side effect of the
way _some_ of the drivers are written.

What I understand from talking to you:
  Most smarter drivers or high performance ones will check if the
  tx/rx rings have been modified by the hardware and will consume
  those packets as well.

However, most drivers have code like this:

if (ifp-if_flags  IFF_RUNNING) {
/* Check RX return ring producer/consumer */
ti_rxeof(sc);

/* Check TX ring producer/consumer */
ti_txeof(sc);
}

Now if more packets come in while in ti_txeof() it seems that
you'll need to take an additional interrupt to get at them.

So Terry's code isn't wrong, but it's not as amazing as one
would initially think, it just avoids a race that can happen
while transmitting packets packets and more arrive or while
recieving packets and the transmit queue drains.

Now, when one is be doing a lot more work in the interrupt context
(or perhaps just running on a slower host processor), Terry's
patches make a lot more sense as there's a much larger window
available for this race.

The fact that receiving is done before transmitting (at least in
the 'ti' driver) makes it an even smaller race as you're less likely
to be performing a lengthy operation inside the tx routine than if
you were doing some magic in the rx routine with incomming packets.

Or at least that's how it seems to me.

Either way, no need to get your latex in a bunch Mike. :-)

-- 
-Alfred Perlstein [[EMAIL PROTECTED]]
'Instead of asking why a piece of software is using 1970s technology,
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message



Re: Why do soft interrupt coelescing?

2001-10-07 Thread Terry Lambert

Kenneth D. Merry wrote:
 [ I don't particularly want to get involved in this thread...but... ]
 
 Can you explain why the ti(4) driver needs a coalescing patch?  It already
 has in-firmware coalescing paramters that are tuneable by the user.  It
 also already processes all outstanding BDs in ti_rxeof() and ti_txeof().


The answer to your question is that the card will continue to DMA
into the ring buffer, even though you are in the middle of the
interrupt service routine, and that the amount of time taken in
ether input is long enough that you can have more packets come in
while you are processing (this is actually a good thing).

This is even *more* likely with hardware interrupt coelescing,
since the default setting is to coelesce 32 packets into a
single interrupt, meaning that you have up to 32 iterations of
ether input to call, and thus the amount of time spent processing
them actually affords *more* time for additional packets to come
in.

In my own personal situation, I have also implemented Lazy
Receiver Processing (per the research done by Rice University
and in the Click Router project; no relation to ClickArray),
which does all stack processing at the hardware interrupt, rather
then queueing between the hardware interrupt and NETISR, so my
processing path is actually longer; I get more benefit from the
change than you would, but on a heavily loaded system, you would
also get some benefit, if you were able to load the wire heavily
enough.

The LRP implementation should be considered by FreeBSD as well,
since it takes the connection rate from ~7,000/second up to
~23,000/second, by avoiding the NetISR.  Rice University did
an implementation in 2.2.x, and then another one (using resource
containers -- I recommend against this one, not only because of
license issues with the second implementation) for 4.2; both
sets of research were done in FreeBSD.  Unfortunately, neither
implementation was production quality (among other things, they
broke RFC 1323, and they have to run a complete duplicate stack
as a different protocol family because some of their assumptions
make it non-interoperable with other protocol stacks).


 It isn't terribly clear what you're doing in the patch, since it isn't a
 context diff.

It's a cvs diff output.  You could always check out a sys
tree, apply it, and then cvs diff -c (or -u or whatever your
favorite option is) to get a diff more to your tastes.


 You also never gave any details behind your statement last week:
 Because at the time the Tigon II was released, the jumbogram
 wire format had not solidified.  Therefore cards built during
 that time used different wire data for the jumbogram framing.
 
 I asked, in response:
 
 Can you give more details?  Did someone decide on a different ethertype
 than 0x8870 or something?
 
 That's really the only thing that's different between a standard ethernet
 frame and a jumbo frame.  (other than the size)

I believe it was the implementation of the length field.  I
would have to get more information from the person who did
the interoperability testing for the autonegotiation (which
failed between the Tigon II and the Intel Gigabit cards).  I
can assure you anecdotally, however, that autonegotiation
_did_ fail.

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with unsubscribe freebsd-current in the body of the message